We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Lessons learnt managing and scaling 200TB glusterfs cluster @PhonePe

00:00

Formal Metadata

Title
Lessons learnt managing and scaling 200TB glusterfs cluster @PhonePe
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
We manage a 200TB glusterfs cluster in production. While we were managing this, we learnt some key points. In this session, we will share with you: - What are the minimal health checks that are needed for a glusterfs volume, to ensure high availability and consistency - What are the problems with the current cluster expansion steps(rebalance) in glusterfs we experienced? How did we manage to avoid the need for a rebalancing of data, for our use-case. Proof of concept for new rebalance algo for future. - How are we scheduling our maintenance activities such that we never have downtime even if the things go wrong. - How did we reduce the time to replace a node from weeks to a day. As the number of clients increased we had to scale the system to handle the increasing load, here are our learnings scaling glusterfs - How to profile glusterfs to find performance bottlenecks. - Why client-io-threads feature didn't work for us? How we improved applications to achieve 4x throughput by scaling mounts instead. - How to Improve the incremental heal speed and patches contributed to upstream Road map for glusterfs based on these findings
14
15
43
87
Thumbnail
26:29
146
Thumbnail
18:05
199
207
Thumbnail
22:17
264
278
Thumbnail
30:52
293
Thumbnail
15:53
341
Thumbnail
31:01
354
359
410
Scaling (geometry)Database transactionInformation securityData storage devicePoint cloudServer (computing)Computer fileScale (map)VolumeCharacteristic polynomialWorkloadPoint cloudRootMobile appServer (computing)Volume (thermodynamics)Service (economics)Row (database)Data storage deviceConfiguration spaceNumberElectric generatorRegulator geneComputer fileContent delivery networkBit rateMetadataSystem callVirtual machineProcess (computing)Hash functionOperator (mathematics)StapeldateiCASE <Informatik>Projective planeGateway (telecommunications)Client (computing)DataflowSoftwareResource allocationRange (statistics)InformationDatabase transactionScaling (geometry)WorkloadFile systemDirectory serviceDigitizingFront and back endsGoodness of fitSlide ruleComputer virusComputer animation
2 (number)Functional (mathematics)Bimodal distributionComputer fileArithmetic meanComputer animation
Configuration spaceVertex (graph theory)Scale (map)Human migrationSoftware maintenanceChannel capacityThermal expansionNormal (geometry)Virtual machineVolumeSelf-balancing binary search treePeer-to-peerHash functionComputer fileDirectory serviceConsistencyHydraulic jumpElectric currentMilitary operation10 (number)MiniDiscSelf-balancing binary search treeDirectory serviceFreewareNumberVolume (thermodynamics)Set (mathematics)Hash functionClient (computing)Cartesian coordinate systemComputer fileRange (statistics)Software maintenanceData structureThermal expansionConsistencyHydraulic jumpWorkloadData storage deviceServer (computing)BitMoment (mathematics)Correspondence (mathematics)Maxima and minimaDistribution (mathematics)SpacetimeMiniDiscChannel capacityLevel (video gaming)1 (number)Configuration spaceVertex (graph theory)NamespaceThresholding (image processing)Peer-to-peerProcess (computing)Term (mathematics)Resource allocationOperator (mathematics)BenchmarkVirtual machineCASE <Informatik>Partial derivativeHuman migrationMultiplication signDiagramComputer animation
Human migrationMilitary operation10 (number)MiniDiscHeat transferUtility softwareData integrityReduction of orderRollback (data management)VolumeHash functionServer (computing)Process (computing)WritingForcing (mathematics)Operator (mathematics)Volume (thermodynamics)Client (computing)Goodness of fitPlanningFile systemConsistencyDependent and independent variablesCASE <Informatik>Self-balancing binary search treePhase transitionVirtual machineDemonReplication (computing)MereologyNumberMathematical optimizationDataflowLink (knot theory)Multiplication signResource allocationRange (statistics)Utility softwareMiniDiscParameter (computer programming)Computer animation
VolumeRollback (data management)Human migrationGraph (mathematics)Volume (thermodynamics)Partial derivativeProfil (magazine)Computer animation
VolumeUser profileWorkloadFunction (mathematics)InformationStatisticsBlock (periodic table)Open setWritingReading (process)Directory serviceComputer fileOperator (mathematics)Function (mathematics)WorkloadVolume (thermodynamics)Directory serviceMaxima and minimaProfil (magazine)Distribution (mathematics)StatisticsInformationBlock (periodic table)System callCASE <Informatik>BenchmarkNumber2 (number)Electronic mailing listSoftware bugMultiplication signAverageComputer animation
BenchmarkVolume (thermodynamics)Cartesian coordinate systemMultiplication signDiagram
BefehlsprozessorThread (computing)Cache (computing)RootAveragePrinciple of maximum entropyPrincipal ideal domainWritingBlock (periodic table)Thread (computing)Java appletData storage deviceClient (computing)Default (computer science)Block (periodic table)Cartesian coordinate systemStructural loadNumberVolume (thermodynamics)MereologyDependent and independent variablesProfil (magazine)Analytic continuationSoftwareFunction (mathematics)BefehlsprozessorComputer animation
InformationUser profileSoftware maintenanceMilitary operationHuman migrationReplication (computing)Focus (optics)Self-balancing binary search treeScale (map)Strategy gameSoftware developerProduct (business)MereologyState of matterTranslation (relic)Software developerFunction (mathematics)Process (computing)Volume (thermodynamics)Vertex (graph theory)Multiplication signServer (computing)NumberSingle-precision floating-point formatLoginGeometryPoint (geometry)Configuration spaceScripting languageCASE <Informatik>InformationHuman migrationLink (knot theory)Maxima and minimaElectronic mailing listSoftware maintenanceReplication (computing)Projective planeStrategy game2 (number)Thread (computing)MiniDiscStructural loadComputer configurationDependent and independent variablesData managementCondition numberDemonGoodness of fitProfil (magazine)DiagramJSONXML
Replication (computing)Focus (optics)Self-balancing binary search treeScale (map)Human migrationSoftware developerStrategy gameProduct (business)Total S.A.File systemHuman migration1 (number)Virtual machineMiniDiscProcedural programmingMultiplication signOperator (mathematics)Standard deviationComputer animation
Program flowchart
Transcript: English(auto-generated)
Please welcome Sanju Sanju and Pranit for us and enjoy. Thank you guys. Thank you. Good morning guys. I'm Sanju and he's Pranit. We work at Phonepay.
Yeah, today we are going to discuss about the lessons that we learned while we manage the cluster first cluster at the scale and some of the problems we have faced and the solutions that we have came up with. Yeah, Phonepay is the leading Indian digital payments and technology company
headquartered in Bangalore, India and it uses unified payments interface, which is introduced by government of India. So in India, if you are thinking of any payment, you can do it using Phonepay app. This is how our Phonepay app home screen looks like and we have like a
We see 800k RPS on our edge layer every day and we do 130 million daily transactions. So this will generate lots of records and this will generate lots of records and that documents that we have to store and
as per the regulations in India, we have to store all of them in India only. So Phonepay has a private cloud where we store all these things and we need a service to store and retrieve the files from the cloud. We have developed a service called dark store
which will write the data to GlusterFS and which will fetch the data from the GlusterFS. So coming to the question, why did we choose the GlusterFS? We didn't want it to have a metadata server because like we have lots of small files and storing all the metadata. We didn't want it. So GlusterFS has no metadata server.
So we went ahead with it and our team had earlier success in the GlusterFS project. So they were confident that okay GlusterFS will work for our use case. So we are here. And this is the data flow to and from the GlusterFS. All the traffic is fronted by CDN and the request is forwarded to Nginx and
Nginx will send the request to the API gateway and API gateway can choose to store or retrieve any file from the any file or it can choose to send the request to any backend service. Now if the backend service wants to store this
file or if it wants a file, it can be a post or get request. I mean like it can store or it can retrieve. It will send the request to dark store. Now the dark store will store the data or retrieve the data from GlusterFS servers and dark store also uses
Elasticsearch to store some of the metadata and it uses Aerospike to store the earth related info and some of the rate limiting features. It uses RMQ for asynchronous jobs like deletions and batch operations. And this is our team.
Today's our agenda is an introduction to GlusterFS. And then we will discuss about different problems that have faced and the solutions that we are using and we have some proposals as a roadmap. What is GlusterFS? GlusterFS is a distributed file system.
That means whenever you do some write, the data is distributed across multiple servers. These servers have some of the directories. We call them as bricks and this is where the data is actually getting stored.
So this is a typical GlusterFS server. Each server can have multiple bricks. The bricks will have underlying file system where the data will be stored and in the root partition, we store the GlusterFS configuration. Go ahead. Yeah, this is how a 3x3 GlusterFS volume looks like. When I say 3x3,
whenever a write comes to GlusterFS mount point. So how missing one point? Like we can mount GlusterFS volume on any machine over the network and you can read and write from that machine. Now from the client where the mount is happen
if any write comes. So it is distributed across three sub volumes based on the hash range allocation. We will talk more about the hash range in a coming slides and
another three is transfer. The data is replicated three times. So whenever a write comes, the data will choose one of the sub volume and in a sub volume, sub volume is a replica set. Here it's a three. So it is replicated thrice.
Over to Pranit. Hello. Yeah, so let's look at some numbers that we see at phone pay for dock store service and then to GlusterFS. In a day, we see about 4.3 million uploads and the downloads are 9 million
with peak upload RPS as 200 and download RPS as 800. The aggregate upload size per day is just 150 GB not a lot, but the download size is 2.5 TB. So it is completely read heavy workload and this is after a CDN is fronting it. That means only when the file is not available in your CDN
the call will come to GlusterFS which will download the file onto the CDN and then it will be served. And this is how the RPS is distributed throughout the day. RPS is request per second.
So the uploads actually are reasonably uniform from 6 a.m. to 5 in the evening. Then it tapers off for the rest of the day. Whereas the downloads are in bimodal distribution with one peak at around 12 p.m. and another at around 7 p.m. The latencies are function of the size of the file.
So we have post upload latencies with mean of about 50 ms to the p99 at around 250 ms. Similarly for gets the mean is around 10 ms and p99 is around 100 ms.
Let's look at the configuration that we use at phone pay for GlusterFS. We have 30 nodes in the cluster. Each node contributes two bricks and one brick corresponds to 10 TB and that is a ZFS pool.
So 30 into 20 that is 600 TB of available capacity and we use replica 3. So the available size is 200 TB out of which 130 TB is in use at the moment. Let's now go to the problems that we face and how we solved it. I'll start off with the capacity expansion problem that we solved.
Then Sanju will take over and talk about the data migration problem that we solved. I'll talk about how performance issues are debugged and how we solved the problems using that method. Then Sanju will finish it off with maintenance activities that we do to prevent the problems.
Before we talk about the capacity expansion problem, let's try to understand a bit about the distribution. So the data is distributed across the servers based on hashes.
In this diagram we have three distribute sub volumes. Each sub volume is a replica 3. So when you create a directory, each of the directory in these three replica sets will get a hash range and whenever you create a file or try to read a file it will actually
compute the hash of the name and it will figure out which of these directories in these three sub volumes has that hash range and tries to get that file or store that file in that node. So for folks who are well versed with database, this is more like sharding.
But the entity here that is getting sharded is the directory based on the file names. All right, so the files actually can have varying sizes. For example in our setup the minimum size would be less than a KB but the maximum size is like 26 GB. So you will run into this problem where
some of the shards or distributes of volumes that you have would fill up the space before the others. So you need to handle that part as well. So there is a feature in GlusRef called min-free disk where if you hit that level when you create the directory again, the hash range will not be allocated for the ones that met the threshold. So for example here
even though there are three distribute sub volumes, data is going to only two because the middle one actually has met the threshold. So the hash range will only be distributed between the two 50% and 50% instead of one third that you would expect normally.
So let's talk about the actual process of increasing the capacity and why it didn't work for us. When you want to increase the capacity that is you bring in more distribute sub volumes or shards, the way that you do it is you first you do something called as cluster peer probe
that will bring the new machines into the cluster. Then you do another operation called add brick that will add the bricks to your volume. Then you have to do something called as cluster volume rebalance to redistribute the data among the nodes equally. So what are the problems that we faced?
When we did the benchmark the rebalance had this application latency impact in some cases up to 25 seconds and as I mentioned most of the P99 latencies were just in milliseconds. So this will be like a partial timeout, partial outage for us.
So this is not going to work for us. The other thing that we notice is for large volumes the rebalance may take up to months and at the moment GlusterFS rebalance does not have pause and resume so we can't do the maintenance activity in off peak hours.
That is one more problem. The other one that we have seen is when you do the data migration when it is going from one distribute sub volume or shard to two distribute sub volumes you would expect 50% of the data to be transferred. That's alright. But when you are going from 9 shard slash distributed sub volumes to 10 you want to only migrate like 10% of the data
but GlusterFS is still like transferring about 30% to 40% like irrespective of what is the number of sub volumes are. So the rebalance itself may take so much time with our
workload that by the time we want to do the next capacity expansion the rebalance may not even complete. So that is also not going to work for us. So these are the three main problems that we have seen. this is the solution that we are using now. Then there is a proposal as well.
Since we know that the hash range allocation is based on both the number of sub volumes and number of free sub volumes what we are doing is in our docstore application every day in the night we create directories with
a new basically. So the directory structure will be something like the namespace that the clients are going to use slash year slash month slash day. So each day you are going to create new directories. So based on the size that is available only the ones that have space will get the hash range allocation.
So you will never run into the problem where you will have to do rebalance that much because we have seen that with our workloads reads are distributed uniformly and as we have seen the it is read heavy workload and writes are just a few so we were okay with the solution in the interim.
But long term the solution that we are we have proposed and this is something that is yet to be accepted but there are some POC that we did where if you use the jump consistent hash instead of the one that we have when you are going from 9 to 10 here, it's only about 10 percent that is getting rebalanced.
So that is what we want to get to. This is something that we are focusing on this year. Alright over to you Sanju. Yeah, so let's look at the problems that we have faced while migrating the data.
So we had a use case where we wanted to move complete data which is present in one server to another server. So in cluster phase the standard way of doing this is to use a rebalance operation sorry replace brick operation. So when you do replace brick operation
there is a process called the self-fill daemon which will copy all the data which is present in the old server to new server. So to copy 10 TB data it takes around two to three weeks. So that is like a huge time. We wanted to reduce this time. So we came up with a new approach.
So let us understand few aspects of Gluster phase before we jump to the solution so that we understand our approach better. So the write flow in Gluster phase is something like this. Whenever a write comes based on the hash range allocation parameter just spoke it will choose one of the sub volume. So
the data will go to all the replicas all the servers in that sub volume. Now let's say we have chosen replicas at 0 and
the write will go to all the machines in that sub volume. It is a client side replication so the client will send the write to all the machines and it will wait for the success success response to come. So client will assume the write is successful only when quorum number of
success responses has come. Let's say one of the node is down in our case we see like a server 2 either it can be a node down or the brick process is unhealthy and this can be unresponsive at times. So something happened
the write came to one of the sub volume and it went to all the three replica servers, but server 2 didn't responded with the success response. Now server 1 and server 3 has responded with the success response. So client it assumes that the write is successful. Now when the server 2 is back up
to have the consistency of the data server 2 should get the data which it has missed while it was down. So who will take care of the job of doing this it is shd. So shd is a daemon process which will read the pending heal data like whatever the data that was missing we call it as a pending heal.
So it will read from one of the good copy in our case server 1 and server 3 are the good copies and server 2 is a bad copy. So shd will read the data from one of the good copy and it will write to
server 2. So server 2 will have all the data once the self heal is completed healing the data. We will use this as part of our approach as well. Our approach is we will kill the brick which we want to migrate like we want to migrate from the server 3 to server 4.
So we have to copy all the data right. So self heal is taking two to three weeks. Here in our case we will kill that brick and we have a zfs you know we are using zfs file system
so we will take a zfs snapshot and we will transfer this snapshot from the server 3 to server 4. It's like a old server to the new server and now we will perform the replace brick operation. While we are performing the replace brick operation server 4 that is a new server
will already have all the data which server 3 had. Once the replace brick operation is performed server 4 is now part of the sub volume and the heals will take place from server 1 and server 2 to server 4. So now we have reduced the amount of data that we are healing.
Previously we are copying all the data that is like a 10 tb of data from server 3 to server 4 but here in our case we are healing only the data which came after killing the brick
before doing the rebalance replace brick operation. So the data we heal is reduced hugely. With this approach now it is taking only 50 hours to complete this. That is also if we are using the spinning disks it will take 48 hours to transfer the
snapshot of 10 tb and 2 hours for the healing of data but it is only 8 to 9 hours if we are using ssd. If you are using ssd it takes like a 8 hours to transfer the snapshot and it takes around 40 minutes to complete the heals. So that is like we came
from like a two to three weeks to one or two days or nine hours we can say. We are using netcat utility it gave us very good performance it is like a 60 percent performance optimization and we have in flight checksum at both the ends in the old server and also in the new server
so that it is like we are checking whether we are transferring the snapshot perfectly or not we are not losing any data and yeah it reduces the time. I have kept the commands that we have exactly used in this link and we also have a rollback plan.
So let's say that we have started with this activity but we haven't performed the replace brick yet because once the replace brick is performed it will be something like this the sub volume will already have the server 4 as a part of it before we perform the replace brick
that means when we are here we can we don't want to do this anymore all we need to do is start the volume with the force so that the brick process that we have killed will come up. Once it is up the good copies that we have ssd will copy the data from good copies to
bad copy or the old server so that we will have the consistent data across all our replicated servers. Yeah that is so easy and we want to popularize this method so that it helps the community.
Yeah over to Pranith. Yeah so this we will now talk about the performance issues that we faced and how we solved them. This is the graph that we have seen in our prod setup while doing this migration when we
when something happened that we did not account for so the latencies have shot up to one minute here and I've said that it is supposed to be only milliseconds so this is horrible there was like two hours of partial outage because of this so let's see how these things can be debugged and how they can be fixed so we have a method called cluster volume profile in
luster fs so what you do is you start profiling on the volume then you run your benchmark or whatever is your workload then you keep executing luster volume profile info incremental
and it will keep giving you the stats of what is happening to the volume during that time for each of the bricks that are there in the volume you will get an output like this where for that interval in this case interval nine for each of the block size you will see the number of reads and writes that came and for all of the internal file operations that
you see on the volume you will get the number of calls and the latency distribution min max average latency and what is the percentage latency that is taken by each of your file operation internally so what we have seen when this zfs issue happened is the lookup call is taking more than a second which is not what we generally see so we knew something was happening
during lookup operation so we did an estrus on the brick and we have found that there is one internal directory called luster fs indices etc to list three entries it is basically taking 0.35
seconds so we so imagine this so you do ls it will just show you three entries but it will take like 0.35 seconds sometimes it even takes a second so we after looking at this we found that zfs has this behavior where if you create a lot of files in one directory
like millions and then you delete most of them and then if you do ls it takes up to a second so this bug is open for more than like two years i think so we didn't know whether zfs would fix this issue anytime soon so in lusterfs we patched it by caching this information so that
we don't have to keep doing this operation so now you wouldn't see it if you are using any of the latest lusterfs releases but yeah this is one issue that we found and fixed the second one is about increasing the rps that we have on our volume so the there was a
new application that was getting launched at the time and the rps that they wanted was not what we are giving so basically they wanted something like 300 360 rps or something like
that but when we did the benchmark we were getting only like 250 rps so we wanted to figure it figure out what is happening so we ran benchmarks on prod cluster itself and we saw that one of the threads is getting saturated so it so there is a feature in
lusterfs called client io threads where multiple threads would take the responsibility of sending it over the network so we thought let's just enable it and it would solve all our problems we enabled it and it made it worse like from 250 it went down so we realized that there is a continuation problem in the client side that we are yet to fix so for now what we
did is to on the containers of doc store where it was doing only one mount we are now doing three months and distributing the uploads and downloads over yes so can you repeat the oh yeah
no i didn't it's a fuse mount yeah the thread that is saturating is fused thread yeah so the question is which lusterfs client we are using the answer is fuse client and the
thread that is saturating is fuse thread so what we are doing is we have created multiple mounts on the container and we are distributing the load in the application itself like the uploads will go to all three and even downloads will go to all three that is one thing that we did to solve the cpu saturation problem the other thing that
we noticed this is like part of the luster volume profile output where it will tell you for each block what is the number of reads and writes we have seen that most of the writes are coming as 8kb so later when we looked at the java application doc store we saw that the io block that java is using the default size is 8kb so we just increased
it to 128kb so these two combined has given us 2x to 3x the number and we also increased the number of vms that we are using to mount the client so put all together we got something like
10x performance improvement compared to the earlier one so we are set for maybe two three years all right so let's now go on to health checks okay so for any production cluster
some of the health checks are needed so i will talk about the minimal health checks that needed for glusterfest cluster so glusterfest already provides posix health checks so it is a health thread which will do a write of 1kb for every 15 or 30 minutes i mean seconds so there is one
option to set the time interval in which you want to do this so if you set it as a zero that means you are disabling the health check so you can set it as like a 10 seconds or something so it sends a write and check if the disk is responsive enough and brick is healthy or not
if it didn't get a response in a particular time it will kill the brick process so that like we will get to know that something is wrong with the brick process so the other one
we have is the rest of the things are we have a script and we have some config these are the things we have kept externally kind of thing the posix health checks are the one which come with the glusterfest project so the cluster health checks that we have are like we have a config where we will specify number of nodes in the cluster so that is like an
expected number of nodes in the cluster and using the gluster p status or gluster poor list command we can check the number of nodes that are present in the cluster and we will check if both of them are equal if not we will raise an alert saying okay
something is something unexpected is happening and we will also check whether the node is in connected state or not so in the glusterfest cluster the nodes can be in different state so it can be connected or rejected or disconnected based on how
the glusterfest management daemon is working so now we will see whether so the expected is all the nodes should be in a connected state we will check whether the nodes are connected or not if the nodes are not connected then we will get an alert saying
okay one of your node is not in a connected state and we have some of the health checks for the bricks as well so we have number of bricks that are present in each volume in the config and in the gluster volume info output you will get how many number of volumes that are present in that volume and you will check if they are equal the another check we
have on the brick is if the brick is not online we will get to know it by checking the gluster volume status command and if it is not online you will get an alert saying that one of your brick is down and so whenever the server is down or the brick is down there will be some of the
pending heals and you can check the pending heals using the gluster volume heal info command and if there are any pending heals you will see an entry so if the entry is non-zero then you will get an alert saying that okay you have some pending heals in your cluster
that means something unexpected unwanted is going on that can be like a brick is down or node is down anything and we always log profile info incremental to our debug logs using the health check so that whenever we some we see some issue like the planet just spoke
about some of the issues that we can solve by looking at the profile info output so in such cases this output will be helpful so we always log into our log backup servers and the exact commands that we are using are listed in this link yeah so we have some of the maintenance
activities so things can go back sometimes so we have a replica 3 set up in our production so at any point of time quorum number of brick process should be up so that the reads
and writes can go on smoothly so whenever we are doing something which might take some downtime of the brick process or which which can have some load on particular server at that time we do it only on one of the server from each replica set so that even if
that server goes down or the brick process running on that server goes down we won't be having an issue because there are two other replica servers which can like do all the reads and writes so we are doing few activities in this way one is zfs scrubbing zfs scrubbing
is about doing the checksum of the data it will see if the data is in a proper condition or not and we do migrations in this way only so we we are doing it on a one server
from each replica set so that even if it is down for some time or something didn't work out we are in a good place and upgrades also we will do the in the same manner we have done some contributions so the data migration part that i have spoke
it's a production ready we have used it in our production and pranit has given some of the sessions which has many internals of glesterfest they are very useful for any glesterfest developers who wants to learn about many translators that we have in glesterfest and recently we have
fixed one of the single point of failure which was present in the geo replication feature it was merged into the upstream very recently last week and this year we are looking at another thing the hashing strategy that pranit has proposed once it is accepted at the community
we will take it and develop it yeah that's all we had folks thank you yeah yeah just want to let you guys know that the production ready thing this we actually migrated like in total
375 tb using the method that sanju talked about so it is ready so yeah you guys can use it i think it should work even with butterfs basically any file system that has a snapshot feature it should work yeah thank you guys yeah i think we have a few minutes for questions if you have
any otherwise you guys can catch us there yeah yeah so the question is how do you handle a disk failure so basically the the problem that i showed you where we had the zfs
issue where it was taking like minutes of latency that was the first time it happened on production for f and initially we were waiting for the machine itself to be fixed so that it will come back again and it went for like a week or so and the amount of data
that needed to be healed became too much that it uh coincided with our peak hours so now the uh standard operating procedure that we have come up with after this issue is if a machine goes down our disk goes down we can just get it back online in nine
hours so why do we have to wait so we just consider that node dead we get a new machine we do whatever sanju mentioned using zfs snapshot migration and we just bring it up do you have the zfs backup somewhere the answer is no you you have
the zfs data on the active bricks so you take a snapshot on the active bricks and do the snapshot yeah one of the good ones yes any other questions i think that's it i think thank
you guys thanks a lot