We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

A 'Thin Arbiter' for glusterfs replication

00:00

Formal Metadata

Title
A 'Thin Arbiter' for glusterfs replication
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Maintaining consistency in replication is a challenging problem involving locking of nodes, quorum checks and reconciliation of state, all of which impact performance of the I/O path if not done right. In a distributed system, a minimum of 3 nodes storing metadata is an imperative to achieve consensus and prevent the dreaded split-brain state. Gluster has had solutions like the trusted 3-way replication or the ' 2 replica + 1 arbiter' configuration to achieve this. The latest in the series is a 'Thin Arbiter (TA)' which is more minimalist the existing '1 arbiter', targeted at container platforms and stretch cluster deployments. A TA node can be deployed outside a gluster cluster and can be shared with multiple gluster volumes. It requires zilch storage space and does not affect I/O path latencies in the happy case. This talk describes the design, working and deployment of TA and the potential gotchas one needs to be aware of while choosing this solution. The intended audience is sysadmins/dev-ops personnel who might want to try out the thin-arbiter volume and troubleshoot any operational issues that may arise. The Thin Arbiter (TA) is different from normal arbitration logic in the sense that even if only one file is bad in one of the copies of the replica, it marks that entire replica unavailable (despite it having other files in it that are healthy), until it is healed and syncs up to the other good copy. While this might seem like a very bad idea for a highly available system, it works very well to prevent split-brains due to intermittent network disconnects rather than a whole node going off-line indefinitely. In talking about this feature, my talk will cover: Introduction to how synchronous replication in gluster works. The role of quorum in preventing split-brains. Briefly describe the working of replica 3 and arbiter volumes. The basic idea behind thin-arbiter based replication. Explain the state machine behind the thin-arbiter transaction model. Describe how it can be installed and used.
Rule of inferenceComputer animation
Goodness of fitRight angleOrbitComputer animation
Slide ruleArchitectureCompilerLogicClient (computing)AreaPoint (geometry)OrbitLogicTranslation (relic)Software engineeringDemonMultiplicationComputer fileComputer architectureCoefficient of determinationDoubling the cubeUniform resource locatorComputer animation
ArchitectureVolume (thermodynamics)Client (computing)Different (Kate Ryan album)Server (computing)Data storage deviceTranslation (relic)LogicMechanism designComputer architectureProcess (computing)CuboidKeyboard shortcutCartesian coordinate systemDepictionSynchronizationPlanningLattice (order)Software maintenanceRight angleComputer animation
Common Language InfrastructureComputer fileClient (computing)ConsistencyData modelDatabase transactionConsistencyServer (computing)Moment (mathematics)Dependent and independent variablesClient (computing)Database transactionReplication (computing)MiniDiscCartesian coordinate systemTrailCASE <Informatik>GeometryResolvent formalismControl flowArithmetic meanCellular automatonPlanningWebsiteMonster groupComputer animation
Computer fileData modelDatabase transactionPhase transitionComputer configurationMUDGoodness of fitHash functionStrategy gamePhase transitionPhysical lawDirection (geometry)WordWebsiteGroup actionRight angleDatabase transactionControl flowLine (geometry)PlanningFerry CorstenLengthEndliche ModelltheorieWritingComputer fileSystem callReading (process)Client (computing)Attribute grammarLastteilungPrincipal ideal domainBitDefault (computer science)Computer animation
DemonVolumeClient (computing)Wechselseitiger AusschlussComputer fileAliasingGodCausalityWordClient (computing)Phase transitionPrice indexBlogElectronic mailing listExclusive orSpacetimePhysical systemInsertion lossGroup actionFlock (web browser)PlanningComputer fileMetadataRow (database)Goodness of fitDemonDatabase transactionDirectory serviceDependent and independent variablesWechselseitige InformationComputer animation
SpacetimeComputer fileMultiplication signClient (computing)Type theorySpacetimeRight anglePlanningMixed realityState of matterStudent's t-testFormal languageComputer animation
VotingClient (computing)Computer fileVertex (graph theory)Parity (mathematics)Control flowComputer animation
Computer fileBit rateGradientCoefficient of determinationControl flowBitPhysical systemNumberClient (computing)GodContent (media)Reading (process)WritingDiagramComputer animation
Computer fileLogicState of matterType theoryComputer fileContent (media)NamespaceCASE <Informatik>Attribute grammarMetadataInformationFerry CorstenOptical disc driveEvent horizonComputer animation
DemonMereologyDataflowProcess (computing)GodInformationData managementVolume (thermodynamics)Point cloudLine (geometry)Integrated development environmentData storage deviceNormal (geometry)DemonComputer animation
Data storage deviceVolume (thermodynamics)NumberDifferent (Kate Ryan album)Uniqueness quantificationVolume (thermodynamics)Block (periodic table)1 (number)PlanningObservational studyVolumeMultiplicationSlide ruleComputer fileClient (computing)DiagramPoint cloudComputer animation
Normal (geometry)Volume (thermodynamics)Client (computing)Server (computing)AdditionProcess (computing)Drum memoryPlanningProcess (computing)Operator (mathematics)WhiteboardNumberClient (computing)Cycle (graph theory)Right angleVideo gameData managementPhysical systemCausalityCartesian coordinate systemAddress spaceControl flowOrder (biology)LogicInformationCASE <Informatik>Set (mathematics)Level (video gaming)Moment (mathematics)Portable communications deviceVolume (thermodynamics)Computer fileDrop (liquid)Translation (relic)Server (computing)Field (computer science)Cellular automatonSystem callComputer configurationDemonCommunications protocolComputer animation
Computer fileSemiconductor memoryOperator (mathematics)Client (computing)Cartesian coordinate systemInformationDependent and independent variablesNormal (geometry)Limit (category theory)Order (biology)Process (computing)BitWave packetComputer animation
Right angleObservational studyControl flowCartesian coordinate systemDenial-of-service attackGoodness of fitCodecState of matterMultiplication signComputer fileDiagramComputer animation
MIDIComputer fileRead-only memoryNumeral (linguistics)Cartesian coordinate systemCASE <Informatik>Client (computing)State of matterQuery languageReading (process)ResultantException handlingMixed realityView (database)Control flowNumberComputer animation
CurvatureQuery languageEmulatorRead-only memoryPhase transitionClient (computing)Time domainCellular automatonInformationMultiplicationImplementationClient (computing)Domain nameField (computer science)Translation (relic)InformationComputer fileVolume (thermodynamics)BitDemonSlide ruleCellular automatonReading (process)Content (media)OrbitSemiconductor memoryPhase transitionIdentical particlesProcess (computing)SurgeryRange (statistics)Insertion lossQuicksortOffice suiteFile formatDemoscenePhysical lawBuildingSelf-organizationRight angleSet (mathematics)NumberBlock (periodic table)GodPlanningComputer animation
Server (computing)RootVolumeVolume (thermodynamics)Instance (computer science)Positional notationBitSet (mathematics)Online helpState of matterProcess (computing)DialectVolume (thermodynamics)System callServer (computing)Asynchronous Transfer ModeNormal (geometry)Demo (music)Scripting languagePeer-to-peerVolumeComputer animation
InformationVolume (thermodynamics)InformationMixed realityRow (database)GodMultiplication signNumberDemo (music)Software bugVolumeTraffic reportingComputer animation
Maxima and minimaCommon Language InfrastructureOnline helpDatabaseInformationElectronic mailing listContent (media)40 (number)Bit rateLink (knot theory)NumberComputer fileBackupPhase transitionProcess (computing)Client (computing)2 (number)Data managementForcing (mathematics)Volume (thermodynamics)Formal languageEuler anglesExtension (kinesiology)Crash (computing)Physical systemFlagSet (mathematics)Mixed realityMoment (mathematics)Order (biology)Control flowEvent horizonMetropolitan area networkScripting languageComputer fontSemiconductor memoryAttribute grammarObservational studyError messageIP addressVariable (mathematics)Execution unitIntegrated development environmentPrincipal ideal domainSystem callComputer animation
Common Language InfrastructureZoom lensInformationRead-only memoryVolume (thermodynamics)Group actionTouchscreenVideoconferencingRow (database)CuboidSlide ruleComputer animation
Data managementConvex hullNormal (geometry)Data miningExecution unitCore dumpOpen setShift operatorMenu (computing)Form (programming)Source codeComputer animation
Shift operatorOpen setData managementInfinite conjugacy class propertyLinear mapArrow of timeComputer configurationGreatest elementComputer animation
Data managementOpen setShift operatorLevel (video gaming)Multiplication signComputer clusterRow (database)Goodness of fitDifferent (Kate Ryan album)Rule of inferenceField (computer science)Computer animation
ResultantComputer animation
Shift operatorOpen setData managementPoint cloudOpen sourceFacebookOpen setResultantComputer animation
Transcript: English(auto-generated)
Anyway, let me just try it out side by side. No worries. It's time?
So just put it under your chin somewhere.
It goes in your pocket. Do you want to give it a try? Yeah. Can you guys hear me? Sounds good. Yeah. All right. And I'll just...
Okay. Everyone, let's welcome Ravi Shankar. Yeah. With the thin orbiter from LustFS. All right. Thank you, guys. Yeah, so my name is Ravi Shankar. I'm a senior software engineer with Red Hat. I've been working with LustFS for about seven years,
mostly on the replication component, but I've also worked on other areas of Gluster like the CLI, the Gluster daemon and the POSIX translators. So anyway, so this talk is mainly about thin orbiter for LustFS replication. So the agenda for today is I'll spend the first few minutes on the first three bullet points
where I talk about what the GlusterFS architecture is and how it achieves replication using the automatic file replication or AFR translator. And then we will discuss how quorum logic is important in preventing split-brains when you're writing to files from multiple clients. Once we have an idea of these three things,
then we can actually go to the topic, which is thin orbiter for GlusterFS replication. All right. So this is the architecture of Gluster. So on the right-hand side, you see many green boxes. They are all servers, server one to server N. Each of the servers host a GlusterFS process, which is composed of many translators, like it starts from the server and down with POSIX.
So all of these servers are connected together to form a trusted storage pool, and this is what the volume is comprised of. And on the right-hand side, you have the client, which is basically accessing the volume via different mechanisms like FUSE or NFS, Ganesha, or there's also a libgfapi binding where you can write your own application
using those bindings to access the volume. So most of the logic in GlusterFS is done by translators. So each translator has a specific job. The replication translator sits here on the client side, and it has many children. So depending on the replication factor, each client talks to the respective bricks,
and it does the replication. So the synchronous replication in Gluster is mainly client-driven, meaning the client connects to all the bricks, and the updates are sent synchronously to all the servers, and we wait back for the responses from all the servers before sending back the response to the application. So it follows a strong consistency model,
unlike geo replication where the consistency is eventual. Here, the moment you do the write, because we propagated to all the servers, you get the writes immediately on the disk. And the writes follow a transaction model, because you also have multiple clients accessing the same file, and you need to have a transaction to prevent stale data or partial writes going from one client to one brick
and the other client to the other brick. The reads are served from one of the replicas, and the slowest brick also dominates the write performance, because we are winding the writes to all the bricks. So there's also a feature of self-healing, where when an update from the client does not go on all the bricks,
the self-healing keeps track of what files need heal, and when the break comes back up, it automatically does the healing. So to that effect, you have CLI commands to monitor the status of the pending heals, and you also have commands to resolve split-brains in case of replica 2. So I was telling you that the write follows a transaction model.
So there are basically five steps for a write when a client does a write. The first is the lock. So lock is when the client takes the lock on all the participating replicas. You need to do this because there are multiple clients accessing the same file, and while doing the write, you need a lock to prevent out-of-order writes. So once you get the lock,
then you do something called the pre-op. It's basically a set-exciter call that's done over the wire. So you mark an exciter on the file saying, hey, I'm about to do the write. Let's mark something called a dirty bit. So then after that, you do the actual write operation, and if the write is successful, you remove the dirty bit, and then you clear the dirty bit, and if the write transaction fails on some of the bricks,
then the good bricks will actually mark another extended attribute on the bad bricks saying that there is something pending heal, and so that when the brick comes back online, it can start healing. And then finally, we do the unlock. So the reads are very simple. It's just basically serve the read from one of the good bricks.
So AFR uses the extended attributes to know which is good and which is bad brick. So reads are always ensured that you always serve from the good brick. So which brick does the read get served from? That's configurable. You have various policies. So the default policy which is used is the hash of the GFID of the file. So that means that even though there are multiple clients,
if they are accessing the same file, they will go to the same brick. But you can also load balance it using other strategies like mixing the hash of the GFID and the client PID. The client PID is unique to each client, so you can distribute the reads too. So the self-heal daemon, I was telling you, was the responsibility of the self-heal daemon
to ensure that the missed writes are actually healed on to the bricks when they come down, when they come up. So the self-heal daemon runs on every node of the cluster and it heals both data, metadata and entries that were missed when one of the bricks were down. So there are two ways to do the heal. One is to crawl the entire file system, right?
That's like a really stupid way of doing it. So what AFR does is, it maintains the list of failures in a special directory called indices.lusterface.indices folder. So whenever a write transaction fails on some of the bricks, the good bricks record these GFIDs inside this .lusterface folder and when the bricks come back up, the self-heal daemon cross this folder
and it just gets the list of files that needs to be healed and then it does the heal. So the self-heal daemon does the healing under the presence of locks because clients can also be writing to the same file when the healing is going on. So you will have to take locks for mutual exclusion from the client I O. So the traditional way of replicating
had been earlier replica 2, but the problem, as you guys might already know, is that replica 2 is prone to split-brains. So there can be two types of split-brains, split-brains in time and space. Split-brains in time is when the write from the same client succeeds on one brick and fails on the other and the next write succeeds on the opposite brick.
So here we have brick 1 success, brick 2 failure. Here the write on the brick 1 failed and the brick 2 succeeded. So when both bricks come back up, the client doesn't know which is a good copy and you cannot resolve it. The other one is split-brains in space where clients can partially see the bricks. So there are two clients, client 1 and client 2.
Each of them can see only one brick and you still allow the write because there is no concept of quorum in replica 2. So you can end up in a split-brains in that state also. So how do you avoid split-brains? You have to have a notion of quorum, which means that you need to go at least for replica 3 or basically odd number of replicas. So the general thing is that for a 2n plus 1 replica,
you can at most tolerate failure of n nodes. So that means if you have a replica 3, you can tolerate failure of one node which is going down. So the thing to note here is that just because you have replica 3 and two nodes online, you cannot say that you are always guaranteed to serve a read.
So there is a problem that if the only good brick which had witnessed all the writes goes down, then you still have to fail the IO. So let's just look at that with a diagram. So here you have client which is trying to write to all the three bricks. The first write did not succeed on the third brick
and the second write did not succeed on brick 2. So now we have brick 1, which is the only one which has witnessed all the writes. So when the third write comes, even though the client is connected to brick 2 and 3, we cannot allow the write because the only good brick which witnessed all the writes previously, that is brick 1, is down.
So that's one of the things in any replication system. So even if you have quorum number of bricks, if the good bricks are down, then you cannot still serve the reads. So you guys must have already used or known about arbiter feature also. So arbiter is also a type of replica 2 plus,
the third brick is used as arbiter where it stores only the namespace. It doesn't store the contents of the file, so they are all zero byte files. So I was telling that AFR uses extended attributes to figure out which brick is good and which brick is bad, right? So in case of replica 2, because there are only two copies, you cannot store the exciters, you don't have three copies of the exciter.
So the arbiter kind of overcomes that problem by storing the file name alone, so only the namespace is captured and you store the exciters on those respective files. So now since we have three copies of the metadata information, we can prevent going up in a split-brain state. So this was the arbiter.
So why did we go with the thin arbiter? So let's see what thin arbiter is. So thin arbiter is essentially a replica 2 volume plus a lightweight thin arbiter process. So if you look at the normal arbiter, it's actually a full blown process in the sense that there is one arbiter for every replicate sub-volume and it stores all the files of that particular volume
but thin arbiter is not like that. It is actually lying outside the trusted storage pool, which means that it's not a part of the cluster at all. So you can host this in a cloud environment somewhere where the management daemon of GlusterD is not running at all. So the node is not here
and it's not managed by GlusterD. So if you look at the volume information, you will still see that it is depicted as a 1x2, which is a replica 2 volume, but you will also see an extra line called it's a thin arbiter. So that's how you will identify that the volume is a thin arbiter volume. So thin arbiter can be,
the advantage of thin arbiter is that you can host multiple replica 2 volumes with the same thin arbiter node. So you have, if you look at the diagram here, you see that there are different trusted storage pools. You have TSP1, TSP2 and they host different volumes. Some of them are thin arbiter volumes, some of them are normal volumes and they all use the same thin arbiter
which can be hosted separately in the cloud. So all the clients which are in the respective, which belong, which access the respective volumes also talk to the thin arbiter. So one caveat here is that we must use volume names which are unique across the different trusted storage pools. So the reason is because the thin arbiter has some ID file
which we will see in the next slide. So that uses the name of the volume to identify which replica it belongs to. So if you have the same volume name across multiple trusted storage pools then the thin arbiter ID file uniqueness is lost. So that's why as long as the volume names are unique you can use the same thin arbiter for multiple storage pools.
So what exactly is the thin arbiter process? So it is essentially a lightweight brick process. So you have all the standard translators which you see in the GlusterFS brick process like starting from protocol server and ending with the POSIX. But you also have one additional translator called the thin arbiter here
which is sitting just above POSIX. So I was telling you that the thin arbiter contains only one file and that file is used for quorum to determine which brick is good and which brick is bad, right? So the only operations that come on the thin arbiter are first creating the ID file which happens only once during the life cycle of a volume and then the actual set x center calls
which AFR uses to track which brick is good and bad. So any other op which comes on this has to be like varied. So that's the job of the thin arbiter translator. So it allows only the create and the extra drop pop to go through. Anything else would be varied.
And the other thing is that the thin arbiter I was telling you that you can run thin arbiter process on a node which does not have GlusterD. So if you know GlusterD, GlusterD is a management daemon which is used for spawning all the brick processes the cell field daemons and all that. So if you restart a node it is GlusterD which ensures that the brick process comes back up. So without GlusterD how does it actually work?
So if you have mounted a Gluster volume you know the way the mount logic works is when you issue the command mount minus t blah blah blah the server name and the volume name the client initially talks to GlusterD and gets the information about the wall file and then connects to each of the bricks using a particular port number.
But because thin arbiter node we are not hosting any GlusterD brick process we are not hosting any GlusterD process we need to hard call a port number. So we currently use 24007 because that's the port of the GlusterD. So because GlusterD is not running when you mount the volume the client will directly connect to this thin arbiter process on this port.
If you want to change it to some other port there are volume options available you can configure it using a different port. So let's look at how thin arbiter works for writes and reads. So let's assume that the application is writing something on file 1 and you have brick 1, brick 2 and thin arbiter
and let's say the write failed on the second brick and succeeded on the first brick. So what AFR does is before sending back the response to the application it marks on the first brick that there is some pending operation on the second brick. So it essentially marks that brick 2 is bad. So that information is marked both on the first brick
and also on the thin arbiter. So after marking that it also stores in memory that which brick is good and which brick is bad. The reason why it does is because thin arbiter does not need to be in the 5 millisecond latency limit which is there for normal Gluster brick processes.
So in order to not contact thin arbiter for every file operation you try to maintain in memory the information about the bad brick. So the client says that in memory that brick 2 is bad and then it responds with success to the application. Now when the write 2 comes on the same file
as long as the write succeeds on the previously known good copy of the brick we say that it is a success to the application. So if you look at the diagram from here write 2 comes on file 1 and it succeeds on the first brick but fails on the second brick. So because we already know from the previous write
that brick 1 is the good copy and brick 2 is bad we can return success to the application without actually contacting the thin arbiter. So that's how it's not actually participating in the IO path. So let's see what happens when a write comes and it fails on the
opposite replica. So write 3 comes on the same file and this time it fails on the first brick but succeeds on the second brick. Now because we know that this can and if we allow this as success we can end up in a split-brain state so we don't return success and we actually fail the FOP. So this is the essence of thin arbiter.
So to summarize like how the write works so if the write fails on both the data bricks then you can obviously say that the write failed. If the write fails on one brick and if it is already a known bad brick then you can return success to the application but if the write fails on the brick which is already good then you will have to fail the application.
Alright, so let's look at how reads work. So let's look at each case. So the case when the client is connected to both the bricks. So when we have this state let's say that we already are in a state where the brick 1 is marking the brick 2 as bad.
So brick 1 is good, brick 2 is bad and the client is connected to both the bricks. In this case we don't have to query the thin arbiter because the AFR exciters on both the bricks already tell us which brick is good and which brick is bad. So you don't have to actually contact the thin arbiter to know whether your results that you interpret
from the exciters are valid or not. You can trust them and you can serve the read. The only problem or the only reason is you need to contact thin arbiter is when okay let's before that let's go to the second case. So case 2 is when we have the client connected to the good brick
and it is disconnected from the bad brick here. So if this good brick already blames the second one with an exciter then we can be sure that you don't have to contact the thin arbiter node because the exciter state is known and the exciter here already is blaming this guy. So you can directly allow the reach to go through.
But let's say the client is connected only to this brick which is bad and not to the first one which is good. So if this guy doesn't blame anybody we cannot blindly state that you know I'll serve the read from this because because this does not contain any exciters sorry this does not contain any exciters you will have to query the thin arbiter.
So that is the case where the client actually has to query the thin arbiter and if the thin arbiter doesn't blame the brick which is connected to then you can serve the read otherwise you will not be able to. So to summarize what we discussed in the previous slide if both the data bricks are up
then you serve the read from a good copy both can be good. If one of them is down then you will have to query the brick which is down, I'm sorry which is up and if it doesn't blame the brick which is down then you can surely serve it but otherwise you will have to contact the thin arbiter and get the information from the thin arbiter to see which is good and which is bad.
Okay, so the next two slides are a bit of an implementation detail. Yeah, so I was telling you that the client maintains in memory which brick is good and which brick is bad, right? So the cell field daemon actually also heals the files when the bricks come up. So how does the client invalidate its in-memory information
when the cell field daemon heals the file? So for that it makes use of upcalls. So the locks translator in GlusterFS provides this notion of an upcall for locks. So when there is a conflicting lock from another client the locks translator will send a notification to the one which is currently holding the lock
and it is up to that client to release it so that the conflicting client can take the lock. So locks translator also supports taking lock on the same file from the same client on multiple domains. So for example client one can take a lock on file one say from offset zero to ten
on domain one it will be granted. If it is trying to get the lock on the same file using a different domain it will still be granted. So the locks translator has this distinguishing feature of domains wherein if the offset and the range of a lock on a given file is same but the domain is different you will allow the locks to go through.
So AFR actually uses these two features to invalidate the in-memory information. So we will see how that works. So when the first failure happens while writing from a client during the post-op phase AFR takes two locks one in the notified domain and other one in the modified domain then it marks on the thin orbiter saying that this brick is good and this brick is bad.
After doing the marking it releases only one lock which is in the modified domain. So the notified domain lock still resides in the thin orbiter. So for every client that is connected to thin orbiter you have a bunch of notified domain locks which are residing in the brick process.
So how is that used is what we will see. So when the self-healed daemon starts to heal the files in the volume it will attempt to take both the notify and the modify locks and because of the lock contention feature available in locks translator it will send an upcall notification back to the client. So if there are ongoing writes in the client
it will complete the writes and then release the notify lock. So then the self-healed daemon can actually get the lock and it will proceed with healing the file. So the thing to note is that if IO fails during the heal the client will again mark on the bad brick saying that it will basically invalidate its in-memory information.
So this is how locks translators, upcall infrastructure and multiple domain locks are used for maintaining the in-memory information. So installation and usage is pretty simple. So on the thin-arbiter node you will have to install the server RPMs
and you will have to run a script to start the thin-arbiter process. Once you have that done then you can peer probe and create normal volumes which are replicaed to and use the syntax call cluster volume create. I will show this in a demo. So if you are using it in a standalone mode you can use this method. If you are using containers and if you are using a provisioner for containers
there is something called kadalu.io which also provides support for thin-arbiter recently. You can try that out. So the things to do that are there is we currently do not have support for add brick and remove brick CLI. So if you have used cluster volumes, if you want to replace a brick
or convert existing replica 2 or replica 3 to thin-arbiter volumes it is currently not possible. So those are the things that we need to work on. So the reads also I was telling that so the writes have in-memory information on which brick is good and which brick is bad but the reads do not have that information. So every time it queries the thin-arbiter to get the information.
So that is something which we need to work on to optimize the information about using the in-memory information about the bricks. And you also have to fix bugs. If you guys try it out and report bugs we will be happy to fix them. So I will just show you a demo now. I have recorded it already. I will just play it out.
So we have, I hope the font is visible.
So we have four VMs here like 1, 2, 3 and 4. Ravi 1 and Ravi 2 I am going to use for hosting the actual volume and Ravi 3 I am going to use for hosting the thin-arbiter process and the fourth machine would be the client. So let's first start with installing the thin-arbiter on VM 3.
So you have to run the script call setup thin-arbiter.sh and you have to run it with the minus s flag. So it will basically ask you the brick path and you enter where you want the brick to be hosted. You say bricks, brick ta and then the volume has been, the thin-arbiter has been started. So if you check whether the process is running you can see it. And I was telling you that there is no GlusterD on the thin-arbiter node
which means that the process management has to be done automatically by systemd. So we have integrated this with systemd so that even when the thin-arbiter node gets rebooted or the process crashes it will automatically start it. So the unit file takes care of that so let's try to kill the process
and see you can see that it is again spawned again with a different PID. So you don't have to need, you don't really need GlusterD on this node. So having started the thin-arbiter let us try to create a replica 2 volume using the first two VMs. So I will just export some environment variables with the IP addresses of the VMs
and then I am going to create a thin-arbiter volume so the syntax is Gluster volume create, wall name, replica 2, thin-arbiter 1 and then you give the list of bricks which form the data bricks of the volume and in the end you mention the thin-arbiter. So VM3 is a thin-arbiter here so we say VM3 bricks brick TA and that's it.
So we will start the volume. Okay, so now let's see on the second node whether the bricks are up and running. It is, so now we are good to actually mount the volume and start doing IO on it.
So we will go to the node 4 now. So before mounting the volume let me just show you the ID file. So if you look at thin-arbiter it is currently empty there is nothing here. So I was telling you that the ID file is created when you first mount the volume. So if you go issue the mount command and then come back and check here
there is this ID file which is created. It is a 0 byte file and this ID file is used for capturing the good and bad brick information for all the files in this replicas of volume. So if you do a write to a file you see that the file contents are getting replicated to both the bricks,
of both data bricks. So let's kill one second data brick and try to write something. So the write is successful and you are also able to read it because now this is the only bad brick. So if you now look at the extended attributes on the thin-arbiter, if you do a get-a-fatter and see
you will see that it contains certain extended attributes which blame the client 1. Client 1 is the second brick. So client 0 is first and client 1 is second. So because we killed the bricks in VM 2 it is saying that there is a pending data heal on the second brick.
So the thin-arbiter essentially captures this information. The client also has it in memory that now the second brick is bad and I should not allow any writes which might fail on the first brick. So let's kill the first brick and see what happens. So now we are going to kill the brick which is good and we will bring the second VM back up.
So now you see that the first brick which witnessed the write was killed and now the second brick is up. But even now when you try to access the client you will see that the clients fail both reads and writes. So LS reader fails with input-output error and writes also fail with input-output error.
So because the only good brick is down we are not allowing the writes anymore. So if we bring the brick back up, so we will restart the study on the first node so that the brick comes back up. Now the self-healed amount would have automatically healed the file by now.
And you can see that the file is again accessible from the mount. And if you look at the thin-arbiter, before that let's look at the contents of the file from both bricks. So the contents of the file is also the same on both bricks. And if you do a get-a-fatter now and look at the external attribute that AFR maintains, it has now reset it to all zeros.
So earlier it was like blaming the second brick. Now because the self-heal has happened, it's reset it and now you can continue with the IO as usual. Yeah, so that's pretty much it. Guys have any questions?
Alright then, thank you.
So some instructions as well.
You have the white lines, right? These are some videos. It's the video. And the slides are through the box. Record it on the box. So if you stand in front of the screen, you will not be recorded.
I think let's skip for now.
If I need it, I will put it on target.
So arrow next to it?
Okay. I think I might use an option here. Another arrow next to it? And here?
Yeah, I'll just press it. At the level.
At the level so that it won't... Okay. Now you have... Go back to your... Better there. Oh, we didn't time it. It makes it easier for me.
No, you have to wait until...
The time happens. Yeah, yeah. So that's another 15 minutes. Okay. So you want to hit the mic already? No, I think I'll skip it. Okay, because you have... 15 minutes. Yeah, exactly. So the mic's mostly for the recording.
If you speak loud enough, the room might understand.
Yes.
Okay.
.
. .