pg_paxos: Table Replication through Distributed Consensus
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 34 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/48470 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
PGCon 201611 / 34
1
2
3
4
5
6
12
13
14
15
16
18
20
21
23
26
27
29
32
34
00:00
Table (information)Product (business)ImplementationDemo (music)WordCartesian coordinate systemExtension (kinesiology)DatabaseAlgorithmCellular automatonMereologyReplication (computing)Computer animation
00:45
AlgorithmServer (computing)Address spaceCategory of beingMultiplication signRandomized algorithmNumberSocial classPhysical systemAlgorithmOrder (biology)Process (computing)Server (computing)Source code
02:00
MereologyAbstractionSystem programmingCommunications protocolAlgorithmMultiplication signRow (database)AlgorithmFunctional (mathematics)DivisorBitCellular automatonPhysical systemImplementationSource code
03:08
Function (mathematics)outputVertex (graph theory)Local GroupPhase transitionGroup actionComplete metric spaceFunctional (mathematics)Key (cryptography)Point (geometry)AlgorithmCellular automatonPhase transitionoutputDivisorAsynchronous Transfer ModeSource code
04:50
outputPhase transitionSequenceNumber2 (number)Multiplication signPhase transitionMessage passingNetwork topologyoutputAlgorithmDependent and independent variablesCASE <Informatik>IdentifiabilityMathematical optimization
08:23
Phase transitionComplete metric spacePhase transitionAlgorithmRight angleSource code
09:30
outputPhase transitionRight angleNumberKey (cryptography)Phase transitionException handlingNetwork topologySource code
10:11
Finite-state machineVertex (graph theory)SequenceoutputState of matterRoundingNumberMultiplicationMultiplication signState of matterPhysical systemSequenceNumberFinite-state machineKey (cryptography)Service (economics)AlgorithmOrder (biology)MathematicsoutputRight angleSource code
11:38
Query languageConsistencySet (mathematics)Finite-state machineRoundingAlgorithmComplete metric spaceInsertion lossRoundness (object)State of matterInformationPoint (geometry)Table (information)Multiplication signReading (process)Cellular automatonOcean currentMultiplicationMathematicsConsistencyRight angleCASE <Informatik>String (computer science)BlogSystem callResultantParameter (computer programming)RootPhase transitionGoodness of fitComputer animation
15:05
MultiplicationExtension (kinesiology)Block (periodic table)System programmingBuildingBuildingConsistencyMiniDiscTotal S.A.Distribution (mathematics)WritingSerial portLogicFault-tolerant systemMultiplication signExtension (kinesiology)Table (information)Concurrency (computer science)Physical systemQuicksortAlgorithmExterior algebraData storage deviceDatabaseRight angleCellular automatonBinary multiplierBlock (periodic table)Multiplication
17:27
MultiplicationImplementationStatement (computer science)SoftwareImplementationTable (information)Dependent and independent variablesMiniDiscLink (knot theory)SequelCASE <Informatik>CodeInsertion lossMultiplication signCrash (computing)State of matterFinite-state machineFormal languageInternet service providerRemote procedure callMultiplicationMereologySource code
18:50
MultiplicationImplementationStatement (computer science)Formal languageSemantics (computer science)Computer networkFunction (mathematics)Remote procedure callPhysical systemSoftware testingType theoryConfidence intervalStandard deviationProjective planeFunctional (mathematics)Table (information)Remote procedure callAlgorithmDependent and independent variablesQuery languageSoftwareSequelSemantics (computer science)QuicksortProcedural programmingSource code
20:09
Local GroupMetadataTable (information)MultiplicationGroup actionExtension (kinesiology)Query languageFunction (mathematics)Table (information)Group actionBitExtension (kinesiology)TrailQuery languageRoundness (object)Multiplication signSequelCuboidRoundingLogarithmFunctional (mathematics)LoginMaxima and minimaConsistencyReading (process)Computer animation
21:33
Table (information)Query languageServer (computing)Local GroupBlogCloningGroup actionTable (information)LoginCellular automatonMathematicsQuery languageServer (computing)
22:54
MultiplicationQuery languageSet (mathematics)Cellular automatonPosition operatorTable (information)Right angleQuery languageFunctional (mathematics)PlanningOrder (biology)Virtual machine
23:42
Query languageNumberReading (process)Roundness (object)RoundingOrder (biology)Point (geometry)Virtual machineAlgorithmGroup actionMaxima and minimaSource code
24:32
Structural loadDemo (music)Table (information)State of matterDemo (music)Order (biology)Database transactionString (computer science)Physical systemFunctional (mathematics)WritingWeb pageSystem callStatement (computer science)Rollback (data management)Commitment schemeSoftwareReading (process)Link (knot theory)Moving averageCuboidData storage deviceInsertion lossRight angleComputer animationProgram flowchart
27:29
Phase transitionDemo (music)Structural loadState of matterRow (database)Functional (mathematics)Database transactionRemote procedure callTable (information)RoundingKey (cryptography)Insertion lossInstance (computer science)Service (economics)Process (computing)TrailCellular automatonDatabaseLastteilungDemo (music)Computer animationProgram flowchart
30:23
UsabilityTime zoneInstance (computer science)Group actionVolumeEvent horizonLocal GroupLimit (category theory)Information securityFiber bundleMaizeTask (computing)Direct numerical simulationVideo game consoleService (economics)Independence (probability theory)Cartesian coordinate systemDatabaseTwitterCuboidComputer animation
31:32
Table (information)File formatSlide ruleView (database)Coma BerenicesData typeDatabaseFunction (mathematics)AlgorithmProcedural programmingFormal languageRevision controlElectronic mailing listExtension (kinesiology)Subject indexingKey (cryptography)Key (cryptography)Table (information)Block (periodic table)Group actionMoore's lawExtension (kinesiology)CuboidMobile appInstance (computer science)Computer animationProgram flowchart
32:50
Subject indexingDatabaseAlgorithmFunction (mathematics)Procedural programmingFormal languageData typeRevision controlExtension (kinesiology)Electronic mailing listKey (cryptography)Uniqueness quantificationConstraint (mathematics)Error messageGroup actionPoint (geometry)Insertion lossNumberTable (information)Roundness (object)Lattice (order)JSONComputer animation
33:31
Subject indexingDatabaseFunction (mathematics)AlgorithmFormal languageData typeRevision controlExtension (kinesiology)Electronic mailing listKey (cryptography)Table (information)LoginBitLastteilungContent (media)Scripting languageTwitterInsertion lossJSONComputer animation
34:19
Data typeLocal ringConvex hullPort scannerComputer-generated imageryGamma functionTwitterMessage passingNetwork topologyScripting language2 (number)NumberTwitterJSONComputer animationLecture/ConferenceSource code
34:59
Data typeDataflowCohen's kappaTwitterFibonacci numberScalable Coherent InterfaceTwitterMultiplication signData centerScripting languageSource codeJSONComputer animation
35:39
Subject indexingData typeTable (information)Key (cryptography)Server (computing)System administratorError messageQuery languageContext awarenessStatement (computer science)Boolean algebraFunction (mathematics)Line (geometry)Process (computing)Stack (abstract data type)Function (mathematics)Tracing (software)Point (geometry)CuboidSequelBitConnected spaceTwitterExistenceJSONSource code
36:20
Moment (mathematics)Error messageStatement (computer science)Context awarenessFunction (mathematics)Server (computing)Boolean algebraProcess (computing)Local ringEmailTwitterPoint (geometry)Process (computing)Entire functionData centerTime zoneDifferent (Kate Ryan album)Cellular automaton2 (number)Network topologyJSONComputer animation
37:08
World Wide Web ConsortiumLocal ringEmailError messageServer (computing)Context awarenessStatement (computer science)Boolean algebraFunction (mathematics)Process (computing)File formatGoogolTouchscreenConsistencyVolumeElement (mathematics)AutomationScheduling (computing)Human migrationSource codeMetadataTwitterProcess (computing)Demo (music)Scheduling (computing)Source codeMultiplicationMultiplication signType theoryCartesian coordinate systemBitPhysical systemData centerSemantics (computer science)Reading (process)MetadataData managementHuman migrationWritingBlogFlow separationQueue (abstract data type)Group actionElement (mathematics)ConsistencyVolume (thermodynamics)NumberComputer animationSource codeJSON
40:04
MultiplicationImplementationBitDatabase transactionImplementationCommunications protocolSource codeSerial portQuery languageInjektivitätAlgorithmTelecommunicationDatabaseRight angleDifferent (Kate Ryan album)Cellular automatonSequel
42:04
State of matterDatabaseTable (information)LogicComputer animation
43:05
Computer hardwareProduct (business)Similarity (geometry)Type theoryConnectivity (graph theory)Computer animationProgram flowchart
43:45
Data typeError messageConstraint (mathematics)Uniqueness quantificationCloningLibrary (computing)Query languagePoint (geometry)Table (information)Element (mathematics)State of matterGroup actionExtension (kinesiology)Lattice (order)Cartesian coordinate systemJSON
44:26
Data typeError messageUniqueness quantificationConstraint (mathematics)Order (biology)Query languageServer (computing)Function (mathematics)Statement (computer science)Context awarenessError messageCorrespondence (mathematics)Group actionTable (information)Query languageStructural loadSource codeJSON
45:13
Error messageServer (computing)EmailStatement (computer science)Context awarenessFunction (mathematics)Boolean algebraProcess (computing)Computer configurationQuery languageSheaf (mathematics)FlagMoment (mathematics)Functional (mathematics)Table (information)AlgorithmStructural loadPoint (geometry)Virtual machineElasticity (physics)Multiplication signRoutingComputer architectureConnected spaceSource codeOcean currentLogicLastteilungData centerInstance (computer science)Row (database)Client (computing)PlanningLimit (category theory)Single-precision floating-point formatDirect numerical simulationTime zoneBitService (economics)Network topologySubject indexingCellular automatonLogical constantChemical equationFigurate numberSource codeJSON
Transcript: English(auto-generated)
00:01
I'm Marco from Citus Data. I'm going to talk about PGPaxos. A word of warning, if you're looking for a completely production ready solution, this is probably not the right talk to be at, but hopefully it's going to be quite interesting. So this talk will have three parts. The first part, I'm going to talk about what is Paxos? How does it work?
00:24
Why does it work? And how can you use that to implement database replication? And then I'll go through some of the internals of PGPaxos, an extension for Postgres that uses the Paxos algorithm to do replication. And finally, I'll show you a demo of how you set up Paxos and for an actual application.
00:45
So the problem that Paxos addresses is that of distributed consensus. So let's say you have a number of servers, 3, 10, 100. You need exactly one of them to do something. So they all need to agree who gets to do the job, who gets to be the leader.
01:06
Or you have a number of replicas of some data, and there's updates to that data, and they all need to agree on the order in which these updates are applied. Turns out this is an impossible problem if there are failures, which there are in distributed systems.
01:23
You cannot solve consensus. You cannot write an algorithm that always reaches consensus. Let's say if you have two nodes, you cannot get them to agree on who gets to be in charge if under all failure scenarios. So Paxos is a probabilistic algorithm for achieving consensus. There's basically two classes of
01:43
probabilistic algorithms. One is kind of always finishes, but may not succeed, and the other one is always succeeds, but may never finish. And Paxos is the latter category. So if there are certain failure scenarios, Paxos may take forever.
02:00
But most of the time it does not. So Paxos has built up a bit of a reputation as a very hard and complicated algorithm, and there's a few factors contributing to that reputation. One of which was the original paper on Paxos, which was called the part-time Parliament. And the abstract of the paper is,
02:21
recent archaeological discoveries on the island of Paxos revealed that the Parliament functioned despite the peripatetic propensity of its part-time legislators. The legislators maintained consistent copies of the parliamentary record, despite their frequent forays from the chamber and the forgetfulness of their messengers. So the whole paper was a metaphor using some fictional kind of parliamentary system that never really existed as far as we know,
02:44
to try to explain this algorithm, and no one really understood it. Like if you read this paper, and then from that can figure out how to implement Paxos, you'd be probably a genius. But Lampert, the inventor of the Paxos algorithm, wrote another paper a few years later where the entire abstract was the Paxos algorithm, when presented in plain English, is actually very simple. And it actually is quite simple.
03:09
So this is essentially the Paxos algorithm. So imagine you have a group of nodes, some well-defined group that they all know about each other, and each node has a function called Paxos with a key and a value. And
03:24
for a given key, this function always returns the same value on all nodes in the group, and the value is one of the inputs. So one node might call it with value 1, one node might call it with value 5, one node might call it with value 10, and then maybe it returns 5 on all of them, or 10, or 1.
03:45
Depends on a lot of factors, but it always returns the same value on all nodes. That's useful. So it runs in two phases. If you call this function, you become a proposer. So you propose to,
04:01
you basically propose a new value. But first you need to ask nodes to participate. So you ask nodes, please prepare for a new proposal, and then you need to get the majority of nodes to participate. The concept of a majority is quite important in Paxos. If you cannot get a majority to participate in your proposal for any reason whatsoever,
04:23
failures, other things, you start over. If you do get a majority, you go to phase 2, and you request that these nodes accept a value. And if the majority of nodes at this point accept a value, you have consensus, and Paxos completes. If anything goes wrong, you start over.
04:43
And there's no guarantee that this doesn't take forever. Like you could start over, but usually it doesn't take forever. So zooming in into phase 1, the question that the proposer asks to the other nodes, and the proposer is kind of, it's usually one of the node itself,
05:02
but it could be, it could even be separate. Please don't accept proposals with a lower number than i. So i is some sequence number. It usually just starts at 0, and the nodes receiving this message can then respond saying, okay, I'll participate.
05:22
Or, well, actually I already received a competing proposal from someone else, and his sequence number was greater than yours. So I promised I would not participate in proposals with lower sequence numbers. Now they could be using the same sequence number. Actually, very often they are, but then you look at something else like node identifiers, something that's unique
05:44
across the proposals. So in that case, the acceptor sends back, hey, I already got this proposal with higher number, and then the proposer says, oh darn, I'm going to start over, but this time I'm going to set my sequence number to be greater than j.
06:03
So I said, next time I will get my proposal through, hopefully. And the third response that the acceptors could give here is actually, I already accepted a value before. So someone already reached phase 2, and in phase 2 I accepted a value, and
06:22
then the acceptor sends back this value. to the proposer. And when the proposer gets a value from another node, from someone else's proposal, it has to start using this value instead of its own. And it picks the value with the highest proposal number.
06:43
So now it throws away its own value, and it uses someone else's value. And this is kind of the trick that makes Paxos work, and I'll show you why in just a sec. So the second phase of Paxos, I'm going to send messages to the, to all the nodes that are participating. Please accept this value, could be my value, could be someone else's value, for this proposal number, i.
07:07
And then the acceptors could say, okay, I accept your value. Or they could say, well, I already received a competing proposal with a higher sequence number, someone else ran phase 1, and I promised I would not participate in proposal with lower sequence numbers,
07:23
so sorry. And then the proposer starts over with a higher sequence number, and it could take a while. And then if anything goes wrong, for some reason you do not get oak tree, the majority to say okay. You also start over. And
07:40
finally, there's usually a phase 3, like you don't want to go through all this stuff again. So if you manage to get confirmation that tree nodes are, sorry, the majority of nodes, so that's like two out of three, or three out of four, or three out of five, accepted the value, you just tell everyone else that they don't have to rerun Paxos.
08:00
But again, there could be failures here, and so it's not actually a critical step. You could even skip this step. It's more of an optimization. So why does this work? So actually this, this algorithm is sufficient to provide the guarantee that it will only ever return the same value on all the nodes, and the value will be one of the inputs.
08:22
So why will it never return a different value? Well, if you kind of think of it backwards, and Paxos was actually kind of derived backwards from the notion of consensus, so it was more of a derived algorithm than an invented algorithm. So if a majority of nodes accept my proposal,
08:45
that means that no other node has completed phase one since I did. Because if another node managed to complete phase one, I wouldn't have gotten all my okay's. One of them would have said, no, I will not participate. Because
09:00
to get my, like if someone completed phase one, he would have needed the majority, and at least one node is going to be in common with my majority. Right, that's the trick with the majorities. Two majorities always have at least one node in common. So because I completed phase one,
09:20
phase two, it means that any other proposal still needs to complete phase one, and thus will see my value because they will, they will run into this, right, because they still need to complete it. So they will definitely run into this for at least one node. And it's also guaranteed that my value has the highest proposal number that got accepted
09:43
since no one else has managed to complete phase one since I did. So every other proposal that got to phase two had a lower proposal number. So basically just getting three accepts or getting a majority of accepts in Paxos tells me
10:01
this is now the value that is forever going to be returned by Paxos for a particular key. How is that useful? Like this is a write once algorithm. You can set the value once and then it never changes. Well, actually I can do something which distributed systems people, we call a build a replicated state machine
10:28
where the idea is if you have a bunch of nodes which all have the same state machine and the same initial state and you feed them all the same inputs, they will be in the same final state. So what I can do with Paxos is
10:42
replicate a log of changes, right? A log is a write once thing, a write only thing. So I can say Paxos, I can use the key as kind of the sequence number in the log or the sequence number on the log of the key and my first item could be set X equals 6 and
11:04
then the second item could be set I equals 7 and then the third item could be set I equals 9. So if I know about all these three items, I know that the value of X equals 6 and the value of Y equals 9.
11:22
So what I can do with Paxos is implement kind of this replicated log and I can always make sure that I know what's in the log just by running Paxos for all these values. And this technique is actually called multi-Paxos because you run Paxos multiple times.
11:43
So the way I write to this log is I run Paxos for whatever I believe to be the current round number or basically at the end of the log and then I run it and maybe Paxos has already completed and I just get back someone else's value. Then I say, okay, well
12:02
I'll try with the next highest round number or what I now believe to be the highest round number. And maybe someone else comes in, completes phase one before I do, my proposal loses and then I get back his value from Paxos. Okay, I try again with a higher round number.
12:21
Eventually, hopefully I'm gonna get my value appended to the log. So my command is gonna have some round number where it has reached consensus. To perform a read, what I need to do is I'm gonna ask all the nodes or at least the majority of nodes what's the highest round number for which you accepted a value
12:44
knowing that that is the highest round number which could have consensus at the time that my my read started. And if I make sure I know all the values in the log up to that round number, I'm good. Like my log is entirely up-to-date, up to the point
13:04
when my read started, I'm gonna see any preceding read. I'm gonna get fresher information than any preceding write read. Sorry, I'm gonna see any preceding write, gonna get fresher information than any preceding read. But I do need to provide some value when I call Paxos, like it takes a value as a parameter.
13:23
So what I could do is just give an empty string as the value when I want to do a read. Worst case, something went wrong somewhere and no value actually reached consensus in the round and it could even be that my empty string will actually reach consensus in that round.
13:42
That's fine. Like nothing bad happens. This just means nothing changes. So sometimes you could have like empty entries in the log because of this. And so each node has its own copy of the log and it could be arbitrarily far behind. If a node fails and doesn't get updates for an hour
14:03
like its log, it's gonna be far behind. But it can always just run the consistent read algorithm, say node C comes back, say, okay I want to do a consistent read on my state, or let's say my table now. I'm gonna ask node A and B. What's the highest round number for which you accepted the value? Node A will say
14:23
2. Let's say node B will say 1. So node C runs Paxos for this. Somehow it actually already accepted the value here. Maybe someone tried this insert and then crashed or whatever, something went wrong. Actually another value reached consensus. So when node C runs Paxos for round 1, it'll get back that value. When it runs Paxos for round 2,
14:47
it'll get back this value. I mean it already accepted this value, but didn't know whether there was consensus probably. So at that point node C is completely up-to-date with any preceding writes, so it can
15:00
return a result. So I implemented this in Postgres as an extension called pgPaxos. So it's an extension you can just install in Postgres. I think it works 9.4, 9.5, possibly works on 9.6, I haven't tried.
15:22
And it provides consistent or tolerant multi-master table replication where you can tolerate a minority of nodes going down. So if you have three nodes, you can tolerate one going down. The trade-off that you're making is that the write throughput is going to be incredibly low
15:42
because like you cannot really have concurrent writes. You can only have one write at a time for a particular log items. Otherwise, they're going to compete and that just makes both of them take longer. So the write throughput is low, think like maybe a hundred per second and
16:02
the latency is generally going to be high. There's actually no guarantee on how high it could be. It could be very high. Like minutes, like if two nodes are down, let's say, then there's no wait and there are three nodes in total. There's no way to get a majority, so paxos will take forever until one of the nodes comes back.
16:20
So that's the trade-off you're making, but there's no other way to get consistent fault-tolerant multi-master table replication. So it's not really an alternative to other replication solutions like streaming replication or logical replication. They have no particular purpose and this doesn't replace them in any way.
16:41
It also doesn't sort of magically make Postgres distributed. There will never really be such a thing as distributed Postgres. In distributed systems, there's not really solutions. There's just different trade-offs you can make. But it can be a very useful building blocks, especially for distributed systems. In the way that you wouldn't do your own
17:02
data storage, you wouldn't like, you know, most of the time you don't want to write your own serialization algorithms to write things to disk and take care of truncation and concurrency and whatnot. You use a database. Similarly, there are some very hard problems that Paxos solves
17:20
which you probably don't want to solve yourself and you can use PgPaxos for those. I'll give you some examples later. So PgPaxos is available on GitHub under PostgreSQL license. So it contains a very basic implementation of Paxos and multi-Paxos, which is the state machine written in PLPGSQL.
17:45
Which sounds a bit weird, but it's a surprisingly suitable language for implementing distributed algorithms. Because a lot of the time, one of the things that makes Paxos hard is you really need to be very careful about your local state and
18:01
making sure you write that to disk in case there's a crash before you give any response, making sure you deal with truncation, making sure you serialize data, and actually all of that comes for free in PLPGSQL. I just do an insert on a table and then let PostgreSQL do the work. And PLPGSQL has a networking API, DBLink. It's not so great, but it works.
18:22
And it does make it easy to do like remote procedure calls. And the other part of PgPaxos is basically C code, which uses the planner and executor hooks in PostgreSQL to provide replication for particular tables that you want to replicate. So you can mark certain tables as
18:44
replicated and then any insert, for example, you do is going to be replicated to the other nodes using Paxos. And it should say it is somewhat experimental. Like if someone, like the most useful thing if you want to contribute to this project is actually testing.
19:02
I wouldn't, I don't feel like we're missing a lot of features, but we're definitely, like I've done a lot of testing, but like the industry standard here for these types of system is actually Jepson and it's actually quite a lot of work to do Jepson tests. But any type of testing you can do would be very useful in making people confident that,
19:22
you know, this is a robust system. So as I said PLPGSQL is actually surprisingly suitable for implementing distributed algorithms. You get transactional semantics. It's easy to, you know, collect a bunch of responses in a table and do some query on them.
19:40
There's a simple networking API. RPC, remote procedure call, becomes pretty simple. I can just do create function, which is the one I call on the remote end and then create another function which uses DBLink to call the other function. So it's pretty simple. And because it's PLPGSQL and Amazon RDS and Heroku both have DBLink installed,
20:01
I could even run this on Amazon RDS. I couldn't do the sort of transparent table replication, but the Paxos functions, they would work. So, if you create extension PTPaxos, you will get a bunch of tables. Kind of boring, but just for the concepts. There's a group table, which is the Paxos groups in which this nodes participates.
20:25
So a node can be in many groups at the same time. The host table is just the members of the group. The round table is a bit of a forename for the log. So this contains the actual replicated SQL queries and there's a replicated tables which keeps track of which tables should be replicated using PTPaxos.
20:46
And the Paxos extension, PTPaxos also adds some functions. One is just the Paxos function that I've already described. Pretty much works exactly like that. The apply log function is pretty much the the read, the consistent read function I've described.
21:02
So if you give it like the max round number in the group, it just makes sure that the log is up to date. Apply and append is the write function I described. So it keeps trying to get consensus on its query for a particular round, but if it fails it goes to the next round.
21:22
And but you don't really need to worry about these functions so much unless you wanted to implement your own replication solution or log. Most of the time what you do is you create a table and then on the first node in the Paxos group you do a create group where you specify your own host name such that other nodes know how to connect to you. And then you do replicate table.
21:47
So the data refers to the table. And after this any insert or update on the table will get replicated, added to the Paxos log and any select will make sure that your log is up to date before the select is actually executed.
22:06
So on the first node you do those three steps and then on the second and third and whatever node you do join group. So you specify one of the existing servers in the group, I guess the usually the original one, and then your own IP.
22:23
And what this actually does is quite interesting like if a node joins Paxos the world kind of changes like before the node joins maybe my majority was two nodes, but after the node joins now my majority is three nodes. So I need to be aware of that.
22:41
So the way that works in Paxos is actually the fact that this node joins is added to the log such that any subsequent query will actually see this new node. So what happens when there is an update, so as a user you just do update the table, set to do whatever,
23:03
then in the background Paxos runs this, pgPaxos runs this, Paxos apply an append function with more or less your query. It kind of de-parses it from the Postgres query plan, but more or less it will it will log that. And then it, you know, keeps trying to append and then when it learns, okay, now I've appended it at log position
23:27
51, so I'm gonna make sure that items 0 to 50 have been executed and then I'm gonna execute the update such that my update works on a consistent copy of the table. And this makes basically all writes completely serializable, they're all executed in the same order on all the machines.
23:46
When you do a select basically you run the read algorithm. So in the background, hidden, transparent to the user, usually, Paxos max group round gets, talks to the other nodes, asks them
24:02
what's the highest round number for which you accepted a value and then it gets its log up-to-date up to that point. And it knows that when this select started, like any round number higher than that could not have had consensus because at least one node in the majority would have
24:21
accepted a value for it. There's some subtlety when like new nodes join then you have to repeat the max group round, but this is the general principle. So I'd like to do a little demo. Are there any questions so far?
24:46
So what you can do, and this is more of a manual, oh sorry, the question was how do you deal with transactions and essentially like the the transparent table replication is more or less single statement, so you can't really do begin
25:03
statement rollback or something. What you could do just using the Paxos function is just log the whole string of your transaction, like begin, update, delete, insert, commit. You could log that as the entire string and then it'll get executed in order along with the other writes.
25:24
Yeah, that'd be the the more practical way of doing it. It would also save you on network traffic, but yeah, that's entirely possible. Alright. So I want to show you a demo of a distributed locking system.
25:41
So I don't actually need page epoxos and the nodes on which I'm doing work to be the same thing. Oh, sorry, Jeff.
26:25
No, I mean it'll end up let's say when you do a consistent read and there's a bunch of items in the log that you need to execute, it'll end up just doing those in one transaction basically. Is that the question?
27:02
Well, any state that's recorded during the call to Paxos is actually done over the DB links. So you don't, like Paxos doesn't even, like the Paxos function doesn't even store state on itself. It's assumed to be completely external. So it just talks to the other nodes as if, you know, they were external. It could be talking back to itself.
27:24
But so let's say in oh, this is pretty far back. Let's say when this node responds saying, okay, I accept the value. I mean, that's a separate transaction. And so actually the Paxos function doesn't record any state at all. Everything happens in those
27:43
remote procedure calls. The round number needs to be persistent.
28:02
You need to keep track of how much of the log you have executed. So whenever you do an execute, you also update the round number of, like, I've executed this much of a log.
28:22
Yeah, and you just find that from everyone else. You don't, yeah.
28:43
Yeah, there's a separate table for keeping track of like which log is applied and then there's a table for for the log itself, which contains all the IP. So I'm going to show you a demo where PgPaxos is used to implement the distributed locking service. And the motivation here is doing distributed cron.
29:04
This is actually quite a tricky problem if you have let's say a job that absolutely must happen every minute, but you don't want it to happen twice. So you could set up one node and then it runs this job every minute, but if that node fails
29:21
then it doesn't happen. You could set up two nodes that both run it every minute, but that's probably not what you want. So you want exactly one of them to do it. So this is kind of a locking problem. Of course, you could set up a database where you which has a table of locks, but then if the database goes down you still have the same problem.
29:42
But with PgPaxos you can actually solve this. Like, so I've set up a little PgPaxos cluster using very small instances. I'm running this on EC2 so I'm using the T2 micros. Paxos doesn't have very high throughput anyway, so you might as well just use small instances. I put a load balancer in front of it such that the cron node can just, will just get
30:02
rooted to whichever Paxos node is up and running. And then they both try to do an insert into the same locks table with which has a primary key on the name of the lock and only one of them will succeed. The other one will fail and does know that the other guy has the lock.
30:22
So, yeah, as I said, I set up my cluster on EC2. There's three PgPaxos nodes and two slightly bigger cron nodes. They're not going to have to do a lot of work, but that's just how you
30:59
Yes, yeah. Yeah.
31:01
The application I'm going to do is much simpler. I'm going to send a tweet every minute, but in that example, yeah. I mean, let's say there could be some additional database in which that is stored or they do it locally. It's kind of independent of PgPaxos itself. I guess technically you could actually store the accounts in PgPaxos as well.
31:31
I mean, it's it's technically possible. I mean, usually you want these guys to be, probably have some computing power depending on what they want to do.
31:44
I mean, like, you can, you don't have to, let's say. I mean, you could have the cron nodes be co-located with the PgPaxos node, or you could use them separately. It's kind of all the same. And here it's, because I use these super tiny instances, the PgPaxos cluster costs almost nothing anyway.
32:07
Yes, I can, I'll show you the the Paxos block. Yeah, but at least if a node goes down, you don't have to actually immediately react to it. So you can automate a lot probably.
32:22
So I have a bunch of nodes set up here with, this is one of my PgPaxos nodes, so it has the PgPaxos extension. I have a locks table which has, like, a key and an owner, and the key is the primary key. So I can only have one owner of the key at a time.
32:42
And on this node, I will create a Paxos group, and then replicate a table within that Paxos group. Now, that's not very interesting, so I'm going to join some additional nodes. This is one of my other Paxos nodes.
33:05
Okay, and the number it's showing is actually the round in the Paxos log in which it's joined. It's kind of an internal thing. So at this point I can do an insert, let's say on this node, and
33:21
then if I go to one of my other nodes and try to do the same insert, it's going to give me a primary key violation because they're now looking at a replicated table, so only one insert will succeed. So if I, this is the third node, I can see there's now one item in my log. I can make it a bit more challenging.
33:43
Let's say I do an insert on one node, and then immediately you select from the other node, I should then get that insert back, right? I should see, I inserted A B, and now I see A B on the other node in the selected table. This is guaranteed by Paxos. So I set up a load balancer in front of my
34:04
PgPaxos node, so this just connects me to any of the nodes. I don't know which one. I can't actually really distinguish them. They all have the same content. And I have a little script which will just send a tweet.
34:21
And I have this same script on three nodes, two nodes, sorry, two cron nodes, and only one of them will send it. So I'll activate the script. I just have a this, and then cron runs at, you know, when the number of seconds reaches zero.
34:44
So I could quickly open the Twitter account. No tweets yet. So run with cron as it runs once a minute. Okay, so someone grab the lock, node
35:05
0.171, grab the lock, so I should now be getting a tweet. Yes, and then both nodes are running the script, but only one of them will succeed. And it's actually not, like, the clocks in Amazon data centers are really well synced.
35:25
So they run cron almost at exactly the same time. So it kind of tends to alternate between between both of the nodes if they're both up and running. So one thing I can do now is I can maybe stop one of my nodes, my Paxos nodes.
35:44
If now I do this, then I've got a bunch of warnings. I still need to kind of clean up the output of PgPaxos because it shows all these stack traces of PLPG SQL. But the point is that it did manage to do the select even though it was down.
36:01
It'll make one more connection attempt, and then after that it will just, you know, pretend it doesn't exist. Which actually makes it even, makes it a bit faster. Surprisingly enough, because now it only needs to connect to two nodes. So there should be another tweet by now.
36:21
Right, there's another tweet. Obviously, I can also take down this node. So I stopped the cron job over here. So at some point I should be seeing, in 26 seconds, I should be seeing that the other node still manages to send the tweet.
36:43
I'm not sure if Twitter shows that kind of in a live way. But so I've taken down one of the Paxos nodes. I've stopped the cron job on one of my cron nodes. So I could model this, let's say, I put this in three different data centers, three different availability zones. I could lose an entire data centers and things will just work fine.
37:02
Which is quite nice about PgPaxos. Right, so now the other node did the tweet. So not to completely spam the PgCon hashtag. I'm gonna shut off the the cron job here.
37:29
Any questions about the demo? All right. So what could you use this for? So this is an example of distributed locks.
37:41
I didn't go into this. I kind of used a bit of a cheap trick here, where the name of the lock that these cron jobs are using is actually the number of minutes times since 1970. It doesn't actually rely heavily on synchronized clocks. If one clock is ahead, then most of the time that one will just win because it will be the first one to grab the
38:03
lock. But then if that fails then the other guy, there might be some, you know, several minutes where, let's say the other guy is several minutes behind on his clock that it might take a while before he starts tweeting again. But in Amazon data centers, it's usually synchronized up to the millisecond. So it's not really an issue.
38:22
So PgPexos works well for any application that has low read-write volumes, but strong consistency requirements and no strong latency requirements. So distributed locks is one. If you have this type of cron job,
38:41
any type of resource management, like we can only have one node access a certain resource at a time, it works quite well for that. Also for managing cluster membership, I mean PgPexos manages its own cluster, but it can also manage other clusters. And this is actually one of the most common applications for Paxos.
39:01
And this, for example, also includes deciding which node is the primary. And so you could imagine an application which, where you use Paxos to decide which out of a streaming replication group gets to be the primary. And then Paxos can tolerate one of the nodes failing. You could use it as a job scheduler.
39:21
Alvaro recently did the blog post showing how you could use PgPexos to implement a job queue, which has like exactly once semantics, whereas most queuing systems have at least one semantics, so the job could happen multiple times. Migrations, if you want to move over a lot of data from one place to another, even schema migrations,
39:42
or as a source for metadata, like often if you have the system where, let's say, all nodes have a copy of the metadata, there's one authoritative source. But if you don't want to be subject to that source going down, you would have to use something like PgPexos to make sure that you have a replicated consistent copy of the metadata.
40:05
So why not Raft? I get this question a lot. Well, one is, Paxos can be implemented in PLPGSQL, which makes it quite suitable for implementing in Postgres, because it's actually really simple, and there's almost like a one-to-one mapping of
40:22
sentences in the paper to SQL queries in the source code, because you don't have to worry about transactions and truncation and serialization. That's a good question. The question is, what is Raft? So there is a slightly newer consensus protocol. It
40:43
works by kind of electing a master, basically. It has a quite a good algorithm for quickly switching, making one of the nodes the master, and if the other one goes down very quickly, another one becomes the master. And well, one of the differences,
41:00
like this has been a recent paper, and the paper was very instructive to engineers. It tells you exactly what to do, whereas the Paxos papers are a bit more mathematical. So Raft has picked up a lot of steam lately in newer distributed databases, because the paper is easier to implement in some sense.
41:22
So actually, Paxos as an algorithm is much simpler. The minimal implementation that works is a lot simpler. You can kind of play with it more to be adapted to your requirements. I really wanted it to be multi-master, so I could do writes from anywhere. And mathematically, it's very elegant. Like, in a way, Lemper didn't invent it,
41:43
he just derived it from the notion of consensus, like what would I need to do? And it's actually also optimal for achieving consensus. But actually, the more realistic answer probably is I knew Paxos quite well, I knew PLPGSQL quite well, and I figured, I think I could implement Paxos in PLPGSQL.
42:04
So, that's my story. Hope you enjoyed it. Are there any questions?
42:28
No, because PGPaxos keeps its state in tables. So on the secondaries, yeah, I suppose it could work for logical replication, or if you create another database cluster
42:42
that's co-located with these nodes. Yeah, unfortunately, it doesn't doesn't work on secondaries.
43:02
Yeah, well, I mean, I would typically recommend running it in this type of setup where PGPaxos is actually an external component running on tiny hardware. Because then it's just already up and running when you bring up your big data production cluster, so you don't really have the problem. And you could, you know, you could reuse the PGPaxo cluster for
43:24
other stuff. It's a bit similar in the way people use Zookeeper. They just have this external Zookeeper deployment which solves a similar problem. So the question is, what is adding a node like? And it's essentially just, you set up
43:46
Postgres, you install the PGPaxos extension, you also need in PostgreSQL.conf to add it to shared preload libraries. But once you have a node, it's pretty much just you call this join group. Oh, actually there's one step missing here.
44:00
You first need to create the same tables. It doesn't replicate the DDL yet. Seems to be a common problem with replication solution. But then you just call Paxos join group, and this actually more or less clones the state of an existing node. So any data that's already in the table, it copies over. The membership
44:20
table, it copies over. At this point it might be interesting to show this. This is actually the Paxos log that I've been talking about. So these are the queries that have been executed on on the table. But yeah, joining a node is mostly setting up the tables that are replicated, and then calling join group, and then
44:44
join group also makes sure that all the other nodes know about this node. And there's a corresponding lead group which gives some errors because the other node is down for errors warning. But now it's now it's left the group.
45:11
It depends, let me see actually. So it's a elastic load balancer.
45:21
So the way elastic load balancer works, so this is an Amazon service, it sets up EC2 instance more or less in every availability zone in which you have instances, and then sets up, uses Route 53 to set up DNS records pointing to each of these things.
45:41
So let's say one data center goes down, like you know, I took down one of my Paxos nodes, one of my cron nodes, one of these guys will also go down, but I still have two left. So my clients can just talk to any of those two, and ELB is constantly doing health checks, so you'll be able to detect oh this Paxos node is no longer running, don't send any traffic there.
46:02
So actually the single point of failure is kind of taken care of in ELB, but yeah, I mean it's trickier in your own kind of architecture to do that. Probably the simplest, yeah, pretty much
46:31
it depends, so Amazon or the load balancer actually is a load balancer, so any new connection will typically go to a different one of these machines.
46:43
And it, I mean it performs constant health checks to to figure out who's up and running, but there's no like, you know, Amazon doesn't really pick like I'm always going to send my traffic to this machine. Every time you connect you actually get connected to a different machine. Well, if you try like with an ELB and you
47:26
you disconnect and reconnect and disconnect and reconnect, pretty much always it connects to a different one. It actually just routes or picks a node based on your source port, kind of uses a hashing function of that source port, at least used to be the algorithm.
47:44
I used to be at Amazon, so I know what it was like two years ago. I don't know what it is now.
48:05
Yeah, no it does and I think actually so the question is there's no conceptual, technical reasons that you couldn't replicate DDL, and I'm trying to remember I think actually if you do something like create index, it actually does replicate it because it's kind of easy.
48:21
Alter table is a bit tricky because PgPaxos kind of sits just before execution when the planning has already been done and if then there's an alter table happening before that then the execution no longer makes sense. So that's, that's, or the plan is no longer correct. So that's, there's some limitations around that but doing something like create table
48:45
actually isn't a problem, but then there's nothing to activate the the Paxos log, like once I query the locks table, that's when I start replaying the log. But if I don't have a locks table, then how do I start? I mean, I could just create a function for that or something, but that's not there at the moment.
49:12
Good question. Is the Paxos log rotated? Currently it is not. It will grow infinitely. And yeah, there's also no reason it couldn't be rotated.
49:21
Pretty much if you know all your nodes have reached a certain point, then you can just drop everything that came before them. There's just currently no logic to do that. All right, that's my time. Thank you for listening. And if you have any other questions, I'll be around.