Future(s) of PostgreSQL (Multi-Master) Replication

Video in TIB AV-Portal: Future(s) of PostgreSQL (Multi-Master) Replication

Formal Metadata

Future(s) of PostgreSQL (Multi-Master) Replication
Alternative Title
Replication Futures
Title of Series
Number of Parts
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date
Production Place
Ottawa, Canada

Content Metadata

Subject Area
BiDirectional Replication In the course of the BDR (BiDirectional Replication) project we have worked on delivering robust, feature-full and fast asynchronous multi-master replication for postgres. In addition we have started the UDR project, sharing most of the code and infrastructure with BDR, which provides unidirectional logical replication for the many cases where multi-master replication is not required. To implement BDR a lot of features have already been integrated into core PostgreSQL (9.4). Now that 9.4 is released and BDR/UDR is in production in several complex environment there's some important discussions to be had about what can and what cannot be integrated into core PostgreSQL. We will discuss: Which features are in core postgres Which features does BDR/UDR provide on top of that What can be integrated into core PostgreSQL and how Future features Problems found during the development
Slide rule Software developer State of matter Block (periodic table) Physicalism Replication (computing) Cartesian coordinate system Grass (card game) Degree (graph theory) Revision control Word Arithmetic mean Computer animation Logic Natural number Website Energy level
Content (media) Set (mathematics) Database Bit Streaming media Replication (computing) Revision control Mathematics Computer animation Logic Right angle Endliche Modelltheorie Table (information)
Laptop Demo (music) Server (computing) Demo (music) Computer file Virtual machine Maxima and minima Directory service Compiler Database Flow separation Arm Time domain Revision control Connected space Computer animation Symmetry (physics) Lecture/Conference Network socket Revision control Electronic visual display Data type
Slide rule Serial port Table (information) Server (computing) Tape drive Compiler Directory service Time domain Connected space Computer animation Network socket Revision control Electronic visual display Table (information) Extension (kinesiology) Systems engineering Data type
Installation art Area User interface Serial port Functional (mathematics) Pay television Digital electronics Table (information) Information Code Database Mereology Arm Computer programming Connected space Connected space Computer animation Different (Kate Ryan album) Personal digital assistant String (computer science) Extension (kinesiology) Local ring Extension (kinesiology)
Metropolitan area network Serial port Table (information) Line (geometry) Real number Information systems Database Electronic mailing list Density of states Statistics Sima (architecture) Connected space Computer animation Function (mathematics) Revision control Data Encryption Standard Statement (computer science) Information Extension (kinesiology) Integer Extension (kinesiology) Data type
Metropolitan area network Serial port Functional (mathematics) Pay television Table (information) Real number Weight Demo (music) Gradient Theory Sima (architecture) Connected space Process (computing) Computer animation Revision control Data Encryption Standard Information Key (cryptography) Table (information) Logic gate Extension (kinesiology)
Metropolitan area network Real number Multiplication sign Real number Demo (music) Database Mereology Arm Connected space Mathematics Computer animation Bit rate Error message
User interface Connected space Freeware Computer animation Set (mathematics) Bit Mereology Replication (computing) Neuroinformatik Sima (architecture)
Asynchronous Transfer Mode Structural load Multiplication sign Direction (geometry) Demo (music) Set (mathematics) Sound effect Counting Database Client (computing) Line (geometry) Mass Mereology Revision control Mathematics Causality Personal digital assistant Cloning Extension (kinesiology) Physical system
Asynchronous Transfer Mode Group action Mathematics Computer animation Direction (geometry) Similarity (geometry) Mass Ordinary differential equation Food energy Physical system Connected space Number
Connected space Freeware Computer animation Demo (music) Set (mathematics) Electronic mailing list Electronic visual display Sima (architecture) Data type
Computer animation Computer file Software Codierung <Programmierung> Demo (music) Configuration space Database Electronic visual display Physical system Data type
Serial port Group action Table (information) Codierung <Programmierung> Tape drive Chemical equation Demo (music) Tabu search Connected space Computer animation Key (cryptography) Electronic visual display Extension (kinesiology) Systems engineering Data type Extension (kinesiology)
Metropolitan area network Serial port Group action Table (information) Information Demo (music) State of matter Multiplication sign Demo (music) Ext functor Database Connected space Computer animation Software Personal digital assistant Identity management Extension (kinesiology)
Dataflow Serial port Group action Table (information) Dataflow Information Demo (music) Direction (geometry) Demo (music) Mathematical singularity Connected space Connected space Computer animation Key (cryptography) Office suite Extension (kinesiology) Social class Extension (kinesiology)
Connected space Functional (mathematics) Computer animation Weight Direction (geometry) Demo (music) Mathematical singularity Database Mereology Table (information) Discrete element method Extension (kinesiology)
Table (information) Computer animation String (computer science) Demo (music) Moment (mathematics) Bit Right angle Library catalog Table (information)
Metropolitan area network Multiplication Table (information) Key (cryptography) Multiplication sign Demo (music) Cloud computing Sequence Area Tabu search Connected space Computer animation Oval Subject indexing Statement (computer science) Key (cryptography) Table (information) Row (database) Physical system Data type
Default (computer science) Dataflow Table (information) Demo (music) Mathematical singularity Sequence Area Connected space Computer animation Subject indexing Set (mathematics) Key (cryptography) Statement (computer science) Table (information) Physical system Data type
Table (information) Demo (music) Mathematical singularity Range (statistics) Automorphism Number Connected space Hexagon Process (computing) Voting Computer animation Software Set (mathematics) Subject indexing Statement (computer science) Key (cryptography)
Metropolitan area network Proof theory Connected space Table (information) Dataflow Computer animation Validity (statistics) Direction (geometry) Demo (music) Set (mathematics) Ext functor Table (information)
Metropolitan area network Connected space Computer animation Oval Multiplication sign Demo (music) Set (mathematics) Set (mathematics) Configuration space Database Physical system 2 (number)
Point (geometry) Metropolitan area network Dataflow Demo (music) Discrete element method 2 (number) Connected space Computer animation Oval Set (mathematics) Moving average Table (information) Physical system
Area Default (computer science) Stapeldatei Multiplication sign Demo (music) 1 (number) Virtual machine Database transaction 2 (number) Revision control Word Mathematics Computer animation Physical system
Point (geometry) Dataflow Weight Demo (music) Semantics (computer science) Virtual memory 2 (number) Revision control Connected space Word Computer animation Green's function Set (mathematics) Table (information) Resultant
Revision control Metropolitan area network Connected space Computer animation Image resolution Haar measure Image resolution Demo (music) Set (mathematics) Replication (computing) Tuple
Slide rule Image resolution Multiplication sign Water vapor Replication (computing) Mereology Coprocessor Computer programming Revision control Mathematics Hooking Codierung <Programmierung> Extension (kinesiology) Scaling (geometry) Demo (music) Image resolution Moment (mathematics) Fitness function Database Line (geometry) System call Sequence Process (computing) Computer animation Logic Repository (publishing) Revision control Cloning Extension (kinesiology)
Trail Identifiability Code Ferry Corsten Multiplication sign Image resolution Function (mathematics) Streaming media Mereology Replication (computing) Revision control Mathematics Bit rate Different (Kate Ryan album) Logic Codierung <Programmierung> File format Interface (computing) Moment (mathematics) Binary code Streaming media Database Database transaction Cartesian coordinate system Timestamp Computer animation Logic Function (mathematics) Order (biology) Video game Arithmetic progression
Asynchronous Transfer Mode Information Direction (geometry) Moment (mathematics) Planning Generic programming Port scanner Replication (computing) Mathematics Computer animation Logic Natural number Function (mathematics) Physical system
Point (geometry) Complex (psychology) Asynchronous Transfer Mode Trail Insertion loss Streaming media Mereology Replication (computing) Mathematics Computer animation Personal digital assistant Representation (politics) Object (grammar) Plug-in (computing) Row (database)
Point (geometry) Inclusion map Trail Mathematics Crash (computing) Trail Computer animation 1 (number) Database transaction Replication (computing) Arithmetic progression Physical system
Point (geometry) Functional (mathematics) Overhead (computing) Validity (statistics) Multiplication sign Source code Set (mathematics) Database Database transaction Replication (computing) Workload Crash (computing) Mathematics Computer animation Chain Right angle Arithmetic progression Address space Physical system
Trail Default (computer science) Crash (computing) Table (information) Computer animation Hooking Block (periodic table) Core dump Statement (computer science) Table (information) Replication (computing) Sequence
Point (geometry) Particle system Trail Computer animation Repetition Normal (geometry) Mereology Port scanner Extension (kinesiology) Replication (computing) Message passing Extension (kinesiology)
Point (geometry) Overhead (computing) Differential (mechanical device) Code Multiplication sign Interface (computing) Data storage device Core dump Bit rate Perturbation theory Density of states Replication (computing) Mereology Software maintenance Product (business) Computer animation Logic Personal digital assistant Order (biology) Core dump Negative number Right angle Position operator
Point (geometry) Functional (mathematics) Game controller Presentation of a group Multiplication sign Direction (geometry) Disintegration Set (mathematics) Streaming media Mereology Replication (computing) Product (business) Revision control Order (biology) Mechanism design Mathematics Computer configuration Semiconductor memory Selectivity (electronic) Position operator Partial derivative Physical system Stability theory Default (computer science) Validity (statistics) Interface (computing) Projective plane Moment (mathematics) Content (media) Basis <Mathematik> Database transaction Density of states Cartesian coordinate system Sequence Product (business) Computer animation Logic Interface (computing) File archiver Configuration space MiniDisc Table (information) Resultant
can around here we right good so I'm Andreas II work for a site of data these states uh I mostly developed Postgres and committed the prosperous product and know what
I'm talking about today is that I'm a lot about my and others work on logical replication grass it's important to say that lot of the work here has been done not by such data but the 2nd quadrant to some degree where I was working there but also by all those at 2nd concordant uh yeah so it's not just for fairness and if you want to get this slide just to have them in front of you the URL there have them so at the very creative distinction between logical application of physical replication is that physical replication works on the block level it doesn't look at the the meaning of the data very much but it so it says this block has changed or this we have a new version of this block and then that's replicates to the other side due to that low-level nature it's very hard to make it a more flexible so would be worked on in the last over the last few
years was to implement the logical replication solution post PostgreSQL and I'm at the beginning you're going to talk about what we now have with both UTR and with those 2 things are and then I'm going to talk about what the the old existing infrastructure pieces are what we integrated newly 4 9 2 5 and then for me the interesting question is discussing a bit and then I'm welcome you're includes where we go from there because that's why that it about future different versions of the future so with UTR at which just stands for unidirectional replication it's very unimaginative uh you can have a primary with a set of tables that's those those started to move things and better graphics and so forth and yeah and you can have a 2nd database that's the thing on the right and you want to make it also have all have all the contents of the primary so what we came up with a model is where all the nodes can subscribe to to the primary and then they will become all the the all the data they subscribe to will then be also moved to the standby and keep kept up to date so what will happen is that after it when you start subscribing
copy all the existing data and then as soon as that song which obviously can take a long while because you might have a terabyte of data and copying terabytes of data isn't particularly fast after their that's done to start to stream out all the pending changes and that happens to work in way that you don't lose any changes between the copying step of the tables and the changes that have happened since then so I think we can
just start by doing a little demo of that so let me start the
symmetry because that makes run things easier so I I think it's so
we're here on because grows version 9 of 4
because that's what is really
and and because it's simpler my laptop I'm just going to have that connects to databases inside the same posters installation because that I want to specify ports and everything but he could obviously the intention is not that the connects to databases in the same cluster but if you connect several database database on several machines or something like that so what I'm going to do is to create 2 databases where
it's pretty 1 of them is you are the primary and you secondary and then on the primary just to make a slightly more interesting recreates the 2
tables and table that initial data and insert 1 1 can grow in to it so we actually have some data which if you remember earlier slide you will then copied when we subscribe to the master or to the primary said here so I'm now going to the secondary and
could MIT with less see in T-SQL you connects to a different database and if you look at it that I'm now creating the extension to extensions and the beach
she just a an extension is just a prerequisite of our extension and unfortunately post-processing allowed to automatically install dependencies of extensions so you have to do that manually but then the interesting part is that we do create extension B R and you might notice that it says you need a vineyard of UTR that's because it's the same code it just as a different user interface and maybe you should have an area for that and so what I'm now going to do this do this subscription step before
I I talked about earlier we call of PDR functions br subscribe which goes to the other note then we have the local node name we had every node so for some things has to have a name and we identify them by an arbitrary string in this case we just needs demos secondary and we say we subscribing to another node and we give a novel connection string proposed the connects to the always same host but we go to a different database in this case you are primary and then there's 1 last necessary acquired information we have to also gives us a connection string to the local database to the 1 we're currently connected to that's because we want to restore data in into the database and we have to use started program that connects to the database and that's currently only possible easy when you have connection string so let me just execute those of that circuit of know
it's a that texts and so you don't only have to create spare the extension on the standby but also on the
primary so that's what I forgot
but also that they to create the
dominant and the tails creating the damage it again did that in the wrong
database so I'm back to the
secondary and I'm trying to get
out of the gate so this started the subscription process which means that now in the background all the tables are getting coffee now our grade on initial table had at 1 goes so that's hopefully not taking a long so know just we have a function that that's weights for a theory
of copy subscription to complete and that's nodes for the node joined to be ready and that completed now so we should in theory see that this
time we have the error rate there was on I think world that would be real world data
minerals was I created accidentally used but oppose this database to create so that just somewhere on the site that forget that so um but it's not always easy to write a tool that just copies the database interested more interesting part is what it also keeps up with uh uh changes so I reconnected to the primary and
I know the say at my spelling in earlier there wasn't a very great so I'm trying to use actual computers isation and that's hopefully that worked so that
I think it so it works and I think there's already is quite a bit easier than a lot of other replication solutions are there because it and everything from the scale and with all the details of the user interface are absolutely perfect I'm doubtful and wrote for a large part of it so probably not but I think it's a pretty good starting that's
so that's just to give a very quick overview of the most simple case work obviously there's many more features we can use the base back up to create a clone which if you have a very large database is obviously much better than using uh like dialogical them which takes ages and count can cause load and stuff like that so that can help when all the interesting part is that this feature works across versions from line of 14 onwards so if you've 905 became created in install except UTR extension and then replicate between 9 4 and 5 we had 10 flavored with 1 5 and then fade over to 5 to have a very very short downtime between Apple based in the only thing you have to do is to call the and promote command understand by and make sure that you reconnect redirect all-time implications if you are very careful you can do that in around the 2nd to the effect sets and all that happens is that the clients have to reconnect so now if not let us
notice that only the that we had a primary and a secondary and that the secondary then replicate changes back to the primary which is normally if if you have like an active passive or something that primary standby time for pension system but this use cases where is very useful to be able to write to multiple nodes in the system perhaps because the latency to get to the actual masses too hot so the in the initial motivating features is that he wanted to build and a synchronous multi-master systems and very obviously change to flow into multiple directions so again we'd start with L 1 database in that database in this case is not named primary because there won't be a primary we have no 1 and then you say OK we now want to set up a story replicated
system so we create a motor mass set up at energy that will be necessary that only 10
countries 1 of which is obviously not a very interesting set up but once you've done that and that's what he called creating a group of nodes and so at the beginning that group has 1 note and then we can join that group with further nodes and each
between each node that joins the group they will replicate between uh in that direction so if you join them or you'll see it builds that's connections between all nodes up to fully meshed set up it it always does a fully meshed set up we are thinking about of allowing more complex topologies but for now you'll have to have always have to have a fully meshed set up and is actually good reasons for that for a lower node numbers because if a if you have like a single nodes in between somewhere you have that is that connection is down none of these changes flow around anymore and so that can be problematic because you then can't reach columns and similar things so yes you
guessed that we're going to do this again is a them trying as list so
that come from clean cluster of no
so we're starting afresh and
accept that the art if you cut loaded in the configuration file always has to create a supervisor database that you can actually connect to that slight forget about that but it's exists here so we going to a for simplicity reasons we here at and sigh reasons
and only going to set up a node of a network of 2 systems so let me just creates 2 databases are no edge that
I want to show something for about so we collect now connects to the diversity and again
what you going to do 1st is straight extension to actually be able to invoke all these commands so we create
extension and then what I said earlier before we start by creating a group of nodes that only consist of the 4 nodes and if you only at the 1st node that can actually join another note because there is no other nodes so the 1st command is not adjoining a it's a balance of the story also want to create the same initial data again so the 1st
command what you have is a little click
the create a group of nodes please give the look the note name that's the 1st make in note we identity demo 1 whatever and that what I was very important have to tell how all the nodes in the future will be able to reach this there's just because like you don't want every time you join 1 node to the network you don't want to configure all connections to all existing nodes every time so they have to be able to get it to 1 node and cook collect all the connection information so you you tell them on how to collect node and you say yes in this case the state with the of the data is name if he had several databases you'd always reasons that have cost equals whatever or IP whatever you could configure the username just everything you can configure and use their connection so let's connected it
of his history and then creating the
extensions so that's going to know more than inconsistent
we 1st created the group and now we can have something to join 2 we have the damage to and it's it's been name locally as a demo and uh again we have to tell all the nodes how to connect is this 1 because if you join the other they're all of the other nodal officer have to set up connections in this direction back and so we have to tell how the other nodes can be just and we tell from which nodes are we getting the initial information if you have like 10 nodes we can just choose any of those 10 notes to joint wake flow the data initially molecule user no that's like geographically will close because that will be cheaper to class
so again this starts this poses in the background so we all again have to wait until all the complaints whatever it always
takes 500 ms even if there's nothing to do that here that the 2 milliseconds what they actually does the rest the sleeping 50 could optimize for that but normally the database when the there to survive in MS the weight of matter much but it is a nice so indent nodes dhimmitude and again
declared the 1 about but the interesting part is now that we have a setup where the data is supposed to replicate in both directions and I and fourthly earlier forgot to show you that it with you you are you can't currently replicates which means if you just do like create stable or alter table it will give you there a function that allows you to also show how the function looks like just because of about so this function allows
you review the R and B R to just take a command as string and executed on all the nodes in at the right moment and you'll notice that I have had or maybe not the theater as schema specified the table it will force that's because otherwise it will depend on the sky as such have everything that's not PG catalog will have to be specified explicitly but for PDR we have actually have a nice solution because in we happen for opportunity to extend but there's a bit more I'll come to that later so
with the ah I can just now all the nodes to which was not the note that we initially created the 1st the initial data table I can now create another
table and so this will hopefully be are also created on nodes 1 so
this table has automatically been replicated to the other so but if
you have a distributed system where you can insert and rows multiple times they have an internal nodes at the same time and they are not synchronous have the problem that is not easy to generate given data that hasn't uh natural primary key we have the difficulty that you can use like a sequence because they conflict between them 1 solution for that in the that we have is that we have what
extended to this sequence of command that you can specify using B R and that will use a special kind of sequences that DOS distributed voting to coordinates chunks of values between the systems so I'm not going to just to be nice if you notice that I didn't know had a default value for the primary key of of the of of the here and now it's
changing the table the technology to the next fall of the session which uses this special kind of voting so and just to demonstrate that we actually allocated dangers we on this
note you'll see that the 1st value it actually uses 5 thousand 1 and I presume that the voting process will have given the node the active site from from to
read for some reason we always use the 1st thing to worry about what so I could see that these numbers ranges are distinct and always coordinate between the 2 there will never be every any overlapping dollars and the prequel shallow these values so even if the network is down you have some reserves to continue the so that I instead of something
and go to the undergoing to all the other nodes just to show that I I inserted on
nodes dental 1 on the data is 1 and not connected to and now under updating each year just proof that changes actually go in both directions so we can just go and select from the session Table act right now it has a mock
developed my own session as valid and they can now just say it again
I'm invalidating recession whatever on nodes to and we can just make sure that it's now also the other so great that worked but now you it's
this is all basically instantaneous are you sure that you know just lying to us and it's the same database and just change the name there that would be easy access what is going to do now to simulate like a more complex set up you can configure with are sets
every we've just play around like we had 30 seconds delay on wire on the 30 seconds is not really realistic the but uh it's just like much easier to demonstrate if if the couple seconds to time we have to use the configuration so I'm now
going to fight incident other
owing to the Session Table just for someone else and they're now connecting to the other nodes
and you'll see that right now the
road is not get there I use the basket watch command and post press the just executes the
command every 2 seconds and I sure hope that at some point here we're going to see in the other room being replicate so but it's pretty obvious that if you do this kind
of thing where you have like 30
seconds delay between each machine and they update those at the same time it just great so that works and like completely this is delay yes I know it's a delay on everything that there so we'll be basically saying we we see the change but we only applied and if the it would have been generated less than more than 30 seconds in the past so it happens on a per-transaction based but we still can send them in batch all know it's like every transaction is delayed by 30 seconds and even we receive area we just say OK sleeping to all of these is that things are passed down the the word that so as that like with this kind of the day and even of the the smaller ones you can have like conflict between the system and what that would be ah does it resolves them by default using last happy doing so that means just if he independent which noted happens the last update wins and we'll let make sure that the coordinates these updates between system so that each node's result in the same way so let me connect node 1 and uh updates
session and setting it to about it again so that it now imagine that a couple seconds later on that's true but it sets to in again with that should be adopted was
developed and so on and so I didn't take
longer than 30 seconds because otherwise there will be no conflict which would make this of its words but we didn't aperture here and because we'll now at this at some point see that the other nodes didn't update is not so and that will be good results using last up greens and so will see OK the other node is an update but that weight that's all of the not so my own version is more important and if you do this kind of thing you obviously have to be very careful with the semantics of that are OK and is also very interesting to see whether the actually uh what else that happened so we have a table that locks all these conflicts yes
so it blocks to conflict because I tried it earlier and uh
you see that the conflict resolution of the conflict was last updated wins keep local because the local version was later the local tuple was with valid false and the remote tuple was with well it's true but it says he
obviously there's like 20 more columns but it was hard to show on the scope so this was basically
just what you can do with br there's lots and lots more but I don't want to go too deep into is because more general about how can we make replication better person so
with at the moment the way this all fits together is that imposed is a we have added a bunch of features to allow this kind of thing to happen very efficiently so we have added logical decoding to post that's what I spend like far too much time to ever admit to anybody talk to myself really and we have and that allows you to do some cool things it allows you to extract all changes that happen in a database in could consistent fashion even if Kevin says that this more consistent and we have added background workers which are very useful because you can add to the database something that happens in the background you can add additional processors to the database and then Robert has continued making and even much more about but for the and actually we now use the version that Robert has improved because they're much more flexible so on top of that we have built an extension that's you the art works against top and it can do this the 1st demo and the 1st slide it can subscribe changed and replicate changes from primary to stand by its rather efficient we've been able to do uh 0 over what 28 thousand TPS with bench without letting the standby which is I think pretty good and from minus 4 onwards it's the across version and what I think souls pretty cool you can initialize it from a base that can even if you take like what you can do is take a base back out from under 4 then run PG upgrades then uh and then catch up using logic replication of all the pending changes and then the program use that as a the new master which allows you to operate of very large data it is somewhat reasonable of time then on top of that we have added the yard and unfortunately at the moment of some of these features required modifying call prospects for example the DDL replication where just execute a DL and was automatically forget to the other nodes requires that we make the DLO between is that we are able to catch the DL naked ambiguous and going back to that in a 2nd we have to add conflict resolution and we wanted to have these sequences that are distributed and business and for that and post and there's no way to hook into that of post so we added that but this is all of a modified version of post as it's the same license if the lives on it of posters scale of repository and the important part is all of these changes that are making BDR possible that submitted upstream and what I think is pretty cool that the majority of those are now in Austin
process in that only 5 that's why we need for 9 4 still need the modified worse so just to go back for a 2nd in line 4 we have background waters and logical decoding I don't want to go
to into much detail but logical decoding but just very quickly it allows you to get all the changes in the database you get them in committee order and uh the the important part is that you can buy multiplied and these altered gardens allow you to transform that changes the way they happen in post-stressed like posters internal format into any charter from if you write the rate code you can after Jason you can output SQL you cannot put something in binary because you think it's more efficient you can out of them in some like Protocol Buffers something somebody is actually written that and there's lots of different formats and it's not just useful for applications but it's also useful for life synchronizing of feeding them in like have gone to the other systems of coherent with this kind of auditing yes well auditing of the problem that you can roll back change so that so and it's called like based you ought blood and defined couples called like and get data and they then can stream out those changes and they came up with that those that that data to any the outputs that we have written at the moment this interface why SQL and as an output of former to the replication stream that we use for a physical replication so now in
905 a bunch more additional feature we have commits timestamps which allow to do this conflict resolution if you have an exit the which is the transaction identifier we know at which time that's transaction has committed or whether it has not committed and then can to mine which database where the that changes happened last an important part of it you've got to have that needs to be wrong database winning but it will win on this this on all nodes so there is no danger that you get an injury in database it's just that the wrong and wings that it's more powerful we added uh that you can do a replication more efficiently and that you and that you can't keep track of the progress I'm going to back to that in 2nd
so if you do the replication it into system in both directions and this a change coming in that it would come to node 1 and then we replicated the 2 nodes to and that replication will happen by using logic of coding but if you want to do bi-directional there the change of nature will happen there will just be decoded again and then sent back to node 1 and guess what happening now not very good so you have to have a method of preventing that from happening so what we did is
that you can uh at the end of the feature that you can invent generic names for individual nodes and then you can say this session at the moment is not actually doing original changes it's in the plane changes from another node and I can sit and say this note is playing the role of the other nodes and then the altered planning from logic gets the information you from which nodes uh these uh changes origin
so again we ended we replicates and then no 2 will get the rose to the opera plug-in but the of providing safe all the change originated at node 1 there is no need to stream that back and can just say I'm forgetting object which allows to do like more complex representations but that's not the only part that infrastructure is useful both
imagine that you replicating from just from primary to a standby without any fancy bi-directional stuff if you in search of a couple things into the primary at some point because you doing a synchronous arbitration they will get sent to the standby and it might just be the 1st row that will get sense in 1 TCP packets might also be several at once but in this case we met in just 1 the 3rd insert happens and we allow replicate as and the other 2 in 1 TCP packets but if you just replay there is uh we can have
like the the standby can crash at some point so eliminating we
have sent 1st the 1st and 3rd and then the other 2 in 1 package and but between nodes of the Institute for for 2 and 3 this risk trashed as so you can't just say I'm forgetting the primary about all the changes that have sent out because they can ones and the wire it can get lost while you're applying you can get lost so you have to keep track of how far have you replicate so imagine there was a crash between that and that might be that no 1 has done notes to promised on standby stultification solution crashed or anything we have to know after which the start again that's how far we replicate very easy solution to that would be to just have a table and update the role in there but the problem is that haven't old 2 pieces and then gets updated for every transaction that were very far so what we have done is extended the replication origin system we saw earlier to also keep track of replication progress so you
don't just say I might changes originate on set of on node 1 he also stayed on for every transaction AM replicating a transaction it origin and the transaction on the source system happens at this address and our that's how we define progress in the database that is saying at that for example this is the address and let's just imagine that it happened at this time and then you do search and all the other chains in this election and then you can it and if we add if you at some point crash and want to say 0 I don't know where what actually succeeded to replicate last then you can call a function pv PG replication origin progress and that will give you the node right up to where you replicate and it will it will do so with a very low overhead depending on where what exactly the workload is it's from 0 to 2 per cent of valid overheads to the to 1 . 5 you very very very many many many smaller and it even works if you do the at the ClA understand by you might
you want to use on the stand by as enclose commits because there no need to commit to synchronously so even if you played a transaction that committed it might also still not persist so what is it even in that situation to after the crash you will get a replication progress will give you
the correct answer and it does so by hooking into the vowel replay of posters and keeping track of how far which rose actually persisted after a crash so the other big
feature which I think is 1 of the core features of making replication unless is that uh we added them a feature to hook into the into every detail that happens and normalize that because for example if you do at column to table block that lot table lot might isn't efficiency and then you have to figure out which was actually next and so you need to normalize that's to called fully qualified everything is all the cases like if you add a sequence in the table as a serial column you end up with like 5 statements yet it creates sequence and create a sequence creates table all details such default alters sequence said 0 to the table so this kind of stuff needs to be tracked
and beef into in 905 the basics are integrated for that unfortunately the normalization could wasn't ready for another 5 but now you don't have to modify posters anymore you can do that as an extension so there's a couple
features missing but there we'll work on them 1 of 6 so what I
think we are at now is that in 905 you very good toolbox for implementing of replication solution you like lots of competent but in particle posters all of these are just there without you knowing see they don't do anything and not everyone wants to develop their own replication solution so I think at this point we have to decide where to go from here we can say
yeah we continue to improve that toolbox year build replication solutions externally which might be which is the position we have taken for a logical replication solutions for a long time or it can say hey this is so important we want to integrate this something like this
income prospects and this no I think there is valid points to 4 but if you want to say hey we've never integrated logic replication solution we don't really need to do now it works well to to make posters better to make replication solutions more efficient that's fine it introduces what we have to maintain ankle everyone else is on on the same footing as Corpus Christi can just say this this replication solutions and compose because it has it's easier to do some of them all the products and we also might get the interface wrong which is I think a pretty valid concern we have not gone everything right for all stand there were certainly stuff we would do differently if it's uh where today and I think there are some people have argued that it just part of what we should have a couples and uh the other part is what I think is put the past point is all the other solution pretty much on the order of the data store has no logical repetition solution I don't think we can really afford not to make as easy as possible if you have to configure itself get some other piece of code that in Europe as a repository gift in solid to configured in different places I don't think that's something we can really afford to continue doing so I think that we have just have to get something in and I think a lot of these the Ottomans of negative side have been made for hot standby and I think it's 1 of the best reasons things that happen to posters integrating is integrating hot standby because it wouldn't be where we are today with hot standby without hot standby thing that case is also that uses just frost stuff that's in core much much more so I think there's a lot of
questions on what's integrates a which parts of the we want to just get a unidirectional application do you want to fool the full bi-directional replication do we want the other things that we want when there is I think lot of valid positions on this and I think 1 of the biggest question is do you how do we want to design the interface for court because like for UCB army had to basically use a function based interface because it's very hard to extend PostgreSQL with the out parts of it as the best solution I don't know I think the function defined there but I think that's 1 of the discussions we're going to have I think the control of the replication should be functions stuff like uh saying this sequence is a different mechanism is actually the this table should be in this replication said that should be out the control the replication self should be function but I'm very happy discuss that and that's why I make the presentation today so let's discuss that and the other part is is actually going to work on this I'm going to do some work that point is going to do some work but I think this is a very very large project so please out so unfortunately I don't have the time to take the last point I
want to make but yeah so there is a bunch of resources are this just check it out and play around with its it's in production of a bunch of systems but it still has some rough edges around bunch features so there's definitely use experience needed because like we don't necessarily find those rough edges anymore because we can walk walk around blindly and of the day so please help us without any questions not right now but it would not be very hard to make it so it just wasn't the priority if like 15 chance of thousand dollars at the moment uh it's just constant the corridor now we just have to invent the for that that's was something I wanted to the actual initial point I wanted to make that something we need to improve at the moment you just have a stream of very large selection out which means it smaller transactions happening concurrently there would be delayed and apply it will work but uh it will slow down the increase your replication act for and at the same with any of the other logical replication solutions but I think we definitely need to improve and that it does that on this it's both to disk when needed on other items in memory of all logical or they will get resolved as kind of leave that's how we detect uh conflicts on their parole basis like if there is a unique and say that you violates that that's where how we say this road is the same as the effect that as you on the other side and then we'll depending on whether you can resolve the conflict you i alive again the result is not big win or get locked to the comforts stable and then you have to resolve it yourself but I think it's not generally just that you want to do actual fully will formal domestic the many many workers you can just say most of my data is actually just going to modified on this continent where there is a status contained a modified in this content I just have a set of shared data between that's at the you have to do that itself but you don't get inconsistent results that if the top just have like basically the role node will be the last 1 but it will be the same on all systems because he's the committee the Science and that commits time to resolve conflicts and that will be all the same in all nodes and it's basically did can configure replication sets and say I want to replicate these tables the only problem is that in the initial Cologne we could all the tables at the moment other than that being support a part of the table that's their eyes and I don't think that 0 there the moment because it's just not that interesting what you can do is say I don't want to replicate needs in 1 direction so you can build an archive over here and only a small set of data here but partial replication of members not supported and I don't think that if you to petition stuff then work for example I think that's the way you have taken because otherwise you get into very strange country and that will be weird I think partitioning the data all among the stator you want to to replicate thing to be better but maybe have a great idea to implement placed they open issue and communicated with no indeed it uses the pose the logic including users post as Mallet that's already written anyway and then picks up the changes in many features filter of uh what's happening there OK so that it's an option right now and for some reason we change the defaults to false I don't know why but I think we should revert that it should be the default on and we have the option but it's not on right now but I think more than other time anyway so I like