We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Future(s) of PostgreSQL (Multi-Master) Replication

00:00

Formal Metadata

Title
Future(s) of PostgreSQL (Multi-Master) Replication
Alternative Title
Replication Futures
Title of Series
Number of Parts
29
Author
Contributors
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Production PlaceOttawa, Canada

Content Metadata

Subject Area
Genre
Abstract
BiDirectional Replication In the course of the BDR (BiDirectional Replication) project we have worked on delivering robust, feature-full and fast asynchronous multi-master replication for postgres. In addition we have started the UDR project, sharing most of the code and infrastructure with BDR, which provides unidirectional logical replication for the many cases where multi-master replication is not required. To implement BDR a lot of features have already been integrated into core PostgreSQL (9.4). Now that 9.4 is released and BDR/UDR is in production in several complex environment there's some important discussions to be had about what can and what cannot be integrated into core PostgreSQL. We will discuss: Which features are in core postgres Which features does BDR/UDR provide on top of that What can be integrated into core PostgreSQL and how Future features Problems found during the development
Software developerWordDemo (music)Server (computing)Directory serviceConnected spaceTime domainNetwork socketElectronic visual displayData typeArmComputer fileRevision controlCompilerMaxima and minimaTape driveTable (information)Serial portSystems engineeringExtension (kinesiology)Statement (computer science)Line (geometry)Function (mathematics)StatisticsIntegerDensity of statesElectronic mailing listData Encryption StandardReal numberMetropolitan area networkSima (architecture)Information systemsInformationKey (cryptography)Set (mathematics)FreewareAsynchronous Transfer ModeOrdinary differential equationCodierung <Programmierung>Tabu searchExt functorDataflowMathematical singularityDiscrete element methodSubject indexingAreaOvalCloud computingHexagonAutomorphismPhysical systemMoving averageVirtual memoryImage resolutionHaar measureCloningLogicStreaming mediaPort scannerTrailInclusion mapMereologyMessage passingRepetitionBit rateDifferential (mechanical device)Core dumpOverhead (computing)Interface (computing)DisintegrationOrder (biology)Product (business)Table (information)Direction (geometry)DatabaseFunctional (mathematics)Extension (kinesiology)Right angleComputer programmingState of matterReplication (computing)Address spaceCodierung <Programmierung>Game controllerSound effectRevision controlSequenceContent (media)Moment (mathematics)MathematicsString (computer science)LogicEndliche ModelltheorieResultantCrash (computing)Row (database)Level (video gaming)Arithmetic meanBlock (periodic table)Point (geometry)Data storage deviceMereologyBitNormal (geometry)Perturbation theoryServer (computing)Connected spaceBasis <Mathematik>Arithmetic progressionSet (mathematics)Parameter (computer programming)Computer fileDatabase transactionSource codeoutputCartesian coordinate systemConstraint (mathematics)Validity (statistics)Position operatorMiniDisc2 (number)Different (Kate Ryan album)AliasingDegree (graph theory)Multiplication signSemiconductor memoryClient (computing)User interfacePhysicalismPay televisionFlow separationCASE <Informatik>Goodness of fitCausalityInstallation artBackupStreaming mediaDemo (music)Projective planeProcess (computing)VotingComputer configurationSoftwareGroup actionIdentity managementDefault (computer science)NumberRange (statistics)Configuration spaceFile archiverMultiplicationStapeldateiLoginPhysical systemInsertion lossSimilarity (geometry)InformationFunction (mathematics)Plug-in (computing)File formatNetwork topologyInterface (computing)Forcing (mathematics)Key (cryptography)Order (biology)CloningTupleCodeFitness functionTouchscreenSemantics (computer science)TrailHookingRepository (publishing)Type theoryProduct (business)Image resolutionBuildingOverhead (computing)Core dumpComplex (psychology)Statement (computer science)Mechanism designWorkloadSynchronizationCommunications protocolBuffer solutionGrass (card game)NeuroinformatikLocal ringBit rateReal numberSlide ruleCountingWordChemical equationError messageStability theoryAreaWebsiteWeightVirtual machineTheorySubsetDataflowMassLibrary catalogFood energyGradient1 (number)Incidence algebraSocial classOffice suiteProof theoryWater vaporScaling (geometry)Ferry CorstenIdentifiabilitySystem callRepresentation (politics)TimestampCoprocessorNegative numberVideo gameObject (grammar)Generic programmingGreen's functionPlanningBinary codeSoftware maintenanceChainSymmetry (physics)Data miningLaptopParticle systemNatural numberSelectivity (electronic)Partial derivativePresentation of a groupDigital electronicsLogic gateLine (geometry)Structural loadComputer animation
Transcript: English(auto-generated)
Can everyone hear me all right? Good. So I'm Andres. I work for Citus Data these days. I mostly develop Postgres. And I'm a committer for the Postgres project. And what I'm talking about today is that I'm about my and others' work
on logical replication in Postgres. It's important to say that lots of the work here has been done not by Citus Data, but by Second Quadrant, to some degree where I was working there, but also by others at Second Quadrant. Yeah, so it's not just for fairness. And if you want to get the slides,
just to have them in front of you, the URL there will have them. So as a very quick distinction between logical replication and physical replication is that physical replication works on the block level. It doesn't look at the meaning of the data very much.
So it says this block has changed, or we have a new version of this block. And then that's replicated to the other side. Due to that low-level nature, it's very hard to make it a bit more flexible.
So what we worked on over the last few years was to implement a logical replication solution in Postgres. And I'm, at the beginning here, going to talk about what we now have with both UDR and BDR and what those two things are. And then I'm going to talk about what
the old existing infrastructure pieces are, what we integrated newly for 9.5. And then, for me, the interesting question is discussing a bit. And then I'm welcoming your input where we go from there, because that's why I said it. It's about different versions of the future.
So with UDR, which just stands for unidirectional replication, it's very unimaginative. You can have a primary with a set of tables. That's those darker blue things and very bad at graphics. I'm sorry for that. And you can have a second database.
That's the thing on the right. And you want to make it also have all the contents of the primary. So what we came up with the model is where other nodes can subscribe to the primary. And then all the data they subscribe to will then be also moved to the standby
and kept up to date there. So what will happen is that when you start subscribing, it will copy all the existing data. And then as soon as that's done, which obviously can take a long while, because you might have a terabyte of data and copying a terabyte of data isn't particularly fast,
after that's done, it will start to stream out all the pending changes. And that happens to work in a way that you don't lose any changes between the copying step of the tables and the changes that have happened since then. So I think we can just start by doing a little demo of that.
So let me start the server, actually, because that makes running things easier. So I think, yes. So we are here on Postgres version 9.4,
because that's what is released. And because it's simpler on my laptop, I'm just going to connect two databases inside the same Postgres installation, because I don't want to specify ports and everything. But obviously, the intention is not that you connect two databases in the same cluster,
but that you connect several databases on several machines, or something like that. So what I'm going to do is to create two databases, where one of them is UDR, the primary, and UDR, the secondary. And then on the primary, just to make
it slightly more interesting, we create two tables, and table initial data, and insert one one pink row into it so we actually have some data, which, if you remember the earlier slide, we will then copy when we subscribe to the master,
or to the primary, as I said it here. So I'm now going to the secondary. I'm with slash C in PSQL. You connect to a different database. And if you look at it, I'm now creating the extension
to extensions. And the B-tree extension is just a prerequisite of our extension. And unfortunately, Postgres doesn't allow to automatically install dependencies of extensions. So you have to do that manually. But then the interesting part is
that we do create extension BDR. And you might notice that it says BDR, not UDR. That's because it's the same code. It just has a different user interface. Maybe we should have an alias for that. I don't know. So what I'm now going to do is do this subscription step
I talked about earlier. We call a BDR function, BDR subscribe, which goes to the other node. Then we have the local node name. Every node, for some things, has to have a name. And we identify them by an arbitrary string.
In this case, we just name it demoSecondary. And we say we are subscribing to another node. And we give a normal connection string for Postgres. We connect to the otherwise same host. But we go to a different database, in this case, UDR primary. And then there's one last required information.
We have to also give a connection string to the local database, to the one we are currently connected to. That's because we want to restore data into the database. And we have to start a program that connects to the database. And that's currently only possible easily
when you have a connection string. So let me just execute this. Oh, that's not good. Oh, I know why.
Yes, I forgot to exten. So you don't only have to create the extension on the standby, but also on the primary. So that's what I forgot. Oh, and also forgot to create the demo data in the table.
So I'm creating the demo data again. I did that in the wrong database. So I'm back to the secondary. And I'm trying again.
So this started the subscription process, which means that now in the background, all the tables are getting copied. Now our grand initial table had one row. So that's hopefully not taking very long. So we have a function that waits for every copy
subscription to complete. And that's node for the node join to be ready. And that completed now. So we should, in theory, see that, yes, we have 0.
Wait, that was only 1, I think. No, what? Did you create a row in the secondary also? No, no, I accidentally used the Postgres database to create. So that's just somewhere on the side.
Let's forget that. But it's obviously easy to write a tool that just copies a database. More interesting part is whether it also keeps up with changes. So I reconnected to the primary. And I now say, yeah, my spelling earlier there
wasn't very great. So I'm trying to use actual capitalization. And that's hopefully worked.
So I think it worked. And I think this already is quite a bit easier than lots of other replication solutions out there, because you can do everything from the SQL. Whether all the details of the user interface are absolutely perfect, I'm doubtful. And I wrote large parts of it, so probably not.
But I think it's a pretty good starting base. So that's just to give a very quick overview of how the most simple case work. Obviously, there's many more features. We can use a base backup to create the clone, which,
if you have a very large database, is obviously much better than using a logical dump, which takes ages and can cause bloat and stuff like that. So that can help. One other interesting part is that this feature works across versions from 9.4 onwards.
So if you have 9.5, you can install the UDR extension and then replicate between 9.4 and 9.5. We play with 9.5 and then failover to 9.5 to have a very, very short downtime between upgrades.
Basically, the only thing you have to do is to call the promote command on the standby and make sure that you redirect all client applications. If you are very careful, you can do that in around a second to three seconds. And all that happens is that the clients have to reconnect.
So now you might have noticed that we had a primary and a secondary, and that the secondary didn't replicate changes back to the primary, which is normal if you have an active passive or something like primary standby type of replication system.
But there's use cases where it's very useful to be able to write to multiple nodes in the system. Perhaps the latency to get to the actual master is too high. So the initial motivating feature is that we wanted to build an asynchronous multi-master system.
And there, obviously, the changes need to flow into multiple directions. So again, we start with one database. That database, in this case, is not named primary because there won't be a primary. We have node one. And then we'll say, OK, we now want to set up a replicated system.
So we create a multi-master setup. And initially, that multi-master setup only contains one node, which is obviously not a very interesting setup. But once we have done that, and that's what we call creating a group of nodes.
So at the beginning, that group has one node. And then we can join that group with further nodes. And between each node that joins the group, they will replicate in both directions. If you join them more, you'll see it builds at connections between all nodes
up to a fully meshed setup. It always does a fully meshed setup. We are thinking about allowing more complex topologies. But for now, you will always have to have a fully meshed setup. And there's actually good reasons for that for lower node numbers. Because if you have single nodes in between somewhere,
if that connection is down, none of the changes flow around anymore. And that can be problematic, because we then can't reach quorums and similar things.
So yeah, you guessed it. We are going to do this again as a demo. I'm trying, at least. So I'm cleaning the cluster up now.
So we are starting afresh. And except that BDR, if you load it in the configuration file, always has to create a supervisor database that you can't actually connect to. That's like, forget about that. But it exists here. So we are going to, for simplicity reasons and time
reasons, I'm going to set up a network of two systems. So let me just create two databases. Oh, damn it.
I wanted to show something a little bit. So we now connect to the first state. And again, what we are going to do first is create the extension to actually be able to invoke all these commands.
So we create the extension. And then, what I said earlier, we start by creating a group of nodes that only consists out of one node. And if you only have the first node, that can't actually join another node, because there's no other node. So the first command is not joining a node. It's about, oh, let me, sorry, I also wanted to
create the same initial data again. So the first command, what we have is, oh, that was too
quick, we create a group of nodes. We skip the node name. That's the first node. We identity demo one, whatever. And what's very important, I have to tell how other nodes in the future will be able to reach this node, just
because every time you join one node to the network, you don't want to configure all connections to all existing nodes every time. So they have to be able to get to one node and collect all the connection information. So you tell the one how to connect to this node, and you say, just in this case, what's the other database
name, if you had several databases, you would always have like host equals whatever, or IP equals whatever, you could configure the user name, just everything you can configure in a connection string. So let's connect to database two.
Again, creating the extensions. And what I'm going to do now, we first created the
group, and now we have something to join to. We have the demo node two, and we name it locally as demo two, and again, we have to tell other nodes how to connect to this one, because if you join the other node, the other node will obviously have to set up a connection in the direction back.
And so we have to tell how the other nodes can reach us, and we tell from which node are we getting the initial information. If you have like 10 nodes, you can just choose any of those 10 nodes to join where you clone the data initially. Normally, you will use a node that's geographically closed, because that will be cheaper to clone.
So again, this starts this process in the background. So we all, again, have to wait until all that completes, whatever.
It always takes 500 milliseconds, even if there's nothing to do, but yeah. The two milliseconds is what it actually does. The rest is sleeping. Perhaps you could optimize for that, but normally, your database won't be two rows, so 500 milliseconds normally doesn't matter much, but yeah. It's not nice.
So we're in demo node two, and again, it clones the one row. But the interesting part is now that we have a setup where the data is supposed to replicate in both directions. I unfortunately earlier forgot to show you that with UDR,
you can't currently replicate DDL, which means if you just do like create table or alter table, it will give you an error. There's a function that allows you to, I'll show how the function looks like, just because I forgot it.
So this function allows you, with UDR and BDR, to just take a string and execute it on all the nodes at the right moment. And you'll notice that I've had, or maybe not, you'll see that I've schema specified the table.
It'll force that, because otherwise, it will depend on the search path. Everything that's not in pg-catalog will have to be specified explicitly. But for BDR, we actually have a nicer solution, because we had the opportunity to extend Postgres a bit more.
I'll come to that later. So with BDR, I can just now, on the node 2, which was not the node we initially created for the initial data table, I can now create another table.
And this will hopefully be also created on node 1. So this table has automatically been replicated to the other node.
But if you have a distributed system where you can insert rows multiple times, and several nodes at the same time, and they're not synchronous, you have the problem that it's not easy to generate. If you have data that hasn't a natural primary key, you
have the difficulty that you can't easily use like a sequence, because they will conflict between the nodes. One solution for that in BDR that we have is that we have extended the sequence command that you can specify using BDR.
And that will use a special kind of sequences that does distributed voting to coordinate chunks of values between the systems. So I'm now going to just be nice. If you notice that I didn't add a default value for the primary key above here, I'm now changing the table, the
column ID to have the next of all of the session ID, which uses this special kind of voting. So just to demonstrate that we actually allocate distinct ranges, on this node, you'll see that the
first value it actually uses is 5001. And I presume that the voting process will have given the other node. Yeah, it started from two. For some reason, we always use the first three values, but whatever. So you can see that these number of ranges are
distinct, and it will always coordinate between the two. There will never be any overlapping values. And we pre-cache some of these values, so even if your network is down, you have some reserves to continue inserting.
So I inserted something, and I'm now going to the other node, just to show that I inserted on node demo one, on
the database demo one, and now I'm connected to demo two, and now I'm updating it here, just to prove that changes actually go in both directions. So we can just go and select from the session table. Right now, it has marked my own session as valid.
And we can now just say, OK, I'm invalidating my session, whatever, on node two, and we can just make sure that it's now also on the other node.
So great, that worked. But now, you might say, this is all basically instantaneous. Are you sure that you're not just lying to us, and it's the same database, and I just change the name there? That would be easily hacked. So what are you going to do now? To simulate more complex setups, you can configure with BDR, set every, we just play around like we had a 30
seconds delay on the wire. Obviously, 30 seconds is not really realistic on the earth, but it's just much easier to demonstrate if you have a couple seconds to type. We have to reload the configuration.
So I'm now going to, I inserted another row into the session table, just for someone else, and now connecting to the other node. And you'll see that right now, the road is not yet there.
I used the watch command in Postgres that just executes the command every two seconds. And I sure hope that at some point here, we're going to see the other row being replicated. But it's pretty obvious that if you do this kind of thing where you have like 30 seconds delay between each
machine, and they update rows at the same time, at some point, oh, it showed up, great. So that worked. I didn't lie completely. Yes? Where does it collect the transactions and batch them? No, it's a delay on every transaction. So we're basically saying, we receive these changes, but
we only apply them if it has been generated less than more than 30 seconds in the past. So it happens on a per transaction basis, but we still can send them in batches. It's not collect all for 30 seconds and send them in? No, no, no. It's like every transaction is delayed by 30 seconds. And if we receive it earlier, we just say, OK, we're
sleeping till these 30 seconds have passed. Let me abort that. So as I said, with this kind of delay, and even obviously smaller ones, you can have conflicts between the system. And what BDR does, it resolves them by default using last update wings. That means just independent of which node it happens, the last
update wins, and we'll make sure that we coordinate these updates between the systems so that each node resolves them the same way. So let me connect to node one and update my session. I'm setting it to valid again.
But now imagine that a couple seconds later on node two, it's set to invalid again. Wait, there should be an update there. Was it an update? No.
So I hope I didn't take longer than 30 seconds, because otherwise there will be no conflict, which would make this a bit awkward. But we did an update here, and because we'll now at some point receive that the other node did an update as well.
And that will be resolved using last update wins. And so we'll see, OK, the other node did an update, but wait, that's older than what I have done. So my own version is more important. And if you do this kind of thing, you obviously have to be very careful whether the semantics of that are OK. And it's also very interesting to see how often
that happens. So we have a table that locks all these conflicts. Yes. So it locked two conflicts, because I tried it earlier. And you see that the conflict resolution of the conflict was last update wins keep local, because the
local version was later. The local tuple was with valid false, and the remote tuple was with valid true. But it says keep local. Obviously, there's like 20 more columns, but it's a bit hard to show on these kind of screens. So this was basically just what you can do with BDR.
There's lots and lots more, but I don't want to go too deep into it, because it's more general about how can we make replication better in Postgres. So at the moment, the way this all fits together is that in Postgres 9.4, we have added a bunch of features to allow this kind of thing to happen very efficiently.
So we have added logical decoding to Postgres. That's what I spent far too much time to ever admit to myself, really. And that allows you to do some cool things. It allows you to extract all changes that happen in a database in a consistent fashion, even if Kevin says
that there's a more consistent version. And we have added background workers, which are very useful, because you can add to the database something that happens in the background. You can add additional processes to the database. And then Robert has continued making them much better.
And actually, we now use the version that Robert has improved, because they're much more flexible. So on top of that, we have built an extension that's UDR that works against stock Postgres. And it can do this, the first demo and the first slide, it can replicate changes from primary to standby.
It's rather efficient. We've been able to do over 28,000 TPS with PGEbench without lagging the standby, which is, I think, pretty good. From 9.4 onwards, it's a cross version.
And what I think is pretty cool, you can initialize it from a base backup. Even if you take, like, what you can do is take a base backup from 9.4, then run pg-upgrade, and then catch up using logical replication of all the pending changes, and then use that as the new master, which allows
you to do upgrades of very large databases in a somewhat reasonable amount of time. Then on top of that, we have added BDR. And unfortunately, at the moment, some of these features required modifying core Postgres. For example, the DDL replication, where I just could execute a DDL, and it was automatically replicated
to the other node, requires that we are able to catch the DDL and make it less ambiguous. I'm coming back to that in a second. We had to add conflict resolution. And we wanted to have these sequences that are distributed. And there's no syntax for that in Postgres.
And there's no way to hook into that in Postgres. So we added that. But this is all in a modified version of Postgres. It's the same license. It lives on a git.postgresql.org repository. And the important part is all of these changes that are
making BDR possible have been submitted upstream. And what I think is pretty cool that the majority of those are now in upstream Postgres, in only 9.5. That's why we need for 9.4 still need the modified version. So just to go back for a second, in 9.4, we have background workers and logical decoding.
I don't want to go into much detail about logical decoding. But just very quickly, it allows you to get all the changes in the database. You get them in commit order. And the important part is that you can write output plugins. And these output plugins allow you to transform the
changes the way they happen in Postgres, like in Postgres internal format into any target format. If you write the right code, you can output JSON. You can output SQL. You can output something in binary, because you think it's more efficient. You can output them in some protocol buffers.
Somebody has actually written that. And there's lots of different formats. And it's not just useful for replication. But it's also useful for synchronizing other caches, feeding them in Kafka to keep other systems coherent with this and other stuff. Auditing, yes.
Well, for auditing, you have the problem that you can roll back changes. And it's callback-based. Your output plugin defines a couple of callbacks. And they get the data. And they then can stream out those changes. And they can output that data to any
output that we have written. At the moment, there's an interface via SQL. And there's an output format to the replication stream that we use for physical replication. So now, in 9.5, we have a bunch more additional features. We have commit timestamps, which allow to do these
conflict resolutions. If you have an xid, which is the transaction identifier, we know at which time that transaction has committed or whether it has not committed. And then can determine which database where the change has happened last. And the important part is if you have clock drift, that can lead to the wrong database winning.
But it will win on all nodes. So there is no danger that you get an incurring database. It's just that the wrong node wins, that it's more powerful. We added that you can do replication more efficiently and that you can keep track of the progress.
I'm going back to that in a second. So if you do replication between two systems in both directions and there's a change coming in, that will come to node one. And then we replicate that to node two. And that replication will happen by using logical decoding. But if you want to do bidirectional, the change on
node two will happen. They will just be decoded again. And then we'll send back to node one. And guess what's happened now? Not very good. So you have to have a method of preventing that from happening.
So what we did is that we added a feature that you can invent generic names for individual nodes. And then you can say, this session at the moment is not actually doing original changes. It's replaying changes from another node. And I can say this node is playing the role of the other node.
And then the output plug-in for logical decoding gets the information from which node did these changes originate. So again, we insert, we replicate. And then node two will get the rows to the output plug-in. But the output plug-in says, oh, the change originated on node one. There's no need to stream that back.
And can just say, I'm forgetting that change, which allows to do more complex replication setups. But that's not the only part that infrastructure is useful. But imagine that you're replicating just from a primary to a standby without any fancy bidirectional stuff. If you insert a couple things into the primary at
some point, because we're doing asynchronous replication, they will get sent to the standby. And it might just be the first row that will get sent in one TCP packet. It might also be several at once. But in this case, we imagine it's just one. Then a third insert happens.
And we'll now send the other two in one TCP packet. But if you just replay this, the standby can crash at some point. So imagine we have sent first the first insert and then the
other two in one packet. But between nodes, the insert for two and three, the database crashed. So you can't just say, I'm forgetting on the primary about all the changes that have been sent out. Because while it's on the wire, it can get lost. While you're applying, it can get lost.
So you have to keep track of how far have you replicated. So imagine there was a crash between those. That might be that node one is down, node two primary is down, standby is down, the replication solution crashed, or anything. We have to know after we start again how far we
replicated. A very easy solution to that would be to just have a table and update a row in there. But the problem is that if you have an OLTP system, that row gets updated for every transaction. That's not very fun. So what we have done is extended the replication origin system we saw earlier to also keep track of
replication progress. So you don't just say, my changes originate on node one. You also say, for every transaction, hey, I'm replicating a transaction. And the transaction on the source system happened at this address in the value log.
That's how we define progress in the database state. You say, for example, this is the address, and let's just imagine that it happened at this time. And then you do insert and all the other changes in the transaction, and then you commit. And if you, at some point, crash and you want to say, oh,
I don't know what actually succeeded to replicate last, then you can call a function, pgReplicationOriginProgress, and that will give you the node up to where you replicated. And it will do so with a very low overhead. Depending on what exactly your workload is, it's from
0.2% value overhead to 1.5 if you have very, very, very many, many, many small rows. And it even works if you do the apply on the standby. You might want to use on the standby as a synchronous commit, because there's no need to commit asynchronously.
So even if you replay the transaction that committed, it might also still not persist. So even in that situation, after the crash, you will get replication progress will give you the correct answer. And it does so by hooking into the val replay of
Postgres and keeping track of which rows actually persisted after a crash. So the other big feature, which I think is one of the core features of making replication more seamless, is that we added a feature to hook into every DDL that
happens and normalize that. Because, for example, if you do add column to a table blarg, that table blarg might exist in three schemas. And then you have to figure out which was actually meant. And so you need to normalize that to fully
qualify everything. There's other cases, like if you add a sequence in a table as a serial column, you end up with five statements. You add a create sequence, create a table, alter table, set default, alter sequence, set owner to the table. So this kind of stuff needs to be tracked.
And in 9.5, the basics are integrated for that. Unfortunately, the normalization code wasn't ready for 9.5, but now you don't have to modify Postgres anymore. You can do that as an extension. So there's a couple features missing, but yeah, we'll work on them in 9.6.
So what I think we are at now is that in 9.5, you have a very good toolbox for implementing a replication solution. You have lots of components, but in Core Postgres, all of these are just there. Without you knowing C, they don't do anything. And not everyone wants to develop their own replication solution.
So I think at this point, we have to decide where to go from here. We can either say, yeah, we continue to improve that toolbox so you can easier build replication solutions externally, which is a position we have taken for logical replication solutions for a long time. Or we can say, hey, this is so important, we want to
integrate something like this in Core Postgres. And I think there's valid points for both. If you want to say, hey, we've never integrated a logical replication solution, we don't really need to do now. It works well to make Postgres better to make replication solutions more efficient.
That's fine. It reduces what we have to maintain in Core Postgres. Everyone else is on the same footing as Core Postgres. You can't just say, hey, this replication solution is in Core Postgres. It has easier to do than some other product. And we also might get the interface wrong, which is, I think, a pretty valid concern.
We haven't gotten everything right for hot standby. There were certainly stuff we would do differently if it were today. And I think some people have argued that it's just not a part of what we should have in Core Postgres.
And the other part is, what I think is the plus point is all the other solutions, pretty much every other data store has a logical replication solution. I don't think we can really afford not to make that as easy as possible. If you have to configure, get some other piece of code that's not in your repository, you have to
install it. You have to configure it in different places. I don't think that's something we can really afford to continue doing. So I think that we just have to get something in. And I think lots of the arguments of the negative side have been made for hot standby. And I think one of the best things that happened to Postgres is integrating hot standby, because we wouldn't
be where we are today without hot standby and streaming the application. It's also that users just trust stuff that's in Core much, much more. So I think there's a lot of questions about what to integrate, which parts.
Do we want just unidirectional application? Do we want the full bidirectional replication? Do we want other things? And there's, I think, lots of valid positions on this. And I think one of the biggest questions is, how do we want to design the interface for in Core?
Because for BDR, we had to basically use a function-based interface, because it's very hard to extend Postgres with BDR. But whether that's the best solution, I don't know. I think the functions are fine. But I think that's one of the big discussions we're
going to have. I think the control of the replication should be functions. Stuff like saying this sequence uses a different mechanism is actually DDL. This table should be in this replication set. That should be DL. The control of the replication itself should be functions. But I'm very happy to discuss that.
And that's why I made the presentation today. So let's discuss that. And the other part is, who's actually going to work on this? I'm going to do some work. Second one is going to do some work. But I think this is a very, very large project. So please help. So unfortunately, I don't have the time
to tackle the last point I wanted to make. So there's a bunch of resources around this. Just check it out, play around with it. It's in production on a bunch of systems. But it still has rough edges around a bunch of features. So there's definitely user experience needed,
because we don't necessarily find those rough edges anymore, because we can walk around blindly and not hit anything. So please help us. Any questions?
Not right now, but it would not be very hard to make it. So it just wasn't the priority. We have like 15 chunks of 1,000 values at the moment. It's just constant in the code right now. We just have to invent DDL for that.
That was something I wanted to, the actual additional points I wanted to make. That's something we need to improve. At the moment, you just stream a very large transaction out, which means if you have smaller transactions happening concurrently, they will get delayed and apply.
It will work, but it will increase your replication lag for a while. And that's the same with any of the other logical replication solutions. But I think we definitely need to improve on that. It does that.
On disk. It's built to disk when needed, otherwise it's in memory. Logical.
They will get resolved as kind of, that's how we detect conflicts on a per row basis. Like if there's a unique constraint that you violate, that's how you say this row is the same effect as you on the other side. And then depending on whether you can resolve the conflict,
it will either get resolved using last update wins or get locked to the conflict table. And then you have to resolve it yourself. But I think it's not generally just that you want to do actual, full multi-master. In many, many workloads, you can just say, most of my data is actually just going
to be modified on this continent. The rest of data is going to be modified on this continent. I just have a set of shared data between those. You have to do that yourself, but you don't get an inconsistent result if the clocks drift.
You just have, basically, the wrong node will be the last one. But it will be the same on all systems. Because you use the commit time to resolve conflicts, and that will be the same on all nodes.
It's basically, in there, you can configure replication sets and say, I only want to replicate these tables. The only problem is that during the initial clone, we clone all the tables at the moment. Other than that, we support that.
Part of the table, yeah, that's not there. And I don't think there's a roadmap at the moment, because it's just not that interesting. What you can do is say, I don't want to replicate deletes in one direction. So you can build an archive server here and only have a small set of data here. But partial replication at the moment is not supported. And I don't think that's, if you partition stuff,
then it would work, for example. I think that's the way you have to tackle it. Because otherwise, you get into very strange scenarios. Configuring that will be weird. I think partitioning the data among this data you want to replicate is going to be better. But maybe you have a great idea how to implement it. Please, open an issue and communicate with us.
No, it uses the logical decoding,
uses postgres file log that's already written anyway, and then picks up the changes. And then we just filter what's happening there.
OK, it's an option right now. And for some reason, we changed the default to false. I don't know why. I think we should revert that. It should be the default to on. We have the option, but it's not on right now.
OK, I think I'm more than out of time anyway.