We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

SELECT * FROM changes; - Part 2

00:00

Formal Metadata

Title
SELECT * FROM changes; - Part 2
Subtitle
Logical Decoding
Alternative Title
SELECT * FROM changes; -- Changeset Extraction
Title of Series
Number of Parts
31
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production PlaceOttawa, Canada

Content Metadata

Subject Area
Genre
Abstract
9.4 saw a great deal of development around a feature (now) called changeset extraction. This talk will explain what the feature does, which areas of use we see, what the state of the feature in 9.4 is and which additional features around it we want to see in future releases of postgres. Usecases for the changeset extraction feature are: Replication Solutions Auditing Cache Invalidation Federation ... Changeset extraction is the ability to extract a consistent stream of changes in the order they happened - which is very useful for replication, auditing among other things. But since that's a fairly abstract explanation, how about a short example? -- create a new changestream postgres=# SELECT * FROM create_decoding_replication_slot('slot', 'test_decoding'); slotname | xlog_position ----------+--------------- slot | 0/477D2398 (1 row) -- perform some DML postgres=# INSERT INTO replication_example(data) VALUES('somedata'); INSERT 0 1 -- and now, display all the changes postgres=# SELECT * FROM decoding_slot_get_changes('slot', 'now', 'include-timestamp', 'yes'); location | xid | data ------------+---------+--------------------------------------------------------------------- 0/477D2510 | 1484040 | BEGIN 1484040 0/477D2628 | 1484040 | table "replication_example": INSERT: id[int4]:1 data[text]:somedata 0/477D2628 | 1484040 | COMMIT 1484040 (at 2014-01-20 01:18:49.901553+01) (3 rows) All this works with a low overhead and a configurable output format.
LogicOrder (biology)File formatOverhead (computing)Software testingInterface (computing)Mathematical singularityPlug-in (computing)Set (mathematics)Table (information)Drop (liquid)Function (mathematics)DataflowMathematicsForm (programming)MassFunction (mathematics)Insertion lossMathematicsReplication (computing)Streaming mediaSoftware testingFile formatType theoryConfiguration spaceRow (database)Different (Kate Ryan album)Order (biology)CodeFlow separationUniqueness quantificationRevision controlPlug-in (computing)DatabaseGoodness of fitPairwise comparisonDatabase transactionGroup actionSimilarity (geometry)Multiplication signRepresentation (politics)TimestampDampingContent (media)Binary codeDoubling the cubeOverhead (computing)Source codeSystem callQueue (abstract data type)Codierung <Programmierung>Table (information)Descriptive statisticsConnected spaceKey (cryptography)SoftwareNumberMoment (mathematics)BitMereologyData structureDrop (liquid)ConsistencyLogicFitness functionSelectivity (electronic)Uniform resource locatorMemory managementTouchscreenComputer configurationMotion captureSymbol tableVariety (linguistics)CompilerMetreRight angleMoment <Mathematik>Figurate numberCartesian coordinate systemElectronic mailing listAdditionData typeSlide ruleAreaDesign by contractPhysical systemConnectivity (graph theory)Set (mathematics)PhysicalismComputer animation
Metropolitan area networkDataflowPlug-in (computing)Function (mathematics)Software testingTable (information)Price indexMassInterface (computing)Form (programming)LogicCache (computing)CodeMathematicsSubject indexingKey (cryptography)Replication (computing)Software testingWebsiteInterface (computing)DatabaseMathematicsData storage deviceCache (computing)Moment (mathematics)Streaming mediaIdentity managementFlow separationPoint (geometry)BitCASE <Informatik>Numbering schemeDatabase transactionMoving averageRevision controlBlock (periodic table)Form (programming)Statement (computer science)UsabilityRight angleSpacetimeSingle-precision floating-point formatTable (information)View (database)Row (database)SoftwareScaling (geometry)Core dumpQuicksortOptical disc driveFunction (mathematics)Parameter (computer programming)Alpha (investment)State of matterBounded variationMultiplication signOverhead (computing)Limit (category theory)InformationFile formatMultiplicationOrder (biology)Representation (politics)Mechanism designCartesian coordinate systemBit rateLogicProxy serverConfiguration spaceObject (grammar)Structural loadHigh availabilityComputer configurationData streamDifferent (Kate Ryan album)Pointer (computer programming)Installation artPlug-in (computing)Ocean currentSet (mathematics)Query languageLoginSelectivity (electronic)Type theoryNeuroinformatikAdditionBinary codeComputer animation
Tape driveAdditionOpen setBitMetropolitan area networkTable (information)BackupFunction (mathematics)Replication (computing)Streaming mediaCommunications protocolUtility softwareOverhead (computing)2 (number)Codierung <Programmierung>Process (computing)Computer configurationSingle-precision floating-point formatPhysical systemBefehlsprozessorDifferent (Kate Ryan album)Plug-in (computing)LogicPoint (geometry)CASE <Informatik>Database transactionDatabaseData conversionOpen sourceInterface (computing)MereologyRow (database)MultiplicationMathematicsPatch (Unix)BitRight angleAutonomic computingFile formatMultiplication signPhase transitionFlow separationContent (media)Level (video gaming)Operator (mathematics)SpacetimeProjective planeCartesian coordinate system1 (number)IdentifiabilityBasis (linear algebra)ResultantRevision controlTupleImage resolutionCore dumpMiniDiscSystem callReading (process)Demo (music)Set (mathematics)Source codePairwise comparisonSound effectBenchmarkLibrary (computing)Binary fileTexture mappingComputer-assisted translationState of matterMultilaterationChainScaling (geometry)CodeMilitary baseACIDDependent and independent variablesArrow of timePower setPartial differential equationForm (programming)Moment (mathematics)Computer animation
Metropolitan area networkMaxima and minimaWeb pageElectronic visual displayServer (computing)Data typeRevision controlAbstract syntax treeElectronic mailing listTable (information)Drop (liquid)Interior (topology)Mathematical singularityHand fanConfiguration spaceLevel (video gaming)TouchscreenConnected spaceMathematicsSpacetimeSequenceTable (information)Key (cryptography)Default (computer science)Patch (Unix)Different (Kate Ryan album)VotingAlgorithmDescriptive statisticsHidden Markov modelPoint (geometry)CodeMoment (mathematics)Range (statistics)Row (database)Physical systemMultiplication signSoftwareGastropod shellInternetworkingNetwork topologySkeleton (computer programming)Electronic mailing listMedical imagingTwitterWordClient (computing)Computer animation
RankingFunction (mathematics)Metropolitan area networkFeedbackPlug-in (computing)Multiplication signLogicImplementationPoint (geometry)Replication (computing)Table (information)Information securityAuthorizationFunction (mathematics)Cartesian coordinate systemEmailFigurate numberPrototypePower (physics)Message passingComputer animation
RankingMaxima and minimaCommunications protocolVotingMoment (mathematics)Physical systemSequenceNumberInformationLink (knot theory)NeuroinformatikOrder (biology)StapeldateiInternet service providerConnected spaceSoftwareServer (computing)Point (geometry)Different (Kate Ryan album)MathematicsLocal ringReplication (computing)ImplementationMechanism designSoftware testingInterleavingResource allocationFeedbackMessage passingData recoveryCore dumpDatabase transactionExtension (kinesiology)Cross-correlationPublic key certificateWordParticle systemCartesian coordinate systemCohen's kappaMultiplication signResultantFormal grammarExecution unitConfiguration spaceVideo gameSlide ruleDebuggerComputer configurationPunktgruppeLogical constantAsynchronous Transfer ModeEndliche ModelltheorieSinc functionStress (mechanics)AreaWater vaporComputer animation
Regulärer Ausdruck <Textverarbeitung>Metropolitan area networkTabu searchSoftware repositoryBranch (computer science)Arithmetic progressionPoint (geometry)DatabaseInformationRepository (publishing)Source codeJSON
Transcript: English(auto-generated)
Hi, I'm Andrus, and I work at Second Quadrant, and I spent the last two years working on the feature on the wall here, and I think it's very helpful and a good part for Postgres, so I'm going to try to explain why I did it, what it does, and how you can use it.
And as you might notice from the initial slide, the feature went to quite a bit of different names. I think I initially named it logical replication, then I renamed it myself to change set extraction, then I think Robert suggested logical decoding, I don't remember who, then the marketing material
now names it data change streaming, MS SQL has it as a change data capture, and another unnamed database has it as streams. So, yeah, anyway, and what is that feature with the many name?
What it allows us to do is to get all modifications performed to a database, that is to say, insert, update, delete, in a format that you can use for other things. And the important part is doing that in an order that's consistent, and what a consistent order means, if you take all those changes, put them into another database, you will be
able to replay them, it won't be like a random order, so you don't get uniqueness conflicts or similar things. And what's also very important, and that was a very crucial thing during the whole discussion on the list, initially it was all hard coded, and the output that comes
out of the whole feature is completely configurable, you can make it do anything. And what's also, what I found very important is it's possible to do that, to use the whole feature even if you do all stuff like auto table, drop table, because that's where the existing logical replication solutions for Postgres like Sloanie, like Lundista, like
Cardable, they have big, big problems with that, and they break. And users strangely don't like that very much. And it is very low overhead in comparison to the existing solutions. The existing solutions basically created triggers for every action and queued all the
changes into a queue table, and that meant at least the double amount of writes, and the double amount of write ahead log, and that's relatively expensive, and it got even more expensive if you have stuff like a timestamp data type, a float data type, because converting the binary representation of timestamp to a text is very expensive.
So just to start off, how to actually use the feature, a very important concept in the whole thing is replication slots, and that's basically one stream of the data can be streamed from one slot.
And so what I'm doing here, I'm saying select star from pg create logical replication slot, what that does, it creates one replication stream, and my stream is named my stream, and the output format will be determined by the second parameter, how that exactly works, I will come back to in a second, and then it gives you back what it did.
And we have a nice overview over which replication slots are already created, and that's the
registered slots, and obviously in this example I had only created one, there's also all the kinds of slots, but I'm not going to talk about those today. So what happens if I actually do things, perform DDL? So let's say I have a table talks, I create a table, the talk name is the primary key
because I don't dislike surrogate keys, and there's the description, I insert a talk into the talk table, and that's the select star from changes talk, and the description explains
what logical decoding is. And then to show off that it actually works with DDL, I'm going to add a new column, and I'm going to update the column to add a review for my talk, which is to say whatever, and then I'm going to drop the table.
And now I want to get all those changes. And there's a very simple way to do that, which is very useful for many things, and that's the pgLogicalSlotsGetChanges function, and that is a set-returning function, and that just returns all the changes that happen since the last time it has been called.
And if I call that on the replication slot stream I previously created, that's the myStream thing, and I only want the location and the data column that it is returning, and then it will show the changes that happened.
So the first transaction, the first data modification that happened was the insert, so we see that there's an insert to the table, the column talkName of type text is select start from changes, the description explains logical decoding. And then you can see there's a commit of that transaction.
And later, if you recall correctly, we updated the table and added the review column and filled it with text, so now we updated the table and there's the text. But if you look closely, there's also the xid column, the number of the commit, and
there's a gap in between, that's where I did the alter table, because at the moment we only decode the changes to the table, not changes to the table structure. So it didn't fit on the slide, but there's an empty transaction in between here.
And how is data flowing in this? When we have the database, and then the data flows from there to the so-called output plugin. That was the test underscore decoding I had. And if I have several outputs, replication slots, it will just do the same thing.
Every replication slot can have a different output plugin, and that will then go to the user or the consumer or another database, similar. And output plugins are what determine how the data that will output look like. So the idea of output plugins is that I want the logical replication to only stream
the data in the most efficient format, that's Postgres internals format. The Sloanie people, for example, have a compatible format with the old data. Other people want to have SQL, because they want to just replay that into a very other
database, and SQL is the simplest for that. And there's lots of different things you can imagine too. Sometimes you only need the primary key, and nothing about the actual table data. We want that all configurable. But since we didn't want to add all the format where we need to add a myriad of options,
we said that user defined code, and that user defined code can then form the data itself. And that works by getting an output plugin consists of a couple of callbacks. There's one callback that gets called if the plugin should be initialized, and then
the important things are every time a transaction is decoded, there's the begin callback call at first. Then for every single change that happens in that transaction, be it an insert, be it an update, be it a delete, the change callback is called.
Even if you delete 100,000 rows at once by doing a delete, the change callback will be called 100,000 times for every single row. And then if you commit, the commit will be replaced. If you look closely, there's no abort callback here. That's because aborted transactions are never ever decoded.
You can't look at aborted data here. It's only once the commit happened in the source data, only then you can get the contents of the transaction. And then if we stop doing the decoding thing, a shutdown callback is called. That's usually only used to free memory and similar things.
And what's very interesting is that during the whole, whenever a change callback is called, you have access to the transaction that is following the action. So you know this transaction has this xid, it committed at this timestamp, which is
something you could never ever do before, and it will always get called in the order that the commits happen internally. It might be different to the perceived commit order from the user, because sometimes the network might be slower for one connection than the other.
And there's already a couple of output plugins. The original commit has included the test decoding plugin. That's the output you saw earlier. That includes the column name, that includes the column type, and has a text format. Admittedly, that's not very useful for everyone, but it's very useful for testing, because
it shows most of the information that's available. Then Euler Taweir has written an output plugin that outputs JSON, and that's already quite useful.
I think there's a fair amount of discussion to be had about in which way the database changes reflect the JSON. I'm not 100% happy with the output format there. I think we probably want a JSON format in core, in my opinion, but I think it has to look a bit different.
And Michael Pakir, who is unfortunately not here, has written an output plugin that just does everything as SQL statements. So if you update a row, you get an update, but that means if you had originally one update statement updating the whole table, you now get one update statement for every single
row, because everything is row-based here. No. So the way it works is the data is initially only stored inside the write-ahead log of Postgres. The output plugin gets never called before it's inside the write-ahead log.
The output plugin gets as an argument the heap-tuple type of Postgres, which is just the internal representation of data. And you don't really need to know where it's coming from, but every time it gets streamed out, it gets converted in you. There's no caching of anything going on, it just gets streamed out.
So the replication slots, that was the one thing in the select star from pgu replication
slots, they are there to reserve resources we still need. If there's a checkpoint, and that checkpoint would remove write-ahead log that's still needed by the logical replication, it will say, oh, I can't remove that, I'll reserve it. And the same mechanism can be used for normal streaming replication.
So you can just have a slot that says you can remove any write-ahead log that's still needed by standby. So that's the reason these slots actually exist, it's just to reserve resources we need for some form of replication.
Yes, so unfortunately I've been... So there's additional columns there that say, like, the last data that's required is this LSN, and then that you can monitor.
So that's the base name for a shared object. So what it does, it creates, when you say, create a replication slot with test decoding, what it looks for is, in libdir, a shared object that has libdir, and then has one
callback that loads the configuration, basically, it says those callbacks exist, and those are the function pointers, and then it gets initialized. That is explained in the documentation, I don't think it's particularly interesting. But if you might say, so what if my table doesn't have a primary key, or what if I
also want the whole old version of the row, because I want to compute differences between rows? Yes, there's the new feature that's called the replica identity, you can say alter table
replica identity, and there's these options. If you set it to default, if the table has a primary key, then that's output for, available for the output plugin when you do an update. If it's set to, if there's no primary key, you won't have access to the old version
of the row, but the new version of the row is available. Then what you also can do is to say, I want explicitly, I want to use another index as the identity. Often enough, you have several indexes on your table, and they, like, one id column, and also the username, and the username is also not null, and it's also unique.
So you can say, my external solution doesn't care about the internal id, I want the username. Then just say the username underscore idx, and it will use that from there on. Then you can say, I want the full row. That's obviously going to need a bit more space in the write ahead log, because suddenly you need to store in an update the old version and the new version.
But it also might be that in your use case, you only need the new version, because the old data doesn't care, you don't care about it. So you can just say nothing, and then the overhead will be a bit lower.
And what might be important, interesting there, is that if you update the row, and the, you have set it to default, or using index, and there's no change in the primary key, we won't actually store the primary key again, because we can just get it from the new tuple, because it will be there.
So what do we think is this going to be useful for? The primary reason we want this is replication. Currently, if you use streaming replication, you can't have any write activity on the standby, which is fine for many use cases for high availability, you failover, that's okay, but you can't create temporary tables, you can't have additional tables, all that's
not possible. So we want a logical replication solution that allows to all that activity on the standby. There's additional reasons why streaming replication isn't suitable for every case, primarily it doesn't allow to cross versions. If you upgrade from 9.3 to 9.4, there's no luck, because you can't use streaming replication
for that, which is, if you have a multi giga terabyte database, using pgdump or pgupgrade has its problems. So that's why, at the moment, people use Slonie, why people use Londisti. But those have a performance problem, so we want to build this to make that possibly
faster. Another very common use case is cache invalidation. Say you don't trust your Postgres database to handle all the requests, instead you put a key value store in front of it and cache data, say you put a caching proxy in front
of your website that you might want to invalidate the data that's in that cache. So we can create a replication stream, include in the output plugin, write an output plugin that only outputs the key to the cache, and then in the cache just purge all the incoming
data that's been changed. Another very common use case we hope to solve with it, or support with it, is auditing. Many installations require that all changes need to be visible in some usable format,
but for example, using the statement log doesn't really help with that very well, because it isn't a consistent order, you can't query it because it's just text, it doesn't handle upward transactions, and all that. It's also very, very expensive to log every single query.
Another thing is aggregating rollup, you have several databases that don't have the same schema, you have, for some odd reason, still another database in there than Postgres, and you want to update it with the new data from Postgres because it's old software that's
still around, or similar. All that should be relatively much easier with that support. Auditing is to also make some settings like the current username and stuff like that
available? Not yet. I think we're going to get there, that we need to attach user-configurable information to every transaction, because obviously the username is going to be helpful. What you can do is to create a statement trigger that just one time a transaction
adds the user in there into some special table, that's obviously not very nice. I think we want more, but there was a limit to what we can do. I wanted to rob it, not to kill me. So as I just showed before, there's a very simple SQL interface, it's just select, star,
from, get me all the data, and that's very simple. There's a couple of variations in there. There's a variation that says, peek into the data stream, don't consume it, which is very useful if you're playing around. You can just say, what are the next hundred changes, show me them, without consuming
them. There's also the version that supports outputting data in binary, which might be useful if your output plugin streams data in binary. The problems with that are that it's relatively expensive to call that function. It's okay if you call it to stream out a half a gigabyte of data, but if you call
it for every single change, the initiating the logical decoding process takes some time and might read data from disk repeatedly. So you probably shouldn't call that a hundred times a second. The next part is, obviously, the SQL function can't do any streaming. You can't say, give me all the rows as they come in.
Pretty much, at the moment, not solvable with a SQL callable function. So those are the caveats. So what we also have is the so-called wall-center interface that's already used for stream replication and for PG base backup and things like that.
But now there's also a command that allows you to get the data in the logical format. It's pretty much the same commands you can do from SQL. It's not PG underscore create replication slot, but it's create logical replication slot in uppercase letters. And then, but the advantage of that is that it streams the data as the data is produced
on the primary, which is very useful because you can get the data very fast. There's no repeated startup costs. And very interestingly, it supports synchronous replication, which is something none of the trigger-based solution has figured out how to do.
There is a command line utility to receive those changes that's called PG receive logical. That is another thing that was repeatedly renamed, whatever. The problem with this is it uses the same protocol to get the data out that the streaming
replication uses, and that's basically a binary protocol. So you need to be able to say, okay, those eight bytes are an LSN, those eight bytes are a date, and that is a bit of fiddling. I really hope that we can get a small Python library, a small Ruby library, and stuff that
just gets the data in a more convenient format. But we don't have that yet. Yes, I have done benchmarks.
So with pgbench streaming out the changes, the overhead was like 6% or so. That was because with streaming the output, I hadn't compared it to streaming replication. I was just comparing to not doing this to doing it.
In comparison, we tried both Sloanely and Lundista. They completely fell apart because they couldn't keep up with the performance. So we monitored the backlog, and the backlog within 12 seconds was just never decreasing anymore in any meaningful manner.
Because the overhead is so large that it just couldn't keep up.
So when I streamed pgbench, one decoding process could keep up with, I think, I had
done it on a system with 32 CPU cores, and it could keep up with that, with a single decoding process. I think that's optional for improvement there, and it very heavily depends on what your output plugin does. If your output plugin converts everything to text, it's different from when it just
streams out the data in the binary format, because that conversion from the internal format to the textual format has costs. So there's trade-offs to be made. You also can do stuff in the output plugin, say, I don't need these changes to this table,
I just skip it. So what are the problems with the support that's now committed and is coming as part of 9.4? The biggest issue probably is that only committed changes are streamed. If you have a very, very large transaction, say you just restored your entire database
and it's one terabyte large, nothing will happen while that one terabyte is read. The whole data will only be streamed once the commit arrives. And obviously, then you're going to be busy for a while streaming out that one terabyte, because even a modern system, that takes a while.
I think there's a very good case to be made that we want an interface that allows, optionally, to stream data out while the data is being written. The problem is that that makes the apply side very, very complex, because you need to be able to deal with n open transactions at the same time, and all of those might
abort at some later point. And Postgres currently doesn't, from the receiving end, doesn't have any infrastructure for having a single process with multiple transactions in the process at the same time. So, if we have the right autonomous transaction support in Postgres, this might get easier
to use and that might be sufficient reason to implement it. I don't think the sender side is very complex to implement. I think the user, how the interface looks like and how to apply changes coming from that isn't going to be easy.
So, and the next biggest problem that personally I find to be much bigger is that we don't replicate DDL changes itself. As you saw before, once there was no alter table in that stream. What we need is the ability to say, get a command that's just, oh, this table has
been altered. The new schema looks like that and that. And the command you need to do to transform those two is that SQL. Luckily, Alvaro and before the Dimitri worked on that. And we have 4.9.5, a patch that implements DDL replication for most commands.
The things that we don't support are create database because the whole decoding only works on a database level, so it doesn't really make sense and create role and so and create table space. Basically, we don't support any of the operations that are bigger than the current
database have cluster level scope. So it's going to be interesting to see whether that gets into 9.5. I'm going to fight for it. I think a couple of other people will, you'll see. There's a couple of features this could very well use.
For example, it would be very interesting if we had support to decode two phase commit transactions separately. So when you actually get the contents of the transaction, when the two phase prepare is done and then you can stream it to the other side, do prepare on the other side and
only do the two phase commit prepared once the other side has successfully applied the transaction because what that allows you to build is multi-master systems that have optionally synchronous apply in a sense, it's optimistically locked synchronous apply. That I think can be the basis of a very powerful set of features, but that needs
to be written. I think there can be some interesting additional options that allow us to specify, like we have the replication identifier that allows us to specify which of the old rows are there
on a per column basis to have tailor the overhead to whatever your table and your application solution needs, but those are the ones I see. So, why did we went through nearly two years of work on this?
The reason is we are working on a project called BDR and that's asynchronous multi-master and we want that to, or it's open source and we want that to be available in Postgres. We want to put as much as we are allowed to into core Postgres because we think many of the scenarios that are coming up are much harder to do if you need to build that all
yourself and for now all the conflict resolutions we are working on is last update wins. That's good enough for many usage scenarios. There obviously are other scenarios where you can't use last update wins. That's why we support conflict handlers that can say, okay, if there has been a conflict
between that and that update from two different nodes, this function gets called and the function can say, oh, this is an error. I don't replicate anymore. This is all broken. Or you can say, use this tuple as a result and the conflict is resolved.
It's all open source and as I said, we are trying to get it integrated into Postgres. I have a very small demo of that. So as you can see, this is one shell. There's another shell. Since I didn't want to rely on the network here, it's all on my computer, but as you
can see, one port here connects to port 5 4 4 1 and 4 0. So right now there's a table and a sequence.
So what I'm going to do first, that's from when I tried that actually everything worked. So I'm going to drop the table and I dropped it on this node and now we can check, did it replicate?
Yes, it did replicate and even the sequence was dropped as well. You can also create a new sequence and as you see, there's a new syntax here and that's using BDR. What that does, it says we want a distributed system that works across a multi-master system
then we create a table that comes, doesn't come out very well here, but here we create a table talks again. This time we have a key and that's big end and it has default to next of all of
the sequence we just created on the other node. It's the primary key. We also have the talk name and the description. Now I'm going to say, oh, that wasn't what I was intending. Hmm.
That worked before. Yeah. So I'm not going to change the owner supposedly that should change the owner to the table so it doesn't, it gets dropped together. I'm not sure why it's not working anymore. It did when I tried it before. So I did now insert it to a rows and through this very arbitrary ones.
So the it's not getting wider any anymore have to. So now it updated those. As you can see, the voting algorithm for the sequence doesn't start at zero, but that's
pretty much not a problem. So now I could start insert into talks value of, Oh, thanks.
So, Oh, and that's good up the same quote, very inconvenient that the screen doesn't
show everything. And now it inserted and it will also be with all here. Obviously we screwed up croting, but it should be, have the same data.
And as you can see, it uses different ranges of rows because it gets assigned differently. So I can show that's basically how it works. The configuration, I think we are going to want to change that at some later point before
we can convince anybody to put into call Postgres at the moment it's configured by setting some Gax. So we have a BDR of connections. It connects. There's a P here. There's two connections. It does to other nodes. It's five, four, four, zero and five, four, four, two.
So you can recognize that a third note that I didn't have space to show on the screen. And then that's the connection that specifies to where we connect.
And the same, if you look here, it will be very much the same, but show obviously not itself because that's five, four, four, zero. Obviously it won't replicate to itself. So it has just the other two notes in the three note configuration.
And that's basically all you need to do to replicate the data between the two notes. I think we are going to want to have something like create SQL command, top level commands that connect to the other node. But there's some infrastructure required for that. And we don't want to patch code PostgreSQL any more than we want to need to.
So that's the configuration we have right now because you don't need to change anything for it. Yeah. And I'm already getting close to the end. I want to thank very much that Intel security and McAfee have allowed us second point to
work on this. And without so many funding it, it wouldn't have been possible to work two years on a feature. And I think that's the reason why all the existing logical replication solutions have used the Q table with Triggers is just because it takes an enormous amount of time to implement
anything that's better than that. And it's a very convenient solution. The community for review, I started two and a half years by sending an email, oh yeah, here's the feature I want to work on. I want to work on multi-master. And I think in the next two months or so there were like 400 messages. And there were lots of arguing.
But we got very, very much better than my first approach. My first approach was kind of ugly. That's what I wanted to say. But we improved on that. There was a prototype. It went through lots and lots of review. I want to particularly thank Robert here because without his review it's rather unlikely
that the feature would have made it into 9.4. So very many thanks to that. I also want to thank the people that already worked on output plugins before the feature was even committed. Like Steve Singer had provided very early feedback of the API.
And I have my doubts that without that feedback saying, oh, another solution than what you want to do has a chance to use this, without that it might have not gotten in. So thanks. And the two authors of the output plugins, I also have to thank.
But they aren't here, unfortunately. And second quadrant for letting me spend far, far, far too much work on it. Yeah. Yes.
Does the sequence number allocation work for the multi-master? It's a voting protocol. What? Voting protocol. Majority voting protocol. Explain the sequence number. So we have a new AM for sequences. Those allow another implementation of sequences.
That sequence AM then internally does say I need a new chunk of values because at the moment I have no allocated values. Then that request gets transferred to all the other nodes in the system. They'll say, oh, I am allocating a conflicting value. Then they say, you're not allowed to use that.
Or they say, I don't care. Use that. And then once the majority says, OK, you're allowed to use that, it starts to use the chunk. At the moment, the chunk size is hard-coded to a thousand values at a time. And then we want to provide auto sequence set BDR dot batch size at some point.
Then you can say different batch sizes. How does it know, how do the servers know which servers are sharing a particular sequence? BDL is replicated. So it's just using the same sequence name on all nodes.
I'm not sure if I answered your question there, but that's the, this was the BDR dot connections I showed. It uses the same cluster network that's configured there. I think you're right.
I am afraid there will be a couple of bigger complaints. Another approach that you might want to consider would be to start the sequence on a different
end number and have a larger increment. That's problem is that doesn't work very well if you add new nodes to the cluster because then you have rebalancing problems. And so you need to have some, you need some coherency mechanism to agree on a new interleave.
And that's, yeah, but you can just use the built-in sequences and set auto sequence set increment one, three, and do that. Well, I would start with a hundred.
As a test, I now have a gap-less sequence implementation that doesn't allow any gaps in sequence for local because people ask for that and it was a nice test. So we see API is pretty much next ball, reset, and all that options, so you can
do what you want.
The advantage of doing something like this is that you can say this local chunk that's acquired is only valid for a minute and then you have like mostly coherent sequences across nodes. So you have, can do like order by sequence and you get a very reasonable value globally and that can be very helpful. It's also what we've been asked to implement, but yeah, it's only interesting if you want
some correlation. I think I shouldn't have shown that we have distributed sequences.
No, that was the flight to last PGCon and the flight back I think, but yeah, they reconnect.
So yeah, and then we re-transfer it and so what this, when you do the, use the valsender information protocol, you can say, give me all changes that are streamed after that
LSN, that happened after that LSN, and then on the receiving side, you remember which transaction have you replayed already. And then if you disconnect, you say replay all that. We have an extension proposed to Core Postgres that allows to integrate that on the replay side automatically, so the local commit has the information which remote commit was
it replaying, so crash recovery can replay all that information. There's one particular person in this room who has doubts about that approach. You'll see, I think it's very neat and I like it, but that's because I wrote it, so yeah, so we have, that's basically the approach, you remember how far you replayed
and then you request all those changes since the last time. And you can basically, that works by every now and then, if you use the valsender protocol,
you send, say, I have received up to here, and you can't get any changes that are older. You should do it more frequently, if you can. So basically at the moment, it even asks you, please send me a feedback message. Now that's all the valsender protocol has that built in for streaming replication,
because that also wants that information. We just use this exactly the same protocol, and that's also what allows us to implement synchronous replication without any further changes. There wasn't anything we had to change for that, but yeah, that's how it works. Any other questions?
Yes, let me, I unfortunately, my computer crashed before and I lost three slides, so that's why that information was actually on a slide. It's the EDR user guide, I'll put this link up in a second, that doesn't work well.
So okay, there's the information that has the manual, that has the git repository. I think at this point, you shouldn't use the VBR branch, but VBR next branch, because we've made some progress since the last release, but yeah.
It will officially, more officially, be released at char, what year do we have, 14? And yes, okay, to the minute.