On the road to in-core logical replication!? Part 1: Architecture of Changeset Extraction - TIB AV-Portal

On the road to in-core logical replication!? Part 1: Architecture of Changeset Extraction

00:00

5

PGCon - PostgreSQL Conference for Users and Developers, Andrea Ross

Formale Metadaten

Titel

On the road to in-core logical replication!? Part 1: Architecture of Changeset Extraction

Serientitel

Anzahl der Teile

25

Autor

Mitwirkende

Heroku (Sponsor)

Lizenz

CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben

Identifikatoren

10.5446/19066 (DOI)

Herausgeber

PGCon - PostgreSQL Conference for Users and Developers, Andrea Ross

Erscheinungsjahr

Sprache

Produktionsort

Ottawa, Canada

Inhaltliche Metadaten

Fachgebiet

Genre

Abstract

Logical Replication in the context of postgres to this date consists out of several independent out-of-core solutions. While some of these solutions are great, the existance of many of those also causes problems like code duplication, lack of trust, features, reliability and peformance. As part of a proposal to include one logical replication solution into core postgresql we submitted the changeset generation/extraction part as a core infrastructure to postgres. In a way its usable by all the existing replication solutions and for lots of other usecases. This talk is about: the architecture of the committed/proposed changeset generation mechanism (2/3) An overview over further proposed patches (9.4+) to get a whole logical replication into core postgres (1/3)

PGCon 201324 / 25

1

55:23

Autoscaling PostgreSQL: a Case Study

2

05:01

CartoDB, and maps out of the Schemaverse

3

59:09

Corruption Detection and Containment

4

52:10

Elephant Puppets

5

1:01:19

Estimating query progress

6

1:00:36

Hacking PostgreSQL

7

51:58

How Postgres Got Its Groove Back

8

08:02

How Postgres Got Its Groove Back (Part 2)

9

54:01

Implementing High Availability with PostgreSQL

10

57:35

Introducing PostgreSQL Enterprise Consortium activities

11

1:04:09

Nulls Make Things Easier?

12

58:51

One step forward true json data type.

13

57:44

PostgreSQL Backup Strategies

14

1:03:00

PostgreSQL High Availability with Linux-HA

15

1:09:54

PostgreSQL on Hadoop

16

1:07:53

Query Planning Gone Wrong

17

49:54

Row Level Security

18

1:03:21

PostgreSQL and Sphinx

19

56:17

SQL Hints, Tips, Tricks, and Tuning

20

1:00:58

Billion Tables Project (BTP)

21

48:10

Understanding PostgreSQL timelines

22

46:19

Using logging hooks for real-time log analysis

23

54:01

Using PostgreSQL with Redis

24

58:06

On the road to in-core logical replication!? Part 1: Architecture of Changeset Extraction

25

40:03

Visualizing Postgres in realtime

Automatisches Abspielen

Sprache

Text

Bild

00:00

SpeicherabzugMereologieElektronischer FingerabdruckMathematische LogikTrigonometrische FunktionPCMCIACachingPlug inDigital Rights ManagementStatistikMetropolitan area networkBenutzeroberflächeOrtsoperatorFormation <Mathematik>Tableau <Logik>Inklusion <Mathematik>Diskrete-Elemente-MethodeRegulärer Ausdruck <Textverarbeitung>Serielle SchnittstelleTropfenZoomVerweildauerFunktion <Mathematik>PlastikkarteLokales MinimumRelation <Informatik>SystembereichsnetzKlon <Mathematik>GravitationsgesetzBaum <Mathematik>GammafunktionMehrwertnetzReelle ZahlUnterraumARM <Computerarchitektur>MathematikBildschirmsymbolDoS-AttackeDatenerfassungAuflösung <Mathematik>InformationCodeComputerspielDatenbankDatensatzFolge <Mathematik>ImplementierungInformationMathematikMathematische LogikNatürliche ZahlOrdnung <Mathematik>PerspektiveComputerarchitekturRelativitätstheorieVakuumZeichenketteFunktion <Mathematik>PlotterHalbleiterspeicherTypentheorieGebäude <Mathematik>Produkt <Mathematik>StörungstheorieKategorie <Mathematik>SoftwaretestGanze ZahlAutomatische IndexierungGrenzschichtablösungKonfiguration <Informatik>EntscheidungstheorieAggregatzustandBildschirmmaskeBinärcodeBitEinfach zusammenhängender RaumFunktionalGeradeGruppenoperationInhalt <Mathematik>MaßerweiterungMereologieMomentenproblemMultiplikationPaarvergleichPhysikalisches SystemWiederherstellung <Informatik>SpeicherabzugSpeicherverwaltungTransaktionWiderspruchsfreiheitE-MailFlächeninhaltSystemaufrufKonfigurationsraumVersionsverwaltungReelle ZahlÄhnlichkeitsgeometrieOnline-KatalogStochastische AbhängigkeitBasis <Mathematik>ParametersystemAusnahmebehandlungCASE <Informatik>NormalvektorBinärdatenStrategisches SpielAbstraktionsebeneFehlermeldungCoxeter-GruppeVollständigkeitZeiger <Informatik>Deklarative ProgrammierspracheGemeinsamer SpeicherParallele SchnittstelleExistenzsatzPunktCodierung <Programmierung>StrömungsrichtungSchnittmengeArithmetische FolgeProtokoll <Datenverarbeitungssystem>Web-SeiteKartesische KoordinatenUmsetzung <Informatik>UmwandlungsenthalpieSystemzusammenbruchEinfügungsdämpfungVollständiger VerbandDateiformatBefehl <Informatik>QuellcodeElektronische PublikationEreignishorizontWeg <Topologie>Tableau <Logik>Trennschärfe <Statistik>Streaming <Kommunikationstechnik>NP-hartes ProblemSchlüsselverwaltungDifferenteZeitstempelKontrast <Statistik>AutorisierungPatch <Software>n-Tupelp-BlockObjekt <Kategorie>Hinterlegungsverfahren <Kryptologie>Front-End <Software>sinc-FunktionFahne <Mathematik>Auflösung <Mathematik>Dreiecksfreier GraphZurücksetzung <Transaktion>MultiplikationsoperatorDatenreplikationURLPlug inStandardabweichungSchreiben <Datenverarbeitung>Message-PassingZweiRechter WinkelFigurierte ZahlOrtsoperatorMusterspracheSpezifisches VolumenHD-DVDDefaultSoftwareentwicklerEinsKoroutineBruchrechnungDatenstrukturRückkopplungDisketteIterationMAPInvarianteSynchronisierungProgrammierumgebungProgrammverifikationGefangenendilemmaGesetz <Physik>Analytische FortsetzungArithmetisches MittelGarbentheorieIndexberechnungInverser LimesIrrfahrtsproblemKoordinatenKritischer ExponentLastRechenschieberRechenwerkResultanteSkalarproduktStatistische SchlussweiseVertauschungsrelationVirtuelle MaschineZahlenbereichZentrische StreckungQuick-SortBildverstehenGüte der AnpassungSpannweite <Stochastik>RuhmasseGewicht <Ausgleichsrechnung>ATMNummernsystemNotepad-ComputerSummierbarkeitAdditionVererbungshierarchieKlasse <Mathematik>InformationsspeicherungWort <Informatik>Parametrische ErregungHilfesystemPrädikat <Logik>BimodulMetadatenMailing-ListeData MiningSchätzfunktionRadiusFramework <Informatik>SoundverarbeitungSchreib-Lese-KopfFitnessfunktionSerialisierbarkeitPartikelsystemEinfache GenauigkeitKontextbezogenes SystemSchlussregelFreewareFreier LadungsträgerBenutzerbeteiligungGesetz <Mathematik>Computeranimation

Transkript: Englisch(automatisch erzeugt)

00:00

We can start. Can you hear me? Okay, I'm Anders Freund and I'm talking about how we propose to implement the features necessary to build logical replication on Postgres in a non-hacky way. And yeah, that's what we propose.

00:22

So what we want to do first, to make that easier, is to provide a neater way to get the changes necessary to replicate data in a logical way. And logical here means that we have the tuples in a way that they mean something.

00:42

In contrast to when we do streaming replication or hot standby in Postgres, that's just shipping bytes around. Nobody knows what they mean. So all changes that extraction basically does is that one session performs, or multiple sessions perform any form of DML.

01:00

They enter, they update, they delete, potentially at some point they truncate relations. And those actions, we want those actions and stream them to some other node and do something on that other node, whatever that is we want to do. And there are several use cases. The primary one, which is why I'm mostly doing this, is that we want to build a replication solution around it.

01:27

But there are also quite a bit of use cases around auditing, because if it's not in the right-ahead lock, which is Postgres consistency method, it's not a persistent datum in the normal relation.

01:44

So it doesn't necessarily need to lock. For now, yes. I'm not saying we provide everything, so that's the basis.

02:03

And then there are obvious other use cases like you want to replicate 10 different databases into one, or you want to process that stream and analyze patterns in it. Or you want to integrate it into some caching proxy and invalidate the pages that are relevant. There are quite many use cases.

02:25

So the high-level architecture we chose to do this is that in a normal Postgres backend, that just as normally writes a headlock, which is what Postgres uses for crash safety and for the streaming replication.

02:41

And that's all there. And then we call a so-called output plugin that gets the data from the right-ahead lock and formats it in the way you want it for your specific use case. So if you want to replace that part of old-style Sony, you would produce SQL snippets.

03:05

If you want to do it for a new style, you would provide some array of the columns. And if you want to do another replication solution, you might want to send the tuples in a binary format, because the text conversion is inefficient. They're countless ways.

03:24

And then those other plugins are rather easy to write, and they'll show how that works. And then you actually want to get the changes. So for some use cases, especially for debugging and development, it's very useful to get the changes as a table in SQL.

03:43

So we have a set-returning function that just gives you, OK, give me all changes since the last time I called this function, and then gives you a table-like stream of those changes. Or you can use the valve sender that we use for streaming replication and say, oh, I want logical data instead of physical data.

04:03

And it will give you that, and that has the advantage that it's an actual stream protocol, so you never have to have more than one tuple in memory at the same time. So how do we work with that logical replication? There are three high-level steps to manage all that.

04:26

The first is you say, OK, I want to do logical replication. And if you do it on the SQL level, you say init logical replication with some parameters that show them. Or you can do it on the valve sender level, and then it's a different command.

04:41

And the most used then will be start logical replication, which then just starts streaming all the changes. If you use the SQL level, it will give you all the changes up to now. If you use the streaming format, it will just stream out changes until you say, oh, I don't want anything anymore.

05:01

And if your replica goes down and you remove it, you can say, OK, I don't need that replication anymore. So if we, unfortunately, that went too wide, I thought. OK.

05:21

So on an SQL level, you can do select star from init logical replication, give that replication stream a name and give it an output plugin. And that output plugin will be responsible for formatting the output. And then it tells you, OK, I created a slot of that name and it starts at that XLOG position because you base it on the write ahead log.

05:45

And that's just the position from where we can stream data onwards. So let's see how we can get the changes. So let's suppose after we have done the logical replication, we do a create table and insert two rows, one row containing a one and containing a five.

06:06

And then we can do select star from start logical replication, the slot name. And we say, OK, we want all the changes up to now. Or we can just say, give me that all the changes up to some other than if we don't want to get too many changes at once.

06:21

And then we'll give you the data. So we have the first transaction from here, which is the create table. Create table doesn't insert any data itself. So it's a transaction that gets a commit and no data. And the next one is, OK, another begin and then two rows and then commit again.

06:46

Now, we can do that, but it's very hard to do anything sensible with them. So we don't do DDL replication yet.

07:03

There's one person in here who has worked on it. And there's another one who has worked on it a bit differently. And I think we are on a good way to get there, but we aren't there yet.

07:27

At some point, we will have the event trigger thing. And that will allow you to replicate DDL as well. We are just not there yet. I'm presenting what we actually have here.

07:43

I think that's already useful for quite a bit of certain things. Because normally you will do the create table and then you will do the create table in all nodes. Or you do the create table, insert the create table into a queue table. That will then be replicated because it's normal insert and then execute all that.

08:03

Yes, but you can all replicate the SQL statement for doing that. But it works.

08:22

Yes, so you want to see the alter table. We don't replicate DDL itself. It will replicate the data how it is on disk. So it will have the default value filled out.

08:43

So we stream that and now suppose we do an insert and then we drop the table. Which is actually an interesting case. But as you can see here, we do the start electrical replication again. And it will now stream the data even though the table isn't there anymore in the present.

09:02

You will see a bit later why that is not that easy. So as you can see, or not that well seen, is that I now passed a parameter to the function here, hideXIDs. Which if you compare the statements before, the text contained the XID.

09:21

Now it doesn't anymore. That's just a generic way to pass parameters to output plugins. So now I remove the node again because I don't need it anymore. So the same thing can be done in a streaming fashion using the replication protocol.

09:42

And that's just our tool for that. At the moment, PGU receives a log which is possibly not the base name. So we tell it which host to connect to, which slot to use and where to write the data out. We just write it to standard output here.

10:01

And then you say, oh we're in it and we start streaming and then it starts streaming the data. And as you can see, in this case we updated the primary key. And the old primary key was in the ID at one and then the new one is minus one. So it gives you the data to do that.

10:22

So as you can see here, when you created this, we had defined the name. And suppose we have several nodes to replicate to. All those will have a different state up to where they're replicated. So we need to keep track which data do we need to retain on the primary to be able to stream from that specific position.

10:48

Yeah, it's exactly the handle which is persistent across restarts, postgres and everything. So you define it and that ensures that you have enough information to stream data from there on.

11:03

And also has information like which database are we currently streaming, which output plug-in are we using, because if you can check, see that we only defined the output plug-in here. The test is coding output plug-in on the init. We don't do that on the start logical replication anymore, which you do multiple times normally.

11:24

Because that output can change parameters of what's needed. That is what I want to do.

11:41

It's not implemented yet. But we have that in a shared memory segment, so it's theoretically true to implement. I just want to do the same for physical replication. So I referred frequently to output plug-ins and that's what people who will be implementing a new replication solution will have to write.

12:08

Because they want to have their own output format. So currently you have to compile a shared object and that shared object has to define five callbacks. And they are pretty straightforward from the definition.

12:22

You have one init callback that just says, okay, you can allocate memory, those are the parameters passed to you. You're working on that database and whatever you want to do. And then whenever somebody calls, when a transaction is streamed out, the beginning of a transaction is streamed out. You call begin transaction and a change comes in.

12:43

Change and commit and clean up is just if the replication stream goes down, you call clean up. So here's the example callback from that aforementioned test output plug-in. So it gets passed the context, it gets passed information about the transaction and it gets passed the information where we committed.

13:07

So we do some scaffolding and then we do the important part is we say, okay, we may want to write out the data, say okay, allocate enough data to stream our changes.

13:21

Then we can check some options which we earlier passed. Okay, you want to write XIDs or not and then you can just use the string info infrastructure for writing out data. You can write out data, binary data as well. You just need to use a different append string info function.

13:41

So in one case we write out the commit with the XID and v1 not. Obviously that's, you can also write out the timestamp and similar things. And then if we are really interested, we can write out the data for real. And that will then go out and write the data necessary for either the SQL function to receive it or for the valve sender to receive it.

14:06

Or if somebody wanted to implement another method of receiving those changes, you could do that in two functions. So the probably more interesting case is where we actually have a change. So again, similar thing, we get the context, the transaction.

14:24

We get also now the relation in which the change was performed and we get information about the change. So again, we do the trick with prepare write, which is unfortunately necessary. And then we do a switch which says, okay, which action was the change doing?

14:46

We have an insert, we have an update or a delete. Those are all the case we currently have. Yes? Yeah?

15:07

ELOC, the Postgres internal code. You just throw an error that will take jam upwards and just use the normal Postgres code for error handling.

15:25

If you want to implement a new application solution or you want to do auditing in a specific format and nobody has written the things for your format, you would provide these callbacks. Or you just use an author plugin which somebody else has written.

15:42

There exists two already, but those might not suit their needs because if you look back, the format here isn't really useful for anything except demonstration because I don't think you can easily parse that and use it for anything. For some solution you want to do, create SQL there, whatever.

16:07

For example, what you can do if you get an insert is this. Okay, we got an insert, then just write the tag insert and then convert the tuple we get. We get a normal tuple how it's used in Postgres.

16:20

We can call completely normal Postgres functions. Like, it won't tell too many of you. If it's output functions, you can do anything with it. I wrote a function convert a tuple to a string info because there's some ugly things involved that aren't interesting

16:41

and that will scare most of you away, so I don't show them. That will format the data.

17:01

Only at the moment you get all columns passed except unchanged toast columns. We could parse them as well, but it would increase the write volume and we don't need it at the moment. We can make it an option. On both sides, actually.

17:28

There's another problem. If you initiate a replication stream like we did here, you get all the changes from that position onwards.

17:41

How do you create a second database that has all the contents up to that point? Because if you do a normal pgdump, you don't have necessarily a consistent state because the changes might step earlier, they might step later. If you start doing the init logical replication on the lower level,

18:00

which you normally would do from code, you connect to the database, say, OK, we're connecting via the replication protocol and do the init logical replication thingy. It will give you, OK, from where are we starting, but it will also give you a snapshot name. In a separate session, you can do bin transaction and then you can do set transaction snapshot, that snapshot.

18:22

From that point on, you see exactly the state you had from where, after which you would get all the changes from my slot. The init wait for a point in time where it's possible to stream all changes henceforth.

18:41

Here, if we had a parameter dash dash snapshot for pgdump, we could actually stream changes from that point onwards. Or you can just start a copy, and you can do the same thing in multiple backends. You can start ten backends, all do the copy of the relations defined,

19:04

and then you can load them on the other side. Afterwards, you can do a start logical replication from that slot and it will stream all the necessary changes. So, now that's what, from the actual perspective of somebody implementing a replication solution is interesting.

19:24

So, if somebody has questions from what that all is useful for, go ahead. Yes, well, if you do the init logical replication,

19:41

as soon as you send the next command over that replication connection, or if you end the replication connection, you can't import it anymore. Because the way the exportable snapshots work in Postgres is that the source connection, transaction, session, whatever, needs to be alive. So, you need to wait until you did this to send the next command or disconnect.

20:12

No, you mean how do we release in that session the snapshot? That's independent. As soon as you start it, it's independent from the source transaction where you imported it from,

20:26

and just the normal begin commit. It's stored in a file, but it's opaque to you, you don't know anything about it.

20:45

I'm not sure I can follow right now.

21:01

Yes, if you start the logical replication, it will give you the name for that, and then you can start a modified PD dump or whatever code you want to do to copy the database on that. I'm not sure what you mean by querying the database in that case.

21:25

No, the point is that when you start replication, it needs to find a point in time where it has enough information to decode changes from that point onwards. So, you can't start at every location because why that is, I will explain in the following slide,

21:41

but you need to wait for that specific point, for those specific points. So, you need to first start the replication and then dump.

22:00

Yes, exactly, and it will tell you from where on it can decode the changes, and since that is a point you can't influence, you provide a snapshot to get the information from that.

22:25

So, how do you coordinate this? At the moment you don't. So, there's a bit more intelligence needed to do that. Yes, exactly.

22:53

I have something like this, but it's just a bit too much coordination, and I think we need to start somewhere.

23:02

Yes, but you need to coordinate the starting of the... You need to make sure that the min recovery point is after that XID, so there are some complications involved.

23:25

That's the discussion we had on the mailing list. So, the idea is that you need to either don't do that, or you will get an error if you do, and then you can retry.

23:42

Or just lock it before, or if you want to do it, get really fancy, you can do in the init callback, you can do like install an event trigger preventing DDL, if you really want to.

24:03

Some of us have a lot of users are kind of like Brownian motion, I just assume that they are all doing a random walk,

24:24

it seems that any kind of strategy which relies on users not doing things will be very painful for us, just in the operational world, and we'll make sure that the event triggers will be able to do that. I think you just need to install an event trigger preventing DDL, or you just lock PGPAS, it can't update.

24:46

Does that work? Yes. I tried that. It obviously has some disadvantages, because you stop vacuum from working and stuff like that,

25:01

which might not be what you want, but it works. But I think doing the DDL, installing a trigger preventing any changes, or just acquiring locks beforehand,

25:26

because you prevent create table. We could add an option. Yeah, create temp table blocks, which might surprise some users.

25:45

Yeah, I think event triggers are way much more flexible there, so I think that's the way to go. I think that's possible right now. Yeah, but you can't do that after you quite re-lock and prevent a DDL.

26:17

It's just DDL.

26:21

Okay, so now the following parts might interest a bit of you people, but anyway. So, how do write-ahead-lock records look like? Parts of you will know. For a simple case where we have a single insert, it's like, okay, at a specific position in the write-ahead-lock,

26:42

this is the LSN, the address in the write-ahead-lock, the logical sequence number, which is basically an offset in the file. And we have a transaction number, and we know, okay, it's being manipulated into an insert, to a specific relation, and a specific point in that relation, and then after that follows data.

27:01

And the data that follows is basically nearly the same we store in a normal table. So, after fiddling a bit, it looks exactly the same as if you would do a sequential scan, or a heap scan in some form, viewed by an index, or a anomaly, just like that.

27:24

So, there are some problems when you want to do all that decoding from write-ahead-lock, because it's not really originally made to produce logical changes, because it's just for recovery. So, one of the interesting cases is like, okay, we start several transactions at the same time.

27:42

Then we create a save point in one transaction. We insert in one transaction a record, in another session, then we say, okay, that save point is committed, insert another row, and then commit both transactions.

28:00

The way that we'll look in the write-ahead-lock is, okay, in transaction 704, we started, and we did an insert. Then in transaction 705, who? That's a separate transaction, we did only a separate save point. But the way it works in Postgres, that's actually a separate transaction. It looks like it, somewhat.

28:23

So, okay, and then we see another one, oh, I forgot to copy one here, okay. Then we, no, I didn't. So, we have now here the, yeah, I forgot to copy one, sorry. So, you now see the insert from here, which should be, no, it's all right, it's okay.

28:42

So, we see the first commit, which is the S2 commit here. Then we see a commit of the 703, and if you look at it there, it tells you, okay, we have several sub-transactions, and a sub-transaction that's an XID 74.

29:02

That's the first point where you know, okay, that XID we saw earlier belongs to the top-level transaction. So, now you need to make sure that you can get all those together in some sensible way. There's also another problem, you might have, at this point, you might have a rollback, sorry, rollback.

29:24

You might have a rollback here, so, and if the transaction rolled back, you don't want to stream out those changes, because they obviously don't exist. And we actually can, but that's another matter. So, what you do is, we basically reassemble transactions to the individual items to a full transaction.

29:48

So, we do that by first, inside, if we collect all those streams from the first time we see an XID, till we see a commit record involving that XID, we store those in memory or on disk, we collect them,

30:04

and then when we see a commit record, like here, you know, okay, we have the XID all rows from 73, but we also have the ones from 4, so what you do is we do a merge between all those different transactions,

30:22

which we have either in memory or in disk, and since they are all LSN ordered, we just, we don't need to pre-sort them, we can just walk through them using a binary heap merge, and when query change we see, we call the callback from earlier,

30:47

we call the first, when you start with the transaction, we call the begin transaction, when you see a change, for every change, we call change, and then at the end, a commit.

31:07

No, you can't, because we cannot easily be sure that all the data in there is consistent, and we can, if we see, if the rollback will be in the right headlock, so you might see,

31:23

if you have done a rollback, you would see a transaction thing, but it would say, okay, here's the rollback, and then we say, oh, we see a rollback, we throw away all the changes we accumulated from that transaction, and continue on.

31:43

Exactly, so the output plug-in, because we can't actually call output plug-ins for aborted transactions. Why? We'll come to that in a second. Yes, so we just throw away that XID extension.

32:01

You can only get changes from one declaration. So you need, because we do, I'll explain, ask the question in five minutes.

32:34

It will stream out the changes when you see the commit record. Yes, you get the changes streamed in commit order, which is a useful property,

32:44

which is the order replication solutions, like slowly try to get, and use a hack to not get there fully, but it's a very useful property for replication. There are other interesting orderings, you might ask Kevin Gritner about that,

33:03

and nobody will be able to follow him. But there are other orderings, but that's what you can easily do, and which is sufficient for most replication scenarios, because providing actual serializability across multiple nodes.

33:25

According to Kevin, right? No, he said earlier that he doesn't want to do that. So if there's a rollback, it doesn't set at all? Yes, you don't even get the output plug-in call. Yeah, you don't get the data for any of the things you're getting rolled back. So we actually, it's not that hard to add a streaming mode,

33:44

where we stream out the changes, and then you could get a rollback also streamed out, but that actually turns out to be very annoying.

34:14

So it's possible it's just that your replication solution needs to get more complicated, because it suddenly needs to deal with several transaction parallel.

34:22

Oh no, it's important to make that target insane. It's also important for things like if you have a very huge transaction, and you stream that at once, it might take a while. So at some point we might want to say, okay, the transaction is bigger than 5,000 changes already accumulated,

34:44

switch to streaming mode, tell the output plug-in this is a streaming transaction, if it enables that feature, and then stream it out. But there is some more changes needed for that, because I'm not sure who is interested in that,

35:02

but currently the syscache and allegations we need to make the syscache work correctly are only locked on commit, not earlier, and we need them actually earlier. So if that is enabled, we need to make some interesting things. I was wondering if you started replicating the binary copies with transactions.

35:28

But the rollback will get rolled back on the new node anyway, restarted up for the first time. And if you start, any transaction that is in progress will be marked as a rollback.

35:46

Unless you count prepared transactions, you can't have, oh yeah, believe me,

36:03

there is no problem in supporting them in the end. If you want to do that, I didn't care so far. So why is this all not that easy? So if we do a create table, then we do an insert,

36:21

and then we do an author table, and then we do another insert. Since we are decoding all that asynchronously, if, when we decode the data for that insert, we need to know how the table looks like to decode. If you go back to the record here, that data here doesn't know anything about the contents.

36:42

It just knows, okay, I have that many columns, but it doesn't know which type the columns have, doesn't know about anything like the name, it's just a block of bytes. So you need to know the table definition to do that decoding. So if you decode here, if we are in this point of time,

37:01

but decode the insert, we would try to decode the 1, and look at the table, we would see, okay, it's a type text, and try to decode a 1 as a text datum, which will cause a segfault. Because it will think, oh, there's a toast pointer at the address 1.

37:20

No, not really. So that's not nice. So what you do is, the solution for that is, while we go through that stream of changes, we have a snapshot of the catalog that looks like,

37:42

okay, when we decode this one here, it looks like, okay, the table definition is, you have an integer and another integer. When we decode the next insert, it looks like, oh, we have a text and the integer. So it really provides a look,

38:04

a snapshot that looks like the catalog at that point in time. Did you use the figure out with the difference?

38:22

Yeah. If you are able to get something useful out of that, sure. But the problem is, there might be multiple altars involved in between, and then you don't see that anymore. So you need to make a comparison every time.

38:42

No, every single insert in there makes a difference. You might have ten altars of the same. Consider somebody doing three alter tables in a row,

39:03

and the middle one might be an alter table with using. For example, alter table using, you won't be able to detect that in the catalog. So doing alter table with catalog differences isn't really going to work.

39:23

Maybe, yeah. So how do we build that snapshot continuously? The one interesting property is, we only need the snapshot to be able to look at catalog. We won't be able to look at the normal user tables

39:41

because they are not necessary for decoding. There are some edge cases Robert likes to mention. But for now, you would need some special things. You would say, this table is also a catalog table, and then it will be decodable. He doesn't like that, but that's what I can give you.

40:02

So when we read the writer headlock, and we do the init logical replication, so we establish a logical replication slot, which is this name tag. We read all the writer headlock till we see either a shutdown checkpoint, or we see Excel running exact record, which is built for hot standby and has information.

40:24

All those transactions are running at the moment. With those informations, we can actually build a snapshot that's sufficient for decoding, looking at the catalog. But we can only do that after all the transactions that are mentioned in there have finished.

40:40

So we wait till all the XIDs have committed or aborted. And then at that point, and we don't want to do that every time, so every time we evolve it, we write that snapshot to disk because, for example, in hot standby, you have the problem that if you shut down a hot standby node and start it up, you might need to wait three minutes

41:01

till you can access it again if there's a high amount of write traffic on the primary, which isn't really good if you cannot see intermediate changes for replication because obviously a replication solution which skips 10 inserts every now and then might not be so useful.

41:22

So what you do is we save the snapshot every now and then at specific points aligned with specific things in the writer headlock every time we see a checkpoint record, every time we see Excel running exact record, we save the current snapshot to disk, which is mostly like half a kilobyte or so.

41:42

And when we remember when the last time we decoded, we did that once, and so we just start reading from the last time we logged and go on from there. So that's how that works on the very, very high level.

42:01

So that actually is a 2,500 line C file, which has quite some complications. That's the most complex part of all this. So this is what we have for change set extraction itself. But since we actually want to build

42:22

multi-master replication, there are some more features we have in various stages of completion. So for conflict resolution, when you want to implement a last updated win strategy, it's very useful to know when did a specific XAD commit. So we have the commit timestamp module,

42:43

which you can enable. So show commit timestamp. When it's enabled, we can do start transaction, and then when can they get transaction commit time 695, and then it will give you the timestamp of that. Or crash save.

43:01

That's very useful for us, and I think I've seen others asking for that kind of thing. It's also useful. Yeah, yeah. It's real. Alvaro wrote the code for that in like two hours. So it's just not very hard.

43:20

Then the next thing we have is an abstraction built on top of sequences because in a multi-master environment, using plain sequences is not that easy to use because you need to, either you get conflicts on every insert, which is not very nice, or you do stuff like managing the increment

43:42

on every node individually and saying, okay, aligning the start value so you never get conflicts between nodes, but that sucks if you add new nodes. So not very nice. So what we want to do is provide another way sequences work, and since at the moment sequences are quite hard-coded in Postgres,

44:04

we added the abstraction for sequence AM. So just like we have different index access methods, we now have sequence access methods. So we do create extension PDR, which is bi-direction replication, which is our multi-master thing,

44:22

and say, okay, create a sequence using PDR. So we specify which solution we want to use. Otherwise, it just uses the plain old implementation. Then we can pass options to it like, okay, you want sequences that acquire a thousand IDs for every node,

44:42

and when those thousands are used up, it gets new ones, but that's just our implementation. You can do lots of other things. You can implement gap-less sequences if you really want to. I don't, but some people seem to want to. Yeah, so at the moment, they are not transactional.

45:03

So if you really need them, you can say, okay, create sequence using gap-less sequence, and then it would obviously involve more locking, but you could do it. Postgres license. It's on git.postgres.org.

45:22

It's live-consulated. So what we really want to do is the multi-master with conflict resolution implementation. That's why we are funded to work on the whole change set extraction thingy. So what we're building is logical replication.

45:42

We want asynchronous multi-master because our use case is globally distributed nodes and doing synchronous globally distributed multi-master. Yeah, somebody else. My life is hard enough. And for now, we implement a last update win strategy

46:02

because that's good enough for us, but the next thing would be to allow specific conflict handlers to be called on every conflict, and then to make sure we have consistent state, we lock the afterimage, and make sure it resolves the same way on all nodes, so eventual consistency.

46:23

What other possible... Yeah, yeah, yeah. Is it possible? Yes, you need to set a flag to do that

46:43

because it involves more data being collected, but yes. Well, I think it's possible that we can get the change set extraction into the next release. There are some people interested in that.

47:02

I think we might be able to get the sequence stuff in, possibly even the commit timestamp. There's some problems around that might be problematic, but the rest of the multi-master thing doesn't actually need to modify any core routines

47:21

in the way we have it now. So we just distribute it as an independent tool for now when it seems apparent... No. Yeah, so we just have our own routines that do the heap-insert, heap-lock,

47:42

and on a low level, and about the same as in any normal Postgres. You need to have... No, you can't do that in Postgres. No.

48:00

You can't have more than one gigabyte per column. You can't have more than one gigabyte per column at the moment, and that needs to be hit in two memory. And if you have ten of those, it needs to hit at the moment. Yeah, for us it has only to fit there once,

48:21

so we might actually be better off than normal Postgres. Testing?

48:45

The code, there are some areas of the code which aren't up to something I want to use in production. No. So there will be some months to use that in production. Also, at the moment, it depends on patches of Postgres

49:03

that aren't in core, so you need to run a patch to Postgres, and I'm not sure you want to do that in production. So, not yet.

49:21

Do you want to say something about the timeframe, or...? Because I didn't have a life for one year.

49:41

Not for five. We hope that the change... We were on the developer meeting yesterday.

50:01

We hope that we can get a patch ready very... hopefully very early on, because patches getting committed at the end of 9F releases tend to be problematic and tend to involve emotions to a degree that aren't comfortable.

50:22

So, it won't get into 9.3, and it wasn't ready at all for 9.3, and I hate that it isn't in 9.3, but it is the right decision. So, yeah, definitely.

50:48

Yes. So, for example, I had the first draft of that

51:01

presented at last PGCon, and since then, I don't think too many lines of that code remain. I think mostly comments. No, they change.

51:28

No, absolutely not. You can do that, and you probably want to do that for many use cases, but you get past the relation which contains the relation name.

51:40

You can do any... there's just lookups, catalog lookups, and if you declare your configuration tables as being type-terrible, you have even those, you can declare a Sky Tools or a Sloanie table as, okay, I can look at that in the output plugin, and then you see, okay, at that point of time,

52:02

we did decide to shape that, or not. Yes. Yes. Only for catalog tables or tables to declare as being catalog.

52:23

Yes, we are about three times that long, that PDQ, just streaming out the changes. I didn't do the other comparison so far, and, yes, so NF cases where the performance gap,

52:42

if you have like one timestamp column and you do binary output in the output plugin, it's like 15 times faster. It's just because the output functions for timestamp are very expensive. If you do like PostGIS, it's the same. As soon as you go to stream out binary changes,

53:00

it gets much faster. Yes. Have you had changed the method before? Yes, I have. It depends a bit. At the first, it was impacted quite a bit, but at the moment, it's like three to five percent. When you have one active and the increase in the amount of data

53:22

in the writer backlog during a PD bench run is like six percent. Yes. Can you get what it looks like at the next time? Yes.

53:41

You do need to do that because otherwise, you can't call off of functions and because, for example, if you call record out, you know what it does? It does the SUSCAC. There is no way to do it without SUSCAC.

54:06

No, the snapshot will be like except every time you're in a change, you get past the snapshot that is consistent with one that would have been you had like when you created, except that those at creation time are actual snapshots now.

54:22

You have possibly more inconsistencies there than at the current point in time. But Robert is going to fix that. Actually, it seems like you actually might have you have fewer snapshots.

54:42

No, there aren't any snapshots transmitted. You compute them on the fly. Okay, so it's great. Yes. That was one of the very hard requirements from various people that we... Yes.

55:00

You need to see the holes in my wall. Yeah, I agree, totally. We're very happy. Yeah, I am very happy. I just wasn't happy back then.

55:21

But that's the nature of being told what you did is wrong. So, any other questions? Yes.

55:42

There's a system you can say, okay, which logical replication plots haven't been streamed for a long time and you need to include that in your monitoring setup just like you need to do the same for stream replicas using hot standby feedback or if you need to monitor long-running transactions,

56:03

you need to monitor prepared transactions, all that. No, it uses the normal catalog and because we are in an MVCC database, we don't immediately delete data.

56:21

We prevent vacuum from moving those rows on the catalog tables. Not on normal tables but on catalog tables so we can still... Yes. Yes, for example, if you do like a massive create temporary table load, that is going to be noticeable.

56:49

I hope we can get changed to that extraction. I hope we can have a chance for getting in the commit timestamps and the sequences because those are the parts that any of those solutions will need to work without being tied to the core release cycle

57:05

because at the moment all that is moving too fast for core, in my opinion. So, those are the ones I want to be able to get and there are some smaller technical patches that aren't appropriate to discuss here. Because you're sitting in the middle.

57:21

Because you're in the middle and you're standing. You're the only one standing up. I'll say that isn't just me. So, one more thing. There are quite many people I have to thank for. So, many thanks to all mentioned

57:41

and to all people I didn't mention because I didn't feel like going through several hundreds of emails. I think it's way over a thousand, that folder. So, and by now Peter isn't at second quarter anymore but still, he has helped and he possibly will help further.

58:01

So, definitely deserve to mention. So, yeah.