Warm standby done right
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 29 | |
Author | ||
Contributors | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/19152 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Ottawa, Canada |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Ring (mathematics)MereologyLevel (video gaming)BenutzerhandbuchComputer hardwareDirectory serviceTape driveStapeldateiComputer fileServer (computing)Function (mathematics)Different (Kate Ryan album)System administratorSoftware testingMiniDiscData recoveryFlow separationDataflowSet (mathematics)Demo (music)Partial derivativeFingerprintBackupAsynchronous Transfer ModeSystem callComputer configurationAdditionSineComputer multitaskingPort scannerReal numberBlock (periodic table)File archiverComputer fileRepetitionMereologyFunction (mathematics)Interactive televisionData recoverySoftware testingAsynchronous Transfer ModeWritingCrash (computing)Multiplication signPoint (geometry)Moment (mathematics)Product (business)Boolean algebraCASE <Informatik>Open sourcePivot elementPartial derivativeDifferent (Kate Ryan album)Server (computing)Revision controlComputer programmingGoodness of fitDirectory serviceSystem administratorSoftwareCovering spaceError messageBuildingSynchronizationBackupObject-oriented programmingProcedural programmingComputer configurationSlide ruleMetreMathematicsMultilaterationLocal ringMultiplicationNumber1 (number)Replication (computing)Web pageRow (database)Online helpGroup actionAlpha (investment)Set (mathematics)Database transactionWindowVacuumBand matrixMiniDiscBlock (periodic table)Projective planeProcess (computing)Computing platformFunctional (mathematics)Patch (Unix)Real numberPhysical systemLinear regressionMeasurementCodeFile formatLevel (video gaming)Basis <Mathematik>BitResultantNetwork topologyStreaming mediaElectronic mailing listTable (information)Configuration spaceDatei-ServerVolume (thermodynamics)CuboidRandomizationMappingDatabaseCommunications protocolOrder (biology)BenutzerhandbuchArithmetic progressionData centerPlanningTerm (mathematics)Complete metric spaceIdentity managementRoutingCycle (graph theory)LogicAbsolute valueScripting languagePresentation of a groupLimit (category theory)Data managementHalting problemVelocitySeries (mathematics)HypermediaQuicksortRight angleMassRootBit rateContinuum hypothesisSocial classArc (geometry)WordOracleGreatest elementCantor setForcing (mathematics)Renewal theoryCore dumpWorkstation <Musikinstrument>Analytic continuationCondition numberRecursionProxy serverMarginal distributionDrop (liquid)Subject indexingVector spaceFerry CorstenDefault (computer science)Cartesian coordinate systemSpacetimeCircleBinary codeForestSystem callConcurrency (computer science)Video gameForm (programming)Symbol tableArithmetic meanPosition operatorLattice (order)Complex (psychology)Matching (graph theory)Electronic visual displaySphereType theorySound effectField (computer science)Well-formed formulaArrow of timeDirected graphoutputOntologyWater vaporSampling (statistics)NeuroinformatikLine (geometry)Digital electronicsException handlingCanonical ensembleInformation securityExecution unitDenial-of-service attackConstructor (object-oriented programming)InternetworkingTimestampStudent's t-testInsertion lossTransportation theory (mathematics)Statement (computer science)Wave packetRule of inferenceConsistencyComputer animation
Transcript: English(auto-generated)
00:00
Hi, my name is Heikki Linnagangas. Nowadays, I work for Pivotal. We're working on open-sourcing Greenplum, if anyone hasn't heard yet. But today, what I'm today going to talk about, this is something I've been hacking on for 9.5. This presentation, I'm going to talk about some
00:22
improvements that we did in 9.5 and some changes that are hopefully improvements in 9.5. This is all about the building tools and the server-side behavior that we have in Postgres. I'm not going to cover anything about rep manager, Wall-E, PC backrest, all of these other tools.
00:42
I'm also not going to cover how to do monitoring or heartbeats, anything like that. Those are all very important things, you need those. But I'm not going to cover those important parts today. I'm going to talk about the Postgres internals. I've split this presentation into a few parts that are kind of loosely coupled.
01:02
I'm going to present some problems that we have had in previous releases, then I'm going to explain how you can now work around them or how they have been fixed in 9.5. So, let's get started. Part one, continuous archiving. How many of you use continuous archiving?
01:22
How many of you have set up wall archive? And so, yeah, quite a few. So there are a few gotchas that you might not be aware of. The simple procedure how you set up a wall archive is you set up wall level to archive, then you set archive mode to on, and then you specify the archive command. And this is where it gets more complicated.
01:41
What do you put in your archive command? Stephen. Pgbackrest. That is a good answer. Let's see if Pgbackrest does what I'm going to talk about and if it actually does that correctly. We'll come back to that. All right. So when you set up a wall archive,
02:01
the Postgres user manual says that it's very flexible. The archive command might store the data, might copy it over SSH. You might use something like Pgbackrest and I don't know what the hell that does. Maybe it uses SSH to copy it somewhere else or I don't actually know. Does anyone know? Yeah, okay. Well, you can set it up multiple different ways, right?
02:21
You can set it up to copy to a local directory and then asynchronously. Okay, there you go. Send it off or you can just send it off directly from underneath the archive command. Personally, I like the async for this mode because we generate, I mean, we had the time we were generating over a terabyte of wall in like an hour or so. You know, it's different, so. That makes sense.
02:40
So you copy it to a local directory first. Yeah, that makes sense. So. Make sure you sync that, but yeah. Yes. So you can copy it anywhere, burn it on a CD and whatnot but what the manual says that you have to make sure is that it returns zero if it succeeds and it must not return zero if it does not succeed.
03:02
Another important point that the manual makes is that it must not, it should refuse to overwrite any existing files. It's a pretty important safeguard. I think things will work if you don't do that, if you will overwrite existing files but it just sounds very dangerous. So, you know, it's a good safety measure not to.
03:23
Posters should work if you do, but. As long as it's from the same cluster. Yeah, it's easy to get, yeah, it's very easy to get confused if you have multiple clusters, you point in the same directory and oops. Yeah, but assuming you do everything correctly, this isn't, you know, Postgres doesn't require that for correctness that you refuse to overwrite files
03:42
but it's just a very good administration practice. As the manual says, if you make an administrator error, it will protect you from that. So then the manual goes on and gives this example which is like, you just use the copy command. And it even says that it's a recommend, it's an example, not a recommendation.
04:02
Because it might not work on all platforms depending on what these functions do. Does it work on any? I think it works. But does it work correctly on any? I don't know. So yeah, here's the first gotcha you got there.
04:20
CP won't do everything. So if you use this example command, just copy it to a local directory, or if you write your own tool that just copies it to a local directory, there is no guarantee that when your program exits, it's really on disk. So it's possible that, it's entirely possible that right after
04:41
your command has returned, Postgres decides, okay, this has been safely archived, I can delete it. And then you crash. And now you have the file deleted from Postgres side, but it's not actually stored on disk yet. So you lose the file. No, it's not as serious as it sounds, because it's only the latest file you lose that way, or maybe a few latest.
05:01
So it's not the end of the world, but it's still pretty annoying. And it does mean that if you then continue running, you'll have a missing file in the middle of your archive, and then you can restore that back up any further than that. So yeah, it's annoying. I said I wouldn't talk about monitoring tools, but that is one very good test
05:22
that your monitoring tools should be checking. Do you have all the wild files? They are sequentially numbered, so I recommend that you, every now and then, pull your archive and make sure that you have all the files in there. Just who knows, you might have problems like this or something else that makes one file to go missing. And if you're missing a file in the middle,
05:41
all the rest of the files that you have after that are useless. So the second gotcha that comes, this arises from the fact that if you follow the good administration practice that you never overwrite an existing file, there is actually no guarantee that Postgres
06:01
won't try to archive the same file twice, because it's entirely possible that after you've already archived a file, you crash. And the file was already archived, but Postgres didn't get the memo. It doesn't know that it was safely archived, so after you restart, it will try to archive it again. Now, if you wrote your archive command
06:23
the way the manual says, it will throw an error because it's already there and now your archiving is failing. Hopefully, again, you have these cool monitoring tools that will detect that and then you will go and fix it. But it's something that can happen. So, yeah.
06:50
Yes, I would recommend exactly that. You should compare the old and the new file, and if they are byte-to-byte identical, return success instead. So if you try to archive the same file twice,
07:02
there's no point in archiving the same file twice. So I would recommend writing your archive command if you're writing something like PGBackrest or something complicated tool. It's worthwhile to do this check, I think. So, how to write your robust archive command?
07:22
Only returns zero on success, as the manual says. As the manual also says, refuse to overwrite an existing file unless it's identical to the one that exists. In that case, return success. And if you're writing to a local disk, always use fsync to make sure it actually hits the disk.
07:40
If you're copying it somewhere else, then I guess it depends whether you wanna, I mean, is it enough that you SSH, copy it over to another server? Maybe you trust that it won't crash at the same time. I don't know, but I would, yeah. Yeah, still fsync. Why not?
08:13
Yes, that's true, that's true. So if you just write it, if you, and then move it into place. Yeah, actually, that's a good point. I should have added that there. So also make sure that you're copying the file in place
08:24
is atomic, like use something like temporary file name, fsync, then rename it in place. And do you have the fsync after that too? Yeah. Yeah, you should. You really should. Because again, why not? More fsync's always better, until it becomes slow.
08:43
So I think the manual, it makes it look deceptively easy to write the robust archive command, because it's not actually that easy. We should do something like pg archive command that gets these things right. Or maybe not, maybe we trust pg backrest and boomerang. There's a lot of these tools that plug into archive
09:03
command and hopefully get this right. While writing this presentation, it occurs to me that we should maybe add these things to the documentation as well, which we should.
09:20
So, the part two of my presentation. I'm gonna talk about a new feature in 9.5. It's called Archive Mode Always. So in previous releases, you could set Archive Mode to on or off. And now there's a third option. It's like a three-state boolean.
09:41
It's off and on or always on, which means that it's also on in a standby server. So in previous releases, if you set Archive Mode on, and you have a streaming replication running, or just pointing time recovery, or any kind of recovery system, it's only the master that's generating the wall that does the archive.
10:00
But if you set Archive Mode to on, also the standby server, or if you're doing point-in-time recovery, then also the recovery server will try to archive everything that it has already restored. And the use case for this is, if you wanna have this set up where you have a master and a standby,
10:22
and you want them both to have a separate wall archive, you don't wanna share it, that's the scenario where you wanna set Archive Mode to always. It makes sense if they're in different data centers or whatnot. It's definitely something you should be doing. But it was not really possible in previous releases
10:41
to set up Archive Command that way, nowadays. So I'm gonna leave that for a moment. For Archive Mode always, it lets you do this separate archive thing. There's more things you can, more tricks you can do with that, and I'm gonna get back to that later. So the part three is,
11:01
how do you set up continuous archiving with a shared archive? So the previous slide was about having separate archives for the master and the standby. But this is probably more common, I think. People only wanna have one wall archive if they're all in the same data center or whatnot. And they only wanna archive these files once.
11:21
They wanna keep a separate archive for every standby. So how do you do this? How many people are actually trying to do this today? Hands up. There's a few. Yeah, there you go. Let's see if you got this right. So the naive approach is to set Archive Mode to on and then put the same archive command
11:41
on both of your servers. And it's not quite good enough. I'm gonna explain why. So let's see what happens when you promote your standby server, okay? So the meter hits your master server. Hopefully your wall archive was not on the same server as the master, because then you're screwed.
12:02
So you had your archive somewhere else and it's safe and you have your standby. You promote that to become the new master. At this point, the standby server will, first of all, it will replay all the while it had already streamed from the master. Then it's going to create a new timeline.
12:23
How many of you know what the timeline is? Okay, good. Hopefully everyone who raised hands before will also know what the timeline is. Then what it's gonna do next, the new timeline means that it's gonna bump the number in the while files it generates so that they don't clash
12:42
with the ones generated in the first server. So you get kind of two sets of while going forward. Anyway, what it's gonna do next, it's gonna copy the last segment it was writing to or recovering and rename it on the new timeline and continue going from there.
13:00
And then it's gonna restart the archiving because you had archive mode set on. So you have this situation. You have your master server and you have these three while files there. You have the standby server and you have had streaming replication set up and you streamed all of these three files to the standby and you also have the while archive.
13:20
The master have called archive command for all of these three files and they're all in the archive. So they're all in sync. Oops. Now, as time goes on, there's more activity in the master. It will generate more while. So you see a few extra files there and they also get streamed to the standby.
13:42
And you're always gonna have this one last segment that isn't full yet. It's currently writing to that file. So whenever you insert something, it goes to that last file that's still active. And if you use streaming replication, your standby is also keeping that file in sync.
14:01
It might be lacking behind, but the point is that it's also writing this partial file that isn't full yet. And Postgres won't try to archive that file until it becomes full because you don't want these half full files in your archive. So what's not going to happen when you crash and you promote your server?
14:23
In 9.4 and below, what will happen is that the standby will, first of all, archive this partial file and it's gonna have a name that looks just like any other file, which is pretty confusing because you don't know it's a partial one. And I'm gonna do a little detour on this. So you have, in 9.4 and below,
14:43
you have this segment in the while archive. Let's go back to the previous one. You have this segment in the while archive, about circled, that looks just like any other while file in the archive, but it's not actually complete yet. Half of it contains garbage. It's not a problem on its own,
15:02
but it's very confusing. If an administrator looks at that, how do you tell that it's not a complete file? If you try to open that in Notepad, you can try to open in Notepad and see if you see it.
15:21
Yeah, there's a tool. Yeah, there's the tool. X look dumb, that's it. So you can run that. But from the X look dumb output, oh, your backup tool can do it. How? Okay, so is there any backup tools
15:41
that can do that today out there? There's one that has a feature plan to do exactly that. Okay, so there's a plan feature to do that. Okay, that's good, that's good. That's actually a very nice feature. It would be nice to run that feature on all your while files to make sure that they are all complete. Oh, and maybe check the CRCs while you're at it?
16:01
Yes, that would also be nice. Yeah. Anyway, well, while waiting for that tool to appear, there is no way really easily to tell that it's a partial segment. You could use pgx look dumb, but again, reading the output of that to determine that it's full, it's not trivial.
16:23
I could probably do it, but no, yeah. So there are some problems with this. If the master actually continues running, if it wasn't the meter that struck it, but it's still running, it might later try to archive the same file
16:41
after it's actually complete. But now you have this partial file with the same name in the archive, and you're trying to overwrite it with the complete one, your archive command will fail, and now you never get the complete segment archived. You might not care because you already failed over the standby. Maybe you just wanna throw it away. But then again, you usually don't wanna lose valid transactions
17:02
if you ever might wanna go back and look at it. So the second problem is that what might also happen is that the master actually had already archived it before you failed over, but the standby was lacking behind. So now what will happen is that the standby,
17:22
after the failover, the new master, will try to archive the partial segment with the same name, and that will fail. And now it's gonna keep failing and failing, and your new master will never make any progress with its archiving, and it's gonna get stuck. Now an administrator has to go and fix it. And the administrator is gonna be awfully confused
17:42
because there is the same file with the same name, yes. Yeah, and there's a bit of a, a shared archive, you actually want behavior for a planned failover that you do for an emergency failover, which is kind of a problem. Yeah.
18:00
And one way to avoid this is to actually use like a different directory for the master and the standby so that even though it's kind of the same archive, but you can then have both copies of the files. But again, then the restore command gets more complicated and it's really not ideal. So this has been changed in 9.5.
18:21
Whenever the standby now archives this partial segment that is not complete, it will add the, it will rename the file to .partial. So in 9.5, if you do failovers, you will start to see these .partial files in your archive. Don't be alarmed, that's perfectly normal. Usually you don't need to care about these partial files.
18:43
Postgres doesn't care about them. If you copy them into pgxlog, Postgres will just ignore them unless you rename them, remove the suffix. So this is just for, you know, emergency situations. The partial segment was always like that. It was never really needed for Postgres itself, but it's good to have it in case of an emergency
19:04
if the administrator has to go and, you know, do something in panic. So say that the master,
19:23
say that the standby crashed immediately after the promotion and it didn't get around to archive the next segment yet. It's the new segment on the new timeline. Then you might want to use this partial segment to recover up to the point just before the failover
19:43
because there is no other copy of that while, in that case, in the archive. But as soon as the standby, you know, archives the first segment on the new timeline, you have it there as well, so you won't need it. But there are, I don't really know whether that's gonna happen and I don't think anyone, you know, does this happen?
20:04
I'm not sure, but it's there if you need it. Anyway, this is much better than the previous situation where you couldn't tell that it's a partial segment. So that was about the partial segments.
20:21
Now let's get back to what happens at the promotion. So after the promotion has completed, you will have this situation. You have this master, in the master that crashed, you will have the segment without the partial suffix. That's in the pgxlock directory. You don't need to look at that. It was the file it was writing to. In the standby, it has been renamed to dot parcel
20:42
and it also gets archived by the standby as the dot parcel. And now the standby continues running. It generates these while segments. On the new timeline, you see the little number two there that indicates the timeline. And now everything is good. Except if you look carefully at this situation,
21:01
you will see that segments 18 and 19 were never archived. And that's bad because you now have this gap in your archive and there is no way to get past that point if you try to do point-in-time recovery or anything. Now this, you don't see this very often in production,
21:22
but it can happen. If, for example, the master hadn't yet archived those 18 and 19 files, then they won't be there. They will never get there. Usually, hopefully, your archive command keeps up so it doesn't happen. But there's always going to be this small window where it can happen. And especially if your archive command is slow or something,
21:42
it definitely can happen. And you are pretty badly screwed at that point. Because, yeah, it can happen and basically means that you can't restore from backup to any point after that promotion. The while that we archived here on the second timeline
22:01
is pretty much useless. Who knew that? Yeah, three people. Yeah, this is pretty shocking. I mean, I kind of knew this always. But when I really started to dig into this, it's pretty shocking that this happens.
22:23
Because this is a pretty standard setup. You have the archive command set up the same in master and standby, and you kind of would expect this to work. So what are we going to do about this? I worked for a while on this for the 9.5 release cycle.
22:43
I was imagining an option called archive mode shared where we would kind of try to detect the gaps and then try to archive them in the standby. Run out of time for 9.5. And I realized that, actually, the archive mode always, you can use that to work around this.
23:03
It's a little bit complicated, but at least there is now, it's now at least possible to fix this. So if you remember, with the archive mode always, the standby will also try to archive everything it receives. Now you can take advantage of that by having
23:21
the same archive command in master and the standby, but you have to set them up so that they're not trying to kind of compete which one archives the file first. But you can now put in the logic there to check if the file already exists and then automatically move it in place if it doesn't, and if it's identical, then you can make it work.
23:43
You want to make sure it's identical, but you have to actually compare the files. Yes, you do. And it's not easy to write this command because those race conditions, you're suddenly going to have two servers trying to do the exact same thing at the same time, which you don't normally have it, archive command.
24:02
But it's theoretically possible to write that. I have not actually tried to write that command, but it should not be possible. Someone has to work on that. That would be great. Yeah. So now there's a way out. Maybe we should get back to having the real archive mode shared where it will,
24:21
the system will take care of this in some other fashion, but now there's a workaround. So summary on the shared archive problem. You can work around, you can use archive mode always. Make sure your archive command is concatenated if you do that, and it handles duplicates,
24:42
which means check that the files are identical and so forth. In 9.5, if you start to see these .parcel files, just ignore them until, if you don't know what you're doing, you can just ignore them. That's basically my advice. But it's good to have them, and still make sure your archive command calls fsync.
25:05
More questions on this part? Yes. On NFS? I don't know. I don't. Yeah, that's one solution. Don't use NFS. I don't know. Actually, another way to set this thing up is,
25:23
it's not as performant, but you can just set up like two directories in the same archive and just always archive everything twice. It wastes space, but, and it wastes network bandwidth to move the files, but it might be acceptable, and it's a lot easier to set up that way.
26:19
Yeah, so the question was,
26:20
what is the point of having a shared archive in the first place, if I get that right? Yeah, so one motivation is to use it, use the restore command and be able to restore the files to the standby from there. The other one is just to have a backup. Yeah, the backup of the logs,
26:41
and you'll need a base backup to go with it, to have a full backup with the logs, which, when you have the backup and the logs, then you can do point-in-time recovery to any point and all kinds of nice things. It's, in a production system, you should always use continuous archiving, really.
27:02
I guess the other side of that question is, why do you wanna use a shared archive, and why not always use a non-shared archive? Well, obviously, a shared archive uses less space, for starters. So again, one way to do this is to set archive mode to always, and then accept the fact that everything gets archived twice,
27:21
but then if you care about the share, the disk space, you might wanna have a background process or a cron job or something that then merges all the identical files or something like that.
27:43
Yeah, there is that as well, yeah. So some systems might do deduplication for you, so you don't care. You still have the network traffic to archive twice, but again, it might not be a problem, yeah. Yes.
28:06
Yes. Yes, it does. Or it might have restored it from the archive already, which is kind of strange, because then I'm gonna try to archive it again, but, oh, whatever.
28:22
Oh, yes, there is, because there is no guarantee that the master had already archived it. It might be streamed to the standby before it's archived in the master. So the next part, I'm just gonna mention that there is actually a tool
28:42
called pgvgfxlog, which it, you can do the same thing with pgvgfxlog, and it's probably actually better. It's different from the archive command in that you, it's software that you install, well, it comes with Postgres, but you have to install it in the archive server itself.
29:01
It's a piece of software that runs not on your database server, but in the archive server. So you can't just use an NFS mount or anything like that. You have to have a real server running there. The way it works is that it connects to your server, and it uses streaming replication to get all the data, and then it will store them to files,
29:22
and the end result is exactly the same that you get with archive command in that you get these files, and it's gonna just write them to a directory, your archive directory. And the end result is gonna look the same as with the archive command. It's better than setting up an archive command
29:42
in that it's actually up to date. So archive command only archives full segments, so you're always gonna lose the last eight megabytes or 16 megabytes of data or, wow, whenever you crash because we always wait until the segment is full and then archive.
30:01
But with pg-reserve-xlog, you don't have that problem. It's gonna continuously stream over the changes, which is better. And pg-reserve-xlog was added in 9.2. Since 9.3, it can follow timeline changes, which was pretty important for this failover scenario.
30:22
It means that if the master crashes and pg-reserve-xlog was connected to it, it will try to reconnect, and now it will hopefully connect to your new master, and it's gonna notice that, okay, there was a timeline change. It's gonna stream everything over, and it's just gonna work, basically. 9.4 made this reliable in that we added
30:42
these what we call replication slots. Before 9.4, it was possible with pg-reserve-xlog that the master server already deletes the file before it has been streamed over with pg-reserve-xlog. There was nothing to stop it from doing so. But in 9.4, you can set up a replication slot
31:03
to prevent that, and then it's really gonna be a reliable replacement for setting up your archive command on a file basis. This requires a little bit more monitoring, I think, because now you have to have this process running all the time in the archive server,
31:21
so make sure it's running. Or more than one of them running. Or more than one of them? In different servers, I guess. Okay, yeah, yeah, okay. Yeah, have a dozen of them running and see what happens. If you really care about your while.
31:43
9.5 added an option called synchronous, which is so, if you use synchronous replication, you can now, or you can use synchronous replication now with pg-reserve-xlog, which means that it's not gonna add knowledge to the master that it has the data until it has f-synced into disk, which means that you can set up your system
32:01
so that whenever you commit, the commit will wait until it's safely archived by pg-reserve-xlog. I wouldn't normally recommend using synchronous replication at all, but that's a different topic. You do, make sure you have more than one. If you do, make sure you have more than one, and if you think you do, you probably don't anyway,
32:21
so think again. Usually. There are scenarios, I admit that, but usually when people say they need synchronous replication, they don't. It requires manual interaction because the last file by pg-reserve-xlog
32:43
will have a dot partial or something. That's good, yes. So you will have to manually rename that or have your auto restore command check both variants or something like that. That's a good point. So there was some precedent to this dot partial files. pg-reserve-xlog has always used the dot partial suffix,
33:01
so whenever it's streaming a file and it's not complete yet, it will use the dot partial name for that segment. So when you restore from an archive created with pg-reserve-xlog, you will have this one dot partial file at the end, and you'll want to rename that so that it gets restored, or you have your restore command
33:21
check for the partial files as well, or something like that. But yeah. Any more questions on this? Great. Use pg-reserve-xlog. It's easier to set up, probably.
33:41
So the final part, a lot of people were guessing that I would be talking about pg-rebind today. I am, but not very much. So another new feature in 9.5 is this tool called pg-rebind. It's really cool. It's used for failback.
34:02
So in previous versions, it's always been a problem. If you do the failover to your standby server, now your standby server gets back up again. Maybe you upgraded the binaries or something. How do you fail back to that server? You know, reverse the roles of the servers again.
34:21
And the old answer was that you have to throw away the master directory and restore from base backup. And that's pretty slow if you have like a 10 terabyte database. It's not really feasible in many cases. So there are tricks. You can use rsync to speed that up. Rsync is nice, but it still has to read all the data
34:40
or you have to trust the timestamps, which you really shouldn't. So in practice, to make that safe, you have to read all the data. And again, if it's a 10 terabyte database, you might not have a lot of changes there, but if you even read through 10 terabytes of data, it's gonna take a while.
35:01
So in 9.5, there's this tool called pgrewind, which solves this problem. So let's go to this first. pgrewind is a tool that replaces rsync for this particular scenario of doing a failback. It works pretty much like rsync, except that when rsync tries to detect
35:24
what files have changed and what parts of files have changed, it calculates these rolling checksums and reads through all of the data. pgrewind uses the while for that, because the write-ahead log already contains a record for every modification we make to any pages.
35:41
So we can just use that. So the way it works, it scans the while on the old server, the one that crashed and were failed over from. It scans the while there to determine which blocks have been modified, and then it's gonna copy over only the modified blocks.
36:04
And for anything else than the actual tables and the indexes, anything else than the actual data files, it's just gonna copy everything. That includes any configuration files, any random readme files you drop in your data directory, anything. C log, multi-xid, free-space maps.
36:22
There's all kinds of files that it doesn't handle, which means that it just copies them over completely. Because that's always safe. It's always safe to just copy over your data directory as it is, obviously. That is the way we used to do. We used to just take a new base background. But for the data files, which hopefully that's where most of your data is,
36:43
most of the volume comes from, except when you... Yes, I expect that. I heard someone, was it you who said that you did some testing with the multi-xid box and generated four gigabytes of multi-xids, but only had like 10 megabytes of data.
37:03
It was you. Yeah, there you go. So this tool is not for you. So let's dig into this a little bit. So the problem at failback is basically that you have your old master
37:20
and you might have some commits there or some transactions there that were not streamed over before the failover happened. And those transactions are the problem. We are trying to get, when you do failback, your rsync command or pgrewind will erase those transactions and any modifications they made
37:41
so that you can then get it in sync with the new master. There's a few points that are important to make here. Some people try to use synchronous replication to avoid this problem. It will not help because even if you use synchronous replication, there is always gonna be some
38:01
while that has not yet been streamed to the standby, but has already been flushed to disk in the master. Synchronous replication does not help with that. It makes the window smaller, but it's still there. There's always gonna be, because synchronous replication only affects commits. There's always auto vacuum running.
38:21
There's always checkpoints running. They all write while. They all modify pages. And even your transactions, they do stuff before they commit and they're not synchronous. Synchronous will replicate every while record. It's only the commits. So again, synchronous replication doesn't help here.
38:41
Yeah, any questions on this? It's a really nice tool. I hope people will try to test this and put it into action. The alpha release goes out hopefully next week or very soon anyway. Please try it, see if it works for you. And we still have the time to fix it before the release. It's been around for a while now
39:00
and people are starting to use it and haven't come up with anything major, but we'll see. That is a good question. That's kind of like, yeah. I think that's equivalent to the halting problem
39:22
or something. It's, yeah. Yeah. You can compare, yeah. Yeah, so after you run pgRewind, which you run because you wanted to avoid reading all your data, you should then run pgDump on both servers and compare the output manually, yes. Exactly.
39:47
That would be great, yes. There are regression tests for pgRewind. They are, you can run them. They are very useful. They don't test very much, but it kind of tried to capture all the different code paths
40:02
and the usual stuff. But I haven't tried running it on 10 terabyte databases myself. I hope someone will try that. Yes, there is a version out there on GitHub. I originally wrote this when we were still on 9.3.
40:22
There is a version on GitHub that, as far as I know, still works for 9.3 and 9.4. But there were some big changes in 9.5 that I made to the WAL format to make this more robust and more maintainable, really. If you're interested, you can look at the versions on GitHub for 9.3, 9.4.
40:42
Again, they should work, but I'm not spending any effort myself on that. I think Michael Pakir is still maintaining those. So if you report a bug, it will probably get fixed for 9.3 and 9.4 as well. But I'm not, I don't really care about that anymore myself.
41:03
Any more questions? Yes, Josh. Do you know how complicated the process of archiving, correct, maybe we oughta have a PostgresQL project tool for this, that has it corrected,
41:21
at least for the case of a locally-mounted directory? Yes, yeah. I think we should probably drag that patch out,
41:41
fix whatever is wrong with it. Yeah. From my two cents, it feels like we might have a whole bunch of feature creep associated with that. I mean, PG Backwards has gotten stupid in terms of its complexity because of all the feature creep, because of all the different stuff we wanna do. I'm a little bit...
42:01
We can't keep shipping, we can't keep shipping a database server that requires a PhD in Postgresology in order to set up a streaming protocol plan, which I really will not do. My point was more that there are multiple existing solutions that do try to do it
42:21
and mind you, PG Receive XLOC exists, that works, and it's pretty easy to set up. It's a different philosophy than setting up an archive command, but it exists and it works. So perhaps we should start to de-emphasize this archive command thing and document it more as a thing if you're writing, you know,
42:42
you can plug into if you're writing something like PG Backwards. Although there's, even if you have a command like PG Copy that does that correctly, you're still gonna be missing like the last few megabytes of well, so it's not ideal even then. I don't know what the solution should be.
43:00
Something like PG Copy would be nice, but again, if you do that, then you kind of lose the flexibility. So it would have to be more like a example that you can copy paste from, yeah, yeah, yeah.
44:19
No, I think we agree on that.
44:22
And yeah, we probably should do something like PG Copy. That's actually a good point. If you run it, if you run PG Copy on the target and not on the server itself over SSH, then you can make it work even over SSH or something. But even then, it's kind of limited. And if you want, if you have to run something on the target server, then you might as well
44:41
run PG ReceiveXlog on the target server, maybe. Yeah, it's different.
45:23
The ReceiveXlog doesn't solve the problem for cases where, for example, you are required to route these wall files to a file server where you can't run it. Yes, that's true. You're only allowed to tap into it. Well, mind you, you could run the RPG ReceiveXlog
45:41
somewhere else then. Actually, I wonder if you could run it on the server itself. It would be kind of weird, but. Oh, really? That's cool. Yeah, you can only do it. Absolutely. Huh, okay. Yeah, so then you would just have PG ReceiveXlog
46:01
write it to your NFS mount or something. Kind of weird, but yeah, why not? Okay. And I guess running it on the same server doesn't help with the failover scenario
46:20
because, yeah, you'd still have a lot of the same problems there with, you know. Yeah. So the question was what would be the right way to monitor PG ReceiveXlog. Look, so yeah, something like Stephen's tool
46:43
that continuously pulls your archive, make sure that there are new files arriving there every now and then. I would start with that. And it would also be good to actually read through the files and make sure that you don't have any gaps, missing segments, you know, and that the CRC is matched.
47:01
That would be pretty robust. There was a question? Yes, Josh. Yeah, just to put some more intelligence
47:21
that is you make the restore command actually a script that has intelligence. Yeah, that makes sense. You do end up archiving twice, but. Yeah, well, you couldn't archive twice because it's not the tree. So in the master archive, that's the tree that worked both further.
47:48
And so we have each server archive to its own directory. Then the restore command consults a list of what are valid directories. Yes, okay. And tries to build a complete stream from it.
48:00
That makes sense, but it does not solve the missing segments problem. So there's a gap there. Any more questions? I think we're over time. Thank you.