We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

The PostgreSQL replication protocol, tools and opportunities

00:00

Formal Metadata

Title
The PostgreSQL replication protocol, tools and opportunities
Title of Series
Number of Parts
20
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Producer

Content Metadata

Subject Area
Genre
Abstract
The new binary replication protocols and tools in PostgreSQL 9.0 and 9.1 are a popular new feature - but they can also be used for other things than just replication! The new binary replication protocols and tools in PostgreSQL 9.0 and 9.1 are a popular new feature - but they can also be used for other things than just replication! PostgreSQL 9.1 includes server changes to allow standalone tools to request and respond according to the replication protocol. From these, tools like pg_basebackup allow a number of new possibilities. And the infrastructure put in place in 9.1 opens opportunities for further enhancements - some already on the drawing board and some just wild ideas so far.
VideoconferencingNetwork operating systemArtificial neural networkReplication (computing)WeightCommunications protocolStandard deviationComputer data loggingData recoveryBackupCodeRegular graphDatabase transactionStreaming mediaMereologyProcess (computing)Latent heatRepetitionClient (computing)Computer configurationAuthenticationDatabaseLoop (music)AbfrageverarbeitungPosition operatorSet (mathematics)Query languageMessage passingDirection (geometry)Entire function3 (number)Special unitary groupMaizeSimulationSummierbarkeitOvalCAN busSineCovering spacePerturbation theoryFeedbackFormal grammarAsynchronous Transfer ModeFile formatPhysical systemSystem identificationBit rateFlow separationComa BerenicesBinary fileModul <Datentyp>DisintegrationTable (information)SpacetimeSequenceEmailData transmissionStreamlines, streaklines, and pathlinesRow (database)Maxima and minimaReading (process)Directory serviceLine (geometry)Complex (psychology)Ultraviolet photoelectron spectroscopyComputer fileBlock (periodic table)Parallel portJava appletTheorySynchronizationQuery languageType theoryCommunications protocolPower (physics)IntegerReplication (computing)BitAuthenticationLengthForm (programming)INTEGRALDirectory servicePoint (geometry)Connected spaceRevision controlData recoveryResource allocationMessage passingView (database)Multiplication signDatabaseDifferent (Kate Ryan album)Parameter (computer programming)NumberComputer configurationPublic key certificateLoop (music)Data loggerMathematicsSI-EinheitenPhysical systemTotal S.A.Regular graphClient (computing)CASE <Informatik>TouchscreenCellular automatonLine (geometry)CodeSpacetimeDescriptive statisticsFunctional (mathematics)Dependent and independent variablesCore dumpTable (information)RadiusFluidQuicksortParticle systemMereologyObject (grammar)Covering spaceRow (database)Associative propertyStress (mechanics)InformationComplete metric spaceRight angleStandard deviationPasswordPersonal digital assistantLatent heatProcess (computing)Product (business)Crash (computing)Computer fileAsynchronous Transfer ModeServer (computing)Uniform resource locatorArithmetic meanRandomizationMoment (mathematics)ImplementationVideo game19 (number)ACIDSystem callFormal languagePhase transitionUtility softwareResultantField (computer science)FlagoutputStreaming mediaString (computer science)Database transactionTheoryGene clusterEqualiser (mathematics)Rollback (data management)DivisorService (economics)PixelSoftware testingTape driveSign (mathematics)Disk read-and-write headFile archiverNeuroinformatikTowerGroup actionData typeGame controllerConfiguration spaceData streamDefault (computer science)Multiplication tableNumerical integrationTerm (mathematics)SequelWave packetPartial differential equationPlanningEnterprise architecture1 (number)Inheritance (object-oriented programming)BackupVirtual machineOpen sourceWordGoodness of fitFigurate numberSolid geometryKerberos <Kryptologie>LogicMedical imagingData compressionLink (knot theory)FeedbackInterrupt <Informatik>Set (mathematics)SpreadsheetAdditionExpert systemBeta functionTraffic reportingLocal ringArithmetic progressionData miningBinary codeElectronic mailing listHybrid computerGraph (mathematics)Mixed realitySingle-precision floating-point formatBuildingComputer data loggingWeb pageTouch typingEntire functionImpulse responseState of matterControl flowOpen setClosed setWebsiteRoundness (object)Scheduling (computing)Food energyCartesian coordinate system2 (number)File formatUniverse (mathematics)Formal grammarCuboidCodeFerry CorstenRhombusMetropolitan area networkNumeral (linguistics)Position operatorPerspective (visual)MultiplicationOnline helpEndliche ModelltheorieLimit (category theory)Hacker (term)Insertion lossSequenceThread (computing)MehrprozessorsystemOntologyDomain nameDevice driverKernel (computing)Boundary value problemWell-formed formulaEmailEncryptionLevel (video gaming)Direction (geometry)AreaFormal verificationFlow separationComplex (psychology)Vector potentialInformation technology consultingLibrary (computing)Parallel portError messageException handlingSoftware bugInstance (computer science)Projective planePerturbation theoryNormal (geometry)IdentifiabilityFront and back endsContent (media)Absolute valueSynchronizationVelocityPrincipal ideal domainComputer hardwareNetwork socketBEEPSummierbarkeitSinc functionMiniDiscCombinational logicCache (computing)Variable (mathematics)LoginSpeech synthesisXMLUMLComputer animation
Transcript: English(auto-generated)
Let's get started. Good afternoon, everyone. I'm Magnus Hagender. I'm here to talk to you some about the Postgres replication protocol. I'm a Postgres consultant, primarily. I'm from Stockholm in Sweden.
I work for a company named Red Pill InPro. That's the little red pill in the corner, right? We do open source consultancy, and my team focuses on database sites and obviously on Postgres when it's databases, right? So I'm here to talk to you again about the Postgres replication protocol, the tools and the opportunities around this, and the things that have happened lately.
So I notice a number of people in here who have actually been in previous versions of this talk. How many people know that they've been in previous versions of this talk? Really? I was going to give you the chance to sneak out silently in the beginning, because if you've seen a previous version, that hasn't changed much. There are a few minor things.
But in general, the roadmap for what has happened has been fairly clear over the past couple of years. So the only major difference is that a number of things have moved from the area of being opportunities to now actually being tools, which is the kind of direction that we'd like to see things move. So when we talk about the Postgres replication protocol, we talk about the new stuff that we
added in Postgres 9, right? The streaming replication part. Hot standby is obviously important, but streaming replication is the protocol level and the lower layer of how we transport and how we deal with the actual replication traffic between our primaries and our slaves. It's based on streaming the transaction log,
or as we often call it in Postgres, the WAL, the write-ahead log. It's exactly the same thing with two names, kind of annoying sometimes, but that's what we're talking about. We always start from what we call a base backup. That's the terminology we use for whatever. Some databases will call the same thing a full backup, for example. We call it a base backup.
It's basically, well, it's where you start from. And then we use standard recovery code. The same one you'll use if you're restoring from backup, the same one you'll use if your system crashes, obviously. Postgres doesn't crash, right? The hardware you're running it on might crash, particularly if you're running it on virtual machines or cloudy or anything, strange things like that.
And the Postgres replication protocol is layered on top of the standard Postgres protocol. So basically all of this replication stuff is layered on top of something else. We got the actual replication is layered on top of recovery and restore functionality. The protocol itself sits on top of the standard Postgres protocol, which has given us a lot of flexibility
and a lot of functionality completely for free. The current version of the Postgres protocol has been around for seven or eight years, I think, by now. It's not something we change often. We're gonna change it fairly soon to a new version, but the core of the protocol is a very stable thing to sit on top of, has a lot of functionality.
So looking at the different parts of the puzzle that we're gonna talk about through this talk, the connection processing and startup, how that actually works in Postgres, for those of you who are not aware of that. The Postgres protocol itself, which is again, the foundation of how we do this. The replication specific protocol that sits on top of this,
and then sort of moving into the tool space, which has been very much focused around the PG-based backup tool that was added in Postgres 9.1, which was actually, when I started doing this talk the first time, that's the part that was entirely in the opportunities part, because we didn't have any of these tools. We just had the raw protocols in 9.0
that we kept on working on. So for a normal client connection, how does Postgres work? So what happens is it's a TCP connection. You can also have a local Unix domain connection in Postgres. We're gonna ignore that. They work exactly the same in replication scenarios. You're probably gonna have a TCP connection,
because replicating locally over to the same machine doesn't make much sense, does it? Other than possibly through testing. So we get a TCP connection. You get the kernel taking care of all the TCP packets going back and forth, and we have a TCP connection running on port 5432. Now at this point, what Postgres does is fork. What you'll see in a lot of other databases
is you'll spawn a thread, or you'll associate a worker thread somewhere with the existing connection. But I mean, you've all used Postgres. You know, we use a multi-process model, so we create a completely new process that will take care of this specific connection. And once we've done this,
we do a potential SSL negotiation. This happens immediately after the fork. And this is first the level where the Postgres protocol itself negotiates SSL by checking, does the client and the server both support SSL? What are the settings of the client parameter SSL mode?
And what does potentially the HBA file say about SSL? And once that's done, if we determine that SSL should be used on this connection, then we hand it over to the actual SSL protocol level negotiation to set up encryption, verify certificates, and all of this. This is the first thing that happens as soon as we get in.
Once we've done this, we get a number of parameters that are sent from the client to the server unconditionally at all times. We get the database name and we get the username. These are what we're using to map to the row in pghba.conf to figure out
how we're gonna do authentication of this guy. And any number of other options such as client encoding, these settings that we need to know for every connection are sent at this point. And once we've reached this, we have enough to find the row in pghba.conf so we can perform authentication.
However, that happens to be configured, whether it's username, password. We might do a Kerberos exchange to log you in. We might check an LDAP login. We might use SSL certificates to log in. That all happens during the early startup. And once we've performed authentication, assuming it's successful,
if it's not successful, we just terminate the connection and kill the process completely. But assuming it's successful, we select the database. That basically means Postgres will, well, it'll run a chdir into the data directory of that specific database, but we'll also preload the cache with a number of entries from the system tables
that are specific to this database. And once we've done this, we are ready to start processing queries to start doing things. So we enter the query processing loop where we just stay forever in this backend. Just run a query, we get a query, take a response, get a new query, send a response,
and we keep doing that until we're done with the connection. Now, this is a normal Postgres client. If we change this into what it looks like when we're talking about a replication client, it's very similar. There's actually just a couple of things we change here. We do the database username option thing,
we change it because there's no database name when we're creating a replication connection, in that it's actually a hard-coded database name that we use. There's also no point in selecting a database if we don't have a database. And obviously, we're not gonna enter a query processing loop because we're not going to process queries.
But we are going to accept the TCP connection at exactly the same place. And we're going to fork. And we're going to do SSL negotiation the same way. And we're going to perform authentication the same way in exactly the same place in the code. It's the same line in the same C file. The big advantage of this is any way
you can set up and use authentication for your regular Postgres clients, you can also use authentication for your replication clients. So if you've set up Active Directory integration with Kerberos and GSS API and whatnot for your clients, you can use this for your replication as well.
And I found this, myself, I found this to be really, really useful in many deployments. So I'm happy to say that I was initially against layering it this way on top of the authentication because I thought it would be too complex. It was definitely worth it. I withdraw that objection, or rather I did a while ago. It's shown to be very useful. So the things that we change, again,
we change the database username options to just get the database, it's going to be replication. We still get the username, we still get the option. And then we start something called a while sender instead of entering the query loop. Now we're still inside the same process. We don't start yet another process. It's just we call a function called while sender main
in the Postgres backend. So what is this while sender? It is basically it's a special purpose Postgres backend. Like a regular backend is always connected with a single database. This is why you can't change database
or do cross database queries in Postgres. The backend is locked to a single database. The while sender is not, it just exists. It only accepts very, very, very simple queries and commands. I think there are three commands in total that you can send to the while sender. It's very, very simple. It will return, these are commands, again, this is,
we're running on top of the standard protocol. So these are regular commands. They look the same way. You can actually run these commands from PSQL if you're careful with exactly how you start it and you enjoy watching binary data on your screen. It returns a mix of results sets, which are formatted just like any other Postgres results set
and streams, which are formatted exactly like your copy streams. And if we're looking basically in Postgres 9.0, again, where this is the very simple case, it only supports a single thing, which is the client connects. It tells the server, give me all the transaction logs starting at point X.
The server goes into the mode where it starts sending the transaction log and it never stops until the TCP connection goes away. There's no further commands. Once the client has said, give me all the data, that's where we are. It's just gonna keep sending that. So taking half a step back, we said this is all layered on top
of the Postgres protocol. So let's talk about the Postgres protocol. It's very simple. We can talk about this fairly quickly. It's nice when there's a simple protocol. Far too many protocols are way too complicated. So it's always gonna be TCP. Well, it's always gonna be a streaming socket and we can have a Unix domain socket, but it works the same way.
It is a message-based protocol. We generate fixed messages, pass them along. It's bi-directional. There is no special streaming protocol that is only one way. The protocol is always two way. We can optionally SSL encrypt it. And when we do this, the entire TCP stream is wrapped in SSL.
We don't try to sort of do encryption at the layer of an individual message or anything. We just use whatever SSL library we have. Currently, that's OpenSSL. It's the only one we support. The protocol doesn't restrict us to that, obviously. And we just wrap the whole stream in SSL in that case.
And the message as its simplest is really, I mean, this is the simplest way you can do it. You have a message type that is one byte. There are some 30-ish or so, I think, message type defines in total in the protocol. You have a message length. That's a 32-bit integer. So a single message will never be greater than two to the power of 32.
Usually, they'll be a lot smaller than that. And then you have the message itself, which is obviously the format of the message is depending on exactly what the message type is. So a standard query exchange between a client and a server just looks something like this. We have a message type called ready for query
that's sent from the server to the client. That is the code Z, or Z, depending on where you're from. We do try, libpq and Postgres do understand both the differences between British and American English. So they're fine with that.
I'm not sure it understands Canadian, though. You might have some issues with, if you're trying to connect to, yeah, you can get error messages in French, that's fine. And you get a size, and what we do these days as for Postgres protocol version three is, along with this, we send the transaction status information, which is just, are you currently in a transaction,
and is this transaction broken or not? The client will respond, or at the time it's time to run a query, it'll send a simple query message, which is the queue message. It has a length, which is gonna be the length of the SQL query that we're running coming up there. And then we got the actual SQL query
just straight in as a text string. It's sent to the parser, it's parsed, it's optimized, it's executed, we get some sort of a response. And the response to a standard query is going to start with a row description message, which tells the client the format of this specific result set.
The T message, it has a size, it lists information about each individual column, like what data type is in this column, how big is the data type if it's a variable length data type, sorry, fixed. And then we send one or more data row messages. If you get 100 rows, you get 100 data row messages. That's pretty simple.
The D, it's got the size, it's got the values. And when the whole query is done, we've returned all the rows, we return a command tag message that just tells you basically what did we just do? It's sort of the I'm done now message. So in this case, we'll send the command tag C and a select telling us exactly
that that's what we did. Now, if we apply this exchange to the streaming replication protocol, it looks surprisingly similar. Or maybe not surprisingly, but it does look similar. We got the same ready for query message is being sent by this wild sender protocol. This is why we can talk to it with something like PSQL.
We can talk to it with anything that uses libpq. You can talk to it with the JDBC driver if you want to. I'm not sure it's a good idea, but you could. And we send, again, a simple query message. This is the same message that we use to run a regular query. So we send the simple query message. If we want to start a replication,
we send the start replication. The zero slash zero, that's the Postgres way of formatting the transaction log location. Zero slash zero means you just set the system up. That's the very first transaction point. Actually, I don't think we have zero slash zero anymore.
I think that was changed, so it actually starts at one now so that we can use the zero to identify an error. I believe that was changed in 9.1, or maybe 9.0 even. I'm trying to look around at the people who could confirm this for me, but they're looking confused. I believe you're correct. Thank you. And for some reason, I wasn't looking at you. But thanks for confirming anyway.
I didn't recall hearing that discussion. Yeah, no, there were some issues around if your recovery actually crashed exactly at the segment boundary, it would come up thinking it was done. That was bad. So the difference now is that instead of sending the data header message, we're sending a copy out response.
This is the same kind of response as if your query was actually a copy. That's the W message. We got a size, and then we got two flags sent to it. In streaming replication, they will be zero and zero. One of them tells you how big is each row, or sorry, how many columns are in each row.
If you do a regular copy from a table with three columns, one of those will be a three. But we're not copying a regular table, so we're setting it to zero. The other one identifies that this is a binary copy. It's not a text-mode copy. It's a binary-mode copy, because obviously our transaction log is a binary log. And then we send a copy data message,
which is just the D, the size, and then some binary transaction log data. And after that, we'll send another copy data message, and then we'll send another copy data message, and we'll just keep going, and we will never stop. The actual sizes of these XLOG data messages
will depend on how your data is being generated, but as soon as the data is written to the transaction log, it will be sent across the replication connection. And as we were in Postgres 9.0, this is really all there was to the replication protocol.
That's all we had. And then in 9.1, we started building a little bit further on this, started adding some more functionality. Well, for one thing, we added synchronous replication, right? But I'm gonna cheat and not actually talk about how that integrates in the protocol. But it's really, it's a secondary set of messages
just flowing from the replication slave back to the master, telling it how far along it is. We also added something called a hot standby feedback, which is whereby the slave ships information about open transactions back to the master to avoid hot standby query conflicts.
But that's also really more of a hot standby feature. On the lower layer, when we're talking about the replication protocol itself and the tool set, is we actually added something we, sort of a micro SQL, micro language used for the WAL sender, so that it can now accept a number of new commands. It can do more things than we could ever do before.
Basically, we now have a full grammar running in WAL sender mode. Previously, the WAL sender command was, if string compare input with start replication, do this, else, error, I don't know what to do. We do have a full grammar. It has very few command and it has a very few options, but it does exist there.
And it's being, more is being added to as requirements show up. It's still going to be extremely picky about how you do it. If you thought standards conforming strings made it hard to get things exactly right, this is a lot more picky than that. It's not intended for manual consumption. It's intended for tools. It's intended for tools like PGBaseBackup.
It's intended for the replication slaves and nothing other than that. But it's sort of forming a foundation for future improvements where we can build more tools using this low level binary access to the Postgres system. In the future, we might be able to use it to talk
about things that aren't necessarily transaction log. For those of you in here who were in Simon's clustering briefing, I guess we call the other night, that kind of stuff may well be layered on top of this. There's a lot of things we can do here because we have a generic protocol. So these are the commands that we actually have
available in Postgres 9.1. So it's still not all that much. We had an identify system in Postgres 9.0. It's a very simple command. If you connect to run identify system, you get a single row back that tells you the system ID, which is used to make sure you're replicating from the master that you're supposed to be replicating from, not just some random other Postgres database.
And that's pretty much it. There is a second column in. I can't for the whole of my life remember what the second column is, so it clearly can't be important. We have the start replication command, which works exactly as before. You send start replication, you give it a position.
I mean, if you want to try this, go into PSQL, connect to the database name replication, type in start replication zero slash zero, and watch your transaction log come back and beep at your screen. It's interesting when you do that by mistake. The new command that we added in 9.1, the first of the new utility commands added to this sort of micro language is baseback.
Which takes a number of parameters, and well, it kind of tells you what it does, right? It will perform a baseback. That's the only new command so far in 9.1. We haven't actually added any new commands at all in 9.2 yet. We've added a new flag to baseback up, I believe,
but not an actual new command. So what baseback up does is it gives you a single command baseback. Starting to look like I was in a talk earlier today about the wonders of SQL Server and how it's so much better than Postgres. And one of the things it had was, you can connect and you can run backup over SQL.
Well, guess what? We can do that too. You don't necessarily want to do it from your command line client, but you run the baseback. So you no longer need to deal with this manual, connect to PSQL, run pg start backup. Take your backup, connect again, run pg stop backup. And then don't forget, make sure that everything worked fine.
And absolutely don't forget to actually run pg stop backup or your backups are useless. So we had it all completely integrated. You can still control things like the backup label, if you actually care about the backup label. You can control checkpointing. You can control a lot of these things.
Now it's not a silver bullet, so to speak. This is not what everybody wants to use for backups, but it's very simple. It's very easy to use and it turns out to work pretty well. The old method is still there if you want to, but I don't know, most people don't need and don't want it. The new method is more robust.
So along with this, the protocol is still, you're not intended to use this for manual consumption. You can run this in PSQL and you will get your baseback as a copy result right on your screen, which you probably don't want. So this is why we have the tool pgBaseBackup
that will let you do this. This is the tool that we added in Postgres 9.1. So it's not available in 9.0. It's in 9.1. It'll be in future versions, of course. And it's also designed to be used as an integration point for third party backup systems. If you're using one of these super expensive enterprise backup systems,
they can integrate with this. I don't actually know if anyone does that yet. So if anyone of you know of one of the big backup systems that actually speak this new protocol, that would be interesting to know. So please let me know if you do. But it's there for them to use. So it's a fully documented protocol. It's an official protocol. It's not considered an internal protocol. It's just not for end user consumption.
So the streaming, the actual backups that you end up receiving are simple tar files. Tar is a very nice standardized format that hasn't changed since, I don't know, since they started building computers or something. It's very easy to stream. Many other archive formats,
like if you use a zip file or something like that, actually have a header in the beginning. Tar doesn't have a header. It's designed to be streamed because it's designed to run on tapes. And you really can't do seeking on tapes. You have to stream your data onto tapes. So no global archive header.
All files in a tar archive are aligned at 512 byte boundaries. Again, this is also a legacy from talking to tape drives. But it turns out all Postgres data files are already aligned on an eight kilobyte byte boundary. So the only cases where you're losing, sort of wasting space by using this
is for maybe your configuration file and your PG control, not even PG control, but style your PID file. For the majority of your database, it's already aligned at these eight kilobytes, which is obviously also aligned with 512. So you don't actually lose anything. What you're going to receive
is one tar stream per table space. So if you have 10 table spaces, you will receive 10 tar files, one for each table space. If you were in the default scenario where you have only one table space, well, you will receive only one tar file. And they will all be transmitted sequentially. We don't currently have support
for backing up multiple table spaces in parallel. We'll just do them one by one. So if we look at this again in the protocol, it looks very much like the other stuff. We start with our ready for query message. Well, we send back the base backup label foo, saying this is the thing I want to do my base backup of.
What we end up getting here is first we get a regular libpq result set. So a regular result set with three columns, spc, oid, spc location, and size. That's the table space object ID, the table space location, the full path of where this data is supposed to go,
and how big this table space is. By default, the size is going to be minus one, meaning it's not being used. Actually, no, it's going to be null. And then for each table space, we get a row with information about it, and when we get that, we get a command tag that says we're done with the select. Now, without waiting for any further data,
we're going to send a copy out response. This is something that you'll, it'll look like if you're sending multiple queries to Postgres with a semicolon between them, and we just tack the result sets on the end of each other. So you get a copy out response.
Again, it looks exactly like the streaming replication, which is we send the, well, number of columns zero, and yes, it's going to be binary, and then we just stream a tar data. Number of times when we're done, we send the copy done message, which is the copy version of the command tag, saying, okay, we're done with this file.
And then we're going to send another copy out response for your next table space. And we just repeat this until we've sent all the data that we need. And once we're done there, we go back into the loop, and you can run further commands if you want to.
Can we do live interaction? Oh, absolutely. I mean, if you use, actually, if you use a tool like Wireshark, they know how to disassemble this protocol.
I mean, they won't disassemble the contents of your copy data, but they will know that it's a copy data message. So you can definitely disassemble and see that this is actually what's happening. And it's fairly simple. So that's the simple way of how this is done. Now, again, what we normally use for this
is we use pgBaseBackup. If you're a DBA, if you're actually going to use this to take backups, you use the tool pgBaseBackup that we ship as of line one. It's a simple command line tool, like most of the ones we have. It takes a couple of options. Takes dash capital D to tell you which directory we're going to write the files in.
It takes a dash capital F option, which works like, I mean, if you work with pgDump, you'll recognize a number of these. The dash capital F is the format. It can be plain or it can be tar. Now, this doesn't actually change the format of how we're sending the data across the wire. We're always sending it in tar format.
But if you choose format equals plain, pgBaseBackup will unpack the tar file and regenerate the data directory. Whereas if you use format to tar, the T, it'll just drop a tar file and then you can do whatever you want with this tar file. Dash C will control how we do the checkpoints,
whether we do a fast checkpoint or a spread checkpoint. This is the same as the second parameter if you were using the pgStartBackup function. The first parameter is the backup label. The second parameter is a boolean saying whether you want to do a fast checkpoint or a spread checkpoint. That's this parameter.
We can set the label with dash L. We can enable compression with dash capital Z. If we do compression, it happens in pgBaseBackup. The data stream that's coming out of Postgres will never actually be compressed. But we can write compressed files. Same as if we're using compression in pgDump
where the data flowing from Postgres to pgDump will be uncompressed and then we'll compress it and invite a disk. Now on top of this, since we're running this thing over the standard libpq protocol, we accept all the standard libpq client options, like the same ones you'll see in pgBaseBackup,
in pgDump, sorry, like dash capital U username, host name, port name, port number, all of these things. If you're using Kerberos, it'll go talk to Kerberos. If you need a password, well you have the same kind of parameters you have for the other tools for always ask for password, never ask for password.
And they look the same and they work exactly the same. We have another parameter that solves the problem that a lot of people have been complaining about, which is you can add dash capital P, then you will get a progress report. It will tell you how far ahead in your backup you are from 0% to 100% ish.
It's fairly expensive because what it does is when you start the backup, it enumerates all the files in your data directory and sums up the size of them and sends that along to pgBaseBackup. And then it starts all over again by sending the files.
So it'll be enumerating all the files twice. Now if you have a large number of databases or a very large number of objects in your database, this can be quite expensive because we're traversing a lot of files. If you don't have a large number of databases or a huge number of objects, it's gonna be in the cache anyway. In that case, it's not gonna be that expensive
to just figure out how big the files are. It's also going to be inexact, right? Because it gets the size of the files when you start the backup. But the files, your database is still online. It's still making changes while we're taking the backup. So the size of the file might well be different
to the point that somebody else might drop a huge table while you're doing the backup. Then the size estimate's gonna be wrong. Or if someone else is loading a table, then the size estimate's going to be wrong. So actually in the early versions, it didn't always end at 100%. Well, it now always ends at 100% because when it ends, we just adjust the value to 100%.
Because we know we're done. But sometimes you can see it actually running up to 100% and then staying at 100% for a few minutes because the database increased in size as the backup was running. And we don't try to adapt to that in any way during the run. We just add up and when we get to 100%,
we stop adding up. I wonder if we might actually stop it at 99. I'm not entirely sure. But the point being, it's inexact, but it gives you a good hint. It gives you a hint whether you're halfway there or you got time to go get a couple of more coffees.
Now again, base backups combined with transaction logs give us some other things. Whenever you restore from a base backup, you use, or you require while archiving, transaction log archiving. Many of you in here have set up archive command
and remember to copy the files over there, make sure you check the exit codes and all this stuff. It's kind of complex to set up. It can be complex to monitor. Now, PgBaseBackup brings us one more feature that's very easy for the simple use case, which is if you add the dash x parameter to your PgBaseBackup call,
your base backup will include all transaction log that's required to restore this backup. So you don't actually need a log archive anymore for a physical backup. Now, if you do that, you don't get the ability to do point-in-time recovery. That will still require a log archive,
but you'll be able to restore just this one backup because a base backup without any transaction log will, Postgres will start up and say, sorry, can't do anything. So what happens here is that the while sender will just include all the required while files at the end of the tar file. The very end of the backup will just send the transaction log that's required.
And this is basically all the transaction log that was generated between when you started the backup and when the backup's done. For those of you who've worked with replication protocol, we have a parameter called while keep segments that you will need to use in this case that tells the server to keep enough while around
because if you run this parameter and you have activity on your server, well, you'll know we rotate the transaction logs at checkpoints, you get rid of them. And if Postgres deletes the transaction log before pgBaseBackup has received it, well, pgBaseBackup will say, sorry, I spent five hours taking a backup
but it didn't really work, sorry. And it'll give you an error. It's not going to silently give you a backup that doesn't work. But it needs all those transaction log files. While keep segments lets you keep extra transaction log around to make this work. And that's the stuff that we have so far. That's the stuff we have in Postgres 9.1.
And this is sort of where those of you who've seen this before, now is when I switch to the opportunity side. But we've got 9.2 hitting beta now. So a lot of these things are no longer on the opportunity side. They're now on please help with testing side. First of all, we have a tool now that lets us do streaming log archiving.
So just show of hands, how many people in here have set up log archiving using archive command on Postgres? Fair number of you. How many of you have not set archive timeout? By mistake? Yeah. So the way it works in Postgres previously
is when you set an archive command, Postgres will run this command once per 16 megabyte of transaction log. It will then copy this transaction log to the archive. This is perfectly fine if you have a high velocity database. If you have a low velocity database, that might take a day or two
to generate 16 megabytes of transaction log. More commonly, maybe not a day, but maybe an hour. That means that this is your recovery granularity. You wanna do point in time recovery? You generate 16 megabytes every two days. Well, your granularity is two days. It's not very good. So we have a workaround for that, which is a parameter called archive timeout,
which says ship these files maybe every 10 minutes, even if we didn't generate 16 megabytes of transaction log. Which would be perfect if it wasn't for the fact that we still shipped 16 megabytes, even if we only filled it with 100 bytes. So you couldn't really, I mean, you'd like to set this archive timeout
to five seconds and get a great granularity, but you'd fill your log archive with terabytes of data pretty quickly, because we'd send 16 megabytes every five seconds, which might be too much. Now, what really is is that the Postgres replication protocol, if you have a replication slave, it already does this. The replication protocol already sends
the exact same data that would go in your log archive. So for this, we have a new tool in Postgres 9.2 called pgreceive-xlog that you run somewhere that's not on your database server. It will connect to your server as a replication client, and it will reconstruct the log archive for you,
just as if you were running archive command, except it'll receive byte by byte, not 16 megabyte by 16 megabytes. It's basically, it's a Postgres replication client without Postgres. We sort of took out that big bloated thing
and just ran a simple thing. It takes a replication stream, writes it directly to files. That gets you out of the trouble, or out of having to choose between recovery granularity and filling up your drives with empty segments. Now, if you actually have a replication client already, this doesn't really help you,
because you can use the replication client to get to the last 15.4 megabytes that never hit the log archive. But if you don't have a replication client, then this is used so you can get rid of this granularity problem. And based on the same thing, we have a new parameter for pgbase backup
that basically take the same process that pgreceive-xlog does and runs it in parallel during base backup. So when you run your pgbase backup in this mode where you take the base backup and you want to include all the required transaction log, it'll just fire up a second connection that streams the transaction log
while it's running the backup. So you no longer need to set while keep segments on the master. You no longer need to keep the while around on the master because we're receiving the while on the backup node while we're receiving the base backup in parallel.
And it's very simple. You just add the dash dash x log equals stream instead of just dash x, which is also I believe dash dash x log equals copy is the same one as the old method. And it'll just work. The main difference being is you now need two replication connections to your master
because we're running these over separate connection. We don't try to mix the stream somehow over the same connection. Another thing that we've been working on that I found fairly useful for the base backups in 9.2 is relocatable tablespaces.
When you were taking base backups in Postgres between version 8.0 and 9.1, there is a very annoying thing, which is you can move the data directory. It's fully supported to move your data directory. So you restore it into a different location. But if you have any non-standard tablespaces,
they must be restored into the same place. It's not supported to move your tablespaces to new location. It's not supported. It works, but it's not supported. It kind of works. The thing is that the implementation prior to Postgres 9.1, sorry, prior to Postgres 9.2 in 9.1 and earlier
is there is a system table called pgTablespace, which has a column called spclocation. It has an OID, an object ID like every Postgres object, and an spclocation that tells you this tablespace lives over here. And as we're restoring, we can't modify this because it's part of a system table.
So we have to restore first and then change it, meaning that we need to restore the data into the right location first. Now, when you create a tablespace, we also put a symlink in the pg underscore tblspc directory in the data directory that points to wherever this thing is. And the fact is, even if you look into earlier versions
of Postgres prior to 9.2, the spclocation field in pgTablespace is never used. Not once. It's set when you create the tablespace. It's never used. It's only the symlinks that's used. It is only used by tools who need to find out where the tablespace lives.
But it's not used when we actually access the tables. They only use the symlink. So what we did in 9.2 is we just deleted the spclocation field. It's no longer in pgTablespace. Instead, we have a function called pgGetTablespaceLocation that'll return this by looking at the symlink.
That means that as of 9.2, it's fully supported to move a tablespace and restore it into a different location than it was before by just changing the symlink, which is quite useful. Now, as of in 9.2, it's still going to be manual. You restore the tablespaces,
you manually modify the symlink. We'll leave that as part of the whole. Here's our, still in the opportunity space for 9.3 is to just add parameters, let pgBaseBackup or have a restore tool do this for you. But it's no longer stored in the database, so we can actually access and modify it before we start the restore database.
I have one other thing that's sitting on my list on the opportunities part that used to be a very long list. We've worked on this pretty hard, which is some sort of ability to do incremental backups. I don't think anyone has actually figured out the best way of doing this. One way of doing it, a lot of people who do incremental backups when you use pgStartBackup and pgStopBackup
is to use rsync. So we could build something rsync-like into the Postgres protocol. Question is whether it's gonna actually be a win. Depends on what problem you're really trying to solve. I found myself when running this on production systems such that actually doing this with rsync
can be pretty darn destructive for performance on the server while you're taking the backup. Because rsync generates a lot of random IO. pgBaseBackup generates nice sequential IO, which is a lot faster and a lot lower load. So there are things to be done. There's also discussions of what can we do using the LSN.
We have an LSN on every page of every data file telling us which is the latest transaction that touched this page. Maybe we can use that to generate our own way of doing incremental backups. There are some issues around that as well. It's not quite as straightforward as it might sound, but things could be done around that.
Found a nice way to be able to decrease the size of a log archive without taking more full backups. Right now you can decrease, I mean you have to keep every log file from your base backup and forward. So the only way to keep your log archive small is to take base backups very often,
but the base backups are very big. So we need to find a way to consolidate these things together to optimize that. Now I think we've done a pretty good job because again about a third of this talk used to be opportunities about a year ago. So a lot has happened in Postgres 9.1 and some more now in Postgres 9.2. I'm sure there's a lot more that we can build on top of this,
both from the concept of base backup and within replication obviously. But what we have now is a very solid foundation that we can build on. And that's really, I mean that's the way we usually end up doing these things in Postgres, like put in a solid foundation, let somebody else figure out how to use it. So we're looking forward to more tools in this space
and more enhancements of these tools obviously. If you have any good ideas on enhancements that you'd like to do yourself or that you'd like to just flag for people that this would be good to use, we're very interested in hearing about that. And I think I actually have, well, I do still have about one minute for questions.
So any questions? Sure, oh, you go first because you actually put up your hand. How do you use that in combination
with any of the synchronization mechanisms? Like I assume the idea is you take a base backup along with the walls that are happening along the base backup, you can take them, and that gets you to a state where you're the same at that moment in time. Yeah, exactly. I mean, what the dash X is for is if you run dash X and in plain text mode,
or in not plain text, but plain mode, it will recreate the data directory, it'll put in the PGX log, and you can actually just do a PGCTL start right in that directory, and you're up and running. That's the use case for dash X. You don't need to set up the log archive, you don't need a recovery.conf or anything to get back up and running. It's a one-off backup that includes everything.
Okay, so you wouldn't necessarily use this, then, to create a second instance of the database that you then wanted to configure slowly or whichever replication system you're working with? No, slowly will get you a logical replication system, so that's not gonna work across. Now, you use PGBaseBackup to set up a replication,
a streaming replication slave, but then you don't use the dash X mode, because dash X is for one-off backups. This should all be, actually, it might not be,
the tablespace moving part might not be in the documentation. Everything else should be fully documented. Well, obviously, other than the thoughts about incremental backups and such that don't actually exist. Yeah? What do you mean?
Oh, you can connect PSQL to, I don't think there is a specific mention that you can do it with PSQL, but it's fully documented. It's a standard libpq connection. Yeah. No, so there, it should, all the information should be there. Some of it is under,
is actually in the protocol documentation and not in the end-user documentation, but it's all fully documented, except I actually think you may have a point in that we never actually updated the documentation about tablespaces. Yep.
You will be able to run pgBaseBackup against standby in 9.2, yes. And Josh is looking at me back there and say, I found bugs in this yesterday. So, but that's the beta version, right? There are a couple of, I think there are two open bugs on the specifics of running the base backup on the slave right now, but it will be, they will be fixed
before 9.2 is up. In 9.1, you can't run it on a slave. It can only be run on the master in 9.1. Oh, there was the waiting for the checkpoint happening on the master thingy. Oh, right, yeah. It's in that one. You found it. Somebody else found it before you, but, yep.
I believe there's a project called PGRman that has been doing that. I have never used the incremental backup there myself. I've heard someone who used it at an early point in time and it looked like it worked and everything was corrupt, but I think they fixed that problem.
But yeah, looking to how PGRman does that is obviously a good step forward in figuring it out. I don't know of anyone other than PGRman has done it. Are there any? Yeah.
That would be the main issue, yes. But I mean, yeah. But I mean, yeah, from the core side of perspective, we have this protocol now,
we have the core part of it. We're hoping to see, maybe get some more of RMAN into core and get RMAN to use these features instead of, because some of the things that PGRman has been doing have been sort of hacky, because that's the only thing, well, the only way we could do it earlier. It's what everybody's been doing earlier, sort of hack around the limitations. We're trying to get away the limitations and then we'll get away the hacks as well. So that's it.
So I think we are pretty much done. I may be running over. I don't have a schedule right now, but this will be the last round of talks. So then after this, we have the closing session, which is right back where we all started with the opening session in the keynote. And I don't think there's an actual break. So I think Dan would appreciate it if once you're done here,
just sort of head straight over there and get ready for the closing session. So thank you all for showing up.