We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Postgres on the wire

00:00

Formal Metadata

Title
Postgres on the wire
Subtitle
A look at the PostgreSQL wire protocol
Title of Series
Number of Parts
31
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production PlaceOttawa, Canada

Content Metadata

Subject Area
Genre
Abstract
While it usually Just Works, sometimes it's useful to know exactly what happens between hitting return in psql and seeing results appear on the screen. This talk will explain how the PostgreSQL frontend/backend protocol works. We will look at the low-level blocks from which the protocol is built and try to give the audience a solid understanding of how bytes are transferred back and forth between their applications and the Postgres server. Even if you never directly deal with the PostgreSQL protocol, every application you're writing or maintaining uses it and it's still useful to know the basics of how it works. Sometimes, knowing the fundamentals can also help you understand some behaviour that otherwise would seems mysterious or quirky. The talk is aimed at users, DBAs and system administrators interested in learning a bit about how the sausage gets sent. We'll go through: protocol versions and how they differ differences between simple and extended protocol text and binary query formats authentication and encryption asynchronous features: async queries and NOTICE the COPY protocol query cancellation and how does it work future development of the protocol
Communications protocolWhiteboardLogic gateSlide ruleSource codeGrand Unified TheoryComa BerenicesSlide ruleSource codeLocal ringOnline helpLink (knot theory)Scheduling (computing)WebsiteComputer animationXML
Query languageCommunications protocolFile formatFrame problemMessage passingDataflowExtension (kinesiology)Data structureRevision controlServer (computing)Basis <Mathematik>SummierbarkeitMassProcess (computing)Client (computing)AuthenticationASCIIComputer wormWebDAVStructural loadException handlingLengthComputer configurationDatabaseKey (cryptography)Variable (mathematics)BitField (computer science)Electronic mailing listRegular graphRegulärer Ausdruck <Textverarbeitung>Computer fontComputer fontQuery languageCommunications protocolFrame problemParsingComputer configurationMultiplication signLengthRevision controlView (database)Execution unitBitTraffic reportingComputer programmingOperator (mathematics)Process (computing)Client (computing)IntegerCycle (graph theory)Connected spaceStatement (computer science)Key (cryptography)Closed setPrice indexMessage passingConnectionismInternetworkingMereologyPhysical systemLimit (category theory)EmailSeries (mathematics)Field (computer science)Level (video gaming)Electronic mailing listSoftwareNumberType theoryDefault (computer science)DatabaseComputer wormDependent and independent variablesSlide ruleCoefficient of determinationInstance (computer science)IdentifiabilityServer (computing)SequenceSound effectFitness functionResultantException handlingAuthenticationDifferent (Kate Ryan album)Interface (computing)Front and back endsEnterprise architectureComputer data loggingFile formatExtension (kinesiology)Self-organizationInsertion lossTerm (mathematics)Regular graphDomain nameComputer animation
Server (computing)PasswordAuthenticationData typeMassBasis <Mathematik>Hash functionEncryptionMessage passingMulti-agent systemClient (computing)Information securityMetreParameter (computer programming)StapeldateiDataflowSpecial unitary groupNegative numberRevision controlCommunications protocolError messageServer (computing)AuthenticationINTEGRALInfinite impulse responseConnected spaceParameter (computer programming)1 (number)PasswordMultiplication signNumberConfiguration spaceType theoryDatabaseBuildingForm (programming)State of matterHash functionString (computer science)Category of beingClient (computing)Price indexRight angleFrame problemQuery languageArithmetic meanRevision controlMoment (mathematics)Computer configurationSymbol tableWindowOcean currentLengthInstance (computer science)Library catalogCodeQuicksortSpeech synthesisMessage passingAsynchronous Transfer ModeFlow separationPlanningInterface (computing)Process (computing)Cellular automatonEscape characterStandard deviationLocal ringGoodness of fit2 (number)Device driverRow (database)Line (geometry)Front and back endsSubsetIdentifiabilityCommunications protocolSoftwareComputer animation
EncryptionRevision controlCommunications protocolNegative numberClient (computing)Server (computing)Error messageMessage passingMKS system of unitsBasis <Mathematik>DataflowStorage area networkOpen setCodeKey (cryptography)Process (computing)BitQuery languageMassSpecial unitary groupExecution unitString (computer science)Data structureField (computer science)Position operatorAuthenticationComputer fileSource codeLine (geometry)Dependent and independent variablesElectronic mailing listStandard deviationInterior (topology)ExplosionElectronic visual displayInformationCore dumpVideo game consoleFingerprintTheoryModule (mathematics)Revision controlConnected spaceMessage passingRight angleField (computer science)Client (computing)Principal ideal domainServer (computing)Communications protocolVideo gameInstance (computer science)Computer configurationOrder (biology)Error messageMotion captureLibrary (computing)String (computer science)Query languagePrice indexBinary codeCellular automatonData structureFrame problemMereologyKey (cryptography)Normal (geometry)MedianSource codeAdditionLine (geometry)Single-precision floating-point formatEuler methodCartesian coordinate systemDecision theoryHeat transferPhysical systemAuthenticationState of matterCryptographyTelecommunicationCoalitionMathematicsFlow separationProcess (computing)PlanningCore dumpDependent and independent variablesQuicksortStreaming mediaCodeElectronic mailing listRegular graphEncryptionParsingFehlererkennungEmailHoaxComputer animation
Binary fileLibrary (computing)Hash functionKey (cryptography)Query languagePersonal area networkParameter (computer programming)Time zoneMathematical singularityString (computer science)AuthorizationRevision controlData typeServer (computing)CodeInstance (computer science)Letterpress printingRight angleCommunications protocolLocal ringCore dumpDomain nameParameter (computer programming)Message passingComputer animationSource code
Parameter (computer programming)Server (computing)CodeRevision controlSet (mathematics)Cloud computingHash functionKey (cryptography)Query languageData typeAverageBasis <Mathematik>Interior (topology)Computer fileExplosionElectronic visual displayCommunications protocolInformationCore dumpInterface (computing)Spherical capIntegerBinary codeFile formatBinary fileRepresentation (politics)TimestampComputer configurationPoint (geometry)Client (computing)Web pageSimulationFunction (mathematics)LengthOvalData structureRow (database)Message passingFrame problemMountain passDatabase transactionCAN busProcess (computing)Metropolitan area networkCASE <Informatik>Error messageString (computer science)Cursor (computers)Field extensionVarianceLoginParsingStatement (computer science)MetreLocal ringSummierbarkeitStrutSpecial unitary groupPersonal area networkEmulationMortality rateNumberSeries (mathematics)Complete metric spaceLimit (category theory)Reading (process)SynchronizationExtension (kinesiology)Gamma functionLevel (video gaming)Ext functorDataflowComputer fontAsynchronous Transfer ModeAbfrageverarbeitungBeer steinMilitary operationHeat transferDependent and independent variablesElectronic mailing listPressure volume diagramDifferent (Kate Ryan album)AuthenticationComa BerenicesLine (geometry)Message passingParameter (computer programming)Server (computing)MeasurementGraphics tabletKey (cryptography)International Date LineType theoryConnected spaceQuery languageRow (database)Binary codeTimestampData compressionDependent and independent variablesAuthenticationProjective planeRevision controlIntegerCommunications protocolRepresentation (politics)ParsingStatement (computer science)Limit (category theory)Front and back endsSynchronizationDifferent (Kate Ryan album)Mechanism designPasswordResultantMultiplication signNumberLevel (video gaming)Descriptive statisticsDatabase transactionCommitment schemeDevice driverKernel (computing)Bound stateAuthorizationBuffer solutionVirtual machineKeyboard shortcutAsynchronous Transfer ModeFree variables and bound variablesStandard deviationMultiplicationArithmetic meanRoundness (object)Line (geometry)PlanningFunction (mathematics)Interior (topology)Power (physics)CASE <Informatik>Heegaard splittingTable (information)Flow separationSeries (mathematics)File formatInterpreter (computing)Field (computer science)Data conversionInjektivitätClient (computing)Order (biology)IdentifiabilityGame controllerSoftwareSelectivity (electronic)Default (computer science)Overhead (computing)Process (computing)MereologyError messageException handlingReal numberFunctional (mathematics)Hidden Markov modelComputer configurationSlide ruleState of matterHeat transferComputer wormWeb portalInformationEmailElectronic mailing listInstance (computer science)TheoryProxy serverRun time (program lifecycle phase)Maxima and minimaInsertion lossMiddlewarePairwise comparison2 (number)String (computer science)Complete metric spaceExtension (kinesiology)Variable (mathematics)Point (geometry)LengthDecimalBitCodeData structureSource codeDiscrete groupForm (programming)Online helpFamily1 (number)Set (mathematics)Atomic numberAuditory maskingArmEscape characterCellular automatonLibrary (computing)Directed graphRight angleDatabaseSummierbarkeitConformal mapCursor (computers)Product (business)CubeoutputSampling (statistics)CausalityPrice indexForestField extensionDirection (geometry)Open setNetwork topologyCountingDesign by contractProteinPattern languageTwitterWeightParsingInterpolationStreaming mediaVideo game consolePhysical systemMathematicsRule of inferencePresentation of a groupSingle-precision floating-point formatSinc functionOffice suiteForcing (mathematics)Copula (linguistics)System callDigital electronicsComputer animationSource code
Transcript: English(auto-generated)
So I'm going to hit the lights and start. So hi, my name's Jan. I work at this small company called Duxport. And I'll be talking about Postgres on the wire. For those of you re-watching that, or if you just want to follow the slides, the slides are up there.
You can download them. And you can get the source for the slides from there if you want to fix them up and pass them as your own. So Postgres on the wire. I'm sorry. Yeah. Sure. This one.
These slides on the website. All the slides will be uploaded to the schedule. Find the talk in the schedule. There should be a link from the talk. OK. So Postgres on the wire.
I'm not actually going to talk about the wire. This is the outline of the talk as is. So first, I'm going to talk about a bit about how the Postgres protocol is structured. What are the basic concepts? Then I'm going to go into how the messages are sent back and forth. What are the types of messages?
What's the framing? Then we're going to talk about sending queries because one of the more useful things you can do with the Postgres protocol is send queries to the database and get results. And then I'm going to try to go over some less known features or some things that are outside of the typical query response cycle.
So frame format. First of all, let's start with protocol versions. There are several versions of the Postgres protocol. And you might see on some slides, I'm going to call it Phoebe, which is short for front-end, back-end. So sometimes people say Phoebe when they meet in the Postgres protocol. It's a term you might run into.
So the version 2.0 got introduced in 1999. If I'm having a mistake in my git log queries. And actually, the release before that only added protocol versioning. And version 3.0, which is the current version, got added in 2003.
So I'm going to focus on 3.0 because it's been around a while. But in some places, I'm going to try to highlight differences between 2.0 and 3.0 because 2.0 is still used from time to time. And I'm sure some of you poor devils are on enterprise systems that still have to speak 2.0.
And the fun fact is that the server still speaks 1.0. So if you hack your client to use protocol version 1.0, it's still going to work. So talk about backwards compact. And 3.0 added some new features, which makes it worthwhile.
The extended query protocol, which I'm going to go into detail in the query part of the talk. Copy got better. And in general, it just feels much more structured, organized. It's much more, kind of feels like more standardized, like someone thought everything through
and just made everything work more or less the same. So first of all, the protocol starts with a client connecting to Postmaster with the main PostgreSQL process. As soon as the Postmaster receives a connection, it forks off a process to handle the protocol.
So the actual protocol parsing and everything that has to do with the protocol happens in a separate backend. This means that if anything goes awry with the protocol, you won't crush the Postmaster. You will crush the backend, which is bad. But still, it's better than crushing the Postmaster. The Postmaster does not touch bytes that come from the network.
The other thing that this means is that authentication is done after the process is forked. So this means you, without any authentication, just with an open port, you can force the Postmaster to fork, which is, well, it's kind of an expensive operation.
But then again, you're not supposed to have your Postmaster port open to the wide internet. So that's probably fine. And this also means that the protocol, and it goes over one TCP connection, and it's intimately connected with the backend that you're talking to. So connection and backend are kind of the same thing.
If one goes away, the other one goes away. If you close the connection, the process that's handling it will die. But it might first have to notice that the connection has been closed, which means that just closing the connection will not necessarily immediately terminate the query that's being run.
This means that, for instance, if you run something and it kind of locked up everything and it's using all the resources, just killing the client will not necessarily stop the query from being executed. So the frame format, this is the very basic thing. Almost all frames look like that.
So they have one character, ASCII identifier, which identifies which kind of a frame is that. Then they have a 32-bit length, and then the payload. And the payload is different depending on what frame are we talking about. Sometimes there's no payload. But basically, it works like that.
So parsing is relatively simple. You look at the first byte. You say, oh, this is message foo. Then you get the length, and then you know how many bytes you need to read. And then you know how to interpret them, because you know the frame type. There's an exception to this. And the so-called startup packet starts with the length, followed by the protocol versions, a 32-bit integer,
and then the payload. So the startup packet kind of doesn't have a type, because the very first piece of data that the postmaster receives is interpreted to be the startup packet. So you know it's the startup packet,
because this is the first thing you got on a connection. So let's see how the startup packet looks like. You have your length, you have your protocol version. And then in version 3.0, you have a series of name, value, parse. So you have a name, the name of the value, then a null byte,
then the value, then a null byte. Then the name, null, value, null. And it goes on until it reaches the final null. And of course, the length needs to add up to what the length header said. In protocol 2.0, it looked different. The startup packet for 2.0 is fixed size.
There's a fixed number of bytes. So you can't really fit only a certain number of options in there. In 3.0, there's no real limit. The startup packet can be as big as you want. And then about the key and values.
Why key values? Some of those are special, are important, like user or database. One of those fields is user. The backend will know, the process handling the connection will know which database you are trying to connect to. You need to be connected as a user. You can't have an anonymous connection. The startup packet needs to specify a user.
If not, the connection will get closed. The database is not really, well, you have to have a database. But since it defaults to the user, you just have to send the user name. You don't really have to send the database name. And there's a number of special keys that are interpreted specially, like options, which are
actually in the options field. You can send command line switches for the backend process. I'm not really going to go into that, just that some keys are a bit special, and the rest are just GACs. So any GAC that you can set, you
can send it in the startup packet, and then the connection will use those GACs when you actually go, yep, network, network. So there's the startup packet. And then this is a regular data packet. Like I said, it starts with an identifier. It's one byte.
It's always seven bit clean, eight bit clean, right? It's an ASCII identifier. And then there's the payload. What follows is, for instance, you can't send a query that's larger than one gigabyte in one go. So if you have an insert statement with a bazillion values, you can't really send more than one gigabyte, if not for other reasons,
like sanity, then because it just won't fit in a query packet. And then the payload, how you interpret it depends on the type. So how does the startup sequence work from the initial TCP connection or the initial Unix domain socket connection?
Because as you know, the protocol works over TCP IP or over Unix domain sockets. It's exactly the same. It looks the same. When you connect, you send a startup packet, and then you wait. The back end can either ask for authentication or tell you you're free to enter.
This depends on the PGHBA configuration. So if the back end decides this connection is good because, for instance, you've configured trust for this net block, or you just configured trust for your entire database or whatnot, the server will send a frame of the type authentication OK,
and then you know you're good to go. If not, it will send an authentication request. And there are actually several types of authentication requests. And it depends on the PGHBA configuration. So the back end knows, oh, for this user, and this database, and this net block,
so I need this kind of authentication. And he's going to ask you, so yeah, authenticate yourself. And there are several kinds. Like I said, I guess the most commonly used ones are plain text or MD5 password. We're going to go into detail about how MD5 passwords get sent, because it's rather interesting.
It's kind of interesting. The thing is that it's up to the server to require a player and trip or MD5. So this means that if you do PSQL your server, then the server might decide, hey, give me a plain text password. And PSQL won't really give you any. It will know that the password will be sent in plain text.
But you won't really know from the interface. So theoretically, an evil server or a man-in-the-middle attacker could swap the frame. And then you would unwittingly send the plain text password out to the server. It's not a big deal, I guess. It's a fairly contrived, convoluted scenario.
But you might reasonably think that PSQL could tell you, OK, the password you're going to enter is going to be sent like plain text and not hashed over the network. And there are more advanced kinds of auth. So the server can ask for GSS API authentication or SSPI
or other Windows things that I don't really know about. But yeah. So MD5, just because this is the most used or common way of authenticating just a password, the authentication request for MD5 starts with the identifier
that says this is an auth request for MD5. Then the length, then actually, yeah, it says this is an auth request, then the length, then the type, meaning this is an MD5 auth request. And then the salt. And the client, what it does is it hashes the password and the username.
And then the hashes, the hash, and the salt pre-pends an MD5 string and sends that off. And this is what the server receives. And this is how it decides whether you're authenticated or not. So why salt? The salt is there to prevent replay attacks.
Because without salt, if you would just send the hash of your password, then an eavesdropper on the connection would be able to reuse the same packet to log in as you the next time. If in this scenario, he won't be able to later on reuse
the same hash to log in as you, because he would need to have the salt. And why the double hashing? Why it's doing hashing here and then another hash there? And this is because the server can only store this hash.
This means that you don't need to keep plain text passwords on the server so that if someone steals your database or gets access to the catalogs that contain the passwords, they will only get the hash and not the plain text password. Although, if you think about it, the hash is as good as a password as a means
to connect to the server. So if someone steals your hashes, they can impersonate you, they can connect as you. But at least they don't have the actual string that you've reused on million services. It died.
That's sad.
Right. Another, yeah. The second mic is recording for audio. Yeah, right. But it is recording now, right?
It was just, ah. They do one recording. I see. Do we have another square battery? So what should I do? This is the square battery. Oh, you put both in? Yeah. I'm sorry, go away. Okay. I'm just waiting for the room back. Oh, so we're good? Just continue for now.
Good. Parameter status. That's an interesting packet. It gets sent by the server, and it informs the client about some parameters that the server deems important to transmit to the client. And so it's a very simple packet, just name value. And the parameters that get sent are, for instance,
server version, right? So the client knows what the exact version of the server is talking to. Some are important to be sent immediately on connection like client encoding, because client encoding is critical to be able to escape text strings correctly, because you can kind of line up multibyte characters
to eat the quote that quotes a text value. So if you're confused about client encoding, you might not do escaping correctly. And this is why the libpq or other client drivers
require a reference to the connection to do escaping, because they need the value that got sent on startup. And some are important for the client, like date style. So the client needs to know if the dates that are coming in the textual form are month to year or day to month or whatnot.
The important thing and the curious thing is that every time any of those parameters gets changed, the parameter status message gets pushed down the connection to the client. So this means if you're connected, we can try to see this happening later on when we'll see the actual protocol running. When you change any of those, on the next occasion
when the server has something to say to the client, it will also include a parameter status message, because things like date style can be changed cluster-wide, and then if you reload your server, it will change for the current session, right? So you need to have the client know that.
So this is how the initial exchange works. From the startup packet, then either an auth request or a notification OK message from the server, then the client optionally can send a password message if it didn't say an auth OK. Then a bunch of parameter status,
then something called backend key data. I'm going to talk about it in a minute. And then a message that's called ready for query, which means I've entered query processing mode. Now we're done with the initial courting. Let's get down to business.
So this is like the basic things. There are a few more things that can happen in initial connection setup, like SSL. If you think about it, it's rather weird that Postgres has a... Well, if you configure it with SSL support to listen on SSL, it will also accept plain text connection on the same port.
So there has to be a moment where you choose, we're speaking plain text or no, we're switching to SSL and tripped communication. And this is done with a dummy startup packet. So you send a startup packet with a special protocol code, because if you compare, the startup packet
has a protocol version as the second field. The SSL negotiation packet looks like a startup packet, but the protocol field is repurposed to be the negotiation code. Actually, if your PSQL client sends 30,
if you're negotiating SSL, it sends protocol version one, two, three, four, six, seven, eight. One, two, three, four, six, seven, eight. So something that you'd never see as a real protocol version. It's repurposed to mean, I want to do SSL.
And then the server writes back a single byte that says, yes, let's do SSL, I'm down for it, or no, I'm not doing SSL. Or you can get an error message if the server is so old that it didn't even include SSL in compile time.
It doesn't mean it hasn't been linked with SSL. It means that it never even knew about SSL, because it predates the addition of the possibility of doing that. And then you'll just get an error saying, that's not a protocol version I can speak, because it will interpret the startup packet as a regular startup packet.
And then the client can decide. If the server says, no, I won't do SSL, we can say, well, I don't want to speak to you if you're not doing SSL, or you can continue. And then there are options that the client can provide for that. So this is a place, for instance, where the client can choose whether to continue talking with a server that,
for one reason or another, refuses to do SSL, or just keep at it. So SSL works like this. You get an SSL negotiation packet, then you get the single byte, which is, I guess, it's supported, not supported, or an error message. If you get supported, you can go to the SSL handshake.
And here, OpenSSL kicks in, you get all the setup, the connection gets SSL. And then you transmit your regular startup packet, this time encrypted. So from down here, everything is encrypted. So it's kind of a simple way to have SSL and non-SSL
on the same port. Now, there's another kind of startup packet that's called the cancel request. What happens is that there is yet another fake protocol version that means I want to cancel my running query.
This means that in order to cancel queries, you actually have to open another connection to the Postmaster. So if a client is connected and it has sent a query and it's being processed, in order to cancel it, you can't just send another connection on the same TCP stream. You have to open another TCP IP connection
and send a separate cancel request. And then, hopefully, the query will terminate and you will get back an error code without having to wait for it to complete. Although, you don't really have any guarantees. You open the connection, you write the bytes, you go away, and then you wait.
What happens? Of course, you can always just close the connection and that's also an option. So this works like that. You send a cancel request which has this dummy protocol version that's actually a cancel code. So it says which PID the process, what's the PID of the process running
the query you want to cancel, which supposedly is the PID of your connection process, and the secret key which prevents other people from canceling your queries. And the secret key is actually transferred in this back-end key data message. So this is how the client gets a secret key
for their connection to cancel their own queries. Now, this theoretically, and it's been discussed in the mailing list, is open to replay attacks. So if someone can eavesdrop your cancellation packet, they will be able to cancel all further queries
you do on this connection until the next connection where the key gets generated again. But in theory, replay attacks are possible. In theory, you could send a cancel request on SSL connections. So you could do this dance, negotiate SSL,
and then instead of this startup packet, send cancellation, and then your secret code gets transmitted over SSL. But for instance, libpq doesn't do that. If you use the official C library for Postgres, it would just send it in plain text regardless of if the connection is SSL or not. So most applications will send their cancel key in plain,
which again, it's probably not a big deal, but it's something to be aware of. Errors. Errors in general in Postgres are notified or are signaled with an error response message.
That's a list of key value pairs again. In version 2.0, it was just a string and then you had to parse it or do some kind of processing. In 3.0, it has structure. So there are fields for detail, hint,
the actual message, the severity. You get all those and the client can display them and process them in a more structured way. You actually also get the source file and the line in the source file where the error originated, which means that you can fingerprint Postgres versions
without even being authenticated to the server because if you just send garbage, it will tell you, oh, that's garbage. It will give you an error message saying, that's garbage, but you will know what's the source file and the source line where the message got generated so you know the exact version of Postgres. Between releases, the code gets shifted ever so slightly
so you can usually know which version you're talking to just by looking at the error message. There's a Metasploit module that just does this, that allows you to fingerprint a server without having authentication on it. These are messages that get used for any kind of error,
authentication error, query error, anything. So tools. What do you use? How do you look at the protocol? TCP dump or T-Shark work. You can just dump the traffic on the interface and look at the bytes.
Wireshark has support for de-parsing the protocol so you can just fire up Wireshark and then you can click and it will tell you this kind of message and it will show you all the fields, but only for 3.0. For 2.0, at least my copy of Wireshark didn't work.
There's a tool that I found when I was preparing the talk that's called PGShark. It's a Perl tool that does protocol parsing and some other nice things. Just to let you know, I think the tool is not widely known and it's useful, it's nice. It has several binaries that do things with protocol captures.
You can either capture live protocol messages and display the debug, the de-parse frames. It can write out the summary from a capture file. You've run this many select statements or you've had this many connections.
So if you don't really have access, you can always log or you can't change the logging. You can just dump the protocol and then use PGShark to analyze it. And it does support protocol version 2, which none of the other tools do.
We can try to take a look at a dump. I want to start dumping traffic from my local instance. I am going to run PGS debug, which just prints out the protocol. And then I'm going to connect to my local instance.
Notice that I'm using TCP explicitly, because otherwise it would go over Unix domain sockets and I wouldn't see them in the packet dump. And I'm disabling SSL because otherwise I won't be able to see anything. So I will connect and stuff appears here.
So as you can see, there's a bunch of parameter status messages like server encoding, UTF-8, server version, 9.1.13, and so on and so on and so on. And then it ends with backend key data with the PAD of my connection and the secret key.
And then it says, ready for query? Type idle. We're not doing great on time, so I'm not going to play with it more. We might just revisit this later. Querying. There are two protocols for actually querying the database,
but before we go into them, let's talk about binary versus text data. The data postgres sends can be either text or binary. Every type has a text and a binary representation. So an integer can be represented as text,
like the decimal notation of the integer, or as a network order integer. Same for floats, for everything. Every type has both routines. It can be represented both as something that's text and something that's binary.
This, for instance, becomes a problem for timestamps because timestamps can be either 64-bit ints or floating point numbers that are also 64-bit long, but the interpretation is different. This is why, in parameter status, the server tells you what option it has been compiled with.
Because if you have a working fast 64-bit type, you want datetimes to be ints, but if not, you'd prefer them to be floats. In each query, the client can choose if they want the server to give it text or binary data.
The actual exact format is rather simple because this is what you see in the console. It's kind of easy. I haven't found a source that documents the binary format except for the actual C code that generates it. So I guess it's a way of documenting things.
You have your C routines that you can read, but I haven't found any real reference for driver authors, for instance, to know what's the binary representation of some of the more obscure types. So let's start with a simple query protocol. It is simple. It works like that.
The client sends a command. The server replies with a message that says, this is the structure that I'm going to send you. Then it sends a bunch of data or messages of that structure. Then it sends command complete, which means your command has been executed, and then ready for query, which means I'm back at the start of the processing loop.
You can send more queries. I'm now waiting. So a query is just a query. It's a string, so you just fire it off. Row description is a bit more complex because it has a number. It's variable length depending on the number of columns
that you will be getting. For each column, you get the name of the column, for instance. So this is in theory. You could know it, right, because you wrote the query. You know the column names. But some of the column names are generated, and for this reason or another, the server will tell you the names of the columns that will be sent. It will also tell you a bunch of information
about the table that the data is coming from and what's the actual column of the table. And then it will tell you the type of the thing it's going to send. So it's going to tell you the OID of the type, which means you need to either know the OIDs or query them beforehand if you want to match
the OIDs to actual payloads. So we need to know that text is 32. If you see 32 here, it means it's going to be text. Then the length and then the modifier, so like car 16 will have a modifier that says something approximately 16. And the last byte here, 16-bit integer,
is the format, which is 0 for text, 1 for binary. This means that in theory, every column can be either text or binary. In theory, you could mix and match. But in the simple protocol, it's always text, except if you use declare cursor binary.
Then it's binary. But other than that, it's always text in the simple protocol. And then data row, which is the actual data that's coming down, it's just how many fields and then field data, field length data, field length data, just as that. Command complete is just a string tag, which is what you see in PSQL when you execute something
that says insert 01 or select 8000 or something like that. And ready for query is the identifier and the transaction status. So it can be either idle, in transaction, or error.
And why is this useful? It is useful, for instance, for PSQL, because if you're still in a transaction, PSQL might change the prompt. And it knows, because the server tells it, you're still in a transaction. This connection is still in an open transaction. But it's actually very useful for tools like PgBouncer that look at the protocol, that proxy the protocol
between the client and the server. And they need to detect whether the transaction is already closed or not. Because if it's closed, depending on settings, it might want to reuse that connection for another client. So it makes it easy for middleware to know if the particular connection has finished its transaction processing and can be repurposed
for a different transaction. This only happens in 3.0. In 2.0, the ready for query message does not carry the transaction status. And libpq does some string comparisons to know what's the transaction status.
And there's this nice comment in the C code, which says, by watching for a message, we can do a possible job of tracking the status. But this does not work at all on 73 servers with auto commit off. Man, was that feature ever a mistake. Caveat user. So for 2.0, you can kind of fake it.
But it's a 3.0 feature that got added. Now, some of the quirks or features of the protocol is that you can send several commands in one query string. You can send select one semicolon, select two semicolons, select three semicolons, and send them off as one query
message. And the server will reply with one command complete per query. So if you got one command complete and then an error message, you need to figure out, oh, so my second query failed. If you send an empty query, you get a special kind of response. I'm not really sure why, but there's
this special concept of an empty response message. So if you just send an empty string, you'll get a special kind of packet. And it's always text, like I said. So the simple protocol is simple. You send a query. You get the description of the row. You get all the rows, command complete, ready for query.
This is the simple query protocol. On to the extended protocol. The extended protocol, apart from being much more complicated, does change some fundamental things. First of all, the execution is split into stages. It's not like you send one query,
and then you get the results. There are stages in which you build up the whole thing. Every step gets a confirmation message from the server. So we do step one, step two, step three. But you can send it all in one without splitting it into separate TCP packets or without waiting.
You can just send it all out, and then the server will confirm it all back. So you don't have to do round trips, which would be disastrous for performance. You can just send all the steps and then get, we'll see in the example of how this really works. It allows separating the parameters from the actual query. So you don't have to send a string.
You can send parameters. Why is that? Because typically, queries like that in the simple protocol end up looking like that, which is not something you'd like to see on your server. It also disallows sending several queries. So we can't send one semicolon, select two semicolon.
Why? Because the feature of sending multiple queries per message is usually used to transform queries like that into queries like that. So we don't want that. And in the extended protocol, they would look like that. And this would be safe, because you wouldn't have to do escaping interpolation.
You'd just have a placeholder. So, placeholders. The extended protocol starts with a parse message. You first ask the server to parse your query text and prepare its execution for you. So we send the parse message. It has a name. It has the string query, but it can contain parameters.
And then you say, for each parameter, you can tell the server what type would it be. Why is this important? Why would you need to tell the server what type the parameter will be? Since you'll see that in the next message, you will specify it again.
Because if you have a function like foo, and one version takes an int and the other version takes text, and you send a query that's foo $1, the server will need to know which foo do you mean. So this is where you can disambiguate by, even at the parsing stage, tell the server, okay, this first parameter will be an int.
And you can give it a name, so we can reuse it later. If you don't give it a name, which means sending an empty string for statement name, this is the so-called unnamed statement, which is the default for the rest of the messages. If you never use the name, it's going to always use the last unnamed statement,
and it's going to be automatically cleaned when you create another unnamed statement. If you create a named statement from parse, it will keep a reference to it, so we can, as long as the connection lives, it's going to be there, and you will be able to reuse it in future messages. The next step, and this is the complicated step, is bind,
which is where you take what parse gave you back, and you, well actually you take what you sent to parse, the statement name, you refer to that statement by name, and you bind it to an output portal. It's called a portal. It's actually the identifier of the channel
you will be fetching data from or executing on. So it's a bit complex, right? You refer to the parse statement by name or use an empty string if it's the unnamed statement. You give the output portal a name or use an unnamed portal, and then for each parameter, you have whether it's text or binary.
Then for each parameter, you have the actual data of the parameter, and then for each result column, you tell the server, I want this column in this format. And that's it. This is mine. Once bind completes, you have the output portal ready,
and you can start executing on it, but not so fast. There's a message called describe that you can use to describe the format of what will be executed. So if you do parse and bind, you're not really sure which function got chosen,
what's the type of the columns. So clients can also send a special message that will ask the server, okay, for this parse statement that I sent to you earlier, what's the output? What is it going to give me? And then it tells you, actually for parse, it's going to give you what are the parameters it's going to expect
and what's the result of the data it's going to come back. And this is useful, and drivers actually do that in order to then interpret the data that's going to be coming down the wire. So you've parsed the statement, you've bound it to a name that will be your output portal.
It's time to execute it. And execute is fairly simple. You just tell it execute this output portal and I want at most this many rows. Now the part where it says row limit is the complex part because if you say I want 10 rows
and the portal or the query that you're executing actually returns 20 rows, you will get back a message called, it's on the next slide, you'll get back a special message that says I'm not done yet. Otherwise you just get a series of data row messages
and then either that message that says I'm not done yet or if you're done, then the message that says I'm done. As you can see, no row description is sent which means that when you do execute you need to know what data is coming down because you're just going to get bytes. You need to be able to interpret them and that's why drivers usually do a describe before.
So yeah, if the portal has been run to completion which means you asked for all the rows for instance, you get a command complete message but you don't get ready for query because you haven't still closed
your extended protocol session but you get command complete. If not, you get a portal suspended message. The thing is, libpq doesn't even have code to handle that because libpq and functions you can find in libpq always run the portal to completion. I've tried other drivers to see if they actually
were aware of portals that can get suspended and only JDBC did from the drivers I tried. I tried a Python driver. So, PsychoPG only uses the simple query protocol. Pypoles.ql, which is a Python 3.0 driver
asks for the maximum number of rows. Instead of asking for all the rows which would be sending 0 as the row limit it sends the maximum number that can fit in 32 bytes which doesn't make sense but they do this. So, driver authors sometimes don't really get things right
and it's sometimes useful to look at the actual messages that they generate in case you're wondering what's going on. In the execute message, you say row limit. There's a 32-bit int when you can say
I want this many rows. If you say 0, it's going to run to completion. It's going to send you all the rows. But some drivers, as I said, just send 2 to the power of 32 minus 1. And they don't handle suspension. So, if you have a table that's humongous
and it will generate too much output it will just crash with the protocol power sync error. And then the last message is sync. So, what happens is that you can send all of that in bulk and only after you send sync will the server start sending you stuff back.
This means you can do parse, bind, execute, execute, bind, execute, execute, parse, and then sync and then the server will send all the confirmations and all the data back. This is actually useful to sync up on errors because if you're an extended query protocol client
you send a bunch of stuff and then you send sync and then your third statement in the series fails like with a foreign key violation the server will know that it should give you an error message for that row and then ignore everything until you're synced
because you weren't aware that you bombed out there that there was an error there and so it won't really process all the other messages because it has error out. But it will sync up with the client on sync. And it's also a means of reducing network traffic because you can send up a bunch of stuff and only after you send sync will you get stuff on the network back.
So, the extended query protocol works like that. You get your queries parsed at parse stage, you get them planned at bind state, and then they get executed at execute. And if you have statement logging on you might see those cryptic lines that say
this many milliseconds, parse, this many milliseconds, bind. This actually corresponds to parse time, plan time, and actual execution time. So, to sum up, the extended protocol works like that. You're a client, you send, parse, bind, describe, execute, sync
and you can all do this in one TCP packet if you choose so. So this is one bunch of data that you send. And then you get back, parse okay, bind okay, okay. The description for describe, a bunch of data rows from execute, command complete, and ready for query because you send sync.
Now, you could, if you send, if you would like to reuse the plan and not incur parsing overhead of the same query, like if you're, you know, you have your simple select star from post where ID equals dollar one
and you want to retain that parse statement and not re-parse it for every connection, you can do that. You do a parse sync immediately. It will tell you, okay it's parsed, your statement is there, and I'm ready for more. And then you can do a bunch of bind execute, bind execute. However many you want, sync up, and only then get the response.
So this is, again, I only, I think that only JDBC gets you the level of detail, the level of control in order to have an execution like that. Most other drivers, even libpq, if you use libpq functions,
you can't really have this conversation with the server because it'll inject syncs between the bind and the execute because it will sync up on every execution. You can't really do this with the C library, which is kind of annoying, but you can't.
I'm not good on time, so I'm gonna, hmm. Let's just go through copy really fast and then I'm gonna skip the less known features and then the last slide is really fast. So copy is a special state of the protocol. Some commands, like copy from standard input, copy to standard input, put the connection in a special mode that's called copy mode.
In copy mode, the expected messages are a bit different. So what happens is that you send copy from standard in and the server says, okay, I'm ready to start copying and it puts the connection in copy mode. So we get a copy in response that says this is the number of fields
and these are the formats I expect for them. And then you start sending copy data. Copy data is just a blob, it's just data. There's no null byte, there's nothing, it's just data because it's designed for fast bulk transfer. So the client sends all its copy data that it wants
and then it sends a special message that says copy done, which means go out of the copy processing mode. We're back to normal, regular querying mode. And then the backend says, okay, or it says no, there was an error doing the copy.
If it's the frontend that's receiving the data, you can't really stop. The server will be pushing data at you until you disconnect or cancel the query. You don't have a copy, fail, copy, stop, please don't send me more data. It will just keep pushing data down the wire.
So what happens is that you send a normal query, it can be an extended query or a simple query, it doesn't matter, and the response is, okay, I'm now in copy processing mode. Then you send all your data, then you say, okay, I'm done, and the server says, your copy command has completed and I'm ready for more. So this is kind of a protocol inside the protocol.
Let's just skip over those because they're really weird things. And since we're running out of time, let's go directly to this. The 3.0 protocol, as you've seen, has been working from 6.3, right,
or 6.4, I can't remember, but over 10 years. And there's been surprisingly few gripes about how bad is it because it's pretty good, I guess. Even though there hasn't been really a lot of push to extend it, to change it, to introduce a new version,
I've tried to gather some of the things that have been floated on mailing lists to kind of give you a glimpse of where it might go if the project decides to introduce a new protocol version that's not compatible. So compression. People were asking to be able to do compression in the protocol itself.
If you're doing bulk transfer, it could be very beneficial to be able to compress traffic and have the protocol understand you're sending gzipped bytes, for instance, or whatever. Some people argue, oh, you can just use OpenSSL with a null cipher, and it's going to compress. And the discussion just ended there.
That's not really interesting. The interesting thing is multistage authentication. If you recall the auth process, there is no provision for choosing between different authentication mechanisms. The server tells you, give me an MD5 password.
And if you'd rather use GSS API, you can't. It's the server that drives it. And the server, you can't say, oh, I'm sorry, I can't. And then the server will go and see you. So I have this other method for you. You don't get any negotiation. You just get a challenge. If you don't respond, your connection's gone. So this is something that people wanted to have,
have fallback to different auth methods. A way to actually negotiate the protocol version, because right now it's not really a negotiation. You connect to the server. You send the protocol version. And the server can say, I'm not talking with you. Or it can say, yeah, go ahead. I can understand everything that you will say. There's no provision for negotiating extensions.
It's very simple. Some people express the need to have inbound query cancellation, which means you don't need to open another connection to the server and send the cancel request. You can just send it in stream, which is problematic. Because if your backend is processing data, it will not
listen for your messages. They will be left there in the kernel buffer of the receiving machine. And they will linger there until the query is done. So we actually usually need to do out of bound. But still, for some things, I guess it would be nice to be able to cancel inbound. And people would like to extend the query message to
include Gox, which would mean, I want to execute this query with statement timeout set to one minute. Because right now you have to do set or begin, set local, execute, commit. This is the way you get one of execution of a query that
you want to have a timeout on. And people would say, I'd just like to attach my Gox directly to the query message so I don't have to mess with transactions and send three different queries. And this is the end.
If you have questions, we're a bit out of time. So you can either do this quick, or you can just come down here and I'll try to answer. OK, thank you.