...Lag
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 29 | |
Author | ||
Contributors | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/19132 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Ottawa, Canada |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
PGCon 201510 / 29
1
2
3
6
10
11
12
13
14
15
17
18
21
22
23
25
26
29
00:00
StatisticsNormal (geometry)Metropolitan area networkTable (information)CASE <Informatik>Maxima and minimaRegular graphBoom (sailing)Client (computing)Phase transitionComputer hardwareFlow separationFeedbackGrand Unified TheoryPattern languageComputer hardwareReplication (computing)Default (computer science)LastteilungOutline of industrial organizationVermaschtes NetzCASE <Informatik>Maxima and minimaMultiplication signDigitizingContext awarenessVarianceDemosceneRevision controlGroup actionForm (programming)Reading (process)TheorySpiralResultantOperator (mathematics)Rule of inferenceNormal (geometry)Configuration spaceElement (mathematics)Product (business)MathematicsQuicksortVideo gameError messageSynchronizationPoint (geometry)Variety (linguistics)Wave packetPoint cloudDisk read-and-write headStudent's t-testProcess (computing)Ultraviolet photoelectron spectroscopyLoginMoment (mathematics)Connected spaceSlide ruleHash function2 (number)Message passingRow (database)Standard deviationHard disk driveGene clusterLine (geometry)TwitterHypermediaFigurate numberMathematical analysisPhysical systemDifferent (Kate Ryan album)Gradient descentACIDDatabase transactionBitQuery languageLink (knot theory)Cartesian coordinate systemCausalityInsertion lossGraph (mathematics)Semiconductor memoryData managementProjective planeServer (computing)Type theoryFormal verificationBackupTimestampView (database)SoftwareKernel (computing)Interior (topology)DatabaseMedical imagingAdditionTelecommunicationGraph (mathematics)WordSoftware bugRight angleSoftware testingWeb pageEllipseMereologyArithmetic meanService (economics)Condition numberFrequencyMeasurementCuboidUniform resource locatorBit rateFormal languageWeightBranch (computer science)Universe (mathematics)Position operatorAreaShared memorySingle-precision floating-point formatGoodness of fitCAN busSpecial unitary groupTraffic reportingEndliche ModelltheorieDecision theoryParameter (computer programming)NumberFile archiverPresentation of a groupString (computer science)Data structureComputerTablet computerSet (mathematics)WebsiteSimilarity (geometry)WindowMassIntegrated development environmentInterrupt <Informatik>Matching (graph theory)Monster groupWaveBefehlsprozessorRadical (chemistry)3 (number)Pattern languageTable (information)Data centerInformationFeedbackDivisorStatisticsStapeldateiSpacetimeReal numberMiniDiscCore dumpData compressionParallel portDirection (geometry)Chromosomal crossoverAnalogyNP-hardPlanningEntire functionDirectory serviceOcean currentLibrary catalogSequenceFreewareRandom matrixCodePlastikkarteOptical disc driveSelectivity (electronic)DampingMultilaterationEmailOnline helpBoolean algebraRatsche <Physik>ComputerAbsolute valueVacuumStructural loadSoftware maintenanceCrash (computing)Graphics tabletData storage deviceSystem callComputer clusterModemElectric generatorComputer fileSubsetGreatest elementPolygon meshInstance (computer science)Independence (probability theory)LogicStreaming mediaComputer animation
Transcript: English(auto-generated)
00:20
My first thought in writing this presentation was to sit here and stare at you for 35 minutes.
00:32
Sadly, that would mean that we all got to go to our computers while you went staring at us. That doesn't really seem fair. The next thought was to rig all slides on a slow transition.
00:43
The most jokes get stale after the first one. Here we go. So my name is Samantha, and I work with a company called TurnItIn.com. We have a couple of good-sized databases, a 600 gig, 350 gig that we split off from the main one last summer,
01:07
and a couple other here and there that aren't quite as material as those two. But the important part is that we have a 24 by 7 environment, aside for a couple of maintenance windows that we always well in advance tell people about.
01:22
The other important thing is that we have required Reed slaves in production. Our master would fall over if we didn't. So we have two required Reed slaves at all times, in addition to a number of other slaves and cascading replication and whatnot. So we undertook a process in September of moving from Sloanie to streaming replication.
01:51
And I'm going to preface this talk with saying I really want to know if people have done the same thing before, what they've come up against, and for people who are about to do this, you get all of our pain.
02:07
So the hard part of that transition was finding what was going wrong in an ideal world. And when you read the documentation, everything says streaming replication. It's great. Sloanie gave us no end of trouble.
02:26
How many of you have used Sloanie? Yeah, what's it do to your CPU? What's it do to your CPU when you have seven slaves, even on a relay?
02:47
So our answer, and what has saved us quite a bit in the meantime, was we moved to streaming replication. And in doing so, we had a lot of pain, and I hope to save you some of that pain as well.
03:05
So how do you know when your slave is lagging? Because obviously ignorance is bliss. So you cannot monitor it, in which case you never know and everything's great. But then your users call you, and that's bad.
03:21
So monitoring 101, I'm sure we use this for everything else. Whatever your monitoring, graphing tool of favor is, use it. There's a variety of things that we monitor. We monitor byte sense, time lag, and byte lag, just for the replication portion of it.
03:44
And it is incredibly helpful. You'll see a lot of those graphs in this presentation, and I apologize. We use Graphite. It's a little hard to read, so I've kind of annotated them. If you can't read or you have questions on any of those, please speak up and tell me, and I'll explain them in greater detail.
04:00
This is your first go-to. Everybody should know this, and if you don't, I'm sorry I said that. It is the first thing I do when pager duty goes off and says, hey, your slave is lagging. Before I go to anything else, I do select star from pgstat replication on the master.
04:25
I did not know that. Thank you. But then you have to muffle it. See, we've had discussions sometimes. Then you modify the query at workflows, you don't have to change that. VI, emacs, this is the way I do it.
04:47
Pick your way of flavor. But it's always the first thing I go to. So I didn't even think about this topic. Well, I did. But it wasn't important to me when I first wrote this presentation.
05:02
And that's because our transaction rate is high enough that we never see stop. We never have a master stop sending data for any reason. There's always traffic from the master to slave, no matter what. This came up, I gave this talk at a meetup previously, and it came up that that was kind of amazing.
05:23
And I'm, what? So if you have a system where your master doesn't tend to have heavy rights and enough rights even that it can go, say, a second or more without sending data, you're going to want to set up something to artificially create that movement between master and slave.
05:40
It can be totally as simple as creating a timestamp table. Insert into it every second or every half second, whatever you want to do, and read when it comes out on the slave. Totally simple, kind of heartbeat-ish, just to create artificial traffic so that you know that if you get a lag alert, it's not totally normal. It might be a corner case, but I had never thought of it just because it wasn't in my world.
06:03
We could never have that. So if you have slow writes, it's something to think about. Time versus bytes. How many of you select star from pgstat replication or table and look at that and see the position of a file and go, I know exactly how far behind that is.
06:25
Totally makes sense to me. How many of you have ever done that? Because I know I haven't. So most humans think in time. So what we want to do is change what we get from those monitoring tools into time
06:41
so that we can make some type of decision logically about how far behind our slave is. I did not write this query because I like other people to do things for me. So I Googled really early on, and you'll find this everywhere. This is one of the very easy things to find is how to monitor your lag on your slave.
07:01
So you take the last received location in your XOG, the last replay location. If they're equal, then obviously you're not lagging. You're totally fine. Otherwise, you take the replay timestamp and determine how long ago that was.
07:21
Everyone's cool with that? This is what normal looks like for us. This is a seven-day graph, time spiking upwards. It looks like we've got a couple of three-second spikes. In a Sony world, that was totally normal. In our world, that's like, hey, that's kind of weird.
07:42
So as you see, it's about ten spikes over the course of seven days, no more than three and a half seconds. Most of the time, we're in milliseconds. I mean, that baseline doesn't even register compared to the three and a halfs or the threes. So the bonus that we're getting from this is amazing.
08:02
Yes? So this is only two. This is two production read slaves. There's more on that graph, but those are the only two that I had visible when I took this screenshot. So those are our primary two read slaves. Obviously, one is doing slightly better.
08:21
There's a purple line and a blue line. The purple is kind of obscured because the blue seems to be a little bit lower. I think I have a theory, and I do this a lot when diagnosing lag. I think the theory on that is because one of those hosts is an FDW host. The other is not.
08:40
So I had a slightly different use pattern. This is also normal. Those spikes are very rhythmic. This is a seven-day snapshot. Bingo. Exactly what that is.
09:01
When you take pgdump, you have to pause replication. If you use pgdump, there's a variety of methods of backups. We employ many of them. This is not the only one. But in the case that someone in engineering accidentally wiped out a column,
09:25
and you want to restore from two days ago, because it took them that long to build up the courage to say, hey, I fat-fingered that, the easiest way to do that for us is to restore a single table on a standby and then upload it back to the master.
09:40
So there's reason. I am not just crazy. But this is one of those things you have to make sure that everybody who looks at these graphs, because we have product owners, we have project managers who look at these graphs sometimes, and they freak out. Like, what's going wrong? Why do you have that? It's totally normal, but this is what backups look like.
10:07
It's a great question. We'll cover some of that later. But if you want to have any type of use on that server while you're taking the pgdump, you have to pause replication. Otherwise, it's going to cancel any transactions that are running.
10:24
This is also normal, and I wish I had labeled this better for you. But what you see is the same type of spikes over seven days where you have this, in this case, it was a 10K, and that's in seconds, spikes, followed by, if you can see down there, the secondary spike rhythmically every single day.
10:47
Even my senior DBA filed a ticket in JIRA and said, what's going on? That scares me. I didn't think about it a whole lot because I had a strong feeling I knew what it was. So what is that?
11:02
No. I'll talk about that in a little while, but that's not what this is. Nope. That's it.
11:22
Kind of. It's pg dumping the other database. Two databases. Geographically dispersed, but dump them individually, first one and then the second one. And it's the little things. I wouldn't point this out because it's so minor. But the point of this is communication.
11:41
If people don't understand what's going on, even people on the same team, even people that look at these graphs every day, something like this can be an anomaly and it can cause panic. And I spent the time to work on this JIRA ticket and prove it and write it up because of this. So communication, documentation, these types of things are important.
12:05
It's still normal. That was the explanation of the dump. So is replication paused? Exactly what you were just talking about. What happens when you turn off replication for a period of time, you pause it, say, for three hours,
12:21
and then you turn replication back on and you turn your monitoring back on? Because the entire time that you had replication paused, obviously you don't want to continue to get paged during that time. So you pause your monitoring. What happens when you turn your monitoring back on and you turn replication back on at the same time?
12:43
Bingo. And we do, and it's not pain enough because, ironically, only the 600 gig database will fire off one page a day when it's getting caught up and it catches up really fast.
13:00
So it's not enough of a pain to fix that race condition? Yeah, we know it's catching up for verification. So we acknowledge it and it just clears itself up. So that was time measurement. Fight measurement.
13:22
Again, I didn't write this because someone wrote it for me. Very, very similar to the last one that we saw. What's the position sent? What's the replay location? And when am I? Pretty straightforward. Pretty clear exactly like we saw.
13:43
I just realized that clock has stopped. Today is not one of those. Now is not one of those. This is an example of completely normal byte lag for seven days. Maximum there is about 4 million. This is not on a cron or a backup or a relay host.
14:03
These are production read slaves. So there's one spike. It doesn't necessarily correlate to any other graphs. Pretty healthy. There's more jitter or there's more difference in the baseline than you see with time. I think it's because it's a better measurement.
14:26
So you have your monitoring in place. Have your pager duty on. And what's going wrong? We found that most of our problems were in the initial setup. The move from another type of logical replication into streaming.
14:42
Those came in three flavors. The first of which was tuning our configuration, hardware, and human error. Given that we think our data is important, I think most people think their data is important.
15:01
Financial institutions definitely think their data is important. We first chose to go synchronous replication. Who's gone down the road to synchronous replication? Okay, a couple people. Does it work for you? Is it cool?
15:23
We quickly backed out of that decision. When you're looking at the amount of lag that you get with asynchronous replication, it's not worth the extra single point of failure in this case. Because what you do is when you add a synchronous host and you're waiting on that synchronous
15:40
host, if it goes down, you've gone away from a master being a single point of failure. Now you have two problems. When your slave can take down your master, it's an epically large problem compared to just your slave going away. You have to have more than one.
16:00
We backed into that. I might come back to this point. You saw this earlier with the graphs to the backups. We not only have our two main read slaves, we have a cascading reputation slave and then a DR full cluster geographically different.
16:23
And there's an alternate configuration on the host that we use for cron and reporting. That's pretty key, and does anybody know why? Really?
16:41
The cron host needs to run queries that are much, much longer than you're going to be running on your normal host. Reports tend to take longer, backups, et cetera, and you don't want monitoring going off on what is normal behavior. So therefore you have alternate configuration.
17:01
So this isn't a deep, deep talk on configuration. I'm going to highlight a couple of the parameters that will make your life interesting, so to say. So max streaming archive delay and max streaming standby or max standby streaming delay, max standby archive delay. Who here actually uses read slaves?
17:23
That's awesome. Of course this would be the place to find that. More often what I hear are people who use their slaves for DR or for HA. They don't have to deal with the problem of transactions or reads on a slave that interrupt or cause problem.
17:44
One of the first things that we saw when we switched over to using read slaves on streaming that we didn't see with logical replication was canceled queries. One of the primary reasons that can happen is your max standby archive delay or your max standby streaming delay.
18:01
The difference, they're very, very similar. When you read them, you might want to read them two or three times if this is your first time to get the actual difference. The archive delay applies as wall data is read, whereas the streaming delay applies when the wall data is received.
18:20
So very similar, kind of cousins, still very different. Replication timeout and wall status, wall receiver status interval. We'll talk about replication timeout first. This is the amount of time that your master will go without hearing from a slave before it terminates the connection.
18:41
So if your slave has gone AWOL and your master can't find it, it will not leave that connection open. The default for this is 60 seconds. The interesting thing is wall receiver status interval. That is the maximum amount of time that your slave will go without reporting back to your master. So every now and then your slave says, hi, I'm here, and your master goes, good, you're still there.
19:06
We pass each other on the street, everything's cool. Except for the corner case, when Phil overnight in the UK sets up a brand new cluster, and I go to check it and make sure everything is cool, and I see that the master's there and the slave's there, but they're not talking.
19:24
And I'm like, he told me he did this. What's going on? Turns out in the odd corner case where I don't know why, but you have set your wall receiver status interval to a nonstandard, standard is 10 seconds. Say we set that to a minute and a half.
19:43
And our default for a replication timeout is 60 seconds. Bingo. Because your master's going, where are you? I didn't hear you. And your slave's like, oh, sorry, here I am. So what you get is when you do select star from PG stat replication or table PG stat replication, your master's like, there's nobody here.
20:08
It causes a little bit of head scratching. Don't ever do that. It's totally possible. Don't ever set your wall receiver status interval higher than your replication timeout.
20:24
I love the giggles because it tells me you guys know this pain. This is one of the reasons I wanted to write this talk. This and the slide you're going to see a lot later. So after we dealt with the configuration issues of getting canceled queries constantly, and we
20:42
tried playing with the timeout settings for archive and streaming delay, and it wasn't helping. And we ratcheted it up. We were like, this is ridiculous. We'll ratchet it to five minutes. And then that was insane because why would you set a streaming delay of five minutes? Now you've got data completely bad, and there's no reason to even read it from your slave. You might as well just fall back to your master, which we do after three seconds anyway.
21:06
And this guy was the fault of that. And it was so frustrating for me. I'm going to vent here. Because there was nothing that I could find that said, hey, by the way, this might be your culprit.
21:20
And by default, it's off. So if you're going to use read slaves in streaming replication, go turn this on immediately. First step, run to your configuration, set that Boolean default on. Now, why isn't it on by default already? It came later.
21:43
To be honest, that would be because it changes the behavior, and it came later. It can't be a default. Yeah, we can't default it on. Now, can't is perhaps an overly strong word because it's ridiculous to not have it on.
22:00
I don't remember what the use case was, but in the discussion this earlier, I think John said there was some use case where this can do very bad things if you don't. Oh, absolutely. It's that. So what hot standby feedback actually does is it's the one way that I know of, forgive me if I'm wrong, that a slave can affect the master.
22:27
So what it does is it tells the master, I'm still reading this data. There are rows here that I'm looking at. Please don't vacuum them. Please don't take that away. So the master is like, cool, I'll wait, and then you get table bloat.
22:47
So what happens when you get these two together? When I gave this talk at the meetup, I had a couple of people in the audience give me the monster of all horror stories.
23:04
They had a 30-minute delay in replication because of this combo together. For some reason, I don't know what they had set their max streaming delay to 30 minutes. Oh, there's more fun you can do than that. Then you'll never recover.
23:21
Well, you fill a wall and then the slave crashes. We had something similar with synchronous replication on our, yeah, where we hit an issue and then actually that might have been the exact issue, which I'm going to go back to the human error portion. When we quickly moved away from synchronous replication before we had gone to
23:41
production with it, it had been originally checked into our configuration that way. So then we switch over and all automation updates configuration. So we're set to use asynchronous replication in the config that CF engine or puppet or whatever else rolls out.
24:01
Except that when we did this change, the change was implemented and the DBA who rolled that out didn't go perform a reload on one of the masters. So to the human error point, we now have a configuration that says, no, I'm asynchronous, a running configuration that says I'm synchronous, and a slave that has hit a hot standby max streaming delay loop.
24:28
That causes outage, in case you were wondering. That causes outage that takes quite a while to go WTF, what just happened, because it's a variety of errors all rolled into one.
24:46
So hardware. This is the first thing. So most of our cluster in production is Sloanie now. And we've got this one streaming slave, because we're now doing a cluster of Sloanie with one streaming slave off to the side.
25:04
This is the first thing we saw. In a world where you're in Sloanie and you transfer over to streaming replication, all of a sudden the clouds will part and rainbows will come out and all your problems will go away, except they didn't. They didn't.
25:22
This happened. And when you go and try Google, hey, I've got lag spikes. What do you think you're going to get? There's no good information on how to diagnose these things. There's just not. It's got intuition, it's experience, it's knowing what might possibly go wrong.
25:44
That's just it. So we looked at this and we scratched our heads, and we knew this was our older hardware. We knew this is the hardware that we took out of being a production read slave because it couldn't hack it, but it wasn't receiving reads. So what was going wrong? It had zero load. All it had to do, the only purpose, it had one job.
26:03
Its one job was to apply streaming logs. All it had to do. Don't do that. Partitioned the xlogs away, got it off spindle disk, everything was happy for that moment in time.
26:28
This slide, this one. This is why this and hot standby feedback are the reasons that I all wanted to talk to you. When something gives you grief, write a talk about it.
26:42
Because I guarantee you someone else has felt your pain, and they're hurting, and you will help them. And when it happens again, you're going to look for those problems or you're going to go, oh, what's going on? When I get lag now, I'm like, hey, what's going on? This is cool. Whereas before, I was like, ugh. So this had a very, very slight increase over time.
27:04
So what we're looking at here is those hash marks on the bottom are by hour, and this is by seconds. So we're starting out around a ten second spike in lag there. And I looked at this for about a week, off and on, because other things were going on.
27:21
But I'd go back to it and try and figure out. We'd gotten the spindle disks out of this picture, we'd partitioned our xlogs, everything was supposed to be fine. This was on a brand new piece of hardware. It was actually the hardware we were going to upgrade the entirety of all of our clusters to. So it was faster, harder, stronger, better.
27:42
You got that. It was everything that the other slaves were and more, and yet it was showing a way worse pattern than anything else. Any ideas? No.
28:02
No. No. These are only our hash marks. So this is happening all the time. And it's slightly getting worse. Nope, this is hardware that we owned in a data center. Nope.
28:21
Nope. So here's a zoomed out. This is what happens over three weeks. Yeah. That's not good at all. And to give you the trend line on it, there you go.
28:42
Nope. Someone got this in the meetup, and I was floored. I was absolutely floored. No, we're looking at time lag between the master and the slave.
29:04
I wasn't looking at bite lag at the time. We instituted bite lag checking afterwards. What was that? Bingo. Bingo. So what happened was, not only in buying new hardware and setting up all the config we already knew about,
29:23
underneath us, when the systems were built by SysEng, they had upgraded from CentOS 5 to CentOS 6. In that change, the configuration for NTP changed. So the old configuration they were using for CentOS 5 was not being read, and therefore we were getting clock drift.
29:45
So the thing that you're looking at, first off, don't trust it. And second off, there's a lot of factors outside of Postgres. It's not just users running queries. It's not just transactions that are going on. It's not just network latency, which, by the way, I have never run across network latency in streaming replication.
30:06
Insane data center. I thought about this when writing this talk. I was like, well, you could have the problem if your master has network latency. But gigabit link, you're not going to have problems. No, no, no. You don't have problems when everything works.
30:22
True. But I guarantee you that when the network goes down, people know the network is down. No, it's not down, though. Down is defined. When it's broken, it's easy. When it's like using 4% of your packets, that's when you have problems with these kind of things where you're like,
30:45
why am I getting these weird things going on? I mean, big networks are under that problem, and I mean, I have a big network that's dumb, and so it might be good for a long time. So we have a network guy named Curtis. Colin works with me, so Colin knows Curtis.
31:03
And there's a reason he's face palming right now. If we had packet loss like that, we'd be getting e-mails in the middle of the night because, man, is he on it. We're kind of fortunate that way. If the system looks at him wrong, he knows about it. And we did have an issue once where we had a switch panic.
31:23
The switch rebooted, and it turns out the one way that this was noticed is because we momentarily lost connection to a master. So one master that happens to connect to everything else via FDWs was down for about two minutes while the switch rebooted,
31:43
which was a memory loss issue with the kernel of that switch, I believe.
32:12
So you're saying that the actual timeline didn't increase? Bingo.
32:21
So what was happening is you're asking the slave, hey, what's the time that you last reported receiving something? And then you're checking that against a server that has a fine time stamp, and you're getting drift between the two, which slowly over time got larger and larger. And it wasn't until I wish I could have backed into this sooner, but it
32:43
was a problem like I'd never come across because that's insane that NTP doesn't work. Why do you monitor it? Why do you have to monitor NTP also? Well, now we are. Not only were we not monitoring NTP at the time, everything about the system had changed, and everything was supposed to be fine.
33:07
So now we have byte monitoring, which goes back to the time versus bytes. Please do both. You might think in time, and that might be the way you want to see it, but check your monitoring and use both, and that works for us.
33:27
And then every so often we get something like this. In the before time, right before November, we didn't have HAProxy.
33:41
And this would have caused a world of hurt. But strides, things are happening. We have balanced slaves now, and something like this tends to, it'll set off an alarm, but when you look at it, that is between about 1225 and 1245, and it seemed to resolve itself decently enough.
34:06
So what causes that? It's not worth it. I mean, it's really not. It resolved itself.
34:21
There's no apparent cause, and the bottom line is it's an anomaly. Somebody could have. With that particular one, I didn't find anything easily in the logs, and it just honestly wasn't worth it.
34:42
You can spend your time trying to track these down, but I don't have three spare DBAs to do that. And they happen irregularly enough that it's not worth the manpower. So sometimes, even if somebody notices, hey, your graph went a little wonky there, that's what we've got load balancing for, and that one just doesn't matter.
35:03
So these are the things that gave us pain with a couple of additions from the meetup that was kind of awesome. I would like to know what other people have felt, if anything. Oh, yes, yes, yes. You, down in front. So the guy in there, I'll tell you all about this wonderful time.
35:23
When all the database servers just weren't working, we were running a pads company on EC2. Well, it turned out the underlying disk stores got remounted as removed. This wasn't AWS. Yeah, it was.
35:41
Oh, it happened with AWS? Oh, so it wasn't necessarily. It took us seven, eight, nine hours before somebody decided to check the message in real life. That long of a message?
36:01
Well, imagine you have hundreds of customers, and they're all trying to call you, and you're basically trying to figure out what's going on, be an answer to all of them at the same time. It was not a pretty day. So that had to be seen in more than just database servers, because if it was an AWS issue.
36:21
The application servers, most of them didn't really care. Yeah, they don't care. Because they weren't doing much with the disk. Most of their stuff is cached. So they're usually in memory, they're not using. They weren't checking for read-only status at that point. So that speaks to hardware. Did this happen again on the other systems?
36:41
Oh, long time ago. Oh, right. Final system changes. Yes? Well, there's your problem right there.
37:03
Okay, who's got the modem sound? So they were in Antarctica? Maybe the moon.
37:23
So they actually managed. If you have enough slaves, you can generate anything, right? Oh, yeah. Oh, yeah. No, no. I've killed a switch doing this. Do tell. Yeah, okay. So we have eight servers, 32 shards across these eight servers.
37:43
All of them very beefy, quarter terabytes of RAM. When you say shards, what type of sharding? It's sharding. There are 32 independent instances of Postgres. Each with a subset of the data. We had this grand idea that we were going to have replication work
38:01
by having all of them replicate to each other. So I wrote all this nice stuff. It worked just fine until we kept the load. We kept upping the load. And eventually, we basically got to a point where we saturated the entire switchback line. You know Sony does a full mesh network?
38:21
Yeah, yeah, yeah. Okay. This was streaming replication. It wasn't Sony. You can get there with that. The good thing is streaming does scale well. I had like 15 of our biggest boxes in RDS at like 35,000 PBS. How many terabytes were we generating a wall per day with that system?
38:41
Nine terabytes of wall per day. It was in a rather tight time. It'll cost like 15 slags with less than 100 millisecond lag. Generating nine terabytes of wall. That TV would come in about half an hour to 45 minutes.
39:01
So that was the period where it was generating most of that nine terabytes of wall. And trying to ship it to everybody else. All at the same time. That was fun. I just don't do that. You asked for experiences. That's a good experience. It was cool when it worked because you had green slaves and you could have any
39:23
node go down to automatically bring up the shards from that slave up on the other boxes. And it just worked. The good thing about the cluster is it wasn't super time critical. And also we grew to the point where we didn't have space for replicas anymore.
39:40
I did also fix the wall problem by using unlock tables for raw and other things. We got the wall done. But basically we had time to restore a backup. We had an extra server and we had time to restore backups if we had to. So it wasn't a transactional system where people were hitting it constantly. It was a big batch statistical processing system.
40:01
Wait, what type of backups? What were you using for your backups? PG backrest. Of course. What else would you use? Not PG dump I hope. Or barman. We looked at barman. There's no way barman would work because it wasn't parallel.
40:21
Forget it. That's why we wrote PG backrest. Also compression. Having two uncompressed copies. It wasn't practical. We needed compression. And compression at rest. And all these sorts of things. And parallelism. So have you met phil?
40:41
Phil lives in the uk. He handles our backups. He will find you and talk to you at some point. Were you at the PG backrest talk? No. You should have been. I did that talk just the last session. We can just go to the bar. I'm happy to.
41:03
So that's a really good replication issue. Is that it? Yeah. Horror stories.
41:29
Instead of crossover through the uk. What was your latency on that? The latency went from 125. So the replication delay one direction was bad enough.
41:49
Yeah, you're looking at my test runs running. I'm not getting any updates. Test runs are running here. So it must be a problem. Two days.
42:03
Eventually it was two days. Badger corporate it. Are you guys seeing this problem? Oh, yeah, yeah. Oh. That would have been helpful. Why don't you follow an analog and how to do this? Yeah. I have a question about what people do around the upgrade and streaming replicants.
42:26
Oh. Should I come up? If you would have been here 15 minutes early, we totally covered that. Because we're in the midst of upgrading from 9.2 to 9.4. We're only on 9.2 because when we went from 8.4 to a version of something that supported streaming replication,
42:45
Sony couldn't handle the transition from 8.4 up to 9.3. So we had to back down to a version that supported 9.2. And now we're left with a three-hour time if we need to swap in a master. So what we're trying to do, because we have to have those necessary read saves up with the minimal downtime possible,
43:03
is use that method that we just talked about to use our sync. So, yeah, you can totally talk about it if you want. Yeah, so the plan is basically do the... This is actually in the documentation I worked through with Bruce, but what you can do is you can do a hard link PG upgrade on the master,
43:20
and then as long as you are syncing the entire... You are syncing both the old directory and the new directory with rsync dash dash hard links, and then point that at a common directory, one that's higher up than the current data directory, it will simply recreate all the hard links and transfer the system tables that have been updated
43:40
that are not actual hard links in the new directory. And then that's it. That's how you can... It's essentially PG upgrading your slaves. Using rsync. Using rsync, yeah. And it works. So I'm going to try that out as soon as we're done drinking tonight. Seem to be like Saturday. Yeah, right. Sunday.
44:03
Sunday, sorry. Maybe on the plane ride back. You want to be very sober when you do those directions. We went through them once already. So basically you're saying you're just popping over just the bits that get changed by PG upgrade. Just the new catalog. Yeah, which is the bits that...
44:20
Yeah, right. And then PG upgrade will also go and recreate all the hard links free. Is there a reason why we decided not as a community to try to support actually making those changes on the plane? It seems like a useful thing. If you ask Bruce to do it, you know what I think he would do?
44:44
Tack on to the end of the script, like the analyze whatever script, like this sequence of commands. But then it would be there. It would be there. And I wouldn't have had to come here to find out about it. Yeah, yeah. I mean, not that I don't love being here. Come on. Right. You'd probably be willing to edit for me.
45:02
I don't know. I mean, it's possible we could do it, but I don't know. It wouldn't be that hard to duplicate, to just essentially do what rsync does and recreate the hard link free and all of that. That wouldn't be that hard. That would be useful. I agree. It was a simple matter of coding. I mean, it wasn't... We've been looking at this process for about a month, and then we tried doing the rsync.
45:23
We couldn't finagle it. And it wasn't until I started asking here, if people knew of a way, that I heard rumors of it. And even then, I had to track down you and Bruce to find this documentation. The docs now, at least. Well, it's in the devel docs. Please do another talk about that, if you're working.
45:44
But he developed it. I'm going to talk about it. Sure. Why not? I'll tell you how it goes. I would like the real world experience doc, actually. It's lovely, but I want to see what happens when there ever meets the road. I was just thinking, I don't have any more talks right now.
46:04
We're targeting this upgrade for mid-July, July 10th to 17th, if we don't push it back any farther. I am happy, totally happy, to... The expertise in not being the person who wrote it is way more important to me as not the person who wrote it than yours.
46:22
Sorry, love, you mean it. That was beautiful. So I understand why you're in this talk now. Because the pain of somebody who's gone through it. Yeah, I'm happy to share any of the experiences that we go through with that upgrade with any of you. You can come up and get cards or whatnot.
46:40
I think we're almost out of time. Oh, we are. Thank you all for coming.