Merken

...Lag

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
my the of a of a I my 1st thought in writing this presentation was to say history 35 minutes my my that you know what you think about the the next that was to read all slides on a slow transition the most it's still after the first one variance has so many innocent and then and I work of a something called Turnitin . com and we have a couple of good size databases and have a 600 gig 350 gig that we split off from the main 1 last summer and a couple other here and there that are quite as material of those 2 but the important part is that we have a 24 by 7 of environment and to assess the couple means windows that we always well in advance tell people about the other important thing is that we have required read slaves in production are master would fall over if we didn't have so we have to require create slaves at all times in addition to a number of other slaves and cascading replication and what not so we undertook a process in September of moving from slow to streaming replication and and the practice the stock was saying I really want to know if people have done the same thing before what they've come up against and for people who are about to do this you get all of our pain so the the hard part of that transition was finding out what was going on in an ideal world and when you read the documentation everything says streaming levitation with its great and what's funny give us no world of it no end of trouble and for many of you have you money you know what to do your CPU but Didier CPU and you have 7 slaves even a really so my so our our our answer and what has saved us quite a bit in meantime was removed streamingapplications and in doing so we have a lot of pain and I hope to save you some that pain as well so how do you know when you're slave is lagging because obviously ignorance is bliss so you cannot monitor it in which case you never know when everything's great and new users call you and that's bad so Muttering
101 for everything else Happened obvious that whatever your monitoring graphing tool of that favor is use it and there's a variety of things that that we monitor we monitor and bytes censored time lag and by black and just replication portion of it and and it is incredibly helpful you'll see a lot of those graphs in this presentation and I apologize we use graphite it's a little hard to read that can be annotated if you can't read a you have questions on any of those please speak up and tell me and I'll explain in greater detail this is
your 1st go to and everybody should know this and if you don't I'm sorry I said that and it is the 1st thing I do when page Aditi goes off and says hey your slave is lagging and before I do anything else ideas like Ca start replication the master yes for all I did not know that I think you know know you know all of the IEEE max this is the way I do it at the wave of flavor and it's always the 1st thing I get to so I didn't think
about this topic while I did it was important when I 1st heard this presentation and that's because ah transaction rate is high enough that we never see stock we never have a master start standing data for any reason there's always a traffic from master slave no matter what this came up I gave a talk in the previously and it came out that that was kind of amazing and I was so if you have a system where your master doesn't tend to have heavy rights and enough rights even they can go say a 2nd or more without sending data you're gonna wanna set at something to artificially create that movement
between master slave they can be totally as simple as creating a timestamp table insert into at every 2nd or every 2nd whatever you want you and read when it comes out on slave totally simple kind of heartbeat and just to create artificial traffic so that you know that if you get a lag alert it's not totally normal in my view corner case but I have never thought of it just because it wasn't in my world we could never have that so if you have slow slow right it's something to think about
time versus like how many of you flexor producer applications or table and look at that and see the position of a file and go I knew exactly how far behind that it's totally makes sense to have you ever done that because I know I haven't so most humans think in time so what we wanna do is change what we get from those monitoring tools and time so that we can make some type of decision logically about how far behind our slavists
as I did not write this query because I like other people to do things for me so I googled really early on and you'll find this everywhere 1 of the very easy things to find is how to monitor and your lab on your slide so you take the the last received location in your your X of the last repeal location they're equal they need obviously you're not liking the totally fine otherwise you take the replay timestamp and determine how long ago that was everyone's goal with that war is what normal looks like
for us this is a seven-day graphs and time spiking upwards and looks like we've got a couple of uh 3 seconds spikes in Estonia world that was totally normal in our world that's like Haiti kind of weird so as you see it's it's about 10 spikes over the course of 7 days no more than 3 and a half seconds most of the time were in millisecond any not baseline doesn't even register compared to the 3 and a half to the threes so the bonus that we're getting from this is amazing yes so this is only 2 and this is to production read slaves there's more on that graph but those are the only 2 that I had visible when I took the screenshot and so their primary to read saves obviously 1 is doing slightly better I can't there's the purple line and the blue line and the purple is kind of scary because with the blue seems to be a little bit lower I think I have a theory that if I do this a lot when diagnosing like I think the theory on that is because 1 of those hosts isn't FTW host the others not to have slightly different use pattern this is also a normal those
spikes and are very written at this is a 7 day and snapshot figure out exactly what that is we take the down yet possible cation if you if you use the G there's a variety of methods of that we employed many of them we this is not the only 1 but in the case that someone in engineering accidently yeah wiped out a column and you wanna restore from 2 days ago cause it to go that long to build up the courage to say hey I fat finger that at the easiest way to do that for us is to restore single-table understand and then applied the master and so there's reason I am not just grazing and but this is 1 of the things you have to make sure that everybody who looks at these graphs because we have the product of years we have project managers to look at these graphs sometimes in a freak out like what what's going wrong why do you have that yes it's and totally normal but this is what backups look like the great question and will cover so that later but if you want to have any type of views on that server while you're taking the P-GW you have cause replication otherwise and it's going to cancel any transactions the rest of this is also normal and
I wish I had labeled this better for you but what you see is the same type of spikes over 7 days where you have this in this case it was a 10 K and and and that's in seconds spikes followed by if you can see down there the secondary spike rhythmically every single day but even my senior DDA while the ticket interior and said what's going on that scares me and I don't think about a whole lot because I have a strong feeling and what was so what is that I know of about that what that's what this is not time between you know the kind of it's PG dumping the other databases to databases as geographically dispersed but as dump them individually first one and then the 2nd 1 and it's the little things I will point this out because it's so minor but the point of this is communication if people don't understand what's going on I even people on the same team even people that look at these graphs everyday something like this can be an anomaly and it can cause panic and I spent the time to work on this juror ticket and prove it and write it out because of this so communication documentation these types of things are reported was still normal
and that was the explanation of the of
those replication past exactly what you were just talking about and what happens when you turn off replication for a period of time you posit safer 3 hours and then you turn replication back on and you turn your monitoring back on because the entire time that you have replication pause obviously you don't want to continue to get page during that time so you Podger monitoring what happens when you turn your monitoring back on initial replication back on at the same time you can think of and and we do and it's not paying enough because ironically only the 600 gig database will fire off 1 page a day when it's getting caught and it catches up really fast so it's not enough of a pain to fix that race condition because of you know yes we know 20th affiliation so we acknowledge it manages the self so that was time
measurement height measurement again
I didn't write this because someone wrote it and very very similar to the last 1 that we saw the light is of the what's that positions and what's the replay location and then when I I pretty straightforward pretty clear exactly like we saw but which was the top 3 is not 1 of them now
with all that this is an example of completely normal bite lag for 7 days the maximum there is about 4 million this is not on a common or back up a relay host these are production read slaves so there's 1 spike it doesn't necessarily correlate to any other graphs pretty healthy there's more jitter or there's more difference than the baseline then you see with time and was obtained a better measurement so you have your
monitoring in place every major duty on what's going wrong we found that
most of our problems where in the initial set at the move from another type of report what was occasion into streaming this came in 3
flavors the 1st of which was configuration hardware and
human error given that we
think are data is important I think most people think that it is important financial institutions definitely think the data is important we 1st chose to go Synchronous replication who has gone down the road Synchronous replication look a couple people does that work for you the gold in world and we quickly backed out of that decision and when you're looking at the amount allowed did you get with asynchronous replication it's not worth the extra single point of failure in this case because wages when you add asynchronous host and you're waiting on that synchronous host if it goes down you've got away from I. Mastering a single point of failure and now you have 2 problems when your slave can take down your master it's an ethically large problem compared to just your slave going away yet to more than 1 and we backed into that problem in that I might come back to this point
from the Sun as earlier with the graphs to and that that's we not only have a to married slates we have a cascading replication slave and then idea full cluster as geographically different and there's an alternate configuration on the host that we use for chronic recording that's pretty key and does anybody know what really the crime hosts needs to
run queries that are much much longer than you can be running on your normal notes and reports tend to take longer and back that's etc. and you don't want monitoring going off on what is normal behavior separating have alternate configuration so this is in deep
deep talk on configuration and then a highlight a couple of the parameters of that will make your life interesting so so Max streaming archive delay and max streaming standby or max standby string delayed max standby archive delay and here actually uses read slaves that often and of course this would be the place to find that the more often when I hear are people who use their slaves it the r aw for a check out and they don't have to deal with the problem of transactions or reads on a slave that interrupt or cause problems when the 1st things that we saw when we switched over to using read slaves on streaming that we didn't see with logical replication was canceled queries and when the primary reasons that can happen is your max by archive delay or a max standby streaming the difference
they're very very similar when you read them you might wanna read them to 3 times if this is the 1st time to get the actual difference the archive delay applies as Wall Data is red where's the streaming
delay applies when the wall data is received so very similar kind cousins still very different replication timing
and wall stands for receiver status interval data about replication time 1st and this is the amount of time that your master will go without hearing from a slave before terminates the connection so if your slave has gone AWOL and you master can't find it it will not be that connection open the populace is 60 seconds the interesting thing is while receiver status that is the maximum amount of time that your slave will go without reporting back to your master so every now and then your slaves says hi I'm here in your master goes data so that the pass each our street everything's cool except the corner case when fell overnight in the UK sets up a brand new cluster and I go check it make sure everything is cool and I see that the masters there and slaves they're but they're not talking and like many Douglas what's going on turned out in the on court case where I don't know what that you have set your wall receiver status interval to a non-standard standard is 10 seconds say we said that to you a minute and a half and fall replication time under 60 seconds we but think of it as an acid going where a living hearing and useless like 0 sorry so what you is when you do like 72 verification or table PD Sarah application your masters like there's nobody here if it causes a little bit and scratching don't ever do that it's totally possible don't ever sector wall receiver status interval higher than your replication timing a lot of
a money elliptic for the tells me there's no this pain this is 1 of the reasons I wanted to write this talk this slide you can see a lot right so after we dealt with the configuration issues of getting canceled queries constantly and we tried playing with to the timer settings are kind of streaming delayed it wasn't helping and we ratcheted up we were like this is ridiculous will will rented to 5 minutes and then that was insane because why would you set a string delay of 5 minutes now you've got the that's completely bad there's no reason even read it from your site Maslow just fall back master which we do after 3 fifths seconds anyway and this guy was the fault of that and it was so frustrating for me to the number here because there was nothing that I could find that said hey by the way this might be your culprit and by default itself so if you're going to use read slaves in streaming replication don't turn this on immediately 1st that when you're configuration set that Boolean default on no why on by default already all various all the way to go because the data in a way that we can do not get a lot of them out of the hands perhaps the wrong word sense it's ridiculous the that was used in the structures of the structure of the case where the you that what do you want to that so what
hot standby feedback actually does is it it's the 1 way that I know I forgive me if I'm wrong that a slave can affect the master so what it does is it tells the master I'm still reading this data there are rows here that I'm looking at please don't back in the so take that away and the mass yeah so the mass I call a an indicative of what were so what happens when you get these 2 together I when I give this talk to to me that I have I had as a couple people in the audience and give me the monster of all 4 stories they had a 30 minute delay in replication because of this combo together for some reason I don't know what they had set there um match streaming the late 30 minutes of your never recover so you use all the time we had something similar with Synchronous replication on a gas where we had an issue and an action that might have been the exact issue and which I'm going to go back to the human error portion of when we quickly moved away from synchronous replication before we had gone production with that it had been originally checked into our configuration that way so then we switch over and all automation of these configurations so were set to use asynchronous replication in the configure that AppEngine profit or whatever else worlds out except that when we did this change the change was implemented and the DDA who hold that out did go perform reload on 1 of the masters so to human error point we now have a configuration says no I'm asynchronous of running configuration says I'm synchronous and the slave and there is here a hot standby max training delay that causes outage in case you're wondering and that causes allergies that takes quite a while to go what WTF wedges happened because it's a variety of errors all rolled into 1 and
so hard but this is the 1st thing so
most of our our classroom production this this Sony now it we got this 1 streaming slave because we're now going to cluster with 1 streaming save up the side was the 1st thing we saw in a world where you're in sunny and you transfer restraining replication all a sudden the clouds apart Ramos will come out and only problems go away except students they didn't know this happened and so when you go into high who will he I've got land spikes leaving can can get there's no good information on how to diagnose you think there's just no it's intuition its experience in knowing what might possibly go wrong that's just it so we look at this and we scratched our heads we knew this was our older hardware we knew this is the hardware that we took out of being a production rates slave because they couldn't hack it but it wasn't receiving reads so what was going on we had 0 love all it had to do the only purpose it had 1 job it's 1 job was to apply streaming logs all I had to do with don't do
that and it's British away and start ups the this everything was happy for that moment in time
this slide this 1 and this is why this is hot standby feedback are the reasons that I'll wanted to talk to you when something gives you brief right talk about it because I guarantee someone else's dull your pain and they're hurtin and you will help them and when it happens again you're going to look for those problems you're going to go over when I did last unlike like hate can what's going on the school whereas before wasn't so this had a
very very slight increase over time so what we're looking at here is this hash marks on the bottom row by our and this is my 2nd so we're starting at around the 10 seconds spike in black bear and I look at this for about a week often on his other things going on but I go back to it and try and figure out we got spindle this out of the picture we partition X logs everything schools to be fine and this was on a brand new piece of hardware it a hardware we were going to upgrade the entirety of all of our clusters to so it was faster harder stronger better it was you got that here was it was everything that the other slaves were and more and yet it was showing a way worst patterns than anything else any ideas you know no these are only our user our hash marks so this is a this is happening all the time and it's it's slightly getting worse not this is hardware that we owned in a data center note note so so here's
a here's a zoomed out this is what happens over 3 weeks in the act of as surrogate off and to give you the trend line on it fail to note someone got this in the media and I was I was floored as absolutely for all know we're looking at a time lag of between the master and slave what I was looking about like a time instituted by like checking afterwards what was that the thing
about the figure of a good so I happen was not only in buying new hardware and setting important Fig we already knew about underneath us when the systems were built by by analysis they had a greater difference and passed by descent 6 in that change the configuration for NTP changed so the old configuration varies percentiles by was not being red and therefore we were getting my so the thing that you're looking at 1st off don't trust it and 2nd of there's a lot of factors outside of post press it's not just users running queries it's not just transactions that are going on it's not just network latency which by the way I've never run across network latency and streaming applications in same data I thought about this when writing talkers like when you can't have a problem if you're shipmaster has now we're going to do that didn't link cannot can have problems with your own true but only over the wall but I guarantee is that when the network goes down people know the network is down of all of the world that we would like to see more and all the while you're on the world of so we have a network dining Curtis Colin works of meaning so column those curtains and that there's reason is based only if if we had packet loss like that would be be getting e-mails mill the nite because Manistee on it were were kind of fortunate that way in the system looks at him wrong he knows about and we did have an issue 1 so where we had a switch panic the switch rebooted and and it turns out well 1 way that this was noticed is because we momentarily lost connection to a master and so 1 master that happens to connect everything else by FTW this was down for about 2 minutes of also attributed some which was a memory loss issue with the kernel of that's which I believe it was all on the value of the quality of of the of the user and the reason is In yeah so what was happening is you're asking the slave hey what's the time that you last reported receiving something and then you're checking that against the server that has a fine and timestamp and drift between the 2 which slowly over time got larger and larger and it wasn't until I wish I could back into the center of it was a problem like I've never come across because that's insane NTP doesn't work all what will now we are the well but we were not only were we not monitoring and at the time as this was the was everything about the system had changed everything was supposed to be funny so now we have fight and monitoring which goes back to the time versus bites these tuples and you might think in time and that might be the way you want to see it but check your monitoring in useful networks press and in every so often we
have something like this on a novel form time right before the november we didn't have a proxy and and this would have caused a world of hurt but strides things are happening we have we have balanced slaves now and something like this tends to it'll set up alarm but when you look at that is between about 12 25 AM and 12 45 and it seemed to resolve itself the Selena so what because is that it's not worth it I mean it's
really not and it resolved itself there is no apparent cause and the bottom line is it's and I'm not it it's an anomaly and somebody could have with that particularly with that particular 1 and I didn't find anything easily analogs of images honestly wasn't worth it you can spend your time trying to track these down but I don't have 3 spare deviates to do that and they happen a regularly enough that it's not worth the manpower so sometimes even somebody notices hang your graph would have little wonky there that's what we've got load balancing for and now 1 just doesn't matter so these
are the things that gave us pain with a couple of additions from the and the media that was kind of us and I would like to know what other people have felt if anything but yes yes yes if you don't already know so that the little 1 for all of these were the words that are used to move to the other 1 used in the this was a US over there the I was so it was an essential service some of the men or something like that just as light and all of the of the of the customers all of you find the optimal use of all the nodes so that have to be seen in more than just think a service that there's a US in AWS in here the use of all of the the work on the 1st of the so they're usually in memory on so that speaks to hardware this is the 1 the other this is also 1 of the things he called so that was you yes you want to pleasure from right here so that I can add up to 1 of the interesting part the things that and send it to the people so they were in Antarctic and and so on on to if so they actually managed to although you have all of that you did he say that I'm don't going to put you so that the town Korea it here and there in the 1st 2 test across the service although very war there was when you say shots like take a shining through different during the production of of you with this is that we talking with random if you have read his work by having all of the region of the normal way of going about it if it works it's just below the mean of the below and eventually you basically got where we set we find that that stuff here as a full mesh network yeah I think I was his will be the year that you you get there the you know what I like to be on the part that way to find out you the reality of of our universe to alter the the status of the 1st noticed that it was a rather than have the same probability of something like you ladies I was also by their own and there were all these
1 of the most of all branches of the of the all of the of the of the of all of this is done to match the next to the theory and better it is science is about the whole thing it's like the area of the rate of growth differ from where all your boxes and they were the results of the language used is that of the peoples of the world and so on and so on all of this is that you have all of your the President of the use of the word that is all where weight like about tests were recently that people of course that was used by most of the time but I don't know how long the former working with her in a the title of the news and you and you know what kind of model of the world so that we all right so have you met felt listening you Katie handles the fact that he will find your inductive someplace and how we can think of it in the course of the revision of the results of the work for you so that's that's a really good that's a really good replication share this is that it is not yet a can of course 1st of all over the world you know we are used in the process of you know he was really insulin and what you want 1 by 1 all all the world is going on in the rest of life and yet we're looking at that time and not in the context of the of the of the of the digits of the head of of the the of the of and so you can see it's bad for the current scene is probably 1 of the things that I have found that out of the the label of the log of of the of the course of the believer the variance of How would you all I you have any 15 minutes early we totally covered that because awareness of upgrading from 9 2 2 9 4 and as I have were only 9 to because when we went from 8 4 2 a version of something that supported streaming replication and so couldn't handle the transition from 8 4 2 9 3 so we had to back down to a version of supported 2 and now we're left with you know a 3 hour time if we need to swap and master so what we're trying to do because we have to have as necessary read saves up with minimal downtime possible and is used and that that that we just talked about to use our 678 until it happens that once you know we do the same thing about the way you will you will go through all of the form also our you want to go over the year we will also take the form of 1 of our group was the spiral of the peoples of the liberal theory of this of the role of war and the 1st year of operation of using our set of your the work of Sonnenschein out as soon as there were done drinking the nite on the side of the the of the ongoing work on the use of the results of all of the things that we talked over all of those things you go there and then you know you know the rules of the theory all of our all of all of the all the all around the world the norm you tap on the head of of the group you already have a lot of people like this the but didn't think there is an island it had to come here to think about it and I let me know and I'll of being clear on that 1 but what we do know about this into that article we need to do is that similar to those the of the of that would be useful when it wasn't we've been looking at this process for about a month and then we try doing the our single you know what we couldn't enable it and it wasn't until I started asking here if people move away that I heard rumors of and even then I had to track down you improves to find this documentation so I know what didn't want to be that he developed it would be 1 of the problems in element is life I was just thinking I don't have any more toxic than the other way around where targeting the subgrade for mid-July to lighten the 17th that we don't push it back any by I'm happy totally happy it's it's you the we it was use this shifted it a sort of it that was that was used as a scientist and later in this talk now so that the pain of somebody gone through and yet I'm happy to answer any the experiences that we go through with that of great with any of you and implement a departure went on and think were almost out of time that a we are so and thank
you all for coming and then
Addition
Bit
Prozess <Physik>
Datenhaltung
Güte der Anpassung
Gruppenoperation
Zahlenbereich
Kombinatorische Gruppentheorie
Zentraleinheit
Biprodukt
Computeranimation
Rechenschieber
Datenverarbeitungssystem
Datenreplikation
Mereologie
Bildschirmfenster
Programmierumgebung
Varianz
Extrempunkt
Wellenlehre
Datenreplikation
Statistische Analyse
Ungerichteter Graph
Kombinatorische Gruppentheorie
Computeranimation
Homepage
Varietät <Mathematik>
Tabelle <Informatik>
Einfügungsdämpfung
Sichtenkonzept
Physikalisches System
Kombinatorische Gruppentheorie
Bitrate
Computeranimation
Metropolitan area network
Transaktionsverwaltung
Rechter Winkel
Vorlesung/Konferenz
Zeitstempel
Tabelle <Informatik>
Normalvektor
Rechenschieber
Metropolitan area network
Ortsoperator
Datentyp
Abfrage
Vorlesung/Konferenz
Kartesische Koordinaten
Zeitstempel
URL
Computeranimation
Tabelle <Informatik>
Entscheidungstheorie
Bit
Subtraktion
Sichtenkonzept
Graph
Physikalischer Effekt
Zwei
Ungerichteter Graph
Extrempunkt
Biprodukt
Datensicherung
Physikalische Theorie
Computeranimation
Transaktionsverwaltung
Datenmanagement
Datentyp
Mustersprache
Datenreplikation
Projektive Ebene
Figurierte Zahl
Drei
Gerade
Normalvektor
Varietät <Mathematik>
Telekommunikation
Punkt
Datenhaltung
Zwei
Datentyp
Regulärer Graph
Ungerichteter Graph
Extrempunkt
Baum <Mathematik>
Innerer Punkt
Computeranimation
Programmfehler
Metropolitan area network
Datenhaltung
Konditionszahl
Datenreplikation
Frequenz
Einflussgröße
Homepage
Subtraktion
Ortsoperator
Extrempunkt
Ungerichteter Graph
Biprodukt
Computeranimation
Metropolitan area network
Statistische Analyse
Client
URL
Normalvektor
Einflussgröße
Normalvektor
Hardware
Punkt
Einfache Genauigkeit
Computeranimation
Entscheidungstheorie
Menge
Datentyp
Datenreplikation
Phasenumwandlung
Vorlesung/Konferenz
Konfigurationsraum
Verkehrsinformation
Fehlermeldung
Hardware
Datensatz
Datenreplikation
Tablet PC
Ablöseblase
Abfrage
Ungerichteter Graph
Normalvektor
Konfigurationsraum
Verkehrsinformation
Computeranimation
Parametersystem
Videospiel
Subtraktion
Extrempunkt
Physikalischer Effekt
Abfrage
Extrempunkt
Packprogramm
Interrupt <Informatik>
Computeranimation
Transaktionsverwaltung
Datenreplikation
Konfigurationsraum
Zeichenkette
Einfach zusammenhängender Raum
Bit
Subtraktion
Extrempunkt
Atomarität <Informatik>
Zwei
Programmverifikation
Ähnlichkeitsgeometrie
Kartesische Koordinaten
Extrempunkt
Computeranimation
Metropolitan area network
Datenreplikation
Radikal <Mathematik>
Message-Passing
Standardabweichung
Tabelle <Informatik>
Rückkopplung
Web Site
Wellenpaket
Punkt
Extrempunkt
Gruppenoperation
Mathematisierung
Zahlenbereich
Extrempunkt
Synchronisierung
Computeranimation
Monster-Gruppe
Metropolitan area network
Datensatz
Datenreplikation
Datenstruktur
Default
Konfigurationsraum
Große Vereinheitlichung
Matching <Graphentheorie>
Zwei
Ruhmasse
Abfrage
Ähnlichkeitsgeometrie
Störungstheorie
Biprodukt
Rechenschieber
Rückkopplung
Menge
Rechter Winkel
Ellipse
Wort <Informatik>
Zeichenkette
Fehlermeldung
Varietät <Mathematik>
Hardware
Produktion <Informatik>
Prozess <Informatik>
Datenreplikation
t-Test
Information
Biprodukt
Login
Streuungsdiagramm
Computeranimation
Hardware
Schreib-Lese-Kopf
Rechenschieber
Rückkopplung
Momentenproblem
Ultraviolett-Photoelektronenspektroskopie
Computeranimation
Hardware
Zwei
Login
Computeranimation
Rechenzentrum
Mustersprache
Festplattenlaufwerk
Datensatz
Twitter <Softwareplattform>
Hash-Algorithmus
Mustersprache
Hypermedia
Cluster <Rechnernetz>
Gerade
Einfügungsdämpfung
Subtraktion
Mathematisierung
n-Tupel
Kartesische Koordinaten
Computeranimation
Kernel <Informatik>
Metropolitan area network
Bildschirmmaske
Gradientenverfahren
Zeitstempel
Konfigurationsraum
Figurierte Zahl
Analysis
Einfach zusammenhängender Raum
Hardware
Datennetz
Abfrage
Physikalisches System
Binder <Informatik>
Teilbarkeit
Transaktionsverwaltung
Rechter Winkel
Festspeicher
Server
Softwaretest
Addition
Hardware
Graph
Physikalischer Effekt
Computeranimation
Lastteilung
Arithmetisches Mittel
Knotenmenge
Dienst <Informatik>
Rechter Winkel
Festspeicher
Mereologie
Hypermedia
Wort <Informatik>
Grundraum
Gerade
Bildgebendes Verfahren
Standardabweichung
Vermaschtes Netz
Resultante
Prozess <Physik>
Gewicht <Mathematik>
Quader
Gemeinsamer Speicher
Formale Sprache
Gruppenoperation
Gruppenkeim
Versionsverwaltung
Element <Mathematik>
Physikalische Theorie
Computeranimation
Demoszene <Programmierung>
Bildschirmmaske
Informationsmodellierung
Spirale
Datenreplikation
Varianz
Softwaretest
Videospiel
Nichtlinearer Operator
Güte der Anpassung
Verzweigendes Programm
Schlussregel
Bitrate
Kontextbezogenes System
Quick-Sort
Menge
Flächeninhalt
Digitalisierer
Wort <Informatik>
Normalvektor
Lesen <Datenverarbeitung>

Metadaten

Formale Metadaten

Titel ...Lag
Untertitel What's wrong with my slave?
Alternativer Titel ...(Lag)
Serientitel PGCon 2015
Anzahl der Teile 29
Autor Samantha Billington,
Mitwirkende Crunchy Data Solutions (Support)
Lizenz CC-Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.
DOI 10.5446/19132
Herausgeber PGCon - PostgreSQL Conference for Users and Developers, Andrea Ross
Erscheinungsjahr 2015
Sprache Englisch
Produktionsort Ottawa, Canada

Technische Metadaten

Dauer 46:49

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Most of the time, a streaming replication slave in the same data center is so close to the master that lag can be measured in milliseconds. However when it's not, that lag can be baffling at best, and catastrophic at worst. We will look at all things lag; strategies of monitoring, configuration options to fit application needs, diagnosing common issues and real cases of 'what went wrong'. If you google from "postgres streaming replication lag" (go ahead, I'll wait...) your result set will include much information on set up and monitoring, but very little on diagnosing and even less on correcting. This talk is an attempt to fill that gap. We will start with the basics of monitoring and trending over time, look at configuration options and 'gotchas' for making your slaves trusted read sources, diagnose hardware and system factors, and finally share the pain of elusive lag patterns that took days, if not weeks to figure out. This talk takes a broad look at system health. Many factors contribute to making a database cluster run perfectly; disk speed, network latency, user query patterns, etc., etc. It can be easy to over look, or take for granted things that may strongly effect how close a slave follows the master. In fall of 2014 iParadigms converted 8 server clusters across two data centers to streaming replication, allowing us to find and document many such issues.

Zugehöriges Material

Ähnliche Filme

Loading...