Bestand wählen

Autoscaling PostgreSQL: a Case Study

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Erkannte Entitäten
so over body things coming out of this last last session of the left arm and resolved at us
and we're going to talk about scaling AI and induces a quick decorated inputs graphs means that meet to you and my work with a company called of restore experts were base of services go we have sold solved in and post at the jets and kind of gender right in the Amazon died universe has a sense of humor because I got sort like did sir presenting about how much it just like US as of now I wanna get all US clients that for the check of the thing about and my twitter is excellent so this
clanking grows with the problem of this the very serious scaling issues but only some of the time really variable traffic flow the allotted time is crucial ongoing not much happening and suddenly a thousand times over the course of minutes from in the 1st is largely unpredictable there really was just the middle nite suddenly something would happen would drive what's the site again my apologies every variable but they start ups so self mode but you get the idea and they were very high chasing slashdotted so they need to be prepared for that but
but the good news is there's some read the read write head was a lot more reads and writes even by general database standards on the other thing is is the application of mean-field so they were able to plan for this in advance so that not every application has the luxury of doing of
images of the same content the obsessed change fairly frequently so that was easy push from the CDN were good news
and their philosophy was it's better that once in a while if something happens or the system gets overloaded is catching up information demand is very get a 500 page but you know that old sorry could you click reload please the farewells then have a slow response for strumming the finger saying is this up as you know at a B Aldridge being driven to is it up for me on in down for everybody adjust for me and the mean the
bad news was the queries were very generally very simple but they're very unpredictable they they were very hard to catch so you can see these are the top 20 crucial just capsules results had what the queries to the database some really high CPU because using coasts and when the traffic happened they haven't really quickly like in 1 to 10 minutes it could go from a lower a load of 1 fairly which you know you could probably in 1 single a node to a thousand or 10 thousand times that so think OK interesting problem and occasional false positives spikes were not uncommon this is we're lot of them for minute it would do this and they come back was a very hard to deal with so I'm not sure completely solve that problem so what was the briefly gave us they wanted to scalable computing resources up and down to me I'm and they didn't want to just by for the highest demand because that would ridiculous yes well in a mostly it's that of what will happen is all link to the site will hit a large aggregator at some random time a lot of people click through than all fall off the aggregator quickly so that some you know it's not a false positive in the sense that the monitoring is wrong I was probably not the best choice of terms but is not indicative of a large sustained in spite of but they were they were per they were interested in buying for the highest possible demand because that wouldn't thousands upon thousands of continuously and why why waste that money on the 1 it reasonable fault tolerance the case be very reasonable there was 1 3 minutes of downtime this is as a sidebar when they that's people always this is like this conversation dB consultants have that everybody be consulted to rattle off which is you going to say how much downtime can you handle they say not they say great 25 million dollars and all quite the same if you want none 0 I know we'll ever see an error 25 million dollars to protect it is it how about for a minute of downtime it is the every bit there's this exponential factor that the closer you get to 0 the more the money goes up it is asymptotic it never had bad at true 0 it's very very high
supported they had recent realistic expectations on that and they also and this is continued in case of disaster like a meteorite destroys US West US 1 that have so given
that Y EWS especially at the beginning when I proclaim I have little still AWS we can actually come up with a better use case for EWS alone in the West still hazardous a guess for doing all this stuff they're very the the guys are very well documented very high quality very easy to use
of them when you say give me a new instance you get an instance pretty fast and usually insects I and the 1 probably the single nicest thing about EBP i'm about AWS disabilities snapshot in EBS falling into S 3 will show how to use later on so but the building
application where they use well into next front end because that's all coca duties on using you whiskey as the application compare our Django was the application status they want to use you at a Python on 1 nice thing about jingoism has multiple database support as of 1 3 belief i which is really handy and I'll show you how we use that later on
the statisticians nothing stack each step his eponymous so each node contains engine X with you is the generating inside of it can as a self-contained unit of high-velocity things like covers web sessions that having a restoring around us so that have to the database X composition and because this would mean field development they were able to build it so that the create on how we can use gender at all the Django up on 1 of the you can specify you can create multiple database connections for the the object which humanity warm inside and out and so they were able to say all go to this database connection alright queries go this database connection and that was a huge there was that's all although a multitude of problems so that's how we use it very nice and because this is a new application area will sort this out early so requested coming in for
the apple for the application servers this using the elastic load balancer is really very bad for this kind of we they they in this case that that excreted on with the whole application Centauri baked into it so it's very easy provision anyone and they use user data and shaft to push out there on the application to of newly host basically pulling it straight out the repo launching great this pretty much on the front end well for them they're able to um skill and basis the CPU load which was kind will be used as a proxy for hominid a busy you is what's thought were pretty well was 1st of
all I'm just as note they built it and this is a very this is also very wise with cash 1st
architecture which is you always return the value of the cash and then if you think you need to recalculate the cache invalidate it from the bottom up as a joke that the only 2 things about hard thing sink in computer science or naming things that capture validation yes alright so I was so so I had 4 they're 3 instead of 2 structure of a hard of so the because velocity was it was better to return still don't do quickly then to which have users looking at the of looking at the progress bar waiting for the real thing is the universe was they were returning you know somebody's medical results here it was OK to be perhaps slightly results OK so given all
that what do we use the dataset is a master database on it accepts all the rights and system of there 2 or more string replica databases and anyone time generally all the reads go to their sometimes reads come out of there if they in the limit cases they had to do immediately after right but we try to minimize the number of those so that all those so that because that pretty much architected aware the idea was . application architect replication like problem like by just assuming that when wrote data in into the database it could be a while before you read that the very limited cases where you had they had to we could direct the reads directly to the that the reads the master database but also as it happens the service stylistically writing Django applications you tend not to do a lot of read after writing so on each emission was is running out to an x large instance on and those kind of the right tradeoff in terms of cost versus the size of a database at this particular time nothing magic about that they could pick a larger 1 thing and running on wonderful form in the data volumes on a single EBS like how
many people that don't give that away and
I think reading 3 D S you get there are advantages to doing so with this 1 be the advantage having us ing a lot about about the whole that
kind looks like this each chat server has to connections to an proxy fight and so you can see 1 knows the master 1 is load balanced across all the secondary is 1 semantic secondary called the error which I'll talk about so much as secondaries this note there's an invisible but not thought this could be larger smaller but there are always at least 2 secondary absorbed think so we
obviously using string replication that's without arrow with thing with errors bottom was is 1 2nd is this thing is the air it's runs using synchronous replication yes not on the same on the database which it's not not particularly I mean it's the talks to the database over a socket over local sockets yeah rather than over the might be and exceeded the performance hit entries non-zero but what we didn't in miss anything some it runs using synchronous replication and the reason for that is you always know that it is at least as far ahead as any other synchronous wrapped as secret of aggressive secondary because of that you know that you can always promoted to be a master and attach the other secondary to see don't have to image now I'm sure a lot of your thinking but that has performance problems and the answer is and you're absolutely right it has performance problems and I wish I will talk about the other use asynchronous replication so they could be farther behind at any 1 moment yes I have yes it promotes another 1 to be there and there is a risk of centuries conditions you know 1 of the things that will will to others I probably have more problems with this architecture slides and presenting the architecture so but but yes that is exactly the problem is that the error goes down the obvious is low whole-systems still up but there's a window where you meet you were a cascading failure could have be you would have to reimburse you'd have to just pick 1 image the others so if the master fails we mean the errors the designated successor what was the name of UCT
proxy new to them of the instance provides a basically it's jobs right consonant the current master database so the applications that have to be balanced two-point victory to if the master changes on the s
provided us IP and load balancing across all sectors as well
OK now we just introduced a single point of failure and ask all of the the the agent proxy boxes do have warm standby so each 1 is shattered by 1 that has had identical configuration but is actually act on the problems were not running in the PC because they might pay for that so we don't have virtual like which means if 1 of these sales . Yankee to to the warm standby which means we have to application push and everyone groans of think 0 my god that sounds horrible it does
sound horrible but in real in reality it's about the same downtime as master data so it's not a disastrous it's just uncomfortable so
using he announcer each PG dancer when he denounced pretty database server it runs on database server itself is local sockets support to to put pressure on right now jingle opens a new connection nature of web request that's actually changing 1 . 6 the building in robbery of persistent connections but right now every time you do web requested opens and closes a single connection on so what he announcer helps and reduce the connection of head to the database that's basically what is therefore it will also help with loading because we see each database has a relatively she small number of Macs connection set which will talk about a bit and we allow more than that number of connections to come in and queue up inside reaching out to the oral introduction troductory affect anything fancy the perception so here things
that didn't work equal to because we didn't have sufficient control over the premier of secondary promotion and the need to promote the air and not anybody else in it only
for scaling elastic load balancer their of a lot of moving parts to bring up a new posters database and we just couldn't get that working right in an efficient way on a loss of about maybe we just in the standard thoroughly enough but there was a interestingly enough my terrific provision I obs is that over the all over the grand sweep of history the slower than non provision product because you have this little asterisk of provision types and you read about disastrous as well you wanted guarantee you have your I O q is a sufficient depth of believe it's a transactions and all the lottery more In 0 by the way this is a cat not to support it you never get faster than the provision I have products now if you're experiencing if if you give credit to the next orienteering EBS server is having a party there then provision ask is great because you're progeny horrible performance of idea of however over generally EDS runs faster than that yet for the maximum provision I have speed is a hundred of them is 100 transactions a set is of some operations a 2nd or a thousand I'm sorry that's 5 megabyte admits that megabits per 2nd in 2013 5 megabits per 2nd does not blow me away as a storage space of
you know all is a documented where OK I just was there this morning confirming this I didn't see anything in the English language documentation that I I'm sure there's a lot people has it's not there it's a good idea that affords which part of the graph it's possibly true but anyway and sure enough I do I believe you it's just be very hard to find people to features and so is it relatively little Max connections I believe that right at 50 the reason for this is this is a lot of mass connections is interesting number is all going to have a lot of what nature we
focus here going to a database and mass connections is set to a thousand and say well OK and this is why is that will we didn't want any refuse connections OK you probably get with it said to a thousand this database really handle thousand running queries at the same time we know we know so why would you want these connections are to be refused the philosophically I think it's always better to your hard crisp error then the whole thing just starts melting down slowly over time so that was the philosophy here especially PG bouncer in front of it where they can up rather than just be refused I think that's on so let's see others SSL between each epoch in application and if proximity dancer using stuff on because it's proxy does now support arm SSL at the time it was being spec out it didn't but it's also as a lesson I think it's still in the branch and utterances and release branches on this instance is is is is being run in the PC these are basically public anybody inside Amazon's cloud and so we need a system so that that will
be part of it well there's 2 hops because proxies a separate box so we have some we but yet we only we have to use it we have places because each approximates self as a support as a itself so that the Django the the averaging web servers can connect using SSL despite his would accuse support of if you have open access of out but so we have to do to basically 2 astronauts that which is going on right now the good news is that base it's more or less memory which is and so the 2 that are tuning generally reflects that we have a fairly aggressive setting out of order things like seeking to costs of like that of
replication so messages to the standard string replication thing on the wall signature should to a central server really our sick and we don't ship them directly to the secondaries distribute them sectors we just push them all into a central server and then they hold also in this form was required also using our set so blue inevitably they do this won't
occur at least 1 and the
same center per dose another piling up on a central server we have and then deleted via a heuristic which is we take a wild guess at home and how long a rate of like 10 days and then we drop virtually you can imagine I I saw was couched cop concocting this scenario where on the archive the arcade cleanup command instead of actually doing the we could like notify the central server 0 I've got into this far and all that and I realized I really needed to sit down for a moment because I was obviously get going crazy answer sorry I would just like to find minus and have it all will will will talk about some of the
hold that up but there is as a server which actually happens from service also called the control on otherwise known as the server you too much but it's the 1 among ourselves post with databases on the user rolling average of connections of of active connections are actually doing work as a as a as a proxy for the server load Stowe it's doing it's basically paintings that activity ordering or basis and when it exceeds a threshold it decides OK can stand up and the database of we also do not use refer for more it's like checkboxes equivalent long-term stats graphing prefer and then you relic for strategic stuff new wrote is great but it's not a this flamed out please fix it I think it's more of a general health service stuff and controlled as discussed mongering it also happens the order fill in the idled transactions that it sees if there's not loan transaction has been more than a couple minutes since last period tells it just 1 because these are unfortunately notorious in gender and monomers gender this that hasn't hasn't really had a problem with that but once in a while you'll see 1 and so on OK so it's decided needs more it's it's watching this load is creeping up it's fairly aggressive and so because it wants to anticipate the slowed slightly slowed spikes so we request users from Amazon we tried balance across the easiest 1 confirmed rules appear in the master never in the same availability zones because that's the only way of absolute guaranteeing the harbor failure was that take both down at the same time on the provision in
the 2nd hour the various components right now we use a combination of user data and shaft to revision it we really should be built we should really should be company MI for this we haven't yet but that's about that will happen i and after short strap just Shaphan install everything I and it's a combination of like sometime shekels Python and it's like all very what and then
the magic this is why this is why it's called the on Amazon you should start back the data this actually the of the masses CBS by yourself back up on you you now have a copy of this in S. 3 you mounted on the secondary as its database volume and you fire at the database and assuming everything set correctly when helps it just works starts up at the start pulling the wall segments in server it enters recovery Mobil's Wilson was a catch up connects the master it works it's all all I know yeah I mean it's so so far we haven't had that trouble but right now were while I was there right now we do is if this doesn't work we throw away the whole thing and do it again all it all yeah will talk about what happens if this doesn't work which is a dozen of and then we attach the new 1 to wage a proxy or off to the races and some of
them I we have attached the necessary to be comfortable or allow replication connections is secondary a better way probably doing this is certificates and then we should implement this is that we don't have to do this you know go in and grabbing the text file constants which is you know giving part of we keep the snapshot around because why not you know 3 cheap and you know of course a couple weeks we clean up status you fear the sum of its
2 defendants generally how what pheromone it's a more it's more like a decay curve and belt that delta of so generally we have a new server within 2 within 2 4 that's it's just on but once we connected the little relief is really quite dramatic think it really it helps a lot right away that's good and
we generally were generally paranoid about at load estimation is like as soon as the spike go with like every although alarm bells ringing we start creating database servers essentially worked out pretty well so I was worried I'd be spending too much money at work right so that is you know
really well documented and the really easy to use and they and they have a lot they frequently don't work on for example they will cling success will affect the operation not complete I will get back the Volume ID and I will then say great not this volume idea and they say are you crazy there's no such volume although it's the consul in there is the volume ID l signal that volume IV they say what volume idea we will the Council the all might he's gone thanks to f of its report events and what does complete is actually most common ones like the like the API will be stuck seeing creating on a on on EBS volumes in there and if you go into the Council's like what is ready to attach French you attaching this watery stupid and you try to catch and then you can use your there is so that now to be on that this is not this is 1 per cent of false if that this is not 50 per cent of calls but automated something you really remember that 1 per cent of the keys and what's most histories of again instance you'll you'll come up will be and it will just not be right you know the kind the run to the letter the train would have to be assigned to it won't work like go away for a while then come back and say what rank here you know in so that's that's the most frustrating 1 could you don't know until we to you start doing things with the instance this is kind of wrong in the head I I
1 1st I you know I really don't know the exact number but it's under 1 per cent of the calls are behaving in some really ought weight words on a crisp and I have no problem when I say do this says there's no error that's that's fine I don't consider that a failure and was the year it was completely wrong you know what as long as the backing data correct is I would say it's under 1 cent calls where I ask you to do something it says no and but it seems to be out of sync with reality for some reason well what we do is this the 0 tolerance solution making weird happens we just trash where we're doing and start over from scratch that's the answer something doesn't we don't destroy the instance go back to the very beginning and try again and you know enough write 5 thousand lines of Python to try to get to exactly the right thing and I'm not sure I could even if I wanted to I you know basically go back to last known state started later it's also much easier to because that way yeah yes and that's where the 10 minute thing that happens because that's where the end of the bell curve the end of this figure that happens for those things to do good news is we really get stuck in the in this continuous loop of just doesn't work workers workers work that has happened sometimes the only when the region was having and if the reasons having a problem regions so OK what's the shaker going down we decide to spend 1 of these then on we have a much longer rolling average 1st for spin down you know parts or not flapping them up and down but also the number of moles so because we don't want be fooled by exception so we use a 1st entered 1st remove policy was the reason for this is that the added in since a long time ago it's still running we don't wanna get rid of it you have to can trust that instance of we like it not and so we were from each approximately 4 default is 0 we don't wait too long at some point we decide what you've had your time finish these things we collect and I we're extraneous is removed from the massive PGA each across this is another reason long-running processes tend to be rooted to the master because master is never destroyed it was fails OK yes
I would say that should laughter virtually that's that's a that you that's I need more coffee everyone taking notes correct thank OK the master fails 1st thing we do is we just radio master or estimate soon there's a sodium site has a history course already but we remove reveal massive from you know 1 thing about instances is don't get attached to them they're not your friends I you know and I will just say 0 that since you know things happen in a public it's you know that's so that he got since the farm units so there you have to be careful about that stuff in the overall measure from proxy from the NASA proxy here with the air from the proxy sites down at this point the error people but we do the promotion in such a devout all a becomes and then the controllers running white men think hatching half the system of what you were we we point the the master proxy added all we reconnect all the other masses to the new IP address the real secondaries IP address and have them catch up sometimes they will fail to reconnect I'm not exactly sure why that happens in which case we just destroy the man but the Mac sure it's not their fault nothing personal but and so on yeah it's it's a tedious process but generally work quite get it but despite the number of odds that it works like that it works pretty well now so what we get a real life it on the evidence of 75 seconds for the Europe Asia go away they have to come back up after the mass after we detect the master failure this is important because that because all as we'll talk about later it's really as it as good as suddenly the cable gets cut it up so far we have not had 1 these not work now doesn't happen very often but even in testing we pulled will when we pull the plug we've we've done so far we don't pulling the plug overloading in enforce and disconnected EBS volumes as ways of of forcing a failure since defines finds force disconnect sometimes on the out and the nasty game here and how well this is working seems work so far but
OK secondary effects we just destroy it the rethinking around yeah I know but what we do II right now what we do is if the controller either can connect to it at all obviously or the simple query takes longer than a set threshold which is right now 3 seconds for select 1 we just decided we don't know what's going on it's just too weird let's review on the book but that's not very sophisticated but it seems to work out OK enough the but as we'll talk about this forces rare that a bot red boxes don't just explosion completely fortunately they get sick and we'll talk a little bit about that I spin up the Secretary of were American on the secondary so false positives I'd rather destroy secondary and that was actually healthy in working rather then have a have a secondary command it's not that I can trust and so here in this is not for the and said it's
a bad world OK so only if there is a failure and new secondaries created of it and either either the heir fails or map it's been promoted there is a race condition right now after new mass if you promoted the air but the new area but a new areas not online and that all and that when you just promoted fails that something human being at the old right now this scripting is not smart enough to deal with that situation fortunately that's were getting into cosmic rays novel events at that point so of course you also starts happening on a regular basis I'll completely change by 2 but at the moment that seems OK
so will more about this control thing it's a Python application as a local post with database to keep track of stuff like you know what's what's going on in the system but it has its own warm standby in case of failure on right now is that fails that something else a human being has all of the good news is if the controller fails the system itself keeps running nothing bad happens as far as the system goes it is means for failure to go undetected until we bring about other failures will go undetected with stuff of nightmares is what happens if we have 1 of these rolling failures of machine after machine certifying the control happens to be the first one I that hasn't happened yet but that is that is continuous he would have to start planning for OK
disaster-recovery like loss of regions in our and Tappan I'm here rather large store they just crash in the eastern seaboard 1 so we these snapshots were taking where should be should this buckets in different regions in but in different regions we gather compress and archive wall segments that are chronologically related to them as 3 on right of the retention strategy for those is will worry about when delegates arms or keeping them ultimately what I want to do is start shipping them over so there's a cold standby in other regions we can just push a button and bring back up but also the point time recovery too young for society attacked somebody pushes out of bed migration drops a table shouldn't have you all that stuff that string replication does not particularly so this has rendered
benefits that we can resize these things really easily just by changing the parameter for what inside you want sort it recreate stuff and of over I it's easy scale-up manually just looking for say you know you how your D B 2 so you have a minimum of 2 was me that minimum 6 just because I have a bad feeling about this week and I in you pretty much better on backup the backup strategies pretty well handled by this including point-in-time recovery which is very dear QuickTime cover is very important for business continuity where you you need that because old string replication in the world does not protect you against failures of the database considers correct but a human being doesn't consider correct like dropping important table cell of what we were able to attend secondary trade out without without without really noticing anything at all about 10 the loads does start to get noticeable mostly this kind of rate ratio about about every 5 from in-service wheat and that the database with this and I don't know what this means that there is no on
its head above 15 that the master stops becoming useful so that's destruction test on a that's interesting number of in real life we don't expect this month the sectors go much of so that seems to be about where I'm at that point we get you know pretty much this I can handle whatever from so why I don't like my architecture there a lot of moving parts of and that makes me nervous there's this controller thing there is a proxies there's this all this stuff and you know everything is something that can fail and has had a secondary and Hassan failover strategy after a while you start getting pervert about still too many failure modes require manual intervention especially multiple multiple node failures of very you know of not so much like to secondaries go down system handles that 5 but the controller and 1 EEG proxies goes down or something you know what those kinds of combination modes there too many of them that the system can get and get wedged in a way that requires me to and you just to places there these heuristics at time answer educated guesses about the way things work and that makes me uncomfortable it works but you're working is a relative term and a whole easy going down I have no idea what happened that the cases of this but you only and I'm using this kind of as a proxy for a large number of nodes all by at once or we have 1 of these rolling failures where nodes die especially the diamond that order for the architecture point you to some challenges remain 1st splits the AWS you know I love the PIC just wish they would always work the way they say they work or if they feel they feel cleanly inaccurate at the problem you know I have to put this stuff into my code and personally deeply offended in 2013 and the kind wait 90 get something to work on making a Web API I need come along we're all grownups here I you know and so far there is no good solution except you soon worst assume everything in some horrible state you go back as far as you can without losing data and try again have tried it on
some infrastructure stuff and you know this gender solving in above all what come on you know just build that you know up 80 proxy fully supporting SS al would solve a lot of stuff he announces pretty SSL it's all stuff and then it would be this combination of is this is an easy is right now are too many instances are on the same easy that makes the universe the issue on I'm
worried about network congestion with all the stuff flying back and forth so all the secondary is in all this traffic so far I have that we have run into it but I'm so nervous about a given that they're not running out of the PC or or cluster you know
1 of the 1st 3 things is you say warmongering the machine so if the machine dies will understand that that's a little bit like saying well will work really tell of a person's sick because we can see the skull in front of us that's not the way machine machines don't die suddenly anymore than people that they start getting weird they were legally there are legal way for well and come back if thinking I wonder what that was about or the media so in this mouse and you go well ship was a good and remounted everything seems fine so that the problem is that monitoring you requires the sometimes impossible to determine the heuristics for whether or not a machine is really dead fortunately and they get sick history it's really hard to write a perfect that says which side of the graves this machine onto a trusted adult on so right now we basically adopt a 0 tolerance policy is easier to get a new machine and just forget it and try again then 2 of the injured vintage of getting really super detail you know start taking Poulsen put in for moderate the control you know it
works I kind like it but you know it just doesn't feel right to be as American it's somewhat a single point of failure and it has a backup and all that stuff but it's just the sum 1 single process it always makes me nervous when I have a single thing doing high availability you know that's it doesn't feel right In the promise it has acted has had access to everything so it has all these keys on it it has to go around we write configuration files all the time and that's just you know messy and I would love to be fully distributed and kind of you know they are integrated so that I have but I haven't found a practical way of doing that meets all the steps yes right right right now using almost exclusively on-demand instances so it's kind not an issue the basically the gradient leave them on the fly they assure me this is the right going mobile for them it's not my checkbook so that I would probably I I would probably reserve like the core instances the stuff I know is going to be up all the time you know like a certain number of web servers the controller its shadow the HE proxies their shadows to you the database server it's at this run those on reserved instances and then use on this is the stuff that's probably you know for margin so you know that is
focused on I should point were pretty closely and I think we're OK yeah yeah it's effect of the thing that of the i don't 1 run over so why would you just keep an eye on the you know the cons no is great if you're human being if you're not a human being groping around inside of it Atropos press it's a a really tedious way of handling things is not great automated figuring what things
also mostly is right now we have a cyclic dependency between the master the secondary so we need to have both ID addresses before we can do the insulation because of P 0 because of tree can't of master had we reserve right and things like that we can get around that problem right the we don't also want things adjust both is about to center packaging starts the server wish to have that the terror it out because I have to move around like isn't a
rate of yet lesser say when 9 3 comes out and I have to roll this thing forward cause I know I'm going to be the 1 after you doing you build the effects of the prominence is also a replication all has to be the same major version doesn't that sound only and so uh maybe I'll have to do something like this and I'm trying not to think about it it makes me it hurts but yeah this is really the issue with all of it but but not yet have I I suspect in 1 of 3 before family of this unification does the right performance on master and specifically the and there's a there's actually 2 points rolled into 1 here is that there's little on the master and there's also the synchronous wrapped off for the error and that really gets right before a lot the high right operate are moved into solar-driven cues the user doesn't have to sit there wait for them build a simple thing and so and come back and say here's your stale data or it will construct a it construct a full page that wasn't actually done by database query on all very clever kind of happy I 1 possibility any kind of status move because the replication model where the errors the 1st level cascade and secondaries of however the rewiring in the case of promotion here is going to be even more challenging so I need to see how this and general scaling
stuff ultimately were indicate a lot higher limit on the matches word always falls apart and virtually on Amazon falls apart earlier than those system on because all of this stuff helps the reader doesn't do you much good on the right on the 1st day for ATBs it takes away the gold it is actually a side arm and we can't go necessary for the same reason concept sentences and 0 by the way there's 1 terabyte on inside so we can watch that retrieval Creek here so we'll just posted a post was sexy better be done before any of this because a problem
on the the little stuff that we wanted to create in MIC database machines eager provision cuts and time off of it moved the PC
with the PC from both the security we might even be able drop the stubble stuff if we wanted to and we we get our own nineties to assign between much nicer because it will make society dependency weirdness they were having to put up with on cluster the instances so we get a private network between them that will be much nicer and it's nice fast channels will be very pleasant and ultimately right have you something shot it something like that for the rate of the universe is relatively small so we have time but eventually something is very so I think will
wish list based on this is I would really love that I didn't have to grab text files to do this stuff and income what 2030 enough when it affect whether you know web services and what I want is PD notes which is the people to you know just just be the the um with of course my my pet feature set because it's me asking of the with with just the right the the state the string replication failover stuff but otherwise PG bouncer this with event-based you know single wall uni-varied acid-fast that would be that's what that's the tool I wanted but then again I you know come on I could not open idea I don't care about the politics just compile with Open SSL already I and II yeah in please give me reliable results in this even if it means I think there was a failure so here's your failure already industry-wide this failure that would be really nice so now that we know
this what do we know the Professor Rae
uses say I you know that I think the big mistake people use on Amazon is think think of it as a machine rental service is a really expensive machine rental service instances you pay a lot for instance for for what you get out of Amazon but if you think it as a computer as as a compute resource service reason I need some compute computation for a while could you could you give me a machine for as long as they needed it actually very cost-effective in Amazon is the best so far it can use these dynamic resources and taking them back
and we can do this required a lot of infrastructure we did it on the right side so there's no there's no that's that's a problem is how is how it is that when you hit the right limit some architecture of balloons in terms of complexity and
great any questions search I just assume it's not a failure gap and this is the image that does happen from I would say probably that is that that is that vise for the most frequent failure of actually than that that m Amazon that medium just disappears for some reason the really annoying 1 is where they migrated but they don't migrated just right like the migrated to the would be attached and things like that for those we just assume assume its construction we could do heroics like trying to find the EBS volume reattach and do things like that like you know light up I wanna go home at some point so that I decided not to not to try for that I mean it is absolutely unquestionable that with you know if somebody younger and stronger than I am could probably from him a lot of these edge cases but overall I think was easier to read or Isaac was easier to do for me just feel we have been because of a data disk I felt like I know it's a transaction all work all this roll back to last good home state and trying to get at the end the it does mean actually once allowed the me has to go and delete some stuff out of a consul because of things we've lost track of for some reason usually EBS volumes because the instances that it's detection instance and we've lost track of it that's something we can add to the controller at some point is go through and encircling up on a test instances have around I guess a new OK yeah alright so that there were 20 megabits per 2nd and so on all so we've we've broken through the flash the the the USB flash keep area which is I a yeah how come up on the I considered it the the load balancer quote the the the problem I was actually as has refrained slides I was trying to remember exactly what I was before the Friday we are using the load balancer on the number of cells in you have of what we call my mom it's an Indian and at this point and they have to be a little bit lame states seem like a good idea at the time there was there was a compelling reason I'm not remembered that we could use the full Elastic Compute thing you know the or the full the full of will auto-scaling stuff because they're were just way to moving parts and bringing up attaching postprocessor that that part's pretty self-explanatory why we use a cheap proxy instead of that and as that of the army is 1 that's a little bit lost my memory so but it's so maybe they know could about along with that yeah it's so you it is a new at all no action haven't so if you if both of those have escaped my attention so from universal no in Twitter something in the year I have not been carried is corruption although I have certainly heard of it but that's a big reason with 9 3 0 then you have check some of what we do it with the usually gives failure mode is a is it is an inexplicable forces connect from the is that especially on on what he there is sort of the were example is an example decide to work on a rack and so will be the end of the year the bigger instance they'll come back up immediately more reattach and sometimes this way will be there sometimes it even what they're and that's you know we treat that as an instance failure you have a fighter the slavery itself fails I I but I can certainly imagine corruption there's lots moving parts in the cell of the of the world it's you know as as far as I'm concerned it's a desk and study the rated desk it's just that it's it's it's a it's a bear spinning-disk just treated in that's on the news is you know that that hopefully you have you know in a corruption and undetected corruption that got into the snapshot and all that kind of stuff would be pretty bad I have an encounter 1 of those I know this is a good reason 9 3 has checksum and I certainly from the 1 the instance world tonight have invested make the here look at this you know that all only nominally yes because it no he the money via is giving them an easier so in there be that's reliable but it's not simply a failure in that the of this system operation will be compromised if it it does many can be has to go put on the washers and start and re and repair the situation but and that's something that that's kind of 1 those things we haven't gotten around to yet but you know it is of the because generally 1 of the the system the of the secondary only actually need that if they're doing a recovery warrant being prime it when they're 1st starting up to to recover the wall segments for the disconnect period of you vary from from the end back from the from the start back up to up to now and so it will do this and that's 1 of the parts this cascading failure mode that that kind of makes me nervous is like that's the 1st machine comes out in the secondary goes on of that kind of thing but has been a problem so far but yeah that's something we probably could use 1 of the ones that will do not do that do not mounted to the work to the secondaries it's are sinking to infer yeah there we we do actually have a way of gathering information is when I mention because of the secondaries call could call this is committed is the opera cleanup command which it runs to say I'm done I'm done with everything before this small segments in theory we could collect that information we could be you know what's supposed to happen is run the jury cleanup and the stuff but it can be any command it could communicate to look at the archive cancerous say OK the secondaries now done from this point onward it could integrate that data and then release on radically that's like work compared to a finding minus m plus 10 dashed delete that sounds like real work the cell when thank you
very much
Gewichtete Summe
Schreiben <Datenverarbeitung>
Ungerichteter Graph
Gerichteter Graph
Statistische Analyse
Skript <Programm>
Gleitendes Mittel
Analytische Fortsetzung
Computerunterstützte Übersetzung
Ordnung <Mathematik>
Lesen <Datenverarbeitung>
Tabelle <Informatik>
Spezifisches Volumen
Virtuelle Maschine
Weg <Topologie>
Reelle Zahl
Endogene Variable
Installation <Informatik>
Spezifisches Volumen
Architektur <Informatik>
Verzweigendes Programm
Binder <Informatik>
Offene Menge
Wiederherstellung <Informatik>
Wort <Informatik>
Information Retrieval
Prozess <Physik>
Weg <Topologie>
Natürliche Zahl
Familie <Mathematik>
Euklidischer Ring
Computerunterstütztes Verfahren
Komplex <Algebra>
Komponente <Software>
Regulärer Graph
Schnitt <Graphentheorie>
Physikalischer Effekt
Speicher <Informatik>
Elektronische Unterschrift
Arithmetisches Mittel
Framework <Informatik>
Web Site
Zellularer Automat
ROM <Informatik>
Physikalische Theorie
Wiederherstellung <Informatik>
Physikalisches System
Elektronisches Buch
Proxy Server
Inverser Limes
Operations Research
Speicher <Informatik>
Bildgebendes Verfahren
NP-hartes Problem
Einfach zusammenhängender Raum
Physikalisches System
Objekt <Kategorie>
Brennen <Datenverarbeitung>
Umsetzung <Informatik>
Stetige Abbildung
Metropolitan area network
Elastische Deformation
Güte der Anpassung
Gebäude <Mathematik>
Rhombus <Mathematik>
Rechter Winkel
Grundsätze ordnungsmäßiger Datenverarbeitung
Repository <Informatik>
Orientierung <Mathematik>
Content <Internet>
Überlagerung <Mathematik>
Arithmetische Folge
Migration <Informatik>
Elastische Deformation
Inhalt <Mathematik>
Ideal <Mathematik>
Statistische Analyse
Elektronische Publikation
Formale Sprache
NP-hartes Problem
Kartesische Koordinaten
Strategisches Spiel
Einheit <Mathematik>
Web Services
Figurierte Zahl
Web Services
Konstruktor <Informatik>
Zentrische Streckung
Nichtlinearer Operator
Twitter <Softwareplattform>
Geschlecht <Mathematik>
Strategisches Spiel
Reelle Zahl
Proxy Server
Virtuelle Maschine
Zusammenhängender Graph
Digitales Zertifikat
Matching <Graphentheorie>
Einfache Genauigkeit
Keller <Informatik>
Inverser Limes
Endogene Variable
Ganze Funktion


Formale Metadaten

Titel Autoscaling PostgreSQL: a Case Study
Alternativer Titel Automated PostgreSQL Scaling on AWS
Serientitel PGCon 2013
Anzahl der Teile 25
Autor Pettus, Christophe
Mitwirkende Heroku (Sponsor)
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
DOI 10.5446/19042
Herausgeber PGCon - PostgreSQL Conference for Users and Developers, Andrea Ross
Erscheinungsjahr 2013
Sprache Englisch
Produktionsort Ottawa, Canada

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Managing Your Thundering Herd Amazon Web Services provides tremendous tools and techniques for scaling services up and down in response to planned or experienced load. However, too many systems are configured to use AWS as an equipment-rental facility, which wastes money and does not take advantage of AWS' unique properties. We'll talk about how to build systems that flex-scale using AWS tools. Among the topics we'll cover are: -- Designing your application and database for sharding and scaling. -- Planning for load spikes. -- Detecting load fluctuations. -- Scripting your scale-up/scale-down functionality. -- Scaling the database vs scaling the application front-end. -- Monitoring and fault-recovery. The demonstrations will be specifically about AWS, but the techniques can also be applied to other cloud environments.

Ähnliche Filme