HowTo DR planning for the worst


Formal Metadata

HowTo DR planning for the worst
Title of Series
Number of Parts
Berkus, Josh
Crunchy Data Solutions (Support)
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
PGCon - PostgreSQL Conference for Users and Developers, Andrea Ross
Release Date
Production Place
Ottawa, Canada

Content Metadata

Subject Area
planning for the worst There's a lot more to disaster recovery than making backups. Most of DR, in fact, is planning instead of code: knowing what you need to do when disaster strikes, how to do it, and who does it. Further complicating things, management and admins are fond of preparing for unlikely events while failing to prepare for probable outages at all. There's a lot more to disaster recovery than making backups. Most of DR, in fact, is planning instead of code: knowing what you need to do when disaster strikes, how to do it, and who does it. Further complicating things, management and admins are fond of preparing for unlikely events while failing to prepare for probable outages at all. This talk will outline how to make a disaster recovery plan, and some basic dos and don'ts of DR. Included: The three most common downtime causes Determining acceptable losses (and getting management to agree) Backup vs. Replication Planning for the unexpected Against Improvising (always have a Plan B) Public Cloud DR Other Dos and Don'ts When disaster strikes, it's too late to start planning. Do it now.
Server (computing) Service (economics) Variety (linguistics) Term (mathematics) Multiplication sign Database Data recovery Formal grammar Planning Variance
Metropolitan area network Enterprise architecture Service (economics) Context awareness Service (economics) Data recovery Multiplication sign Data recovery Planning Insertion loss Total S.A. Event horizon Computer animation Right angle Analytic continuation Self-organization
Point (geometry) Group action Server (computing) Service (economics) Multiplication sign System administrator Scientific modelling Data recovery Client (computing) Event horizon Smith chart Number Causality Term (mathematics) Computer network Software Data storage device Information security Error message Social class Newton's law of universal gravitation Data recovery Executive information system Server (computing) Planning Word Computer animation Data storage device Computer network Natural number File archiver Right angle Quicksort Discrepancy theory Data management
Service (economics) Multiplication sign Scientific modelling Real number Data recovery Insertion loss Event horizon Rule of inference Number Frequency Term (mathematics) Forest Energy level Analytic continuation Address space 9 (number) Beat (acoustics) Process (computing) Physical law Computer animation Computer science Hard disk drive Right angle Data management Data type
Ocean current Slide rule Server (computing) Service (economics) Multiplication sign Data recovery Insertion loss Mass Event horizon Thresholding (image processing) Web 2.0 Computer network Database Data storage device Subtraction Units of measurement Social class Server (computing) Planning Evolute Local Group Computer animation Lattice (order) Data storage device Backup Quicksort Figurate number Data management Window
Server (computing) Building Service (economics) Block (periodic table) Server (computing) Data recovery Planning Insertion loss Set (mathematics) Thresholding (image processing) Estimator Maxima and minima Direct numerical simulation Computer animation Lattice (order) Personal digital assistant Synchronization Computer network Computer network Order (biology) Energy level Iteration Data storage device Address space
Point (geometry) Implementation Numbering scheme Server (computing) Service (economics) Variety (linguistics) Multiplication sign Data recovery 1 (number) Control flow Insertion loss Client (computing) Login Replication (computing) Event horizon Information technology consulting Wave packet Number Frequency Mathematics Term (mathematics) Database Software testing Data storage device Implementation Area Data recovery Server (computing) Chemical equation Cellular automaton Bit Set (mathematics) Euler angles Software maintenance Estimator Band matrix Calculation Uniform resource locator Computer animation Data storage device Password Order (biology) Data center Backup Software testing Procedural programming Quicksort Separation axiom
Server (computing) Multiplication sign Data recovery Auto mechanic Binary code Insertion loss Client (computing) Mereology Number Revision control Medical imaging Frequency Goodness of fit Bit rate Radio-frequency identification Database Core dump Regular expression Backup Subtraction Computing platform Social class Enterprise architecture Information File format Data recovery Point (geometry) Element (mathematics) Binary code Planning Bit Principal ideal domain Port scanner Portable communications device Element (mathematics) Degree (graph theory) Computer animation Smart card Data storage device Logic Revision control Procedural programming Quicksort
Point (geometry) Divisor Computer file Variety (linguistics) System administrator Data recovery Event horizon Portable communications device Software bug Revision control Permanent Radio-frequency identification Data compression Database Software Representation (politics) Backup Social class Area Port scanner Portable communications device Computer animation Personal digital assistant Data storage device Data center Right angle
Point (geometry) Server (computing) Computer file Multiplication sign System administrator Data recovery Student's t-test Drop (liquid) Streaming media Replication (computing) Field (computer science) 2 (number) Software bug Fluid statics Computer configuration Computer network Database Software Computer hardware Core dump Automation Data storage device Analytic continuation Standard deviation Server (computing) Staff (military) Bit Software maintenance Cartesian coordinate system Computer animation Data storage device Computer hardware Data center Window
Database transaction Dataflow Data recovery Multiplication sign Data recovery Icosahedron Replication (computing) Limit (category theory) Login Mereology Food energy Table (information) Fraction (mathematics) Frequency Computer animation Radio-frequency identification Natural number Password Database Partial derivative Backup Backup Quicksort Analytic continuation
Service (economics) Server (computing) Service (economics) Server (computing) Multiplication sign Computer-generated imagery Data recovery Planning Client (computing) Replication (computing) Event horizon Medical imaging Computer animation Strategy game Data storage device Computer network Database Computer network Backup Data storage device Window
Web page Server (computing) Service (economics) Tournament (medieval) Multiplication sign Mathematical singularity Data recovery Mereology Replication (computing) Number Computer network Database Netzwerkverwaltung Newton's law of universal gravitation Data recovery Server (computing) Set (mathematics) Density of states Computer animation Personal digital assistant Computer network Order (biology) Backup Software testing Procedural programming Quicksort Central processing unit
Enterprise architecture Server (computing) Standard deviation Scripting language Data recovery Multiplication sign Keyboard shortcut Set (mathematics) Total S.A. Field (computer science) Number Single-precision floating-point format Goodness of fit Computer animation Lattice (order) Computer network Scripting language Procedural programming Quicksort Error message Data management
Crash (computing) Group action Computer animation Computer file Database System administrator Data recovery Client (computing)
Randomization Scheduling (computing) Context awareness System call System administrator Multiplication sign Design by contract Client (computing) Mereology Information technology consulting Wiki Maxima and minima Bit rate Information Physical system Area Boss Corporation Enterprise architecture Service (economics) Logical constant Process (computing) Electronic mailing list Staff (military) Mereology Message passing Data storage device Duality (mathematics) Website Right angle Quicksort Procedural programming Server (computing) Service (economics) Data recovery Auto mechanic Regular graph Event horizon Rule of inference Number Revision control Database Authorization Software testing Shift operator Information Computer animation Personal digital assistant Computer network Dependent and independent variables
Server (computing) Server (computing) Data recovery Shared memory Point cloud Cloud computing Instance (computer science) Hyperbolic function Rule of inference Arithmetic mean Computer animation Right angle Data type
Server (computing) Service (economics) System administrator Multiplication sign Data recovery Sheaf (mathematics) 1 (number) Point cloud Replication (computing) Frequency Deadlock Computer hardware Computing platform Physical system Time zone Service (economics) Standard deviation Scaling (geometry) Server (computing) Physicalism Planning Instance (computer science) Software maintenance Database normalization Computer animation Order (biology)
Time zone Server (computing) Scripting language Data recovery Server (computing) Multiplication sign Data recovery Volume (thermodynamics) Instance (computer science) Term (mathematics) Cartesian coordinate system Information technology consulting Computer animation Data storage device Database Data storage device
Multiplication sign Source code Water vapor Client (computing) Mereology Weight Storage area network Video game Single-precision floating-point format Videoconferencing Position operator Area Moment (mathematics) Interface (computing) Connected space Radical (chemistry) Smart card Drill commands Interrupt <Informatik> Right angle Procedural programming Quicksort Resultant Point (geometry) Web page Computer programming Server (computing) Service (economics) Observational study Computer file Variety (linguistics) Data recovery Expert system Theory 2 (number) Goodness of fit Causality Database Operator (mathematics) Software testing Scripting language User interface Standard deviation Multiplication Torus Physical law Planning Variance Volume (thermodynamics) Set (mathematics) Cartesian coordinate system Inclusion map Uniform resource locator Kernel (computing) Computer animation Personal digital assistant Computer network File archiver Form (programming) Solomon (pianist)
move the European grammar and the variance of the world the not during the of the house 1 of the In of the we get so welcome everybody and this is a fairly basic talk about creating a disaster recovery plan for your database services other servers so if you thought it was something else now is the time for you to go see the way you can use a story Jason the otherwise in place and so and we will use the term the art and disaster recovery on in a whole variety of ways by
Wikipedia has a rather worried definition for this for what disaster recovery years but I think it's actually a lot
simpler than that of which is restoring services after the unexpected happens what are we going to do to get back up and running after the unexpected happens and this actually breaks down into an attempt to limit or control 2 things which is down time in Dallas right we're trying to minimize usually are down time in our data loss in the event of an unexpected event and so which shows and how many people here have an actual Dr plan for you're enterprises your biggest customer if you're and OK and how many of you would measure that year plan has being relatively complete OK and and how do this remaining 3 people have you actually tested that a hot water OK you're aware that everybody else to problems you don't need to stay for the rest of the talk a lot and everybody
else is more or less in the same boat that that is idea of plan is to panic and circulator a residents I can't tell you the number of customers that have gone into where I was like OK so how do we get back up and running the if your M is unavailable the so goes back the so that's the generally it's so inevitable this for a lot of people and actually 1 of the things 1 of the other things that I thought is that even people have empirically and the deer when is often wrong because they spend a lot of time planning for how to recover from disasters there are extremely unlikely to happen in no time planning how to recover from disasters that are common place like the security policy
idea plan should start with what's essentially a flat model what are the things that can cause a disaster that happened in in general what your after all planning for the unexpected Julie plan for classes of events not for individual events because planned for a very specific event you can prevent it from happening in the 1st place so and then when I start talking to me is about this they usually wanna spend all their time talking about 3 kinds of events right server failure of getting back on very popular for with a lot of points the getting at is popular in natural disaster you know firing using data or whatever it's right this is what they spend management spend their time thinking about in terms of disaster recovery and the thing is they emit a whole bunch of other things that frequently cause disasters and the lot storage failure unexpected traffic spy and administrators network aired benefits offered by the sort of thing so when I was preparing for this talk 1 of things that I did was I went through are e-mail archive for the company and searched out when a client reported to us particularly the clients alike they we we have an emergency services thing the company so the clients who contacted the emergence of servers thing because of the downtime emergency but why they were having the downtime emergency I'm and 3 of these items were the most common underlying causes of unexpected downtime a lost any guesses as to which really work yeah Alexander to let get our actions were actually very common that the 3 actually most common word administrator network failure and actually bad updates particularly firmware updates for the same this was a popular way to lose all of your data and so on so you can see that there's a whole sort of discrepancy here because this is what management thinks that they're planning for for for disaster recovery and this is what actually likely to happen of by the way out of this 1 right here network failure was like out ahead of the others by an ordered by by to 1 and in and usually it was network failure compounded by administrative error after the network back up and so on now 1 of the other
things that management has a problem with this is this idea of accepting law this is understandable because everybody has a problem with accepting all right but we have the whole grieving process right you have to take them through you know denial and then anger and then you know all those other steps of accepting loss which is that you know which is that in the event that a completely unexpected thing happened you are going to lose some that you would be downtime down for some period and you going to lose some data and you just have to decide what's an acceptable level to lose but I otherwise you can't make any other realistic is on and 1 of things that really hasn't helped the situation is this whole
on this is very popular among C-level executives right we have 3 names of time before names of member 5 five-nines of following this is the kind of down time per year you're allowed for that level of 9 is this and given my that if you really trying to hear this like 5 ends up time this is not generally allowed for any purpose at all not just unexpected events but also expected events that required in and in terms of real disaster recovery planning this NYTimes model is actually not useful I for 1 think it pretty much makes the assumption that all types of expected events are more or less equivalent so that is our and you know I throwing a single hard drive is pretty much a to losing all the as because of forest fire the uh the annotated Norris certain kinds of of types of of stuff it also completely does not address datalog it's strictly about service availability and what will you know if I'm allowed 100 % being lost then I can give you any number of 91 the other so it's really actually five-nines is not about disaster recovery it's about business continuity right what is it that we're promising our customers and will have to pay them something if we don't achieve you know the um and also at the higher levels of nines tends to be kind of unrealistic for anybody was a budget smaller than rules of the uh so what model is that what I actually like to use all disaster
recovery planning work should come from the disaster-recovery planning which and basically we have is found the left hand side is we have all this sort of general classes of disaster of unexpected and you can on EN either there are more than this but this is what I had for the slide on so all the different sort of classes of and you can have and this is where we have this work groups will you then we have this 1
down time out so this is a how much down time are we allowed to have in the event that this that this event that right so if there's a server failure or a storage failure for example how long is the service allowed to be done where do we have the target being back up before and then the 2nd column is data loss right if we have a bad of data for hack how much that are we allowed to lose but you know because going on here are talking about in the database server so how much of a data usually measured in minutes are we allowed to use of the disaster happens and the last column that people tend to not that you know out of gear worksheets evolution of Mars on is the detection time that is on this is only for certain kinds of disasters like is getting half were administrative here how long and how much of the window of time do you have to preserve enough information that if you detect that problem 3 weeks later you can still recover from but the and you know because like for example if I would have 3 weeks if the web server was happen they gonna these 3 weeks ago in the mass of stuff do we have detected if it happened last month and do we have to recover from that happened was the important question to know because it dramatically affects the amount of backup storage unit among among other things so anyway so what I will do it is all going to apply the say we want pay you for disaster recovery and a planet after covering up exactly figure all this out sometimes we go to play and say hey you really need a disaster recovery plan because we would like you to pair last invoice and under current circumstances you will be able to on the and so I say I give them this worksheet may say I want you to fill out with the allowable thresholds are and in that 1st meeting naughty pretty much inevitably comes back with because senior management never wants to admit that they might lose that out where downtime under any circumstances so the way that I deal with this yeah I give them an
estimate of how much that would cost to achieve but it's not actually possible for these
things all the 0 but it is possible for them to be near 0 it's just very
very very expensive and and certainly you were 14 employee venture funded start does not have a budget for the and so then I have here we have the will come to Jesus meeting and about what we can generally get a more
realistic set of estimates out of them and for each case right so it is just an ordinary server failure you're allowed 5 minutes of downtime in 1 minute of data loss whereas in the whole network is out usually with the network being out for example because of DNS synchronicity you really need to allow the least 3 hours of service outage because if you actually have to recover from the network failure by moving the addresses of the servers you're not going to recover faster than you DNS refresh interval and as an example 1 of the explained people that you know going to list a threshold and then once you have all that actually gives you the building blocks that you need to make a disaster recovery plan in order to achieve these targets now usually several iterations because you come up with a plan to achieve these targets and then in associated set of cost and that also turns out to be too hot so we just targets again until we come up with something they actually level
up the barricade little bit about cost implementation here a cost estimation here and in that sense it's really important company was not there's a lot of technical adaptive covering but when you're doing planning for this classification important because you really actually balancing to cost right which is how much does downtime cost me versus how much does not having down time cost and if you going give having downtime cost you a thousand dollars a minute and not having them time costume million dollars and then you got an imbalance in your planning someone but also vise versa and so in the past generally breaks down these 4 4 areas implementation maintenance storage and other infrastructure so in in terms of implementation maintenance is acting together since the implementation is generally one-time password maintenance is an ongoing cost and for implementation were talking about things like setting up backups and replicas setting up monitoring doing the monitoring during troubleshooting responding to monitoring training personnel in the back in the data recovery procedures and everything else on and doing coverage tests z implementation maintenance got those tend to be quite substantial actually often eclipsed in other cost particularly if you're using consultants fur for all this work out on the out the
next big cost for any sort of serious data recovery for a database service and the story is because you have to figure out how much storage you're going to need in order to achieve you were annotated data loss targets for example like say were planning on I were planning and preventing on data lost through a comprehensive set of point covering backups well that our calculation with a backup scheme for preventing data loss is the number of the back up times are the data retention periods possible on was 1 that plus 1 because of unexpected event can happen in the middle of the back so you required to keep 3 months of back ups then you'd better have 3 months in 1 week so for example if we had an interesting calculation with a client who had a small data whereas the the unity divides doing point time recovery back up on uncompressed were a variety of technical reasons and they wanted to basically in order to meet their downtime targets and their data retention and at the invading loss targets we had to have the year of the weekly backup snapshots which meant we had ever Europe was 1 week just to make sure we always had a year and to this end up to 43 terabytes of storage which is non significant but once again with a speculation that resulted in a change to the backup targets and the other cell and a hundred gigabytes was 28 let's look at this in a 100 gigabyte database snapshot plus about 20 Gigabytes in write-ahead logs during the then the I then the last thing is your other infrastructural and storage so I hear you talking about the extra servers that you might need extra hosting that you might need for backup and recovery resources and for failover servers on the and networking and then particularly if you're doing intra datacenter disaster recovery where you are doing backups replication to another data center in a separate location bandwidth cost a not insignificant and better estimate them ahead of time so
I'm now all that information presupposes that we actually have a year plan so let's talk about some of the elements of year plan and of I see
your 1st thing is your 1st part element of it is to figure out backups and replicates the 2nd part of it is replacements the thirties procedures and the 4th is people so now comes 1st provides but a couple different backup facilities and 1 is you know obviously you don't which is on logical backup facility in the base that you know and out and then we have our binary backup facilities UPG based back up for our simplest PID or whatever with binary back in and various the images of taking back ups are you know several mostly that of backups of conceptually simple and you take them out whatever period you know once a week once a month or whatever and you can depending on the format talk about that a little bit I they can be more or less portable certainly 3 easy to move it back up to another dataset and depending on how you set you back up so you can recover to last week's good database server were last hours the database server or whatever I was in the image doing back in this city and is doing that that's the number 1 disadvantage of you data recovery the blending entirely in backups is years you're amount of time required to restore right again if you have that even gigabyte database and you're only doing paid you don't to that database that is your only recovery mechanism and you lose the server because the rate card from then you are not making a 5 minute down time target a guarantee and the other problem is if you're not doing continuous back up then you also have a fairly large the loss in now you know if you have very a small database you can do PG don't like every hour you know or twice an hour and have a small the last most people's can't do that most people are not running and smaller database for that a practical consideration so if you're just doing a snapshot or a PG done then you can have it the loss interval of hours to days of which again is not acceptable targets for a lot of enterprise but nevertheless if you don't know what is is is is sort of a trade-off between the use of binary back up the for clients who have really strange in data retention and recovery requirements ends up that we end up actually doing both G down endpoint uncovered because of the trade off between the 2 of so I mean digit the dump is it's extremely portable including across versions and platforms but to a degree of I mean even sometimes downwards to degree because 1 of the Leninist agreement shifted his big for is the class of upgraded 9 . 3 they discovered in with 3 but in no need to downgrade to 9 . 2 and the other so and I 2nd you know PG dumps are a compressed format but often up to 20 times smaller than the original database which means for the same storage cost keeping what more of them and you have some kind of automatically duplicating storage you may be able to get even more out of that but but the disadvantage is that they can take a very long time to take the original PG don't give a large database and they did a really long time to restore an even longer time to restore which means they often don't meet your down kind when I
that face that the out with the base factor that you this and we have a large file sizes me the same size and your database representation and then on the art some degree compressible but not as much as the don't I do not nearly as portable you pretty much have to to restore them together from the same major post with in some cases you have to restore the missing past version related if we've actually had the posters update released that fix something about I'm taking base backups America that I they they do the advantages so that if you really fast storage you can you can run 1 of you bindle based back up almost at the full speed of what a variety which is not true the on in the storage that speed so what can we can use this with point and recovery continues back so they generally a safer
back is let's get back to a threat right because the whole point of this is we are planning for certain new known classes of unexpected events so your backups here is good for natural disasters because the portability means that it's easy ship them to another data center or even into your office and get back to where we use the data center friends right out so the and they're really good for administrative area and that updates things that might cause direct and permanent deletion of data and software bugs to 4 for that matter and you know if you have to downgrade post because you're 1 of the on what users and and with 3 I'm getting half for the same reason from again from India lost on the not so good for other classes of downtown so some of
replication as far disaster-recovery then viral replication is built in replication and despite and a lot of verbal abuse students receive we still useful in 1 a lot of points because they have reasons why it makes sense that even if the buyer you 2 options which is full streaming verses archiving just wall archiving without screening and the reason why you would do non streaming again into data center staff were often on it's actually cheaper to ship the files uses of the files then to maintain a continuous stream character in the USA if there is replication forgive after coverage and continuous and so you are only using a few minutes to a few seconds of data by its fast to failover and minimize you down time window onto minimizing will be lost in down time window for certain kinds of after you disadvantage the hardware requirements tend to be much higher than those for just purely static that's because for things like Fiji dumps and base packets that stuff you can use whatever your cheapest lowest cold storage it's right you don't have to you have a capable database server with 32 cores etc. The spending all my money back on the lot more complex a lot more monitoring and maintenance involved to make sure that they're working on there is some the burden placed on the master under some circumstances so that can be performance-cost particularly if you're only doing replication for disaster recovery purposes not being replication for any other reason and then most importantly certain kinds of failures will get replicated over replication that is our drop cable users will be faithfully replicated over streaming replication generally faster than you can react to it and set them as they but
and so for that reason replication in general and by the way there is a new feature coming in 1 . 4 which is time-delayed standby so they can give you a little bit more of a window to react to that kind of a problem the main reason that have time-delayed standardized but the drawback to that is you're still placing like I I was saying to somebody earlier it can if the same isolated by an hour it can take more than an hour for your junior deviate to admit that he just drop users are you 98 no mention anyway yes OK so is is good for servers storage hardware failures your writing your basic hardware failures terrific for that you set of automated field region be down for a couple of seconds and with almost 0 the lost it's bad for administrative irrigating had software bugs in the software bugs into being in the applications themselves so now
in a lot of ways continuous backup is actually you nice compromise between replication fact nature facts right if you set up for internal covering continuous back up and that gives you the energy is its continuous like replication but you can recover to say before you drop table uses a before somebody act in your database and ends at all we set all the passwords so I'd like a regular back up so the lot of ways like if you have a limited budget for disaster recovery new 1 as a compromise solution it makes sense to set up a continuous back up I've pointed recovery on as you compromise solution the 1 way that it is not the best of both is that it can have a fairly long recovery period because you have to restore the snapshot and then restore all of the intermediate transaction logs which is going to take some significant fraction of the time that it took agreed that traffic in the 1st place and so that doesn't really help you down and we know when a lot of ways and it does you know sort of compromise bill things overall so don't have about
replacements which is the 2nd part here now this is something
that even for our clients who actually already have back up and replication set up they often have been devoted a lot of thought to which is great you last year you know you've lost your primary database server we have a continuous point-in-time recovery back ups on your del so where we can restore those backups to if you're back to her recovery strategy involves getting on the phone Supermicro you're probably not going to make a down time window so
you need to have a plan for how you're going to restore services from your backup for Europe because in the event the disaster strikes in this involves having but you know servers network storage always images are all kinds of other
things know this leads us into part number 3 which is procedures and i which is where you write all this stuff down and and by
procedures I really mean written procedures unless unless you are a company of 1 person in which case you can maybe get away without having written procedures but for anybody else in the world you need to have written procedures because 3 in the morning is not when you want to be making shit up and unexpected disasters tend to happen at 3 in the morning the so and by
procedures I the writing down not only every step you need to take in order to restore services and death but how to decide which step to take this is particularly crucial in the data recovery procedure is going to be carried out by some of the new because if the person carrying it out is the on uncle whole network administrator they don't know what post was use or what it does really it's a database it's got stuck it's unfortified 3 2 that's what they know and so they will be perfectly willing to follow a destructive let's restore from my from back up data procedure if that's the 1st procedure they come across a company which even if the the only problem was that replication failed and just need to restart so but this is a sort of truncated example of how you would write out the set of instructions right so the 1st thing is to get an idea so what the database server does not respond so 1st you need some physical servers down right instructions that someone someone who was not post deviated follow to determine whether or not the physical servers down and whether that was because can be restarted if in the course of this you determine that is the network that is down then it's time to be different to the network recovery procedures and stop trying to recover the database server which is probably perfectly fine you just can't reach and then in the physical servers often try to restart the database using these commands here's how to restore the database 200 tournament in that it successfully restarted were not but then if it's still down fail over to the replica using this procedure and this is how the check with another replica filled over succeeded and if the replica failed then here's where to find the back ups and how to restore the that's the sort of you wanna have obviously was much longer than that of whatever the self-appointed ends up being a 10 page because you have to do a lot of defense sort of planning
so you best thing is sort of good standard is detailed written procedures like on the week you wanna write a better 1 is written procedures that actually have copy and paste of all commands in because you can compound disaster by titling the restore command rather dramatic which means that really the best is forget about pasting commands you have a set of well tested shall or Perl or Python or whatever script that do the various restore step from and just say OK well if you determine that the server up in the network is up and the necessary can be restarted here is the script that does a replica failover and tells you with another of a failover succeeded that's your best solution so now the other thing we
actually need to have here is that there needs to be a fallback procedure 1st procedure didn't work because again you're trying to avoid improvization in Canada because what will happen at 3 in the morning it the film over the replica doesn't succeed at which point the person the the keyboard starts trying to happy solution so you want to to have feel that procedure for if the first one covering features is not to see it restore from back up gives you an error then do acts no sometimes do actually use the minute is called this phone number and if now don't get carried away with this you probably don't want a field fill that procedure on there are some enterprises with it makes sense you know if you actually encountered the FAA you probably 1 fill backfill procedure but for most of the people the field that feel that is OK it's time to get the home phone numbers of the manager of this in the senior deviation everybody else and have an impromptu meeting about are going to do and the good thing about
in this and all of is again is you do not want to be improvised but we do a lot of corrupt database the recoveries for clients and I'll tell you in about half of those it was not the original disaster that caused database to become corrupt it was the action that the Administrator talk after the written disaster post pretty good about itself it is not good about graphic covering if somebody starts copying around into deleting files after the crash so no
improvization but so our
last part of it and in 1 it's often completely neglected and in various planning things is people playing which is you and me you or whoever is going to be responding to the unexpected event needs to know who to contact for various kinds of disasters and problems because it's going to be effectively some kind of random person who initially notices that something's wrong a 1st responder the other the 1st responder being the person who was the most skilled person in the areas of disaster actually occurred so you want to have 8 you want to have the basic sort of voluntary list that says I'm you know on-call staff you here's how to contact the uncle sysadmins here's how to contact here a list right here are database consultants here network consultants you're storage consultants decided so you contact them you know this is the emergency lines the daytime service contact numbers etc. and if you have vendors that you have 24 hour support from how to contact them in and if stuff after the authorized because it might cost money how you can get those authorizations in an emergency we have had cases where we have been contacted by the client on a Sunday because the database services were down and we're like well we would love to help you but our contract with you says that we're not allowed anymore hours without the explicit authorization of your boss so we can help you until you can get your boss and a message saying it's OK and and you know that horrible situation for you to be an integrable situation of your claim to be an optimal situation if you believe the and and so have that authorization information but available now most companies have all this information but what they don't have is have it in 1 place and when easily accessible on in a really well organized shops that had been in we had both a like company wiki and printed out on paper in a clipboard in the I. T. office or next to the server room because you know what can happen In the event of a network outage your goes down the company went and we actually had that happen to acquire will reduce the data recovery number when you have within a limited contact with us why don't you just call us when this happens will we couldn't find your contact information because is the 1st thing that I was the wiki
so you want include is much context information as possible on the least sensed sun it's very hard to keep this information up to date on you know you put version-control system whatever you definitely be up to date but suffered still out of date and if you actually have a website in a couple of phone numbers for a consultant for example 1 of those is more likely to work then if you only have 1 of the but you don't have to copy them all in 1 place due to do something to keep it up-to-date so that the final thing is that you really need to test your disaster recovery procedures now you minimum is when you create the procedure dual run through interest rate and leads to them would that are used to have some sort of regular schedule on which you actually do disaster recovery test drugs or whatever again most enterprises and shops 4 times a year is perfectly adequate there are places were doing a monthly weekly makes sense of who might have a rule for example has BRT because weekly disaster recovery dropped and their separate teams who comes up with random shift throat the opera world of you know the and and what I find actually that works best is that it actually make a disaster recovery part of another process so you have to do all the time the most common thing that we do this is we use a disaster recovery mechanism revision staging servers the ways and that way we are doing a recovery every week and if we don't do the recovery of they're covering fails the QA staff complained immediately because they don't have a fresh snapshot the test so that's actually the best because we always know that that part of disaster-recovery is working that we can do in the I mean
most of all your share of rule is if you haven't been tested any effective the disaster-recovery assume it doesn't work so
I just to finish up just a couple of notes on our like going to that what what about the cloud disaster recovery in cloud well I
really change things a lot of people seem to have this weird assumption that the cloud and somehow magically super redundant right it's a cloud it's automatically redundant I don't have to do anything right well we're
not exactly and it's the cloud and I will tell you you know that boring i in the international catastrophic failure of the global electrical grid the AWS will be up but that doesn't mean that your server instance will be on and as a matter of fact and most cloud services 1 of the reasons why they're cheaper by the hour than buying a server farm and these used for a single hour is because they don't have a lot of guarantees reliable so now we do come
up with cloud hosting you come up with some additional issues that you don't really have on standard on you know you only you want servers on instance failure becomes actually a lot more frequent of occurrence then physical server failure would for physical servers because you're basically combining the kinds affiliation can have a hardware with the kind of those would be the end systems at scales and the 2nd thing is research overcommit where it's not that you were instances down but that their physical platforms overwhelmed by traffic from someone else that is making your system unavailable arms zone failures which haven't people on Amazon have experience on in and I what the ones here is that when people start point cloud services they tend to do a lot of deadlocks automation and any time you have administration and maintenance and scale you also have administrative mistakes that scale you also administrative mistakes that scale and so on there are assumed tools that kind of thing being that brings to you in order for that coverage of 1 is that there are a lot of redundant services you can take advantage of that can help you achieve that every variant of a 2nd 1 is the ability to do rapid server deployment in order to assist in the replacement section of your recovery plan but in all replicas for doing replications period after recovery tend to be a lot cheaper than they would be around for standard server but
other than that and disaster-recovery planning for the cloud is very much against after covering for other kinds of posting a a similar sort of
notes are back applications of AI so this is the the sodium unspecific about because we do and all lot of consulting and Amazon and the as you freely was actually do backups to both on in EBS volume in 3 Volume III the advantage of the BSP is that you can do a very fast failover and provisioning of a new server from as in minutes whereas S 3 takes a lot longer time to restore from but S 3 is redundant and guarantee durable the S 3 storage even survived the famous aims and the salad all you could access it for for quite a while and so on 9 1 1 of
the actually do I recommend here he is play some disaster recovery around rapid deployment you continuous-backup to s 3 you had a a minus and public so whatever spread spreads to do immediate standing up and went to database servers and then you have a we recover even to another zone from a disaster recovery without necessarily having to keep a bunch of X trial instances running all the time so all the
kids for the ah but have multiple copies of your plan in multiple locations and I've said it before and say again a sale and it is not a disaster recovery solution is a matter of fact the standards is generally the cause of a lot of disaster recovery exercises and 1 from the back of the Solomon so questions anyone
I cover absolutely everything abusive right yes on on on you but the letter of the the law of the land that and we were low and all of all of the in the can't get all of of the all yeah absolutely on to the house and you know you don't the like that of the world that you can get to me at the the time of the elves and it was found that this is a result of the fact that the the of wall your life all you think of it as well but it can also be used all of the world but that is also in very moment and have usually there is also in the lower right and the use of on a variety of reasons including the people that is the area that you know about that he has a theory of mind at the moment that variance of the of the of the things that know that was common so any other questions at home and you will learn about the quality of our our work I all of another thing is that the tools for automated so procedures have gotten a lot better I from my provided that's a good thing right if you have a single command the consisting of a new database server and restoring from backup that's a good thing and this thing and you still need the procedure of how do you determine the original database server is down because if you stand up a brand new database server when there was nothing wrong with the original database server and a bunch of the old that services still connected to the old database server then that becomes a bad thing and so so 1 of the leaders of automation is that you want to avoid having your disaster recovery procedures becoming a source to disaster on their own and in the more complex the tooling is the more likely that is half on a lot of cases that points to the need to have more girls that say OK we have a single commands scripts now let's test it actually doesn't we think it that but which is even more critical when that single command script is actually doing a lot of complicated stuff the by like is it that means that on the whole that's a good thing because otherwise what the user interface with the interfacing is a page 1 written procedure how to stand up and the database server and it is much nicer to say here's how to stand up and the database server sending danceable you know this is such a command and then here's how determine whether working on and it didn't work then see page 34 right other questions that better here yet any anytime you have a discuss this another topic I think I think the suffer New York an videos no I don't talk and failover on the in New York and was about years yet if if you have a faithful service which which you know data-server server inherently and then you want to make sure that the old masters down the inaccessible before you stand up a new new databases and I want this a few that 1 is to actually take extra steps to physically shut down if you kill the problem is that a lot of disasters that will prevent you from reaching the server would also prevent you from doing anything meaningful about setting them but and the 2nd thing is to have a way to isolate that server in networks that is to make sure that even if it even if it comes back and nothing can connect to it and the and and so I it's amazing studies is like terminator isolate as what it so that we know the question back here someone raised their hand next of those that are the last thing we have to move 1 of the of the volume of all of them out waxing so 1 of things that I like you like is it was only like doing a figuring out a way to rule that disaster recovery into another procedure so the way that we end at testing failover clients is we use the failover as part of an operator but we need to apply kernel updates to the servers which happens at least a couple times a year right we have to apply kernel update so the so what we see what we do is we apply the kernel update to the replica we restrict the replica of the replica seems OK then we failover and make a replica the new master and that we we've also tested failover over or yeah yeah the duration of the program yes so if you knew it in the newer more often than that but you know you you have to have acceptable which is generally trying to avoid doing it more often water and the other way you can do it is that if you actually have the problem is that you can test the full procedure without having some kind of an era interruption well no because the way weight because because the whole procedure involves things like film over the application connections and that sort of thing but we you want test just basic Davisville over isolation what things you can do is you can stand up staging application you know I do this do that from replica promotion and pointed at the replica but but really if you want to test the full procedure then we don't have to do is to say OK on every Thursday nite at 11 o'clock are going to have this 30 second service outage where we do a film over and then and then and the only useful if you do the fell over and the former replica becomes the new master and then you reverse that the following right of this assuming the film over the procedure it becomes much harder if the procedure you want test involves substantial amount down time like the recovery from points of discovery right in that case you're not going to use the full procedure waiting need do is to recover from the point of recovery archives to the new deployment server right and then point say the staging application of the new deployment server to make sure that it's working but not take it any further than that because they get any further than that would require a substantial hours long downtime and the real services the yet yet yeah yeah yeah I am not imagine that a lot in doing more frequent disaster-recovery drills because generally and the position of trying to get people to test the disaster recovery at all and so so doing it doing it twice a year would be a big step forward for most of our clients of V and so on yeah yeah yeah yeah that were also yet 1 of those leading do if if you're testing that recovery lot obviously here's here's the other thing is to also have had the same disaster recovery procedure available for your staging set and then you can run crude and staging which allows you to least catch things like 15 years and that would be misconfigured you know and that sort of thing it was like catch things like the battery the the red card battery being dead but does allow you to catch you know always somebody introduced above in the Python failover overspread yet what would be nice if we actually had failed back that didn't require a full resync of files to vent their technical issues with that I will get the right time but it was thing is that if you're going through the area from going going to wheel fell over and then you need to shut down the primary because of risking brain but they were at a time so often discussed as well I thought


  882 ms - page object


AV-Portal 3.9.1 (0da88e96ae8dbbf323d1005dc12c7aa41dfc5a31)