Datacenter Fires and Other "Minor" Disasters

Video thumbnail (Frame 0) Video thumbnail (Frame 738) Video thumbnail (Frame 1703) Video thumbnail (Frame 2398) Video thumbnail (Frame 3078) Video thumbnail (Frame 3710) Video thumbnail (Frame 5098) Video thumbnail (Frame 5975) Video thumbnail (Frame 6614) Video thumbnail (Frame 9100) Video thumbnail (Frame 10455) Video thumbnail (Frame 11375) Video thumbnail (Frame 12239) Video thumbnail (Frame 13428) Video thumbnail (Frame 15393) Video thumbnail (Frame 17639) Video thumbnail (Frame 18464) Video thumbnail (Frame 19953) Video thumbnail (Frame 20675) Video thumbnail (Frame 21469) Video thumbnail (Frame 22964) Video thumbnail (Frame 25567) Video thumbnail (Frame 26813) Video thumbnail (Frame 27809) Video thumbnail (Frame 29697) Video thumbnail (Frame 30401) Video thumbnail (Frame 31521) Video thumbnail (Frame 32144) Video thumbnail (Frame 33936) Video thumbnail (Frame 35847) Video thumbnail (Frame 40273) Video thumbnail (Frame 44609) Video thumbnail (Frame 45373) Video thumbnail (Frame 46301) Video thumbnail (Frame 48707)
Video in TIB AV-Portal: Datacenter Fires and Other "Minor" Disasters

Formal Metadata

Datacenter Fires and Other "Minor" Disasters
Title of Series
Number of Parts
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date
Confreaks, LLC
Production Place

Content Metadata

Subject Area
Most of us have a "that day I broke the internet" story. Some are amusing and some are disastrous but all of these stories change how we operate going forward. I'll share the amusing stories behind why I always take a database backup, why feature flags are important, the importance of automation, and how having a team with varied backgrounds can save the day. Along the way I'll talk about a data center fire, deleting a production database, and accidentally setting up a DDOS attack against our own site. I hope that by learning from my mistakes you won't have to make them yourself.
Force Data center Projective plane Twitter
Digital photography Internetworking Multiplication sign Data center Arithmetic progression Bookmark (World Wide Web) Entire function Number
Software Multiplication sign Physical system
Point (geometry) Multiplication sign Order (biology) Website Checklist Avatar (2009 film) System call Metropolitan area network Product (business)
Web page Server (computing) Backup Divisor Code Multiplication sign Set (mathematics) Database Branch (computer science) Menu (computing) Software maintenance Product (business) Human migration Wave Process (computing) Website Software testing Musical ensemble
Point (geometry) Pulse (signal processing) Server (computing) Backup Multiplication sign Set (mathematics) Branch (computer science) Software industry Checklist Regular graph Mereology Product (business) Frequency Bit rate Circle Error message God Area Scripting language Boss Corporation Basis <Mathematik> Database Flow separation Human migration Type theory Process (computing) Pressure Writing
Point (geometry) Web page Backup Code Multiplication sign Rollback (data management) Plastikkarte Database Coprocessor Human migration Message passing Word Process (computing) Internet service provider Order (biology) Data center Website Moving average Automation
Software Internet service provider Operator (mathematics) Cycle (graph theory) Musical ensemble Series (mathematics) System call
Building Theory of relativity Electric generator Workstation <Musikinstrument> Source code Plastikkarte Line (geometry) 19 (number) Coprocessor Power (physics) Process (computing) Data center Window
Web page Trail Pay television System administrator Sheaf (mathematics) Coprocessor Graph coloring Power (physics) Product (business) Cuboid Renewal theory Physical system Form (programming) Email Data storage device Plastikkarte Counting Bit Incidence algebra Line (geometry) Software maintenance Connected space Telecommunication Network topology Pearson product-moment correlation coefficient Data center Website Video game console Freeware Asynchronous Transfer Mode Row (database)
Word Personal digital assistant Multiplication sign Sheaf (mathematics) Data center Incidence algebra Software maintenance Power (physics)
Process (computing) Digital electronics Order (biology) Incidence algebra Mereology System call
Functional (mathematics) Server (computing) View (database) Multiplication sign Data recovery Sheaf (mathematics) Set (mathematics) Mereology Raw image format Computer programming Number 2 (number) Computer hardware Boss Corporation Dialect Focus (optics) Demo (music) Key (cryptography) Data recovery Planning Database Special unitary group Software maintenance Data mining Process (computing) Googol Personal digital assistant Internet service provider Data center Website Point cloud Procedural programming
Adventure game Multiplication sign Set (mathematics) Mereology Event horizon Number Product (business) Digital photography Different (Kate Ryan album) Telecommunication Quicksort Endliche Modelltheorie Metropolitan area network Local ring
Revision control Demo (music) Telecommunication Lattice (order)
Web 2.0 Type theory Web application Demo (music) Different (Kate Ryan album) Telecommunication Multiplication sign Entropie <Informationstheorie> Data storage device Quadrilateral Musical ensemble Client (computing)
Game controller Beta function Service (economics) Proper map Instance (computer science) Cartesian coordinate system Frame problem Sign (mathematics) Message passing Semiconductor memory Network socket Website Summierbarkeit
Server (computing) State of matter Multiplication sign Set (mathematics) Bit Client (computing) Login Flow separation Software bug Product (business) Number Degree (graph theory) Message passing Process (computing) Visualization (computer graphics) Different (Kate Ryan album) Network socket Resultant Physical system
Server (computing) Group action Backup Code Multiplication sign Virtual machine 1 (number) Client (computing) Disk read-and-write head Checklist Mereology Field (computer science) Wave packet Neuroinformatik Number Revision control Direct numerical simulation Telecommunication Bus (computing) Physical system Addition Potenz <Mathematik> Validity (statistics) Information Data storage device Incidence algebra Connected space Message passing Process (computing) Googol Website Communications protocol
Group action Information Telecommunication Mixed reality Electronic mailing list Self-organization Information Incidence algebra Coma Berenices Mereology Physical system Twitter
Focus (optics) Multiplication sign Bit Mereology Bookmark (World Wide Web) Process (computing) Googol Data center Computing platform Website Figurate number Table (information) Point cloud
Web page Scheduling (computing) Software Confidence interval Multiplication sign Website Coma Berenices Right angle Figurate number
if. the. the u..
but other minor disasters project i like it when people treated me during my talks i am the back of my eyes are on the twitters so i submitted this talk to attract untitled war stories and attractive and of happening but i.
no we've all had that day where we broke the internet or everything went wrong. lots of folks seem to have a big red button story i was at a dead obst made up in seattle and i'm like a tell me all year war stories in the number of stories that started so this one time the c.d.o. came to the data center and asked what was that they'd red button for and then the story progresses.
eventually the entire data center goes up in a pile of flames. so my favorite war stories start.
so do you know how the first impressions system to do so merck's because i dare. i'm going to share with you today some of the bone headed things i've done and some of the really odd situations that i've been in and how they've changed how i write software and how i build the teams that i work on so it's all gather around the virtual can fire has its story time. so once upon a time at midnight.
i was doing a release for a start up i worked at within our releases a midnight because we couldn't do migrations and keep the site online so we take the side down. and we do this really isn't any bring it back up this happened to be the first time that i was doing the really is completely solo because my second are on call was on the trans-siberian railway somewhere in russia or mongolia and i'm not even getting about that so he was completely unavailable to me and i was pretty.
a junior at this point i think it only been in the industry for about three or four years. but luckily my back up the person to lead our team was a military man his former military and he believed to be highly organised so he had left me with a thirty plus item checklist that was basically the pre-flight checklist for release of our product every single step had to be done in order to print out the.
accustomed to come off well you did it but every single step had to be done manually and it turns out that takes a long time. because the first thing you do as you tell the team which are going to start the release them to put the maintenance pages. and you notify the team that the site is now down then you put the new code on the server then you run the database migrations the new restart the servers the start menu only testing the new tell the teams are going to bring the side of it and you bring the site up and then you manually test again now you tell a team that the releases been completed and then you watch it for fifteen minutes of richard and blue. i think so i got through that process and i get to that stuff we're going to do the initial set of manually testing and go onto the page and i see this show so something was wrong it wasn't all pages that were thrown the standard rails five hundred age some are destroying assets.
we're missing some of them were rendering really odd way as and took me a couple minutes and i looked at the lungs and i realized oh oh oh no i pushed master i didn't push the release branch which you know that would have been fine because i couldn't just pushed the release branch on top of that would have taken five extra minute. brought this article divide except i bring the database migrations so i had effectively corrupted our production database at one a.m. on my very first really solo while my backup was in mongolia.
so this point i started freaking out i started reading so my friends from the seattle area became you know my god oh my god oh my god oh my god what i do wrong. and they're like you know this you have the skills you done this before you know how to fix this so take a deep breath. and remembered a white one of those thirty plus pre-flight checklist steps was to take a database baca so i have to take a base baca and luckily the other thing that happened was because i was working as a cue engineer at the time one of the things i did on a regular basis as part of my day to day job.
was restoring a backup to are staging server so i knew exactly how to restore the back i could type those commands in my sleep so michael well if it works on staging o'flynn overcome production started the restart and that took about forty five minutes so how good a period of time to get my blood pressure down the pulse rate down and the pace. the circle that was my apartment at that time for the living room for the kitchen the living room for the kitchen while i hope that everything would work out and push the correct branch and the correct set of migrations and write everything back to it was fine and i e-mailed the team saying the release was successful. and i got a mile back from my boss he's like we're down a lot longer than we should have them what had his and i'm like so here's what happened here is what i did because we're going to talk about that morning i like but i was actually feeling pretty good as part of my pacing i remember that while this was the only time. i had made that's particular air i worked in several other companies or someone else had made that particular including my very first release at a relatively large software company in my very first job out of college we were down for two and a half hours longer than we intended to be are some period of time because someone else had made almost exactly the same error on that really isn't it. so my lesson learned are made everything people make stupid mistakes at midnight i make stupid mistakes at midnight and i wouldn't have made that errors that lovely checklist to just an automated is about script and it would have been easy to do that.
and the other thing is if you're going to automate you're really says you should automate your roll back as well and that probably means you need to write your name your rollback migrations are down migrations and rails because we didn't do that why would a robotic show. the other thing is always have a backup and i had had a backup that night i would have been having to write code on the fly and technically was a debate that point to undo the migrations that we had just done. so having a back up his total useful. so i kind of this talk is about database fire or data center fires so here's this under fire. once upon a time someone sent a message to our pager customers were having a hard time going for check out process.
we tried it out and we got the air the said we're sorry we have been unable to charge your credit card please contact your credit card provider and we happen to know enough to know that was the air we showed no matter what air we got back from the credit card processor it so we begin to the logs and we're seeing that our site is getting a timeout trying to process car. words so our sumption is that we broke something some of those well why don't we try process and cards manually or provider had a website where we could deal with cars have someone to the phone and rick a phone and order so we did that and that's the page that showed up.
a timeout are really that's not good. so someone cycle let's call the provider.
and we were put on hold. and a couple minutes later the call this connex. it was like well that's really not good and it happens to be through a series of a lot of bands that the provider we were using was across the street from the company i was working on they were approximately same floor of two high rises a million washington and if we squinted until dinner has just ever so slightly we could see into their network operations center.
and so i know of smash the face against the window and look and we see a lot of red and relate i'm really not good some was a brilliant idea to check the news and we find out that a relatively large data center news facility aid seattle washington it had a fire and nineteen fire vehicles others.
funded and when there's a fire the power had gone off the generators kicked in as they're supposed to do in the fire department shows up as like it's an electrical fire you don't get the generators sorry and turned the generators back off and the entire facility had gone off line taking with it to radio stations a television station and four five close enough. building including the one that our credit card processor wasn't so years. the actual picture of the damage from the fire was able to find the us news sources in seattle and luckily for us once we figured out what the issue was it was pretty easy for us to fix it because do some experiences i have had a previous job and i had insisted that we were how way to turn off the store so that we could turn off.
all credit card processing all renewing of subscriptions all three trials with credit cards everything and the site which keep running and everyone we keep having a good experience we basically put all accounts of the free play mode while the store was all which was great because this particular fire happened over holiday weekends happened on july third i believe and while the fire. apartment out the fire out. and they started bringing back some of the facilities the fire inspector had to check every single colo every single docked every single connection before given sections of the facility could be brought back online and their credit card processor had to bring all of their infrastructure back up as they were only in a data center so we ended up not have incurred a card processing for. four days and if we had to build a system in such a way that renewals we're going to be fine for the trials were going to be fine everything was going to be fine without the store running we would have been a really big world of hurt. so this cemented my strong belief that you should make sure that you can gracefully fallback if any of your external dependencies fail you should be firmly have a way to activate that fall that without having to redeploy your system completely we happen to have a console page along with an admin count.
you went to a specific euro there's a checkbox turned off store hit submit everything was fine everything just picked it up. that was good in meant that solving the problem was fast but it also meant that a couple weeks later someone accidentally click that box because they thought they were on stage and they were testing something and we ran without credit cards for a couple days to add to that particular incident we added some really of noxious colours to the admin consul of production. so that you could not miss the fact that you were on production and so title this talk is data center of fires plural is another fire story i have our king about same company that when we have the first fire with a credit card processor we took it as a chance to check in c a r's. systems were hard and appropriately if we were working correctly against the eventualities like this we decided we should upgrade we're in a great facility but we wanted one that had a little bit better have a track record so we moved to this really nice mom and pop colo very friendly people are the best christmas trees have ever seen it had ram sticks on it and temperaments sticks his grave. and this on us an e-mail saying that about a month and about a month in the future they were going to have to do some mandatory pretty maintenance on the a.p. use the power conditioners basically all the equipment the takes line power and puts it into some form that will actually work for all electronics that are running and colo.
and really great we trust you guys are also him and the appointed time came to this sense an e-mail announcing that they were on generator power we have not noticed any significant little and we had major that we did have battery backus in iraq and case there was a momentary issue. add things right great for about two hours and then suddenly we went down heart hall our word and the cosiness a male saying that there have been an incident during maintenance and all power to our section of the facility had been cut off all personnel have been evacuated.
but they would try to get our rack and the rest of iraq rest of the rocks online within the next hour and if you know anything about how co-location facilities worker data centers work if the words incident and all personnel have been evacuated are in the same sentence it means something caught fire and the hail on was activated because when you use. sohail on fire suppression system everyone has to leave for like fifteen minutes all the gas dissipates and wind and of maintenance their move directs and we have been told if you hear this out and you believe we're going to have someone standing next to the entire time you're in the facility will drag her out if you choose not to leave that we knew what was going on so it to myself look at the other. so i worked with on the infrastructure side and really well we got to go fix something so he kept ago bag headed down to the data center is going to bring things back up as quickly as possible once we're going to be let back and like hey i can show up in hell he has no actually need you back here at headquarters to deal with everyone else in the company and.
also to check things from this side and local i just go down and you can do that my husband turns which is on and the hardy was no no actually. stuff has to be brought back up in a specific order you can just turn all the switches back on everything magically start working again mike is that or written that anywhere is this process documents it is like know no and so we realize that we had a knowledge while a pretty significant one because if you get on the trans-siberian.
third rail way with this particular incident happened stuff and have actually gone very well. so we got everything back up we're when we're only down for about an hour and a half but the call was a nearly as fortunate this on this a couple minutes a couple pictures about three hours later of the parts that had caught fire is actually a picture of a fried circuit borders kind of impressive and don't have the picture.
there was a jury with year but the comet attach the picture was our hundred says we've never seen anything like this before. and then having to run on diesel for eleven days before they were able to get the parts that have ride into stock and come up with a. process a procedure for replacing this part because this was not our part that you had to replace it didn't happen this way and we use this is a wakeup call a we re architect on a rack because we had all the databases and all this which is running on one battery it all the servers running on another and we realize that it made more sense to split the functionalities across the battery so we could bring. but half the racket have the side up and then do with the other half of the raf as opposed to having to bring everything up to have a valid site and added we also decided that we shouldn't have any silos so we talk about pairing our programming we talk about her programming in making sure everyone knows stuff we talk about during colder views but what we need to do.
his focus on things like infrastructure pairing the placement pairing everyone needs to know hard to do everything and maybe not everyone needs to know everything but you need to have at least ten plus one you need to get your boss number greater than one and so we started carrying on the hardware we started paying on the infrastructure i got to come down to do that rewiring of. cabinet remove things around so that i knew how stuff works and i got the second set of keys to iraq so that there were two of us who could get in case something went wrong and the other important lesson i learned was that you need to have a disaster recovery plan and pray place and you need to practice and we were down longer than we should have been. because we had practiced bringing the site back up from completely down we have done it before we moved close but the site had got more complicated in the hardware got more complicated some sun. and finally i use the cloud now i want things catching on fire to be someone else's problem and i like to work with club providers who will practically move your workload out of a section of the data center that's going to have maintenance and i haven't worked for one of those the way we handle the most more most recent hurricane scare on the east coast was really also mean that. we took care to make sure that no one was going to have problems. also multiple regions have your stuff and more than one place and then this is great. so this story isn't one of mine but it starts with the phrase once upon a time in japan and so i were going to team with a lot of the older advocates of google and we have this interesting demo relate to take over its called carbon and this is what it looks like it kind of uses a bunch of phones in a big metal pile of to being in some stuff to to.
take pictures and that is to just them together matrix or catching textile it's really popular as we take it all sorts of events and were taken to an adventure man but if you've ever tried to take significant numbers electronics across the international border it raises eyebrows and many times is just not worth it so we choose to do instead is as much as possible.
by the gear locally. so he said some of the japanese officer let go by thirty of this particular photo here's the model number is the name will bring the parts that are cost of the big rack but you need to go by the phones for us. and i was great and so we got there the night before and the team setting this up as they go to plug stuff in and the rise of the phone with this model number of this model name in japan has an entirely different set of connectors and has in the us.
and the stuff they brought doesn't work with the phones they know how tough. so we're in tokyo.
you can buy electronics in tokyo kiss is not actually a surprise so they had actually trying to store like will just find a doctor between the us version of the japanese version will buy thirty seven to be falling but three hours later they haven't found any so team meeting just cancel the demo someone's like no i don't.
disorder so the team and of staying up late that night soldering trying to get the connectors all connect and ended up killing off can they got it done and the demo really well and i learned a lot of lessons from the next team ignore the told the story one of which is in details matter don't make any assumptions they made.
the assumption that the same phone with the same name would have the same type of connector in different countries seems like a safe assumption but it wasn't and that detail was important so they should have someone take a picture and send it to the before they got all the gear over japan also have people on your team with the first interests and hobbies so that you have someone. can save the day i saw during my team is really cool lots of very odd an interesting people including several people who are very in the quad coppers of electronics and electronic music and so the question wasn't who can sautter was who's the best and suffering and they set up an assembly line with two people soldering iron switch they can totally buy into. three more people were setting up things with the wires all laid out exactly how they needed to be laid out for the people are soaring. so my last story. is that sometimes your own worst enemy. so once upon a time i was working on a client's ever web application was specifically using websites and i'm going to find some details like many of my stories most of the store is true of some some details of an obscure and to protect the innocent but it's important for this one that you know that this particular webs.
socket application needed between thirty and sixty frames a second were messages second another as it would try to reconnect with the experience would significantly degrade and that was because we were doing animation with websites and there's a whole pile assumes learned here that basically start with the phrase don't but we were.
and it was also working here and beta we had some customers it was great. but one day we're having a weekly retrospective talking about what went well what wimp or the week before and we get the spicy sums going we checked the site and stuff not looking right so two of us are like studio taker beers go back to our desks are trying to dig into it. and we see the traffic and memory usage was spinning wildly out of control and the service was start shutting down in restarting really well that's not good.
so we spent about thirty minutes the bugging and eventually really end and we hit the big red button shut everything down bring everything back up and things are back storm and we spent the rest of that we're getting some additional logging so it's never happened again we build figured out what it was a bit a product we didn't actually owned all the client software stuff. if this happens occasionally it wasn't a big deal. and so i went on vacation and lives on occasion the same thing happened except this time after restarting everything fell over again and again and again and the system in the coming back up safely but only toward the end of the day when traffic would have lowered anyway.
and i came back to it seem that it's been about three days debugging and i brought in a different set of experiences they have been trying to blow the logs several gigabytes of logs into a longer do this and i started trying to text process them and make a timeline of what happened to draw visualisation ever got this request of them got these requests and we got these request and between. me having fresh eyes not having dealt with the emergency and them having had some time to eliminate all possible other things that could have gone wrong we realized that what happened was that a malformed socket message had been saved to the database and cause the server to go into about state. as a result of the server going into a bad state they haven't got the frame rate or the number of course they expected and they got disconnected so they tried to reconnect with the could reconnect to the server they're just been contacting does it wasn't about say so i kept trying to reconnect harder and harder and harder and eventually the degree connected to a different server and the serbian a great year.
connected here's all the messages you missed all your offline and it would resign to the bad message which will take that server down which of them take all the machines that were clients are connected to that's ever down with it and then they will try to reconnect and reconnect and reconnect harder and the short version is we do best ourselves by trying to keep the connection to the server. alive and my comments originally said i'm sure others have done similar things. but i know that others have done similar things because who actually read the post mortem the public was more him from the dns outage of weeks ago yet i was back to the sake of lambs so if you read that you notice that one of the things that made that worse than it already was was the fact that the way dns works if you can't reach the dns servers are trying to connect. do you go ask your friends had a head can you reach this guy at which the exponential increase as traffic so in addition to all the malicious traffic there was an exponential increase and valid legit traffic that they were also dealing with this so the very where the dns protocol is written actually made the situation. years and effectively made the deed us that was being perpetrated against them even wars. and the moral of that story is that your often your own worst enemy.
so make sure when you're designing your system think about all the ways that you can break things think about all the ways your own code can take on take down your system and then hard against them also incremental back off is a fantastic thing. so i've got a couple minutes left and what i really wanted to emphasize been doing this talk is that we all messed up i'm sure if i asked people in this room to raise their hand if they've ever broken their site were taken down the internet or let the blue smoke out of their computer they could raise their hands i know that my claim to fame is that i was struck the mac about after it was. signed by the way. but what saves us lots of things same as when things go wrong one of them is trust. i couldn't have gotten through the situations if i didn't trust my co-workers and trust my tools and if you can't do those things you need to fix yourself maybe by learning your tools or need to fix your situation so you have different co-workers. also it saves us is learning from our bad experiences the reason i insisted we can turn the store off was that i had seen an external dependency going sideways cause an issue at a previous company i worked at the reason that we had a really great checklist for the release that included taking that backup was that my co-worker. from his military training you that you needed to think through all the possibilities and written everything out. we learn from our experiences we learn from other people's experience so i hope that all the all take something out of this talk whether that's incremental back off or having a backup or automating everything you can. you also need to be able to communicate with the people you work with and even be able to communicate clearly and honestly you need to be able to say i messed up something went wrong here this is what i'm seeing know that they're not going to freak out not going to blame here and you need to have a group of ownership. we don't want silos so we have a new person on the team drag them along bring them along on the field trip to the data center hell of them said over your shoulder when you do the release of the next time the earliest used over their shoulder so that everyone knows how to do these things so everyone can help out so that you have a high bus number. and everything i'm talking about his stuff that comes up in the post-mortems asher heads who's been involved in a post mortem at the job. some of the couple and i work at google now and part of my job working in jobs working and about his advocacy of google's i could hang out with the us are a team and they're fun they have some great stories that i can share with you which is sad for me. but one of the things that a lot of the us ariz especially really seen your ones that i've talked to believe strongly in is the idea of blameless post-mortems and here's a quote from the a story book that came out and six months a year ago blame mostly written post mortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had.
and that's that trust these and the culture of finger pointing and shaming individuals or teams for doing the wrong thing prevails people will not bring issues too late for fear of punishment and that's the communication part and i've had the joy of seeing this in action has been a couple incidents that i've been. not involved and i have had his ability and two where someone clearly did something boneheaded but they didn't do it because they were stupid they didn't do it because they were done they did it because the system was set up in such a way that it let them do that and so in the post mortem happens after the incident has done ever once had a chance to scale back and feel better about things this. question isn't why are you such an idiot the discussion as how can we make it so no one else does that same boneheaded things we can totally understand why you did that we can totally understand that made perfect sense of the situation you're an. and that's great and that's one of the that culture on the couple teams were i've had involved or just going to assume that everyone had best intentions were not going to look for reasons to fire people means that we learn and we get better in our systems get better and everyone benefits both out the company worked out and the places we go in the future. this wasn't actually were intended this talk to go but it is important because i was reading through the stories are trying to decide started with a list of stories and like how do i tie them together and one of the other things that came up for me is that all of these stories the diversity of experience and diversity of skills of the folks that i worked with her what saved the day whether that was having a. her who really like saw during all a soldering electrodes together for fun and making crazy things without very pas so they could save the day was saw during or whether that was having a co-worker who had been a sub mariner and so therefore nothing that we were dealing with what ever free came out because it was not actually that important or. i was having co-workers who had a lot of experience debugging notifications that went haywire having that diversity of experience having that diversity of knowledge of what made her team stronger and having it isn't sufficient know a lot of teams only guess we have diversity check that you must value and we must cultivated which means the. but in a crazy meltdown situation maybe you should listen to that person has only been a new team for six months maybe they have an experience from their internship the wall exactly help the situation urine and also maybe if you have a person who have an internship experience so exactly how the situation urine you might be raul and you perhaps should listen to the folks who've know the system. inside now that everyone should be listening to everybody in the open to the idea that everyone's experiences of what's going to get you through the current crisis. so was i think you here's all my contact info again on the second eyes are on twitter i am a good habits like a miser and i blog at the organizer dot com about a mix of his hedges cool thing with technology and here's what to culture six and i work on new cars platforms in belgrade.
can my primary focus is our jobs and resold if you want to run every the site or figure out a way to do the job seething are usually go containers that i can probably help you out and i have stickers and ten as the dinosaurs because it always has to years into a plastic answers and because this talk is right before lunch and we're still little bit early celtics and.
questions but i invite anyone who wants to tell me their this one time the c.e.o. visited the data center story or that denver didn't story to join me and others at my table at lunch and tell me stories because this is one of my favorite parts of conferences hearing about all those that people have missed him and so the year. some questions anyone who questions the question was how do you convince a team that is being well both macau citron to adopt a better engineering practices like of gradual back off perhaps the ability to isolate external dependencies so i can say.
they would earn complete confidence that i have worked on team that did not do several of these things and have great piece of advice right davis give me is you get to have one complete hundred chance your network a year but you should pick it light was an unusual schedule at a time which is a fantastic was in pick your battles and so i picked my battles on some of these things i lost. just the fight on not having on having a graceful back off from an external dependency but i did when the fight on gradual back off and the best way i found is to tell horror stories and two also just be a paid not be put in place about of just be no we need to do this note we need to do this i really don't want to take the page. if you're not going to do this but to get there you have to be fairly senior really get away with done so time sometimes letting them feel their pain in one of the situation they didn't so we were working on a site and someone also taken are made and seven we sat there and watched it melt down while they didn't know how to deal with stuff because they have a modest do the right thing for. a couple hours before we find the system didn't save the day because we wanted them to realize their pain sometimes you just have to let them feel their pain this is what they deserve for so yeah and the questions. asa i didn't figure there would be many because stories so thank you all to have lunch. i'm sure.