Managing unmanageable complexity

Video in TIB AV-Portal: Managing unmanageable complexity

Formal Metadata

Managing unmanageable complexity
Title of Series
Part Number
Number of Parts
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
As systems get more complex they inevitably fail. Many of those failures are preventable. We’re not lazy, stupid, or careless. The complexity of our systems simply exceeds our cognitive abilities. Thankfully, we’re not alone. People have successfully managed complex systems long before software came along. In this session, we’ll see how surgeons, pilots, and builders have developed techniques to safely manage increasingly complex systems in life and death situations. We will learn how simple checklists improve communication, reduce preventable errors, and drive faster recovery time.
Message passing Data management Computer animation Multiplication sign Website Window Product (business)
Purchasing Installation art Home page Mobile app Inheritance (object-oriented programming) INTEGRAL Software developer Data storage device Plastikkarte Disk read-and-write head Dressing (medical) 10 (number) Force Computer animation Website Data conversion Information security
Purchasing Code INTEGRAL Virtual machine Product (business) Revision control Goodness of fit Meeting/Interview Cuboid Software testing Information security Address space Physical system Exception handling Authentication Email Software developer Database Incidence algebra Line (geometry) Unit testing Cartesian coordinate system Computer animation Personal digital assistant Telecommunication Iteration Quicksort
Web page Covering space Computer animation Semiconductor memory Limit (category theory) Field (computer science) Computational complexity theory
Computer animation State of matter Structural load Range (statistics) Model theory Planning Design by contract Right angle Disk read-and-write head
Multiplication sign Model theory Expert system Design by contract Planning Performance appraisal Crash (computing) Computer animation Causality Video game Complex system Contrast (vision) Task (computing)
Game controller Channel capacity State of matter Multiplication sign Range (statistics) Design by contract Planning Checklist Limit (category theory) Computational complexity theory Cognition Crash (computing) Computer animation Different (Kate Ryan album) Operator (mathematics) Order (biology) Energy level Video game
Flock (web browser) Computer animation Key (cryptography) Range (statistics) Buffer solution Water vapor Checklist Mereology Information security
Addition Centralizer and normalizer Fluid Computer animation Causality Order (biology) Insertion loss Line (geometry) Procedural programming Information security
Scripting language System administrator Multiplication sign Software developer Model theory 1 (number) Insertion loss Basis <Mathematik> Line (geometry) Checklist Surgery Mereology Field (computer science) Centralizer and normalizer Computer animation Auditory masking Website Energy level Object (grammar) Resultant Task (computing)
Building Standard deviation Statistics Group action Electric generator Virtual machine Online help Surgery Computational complexity theory Estimator Computer animation Software Different (Kate Ryan album) Order (biology) Website Quicksort Procedural programming Error message Physical system Condition number Task (computing)
Point (geometry) Group action Computer animation Operator (mathematics) Data conversion Surgery
Web page State observer Set (mathematics) Mortality rate Surgery Checklist Vector potential Film editing Process (computing) Computer animation Telecommunication Operator (mathematics) Single-precision floating-point format Data structure Data conversion Remote procedure call Spacetime
Group action Confidence interval Multiplication sign Decision theory Execution unit Combinational logic Mereology Subset Mathematics Sign (mathematics) Different (Kate Ryan album) Data conversion God Physical system Software developer Rollback (data management) Bit Instance (computer science) Cognition Process (computing) Telecommunication Chain Moving average Right angle Quicksort Arithmetic progression Point (geometry) Purchasing Dataflow Game controller Service (economics) Observational study Patch (Unix) Electronic program guide Online help Surgery Checklist Field (computer science) Product (business) Number Revision control Kritischer Punkt <Mathematik> Natural number Operator (mathematics) Software testing Data structure Task (computing) Form (programming) Addition Dependent and independent variables Weight Projective plane Polygon Model theory Expert system Database Line (geometry) Vector potential Computer animation Software Hybrid computer Formal grammar Complex system
Computer animation Right angle
V and the here and now this talk starts with 1 of the worst days of my professional career about 5 years ago I
was city my us and I was not feeling particularly well as well that sick and actually decided it was time to uh surrender admit defeat go home and get some much-needed rest when and all of a sudden I saw a bunch of campfire messages firing off the was 5 years ago but and all of sudden goes at Windows started popping up a product manager come right came rushing over and everybody was asking the same questions what's wrong with the site and logged
their the all of and was the cofounder of the company that I worked at all deftly should not have been modernist and we looked and saw that we just undeploy so we quickly rolled back and 6 minutes later everything's working the way it should again but we're pretty high-volume site so in the 10 or 12 minutes that that was alive there were hundreds of purchases that were made and were incorrectly applied to parents house dozens of people added credit cards to parents account it was not good so I walk downstairs and I had a quick conversation with the chief legal counsel I had a quick conversation with our security team I've attraction and lower back upstairs and not to work on figuring out how to prevent us
so I think it's helpful to start with what actually happened in the future that we were trying to rollout was integration with past work but this was right before the launch of I sex and apple apple red head strongly implied that if we're ready with ASPic support on day 1 they would the dress on the home page of the App Store and from prior experience with this is worth tens of thousands of installs that we wouldn't have to pay and lots of new users is a big opportunity but also the short turnaround and had come about when people already working on other things so implementing this feature fell to any junior I less developer not 6 months removed from college this was actually is 1st rails feature and it's have some understanding of you know very
very this is not exactly how works but simplified version of what authentication of light so in application enjoy there a method that would check your all could be verified that it was signed properly and you were logged in the problems that are junior I lost over and was that he needed an account that had purchases on it so that you could test the past with integration and we have in development we would use a slimmed-down version of the production database the filtered out by everyone except for employees and and he looked through and solve Aaron had a lot of purchases so was a good test account and he put this line and the the now I'm sure there are people in the room who were saying I that is just bad code you should have done that it was dangerous and I would actually read like I even in development this has the risk of going to production it should never write code like this and understand what would happen if had to solve the problem that he had immediately on being able to log as somebody and have I have a sufficient test pretty easily I also want to say they're the team in lazy they were reckless they were indifferent to quality so you might think well we have test this would never happen to us yes said it with not reacted good unit test coverage around the the authentication system and if you look at was added in there it was defined by e-mail which returns nail and wasn't or so unless the account an account with that e-mail address existed in the test database it just fall through work like I always did and that you know address was not in our picture data so it just fell through work like all instead Russell had continuous iteration the test we was fully running on a machine that was not developers actually gave us a false sense of security in this case I well if you reviewed the code if you have people doing code reviews you would notice that was there we did we did have something to the code review we had an extremely talented a very conscientious developer actually look over the code and you can review but end up that this was the person's 1st code it was on a tight deadline and in end up being a thousand line data from spreading into 14 work-in-progress candidates and that 1 line in the fossil and it was missed Scott happened people make mistakes I will if you actually ran it then you definitely would have seen that that was there and if you did any sort of manual testing whatsoever than you would of course that well the developer you'd actually did run it but again and was actually a good test and how to test it and so in a less dangerous way that developer would often use and when testing things so we saw Aron's name as he was dancing and didn't think anything of it the so after we've gone and we've done all of the incidents of communication after reads us belonged through on lawns and were actually able to figure out where all those hundreds of approaches were supposed to go inkling of all the data our home much later that nite
and was really struggling with I how we can prevent this I remembered and your car acquired a you're 2 before but this guy on his aims at over go 1 day in his assertion that he had written about how surgeons handle boxes and then he took that
article extended into entire book so I bought that book and undercover cover that 9 Italy 150 pages is that that long and a lot of what I wanna talk about today are the things that I learned from that so let's start with
a another field but that had to do with increasing complexity and the limits of that human memory infallibility Aviation this is the
beachhead in 19 35 this was the state of the art in the American military arsenals it's the 1st of all metal of singling plane that 11 produce a completely revolutionize the design of large aircraft but this is the early days of aviation and things were developing quickly so that a year after this plane was introduced the Army Air Corps I announced a competition for its successor they want to be able to have a longer range to be faster and to be able to carry a lot of the this was the uh the hands-down
favorite that's the Boeing Model to 99 it's the 1st plane that ever had 4 engines it was the largest land plane ever produced in the United States and the models and then I was head and shoulders above the other competitors for this contract I had twice the range could carry twice the load of the 30 % faster these army was so excited that after the 1st a test flight they already entered into on to discussions with Boeing to purchase 65 before the competition and even been completed so in the light in October of the to very senior test pilots in on the major and a senior engineer from Boeing got the plane at right airfield in Dayton went down the runway took off flew what's render turned sharply to the right encroach
both test pilots were killed that Boeing could not complete the evaluation and therefore legally cannot win the contracts n upon investigation what was realized was that the cause of the crash was the pilots and failed to disengage the at the Gus locks Gus last selected the flat swimming around when you're sitting on the runway is that they don't get damage yeah and to release the mind you do is flipped 1 switch 3 simple task yeah now this wasn't drops because of a lack of expertise again into a most experience pilots in the world flying it was dropped because of carelessness if there's ever a time that you be dialed in on things that you need to do it's when you're about to take off in the largest experimental aircraft has ever been produced in your life is literally at risk it was a simple steps but it was 1 of dozens of simple steps this
was the most complex plane Aaron produce this is the A 3 so just 5 years before the model 299 so fateful crash this was the most advanced claim in the American arsenal and there's a lot going on here but I can see how an expert was trained could look at this and keep this in their and understand In contrast this is the cockpit of the model
to 99 for dealing here with here is not a difference of degraded to difference of kind the level of complexity involved in flying this plane is fundamentally different than the planes that came before the and after the crash there was a real concern that this plane was simply too difficult for people to fly the the army was still intrigued by v capabilities of the plane by its range by its capacity speed and they figured out a way through a contracting loophole to place order for 19 which gave Boeing time to figure out how to successfully applied 1 thing they could try to do would be to reduce the complexity make it easier to fly but given the state of technology at the time all the controls were necessary this was necessary composite complexity not accidental complexity can remove it instead would Boeing did and it has fostered was they figured out that they were
running into the limits of human cognition and baby produced a checklist of all the steps that need to be done before common common operation so before you start the plane these are the things that we need to do when starting and and this is what you need to do before you take off this is what you need to wear any money delay land this is what you need to do on the this check was this claim that was too complicated for 2 of the most expert pilots in the world of life became manageable which is a good that when or where to broke out the b 17
long range and approving essential to the Allied campaign in Europe over 13 thousand the seventeens were produced and dropped 40 per cent of the bonds in the US dropped in world war 2 it's not a stretch to say that the b 17 and the capability to safely fly it was instrumental in defeating
it and since the b 17 chapel a key part of aviation security culture but when a US Airways flight took off from Jack and York influence through a flock of canadian geese destroying buffer engines and managed to safely make an emergency landing in the Hudson River with losing a passages they did that armed with checklists on how what to do when you lose engines how to safely ditch fuel on how to safely make a water landing and evacuated passengers but so Pacheco huge improvement
in aviation security our aviation safety what about another high stakes
let our friend . 1 is a doctor so talk about medicine this is a central line essential line is the 2 big it's inserted into a large vein often around here order that so that you can of said doctors can administer medicine fluids directly into the bloodstream it's also extremely common procedure in US eyes use alone in patients and over 15 million days a year with central lines insert but it's also a leading cause of blood infections which are incredibly serious so thousands of people a year die from what infections and it causes billions of dollars of additional costs yeah M. those infections or preventable you but in 2001 there's a
doctor Johns Hopkins in the ICU we decided to try to solve this problem so specifically he wanted to improve the level was generally you want to improve the level of care in the ICU and specifically 1 to reduce the rate of central line infections and like I said these are preventable and we know how to prevent them so he created a simple check was just 5 things every time a center ones they inserted doctors will 1st wash their hands with soap it will clean the patient's skin with an antiseptic they will put still drapes over the entire there were a mask hat down gloves and they'll put still dressing over the in the the insertion site what is it to like a that is a pretty simple things of and you would think that in a house allegiance happens that is 1 best world the recovered they were done but before rolling out the check westerns and making sure that these things were done the 1st nurses in the ICU spend on among observing 1 central wondering imported or had being inserted and reporting on the results the and what they found was that at Hopkins 1 of the best hospitals in the world in the ICU where the most critical patients are being cared for in over a 3rd of patients 1 of these steps will skip so he got together with the possible administration and together they empowered nurses to stop a doctor if they saw they were skipping 1 of these steps they also versus everyday check with doctors if there are any centerlines it cannot be removed and in the year before but this checklist was introduced the 10 day 1 infection rate of around 11 % in the year after the checklist was introduced there is 0 infection rate was 0 % but the results were so good that they didn't entirely believe then they end up modeling for another 15 months in that entire time there are only 2 infections so on an annual basis in this 1 ICU they calculated would be prevented 43 infections and 8 death that represented 2 million dollars of course
that so we've seen 2 different fields where there's free massive impact from introducing chuckles but I'd be willing later this from people really were skeptical that this would translate to suffer development and I think 1 objection which I somewhat agree with is that the 2 examples are given so far are largely around making sure that repetitive wrote tasks are completed and we have a solution for that we automate things if 1 deploying the needs to restart a rescue workers we make that part of the deploy scripts so they were not relying on somebody to remember to do that every time but checklist can help with more complex problems than just making sure the simple things are done so I wanna talk about 1 more example from medicine which is surgery now I
have a healthy pride and fear of the complexity that we deal with in building things that were coming you think of all the systems that need to work all the machines of the networks all the stuff around to build is something relatively simple like place an order from become site it's sort of a miracle that if anything ever works at but I didn't know that I started knowledge that there is nothing that we do that is anywhere near as complex as cutting open a living breathing human being going inside of in fixing something surgery makes what we do look trivial and it's incredibly varied there are thousands of commonly performed surgical procedures and every patient is different every team team is different tiny errors can have a massive impact the scalpel half a centimeter School of enigmatic drug administered 5 minutes to earlier 5 minutes to late 1 of hundreds of surgical sponges based in left body cavity can lead to a literal life-and-death consequences the
yeah so in 2006 and the World Health Organization came to the doctor 1 day and asked him for help they had found that the rate of surgery have skyrocketed across the world there were over 230 million major surgical operations performed in 2003 but the rate of safety having increased along with them so that that we don't have perfect us statistics but the best estimates say that somewhere around 17 per cent of those 230 million surgical operations had some sort of major publications and . 1 of is tasked with leading up 80 the working group to generate recommendations for Western intervention what I what is an intervention that we can do that would improve the standard of surgical care globally that is an incredibly difficult task against thousands of different procedures being performed the conditions that then performed and are wildly different would eventually on was coming up with a general surgery check was
it here's the jet was so they produce and I mean all of this document its simplest is 19 steps that fit on 1 piece of paper it takes about 2 minutes to run through up but it has the potential to improve safety in all those 230 million of charges the 1st thing it does is that it creates 3 plus points
out where key actions which acts and important conversations will be prompted so before and seizures administered before the person's incision and before the patient leaves the operating room surgical team will come together and make should be take care of a simple stuff and it had a conversation about the things have been talked about to but again so
what this does is the structure gives highly competent professionals a space to do their job but it also pull them back together to make sure that the simple things don't get messed and the conversations that need to happen need the happen and that they they've thought through how they're going to deal with likely communication so a lot like the competition so the simple stuff there is an 1 that used to make sure that and and the antibiotic and administered by not more than 60 minutes before the 1st cut is made but it's in the bloodstream beforehand that's been shown to have a huge impact on potential infection rates there's communication that is working on improving a simple thing is that we have before the 1st incision is made and make sure that the entire certainty this introduce themselves and they know who is working together and what their roles are that's not something that will always happen before and it helps with planning 1 of the steps is of the surgeon reviews what are the risks and the surgery what are the possible complications that we can anticipate and by talking through those things in advance if any of them come up the team is going to be more likely to be able to respond in an effectively so this set this is all a lot to be done on 1 page 19 simple steps and the scope of the problem is massive so what would
it work so after doing a few trial-run is in a single a single operating room to iron out the I am the any issues the WHO did a pilot program in hospitals around the world there were very so there are hospitals in the US and Canada in the UK and hospitals in remote Tanzania in the Philippines in Jordan and before introducing the checklist they sent observers to monitor the center occasionally get better statistics and be able to measure when eventual impact in choosing the I was perhaps so this and 3 months observing over 4 thousand operations across the across the hospitals and in this 4 thousand operations for 100 people developed serious complications and 56 died they then introduced the checklists and monitored for another 3 months after that and in those 3 months the rate of major complications was reduced by 36 per cent and the rate of deaths was reduced by 46 per cent all from a single page 19 steps can be done about 2 minutes
how the network so check this work because what they do is that they make sure that the simple but critical things aren't nest and they also make sure that the right conversations are happening while also empowering experts to make decisions it's not about reducing the job of the surgeon to taking things off on a less it's making sure that the right people are talking and training so that they can respond when things inevitably change 1 is a bit chocolates so the 1st is 1 and know your also is a task mitigation I task lists the communication was a combination of both the aviation jet was that we looked at were primarily task project was to make sure that these things happen this surgery check was was a bit of a hybrid there were some tasks in their make sure that the and and at the end have being about it is administered at the right time but there also was a converter like at unication aspect to it the structure that you would do this 2 main forms that you can so you know if you're watching and then if you thinking of our automatic off people are calling out controls guidance check etc. but that's reading and in doing an action a to read you check restore you you add a you confirm checklist which is what the surgery courses let people to do their jobs but nature that actually happened before it becomes too but it specify who is going to do each step so traditionally an operating room the doctor use God what antigens have a well known God complex but that but doctors hands are busy they are surgically scrubbed and so the responsibility for the check was and not being given to the circulating nurse that's the nursing is described of which make sure that there was someone who their primary task is making sure that the steps for happening and specify when to do each step so create think of the polis where there is a natural opportunity to be able to validate the these about critical things that happen in that the conversations don't try to be comprehensive there are a hell of a lot more than 19 things that happened during the surgery and if you try to spell them all out of the arduous using take too long people are going to use it it's not having about the and it's not going to right the 1st time are you need to be able to adjust your specific circumstances but take the time to get that that so let's go back to a couple years ago and after I have had this epiphany and season examples of how successful this could be in other fields how can check was applied to Sophocles yeah I was thinking and I I want to think through what other what others false points where we have the opportunity to 5 to to to instance of like this and I thought there were 3 natural once before submitting a poor requests when you're reviewing of or request and before deployed so before submitting a poor requests this is an individual to this is working with yourself I pass that around the mineshaft themselves the following question so have I actually looked at every line of the death of my sure that everything that is here is intended to be here is there anything in this patch that's not related to the overall time so my conflating refactoring with future changes ever those out into 2 different units of work have actually structured the commits to make the reviewers job using somebody's going to read this year what you're doing is you're communicating with somebody else so 14 working Protestants makes a little hard to figure that out a Majorana locally be shocked sometimes I I've done it myself but there are on you're making a really tiny change and it's so obvious that it's gonna work you haven't actually gone through run it before reasonable request you owe it to the people that you're asking a review to actually run test it yourself if you have a formal QA team is this something that's merits formal Q and somebody else taking a more thorough look at and the poor across does it explain what you're trying to accomplish and how to verify that the features work then 1 over Europe except that for requested some questions they should be asking themselves well what do they do I understand the goal of this change that you don't understand what the the progress is trying to do you have no chance of being able to effectively review it that's the 1st thing you should be asking about looked at every line of the differ regions of between the rich and lasted again 1 thousand deaths that caused an incredible amount of pain if you're doing something and signing off on it you probably want to make sure that you've actually looked at the entire thing and I use because of of ways around major that actually works and runs but to I think this merits additional QA whether it has or has not already and other sufficient tests and then how we know this change works out we know whether accomplishes the goal that it set out for and then after we have sort of the individual steps of the developer submitting a request and lover reviewing pull request back to come together and we now have the team working on something which is it's time deploy so I ask that before deploying this in the interview have a quick conversation and ask why we about going wrong with his anything here that are their performance concerns that were worried about we're really looking for is anything different about this change in production verses in Denver on stage for instance are using a third-party service for the 1st time that needs different production credentials is now the right time to the avoid a change like in the days at the right time on the clock but if you're deploy a few deploying something that's going to mess with your company's major purchase flow you might not wanna do that at the hour that you have the most purchases is it that they have some online are you don't apply something when nobody's around if things go wrong so you don't push everything off hours Is it possible or desirable the rolls out a subset of users the push of minor feature what can you are role as a Navy test is there another when you should actually pushing itself the and then what specific steps which a mistake once it's deployed so that we can verify that it's working we deploy you see the thing's going diffraction pattern of the chain is working the way nature intended to and finally if something goes wrong what will we do this is a change that safe to just immediately rollback or is this 1 of the changes where you find new change database schemas and you can just roll back to a previous version because you have to be extra careful no I did not do
a you controlled study guide enormous out when I with a valid randomized control we didn't even monitor the exact number of issues that happened before and after but I can tell you is we absolutely call things that we wouldn't have by doing we found times were about to push out a new third-party service and we didn't have the up production potentials in uh the the credential mentioned system already and that would have caused a problem we also had in that issue that did not that did get point to production but they were able to respond to and recover from much more quickly because we talked about those risks before the point so I not a 600 II and and confidence that checklists can be a great help to and being able to deploy a higher quality software faster and with much less stress on but I will leave you with 1 more thing which is then 1 thing I love about was and this very light weight idea of process like there's no there's no sign off you actually falling check this is a paper that's being created is that anyone in this room can begin to introduce this practice into their own development without having to ask permission so if you're in if you developer working completely by yourself you can still think 1 of the things that I need to make sure that I think through every time before or something else and start to get into the discipline doing if your individual developer working within the larger to you can do that but then you can also start to ask some of those questions of your team it's as you're getting ready to deploy things or certain other critical points and by modeling the behavior make that become part of your culture and if you leave it to you you can obviously introduce you to some of this right these examples some of the research around us and that help introduce that practice so thank you very much for all your time I am Patrick my purse along
which I very occasionally right at is at pragmatist I am the Director of Engineering obstetrics where we are hiring so please come talk to any of the from the citric people were here for the limits on that we have I'm on a keeper and rt rights interesting stuff and multithreaded that's district stuck be fit but now that
things so in