Bestand wählen

Your Software is Broken — Pay Attention

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Erkannte Entitäten
the problem is that the number of yeah everyone face them down there was in the the back the way also uh so yes I stamina stamina talk a little bit about uh the way to rethink production monastery squatter broad topic of
dive into it in a lot of that but I you know that my so you for itself by max and unknown uh from around here but I'm originally from the UK of been in uh working in the Bay area for the past 7 years um have also been building production monitoring systems for every 10 years uh originally being financed when I was working Limburg then at the start of and now supports staggered helping people the better production altering uh ignorance teams by doing birds like it and just a quick overview of
bugs I think a lot of people may know about sigh of views similar to what we help you understand what's going wrong in your software on production so rather than getting a slew of e-mails coming in from the exception notified jam or taking into your log files or using uh and all tool like a break for example uh we kind of give you the tools and workflow you need to figure out what the most important problems or in your software so uh highlighting on this is uh rather than data things colonies atoms of figuring out which the most harmful bugs all my sample errors or in your in your application I'll be talking a lot about a lot of the philosophies and techniques that we use when building bugs that but this is a kind of a more general talk about setting up a good quality monitoring making sure you are not dropping so it this
is the the scary reality this is the the truth in most companies and I I don't think that I think people will be that honest of themselves most people it so it's like that this this encode this white tests and people with would individual running test these days let's send out production and then I guess it's OK now it's probably fine but what you really want is is
is confidence you wanna know that when your code is like new customers using your product for real that's what so I will be talking a little bit
about what how you can get from the left the right here uh and get some confidence in yourself
so progerin Woodhouse production wondering so it breaks down into kind of 3 core where is this this solve areas but these are the 3 main ones that I think most people think about and care about so stability monitoring this is the kind of thing that bugs like does this detects if your software is broken if there's crashes happening it lets you know performance monitoring uh tools like the relic will be of this is that if you have a slow and availability monitoring so that's basically all time monitoring selecting the uh is might even responding to requests uh but really all of
these come down to 1 thing because of delivering also experience your customers and that's the point of the point of doing production monitoring of book book
why do we care about you know overseas is easy to say let's deliver of experience or customers but it was the actual reasoning
behind this well this is inevitable true this statement is never been easier to make software uh at this you can kind of a test that especially the in the rows communion really clearly it's such a beginner from the space this tutorial everywhere you can build apps in record time it's it's never been easy to to build software so you out live or die based on its quality a classic example of this if you're blamed yes this is equal to e-commerce applications maybe amazon is happening your phone and maybe the best example of a trial by if the app crashes 1 about a teeny you probably know that I was in that environment is like it it doesn't matter where you get TV from analyze situations it matters that you get and you solve the problem you set out to the the if you are a broken slow or unavailable you customers in any case they leave they did not come back and if you have article customers this is
why this is a real uh kind of it's retention based study that was done about a year ago 84 % and this is the mobile space but we we looked at this on on the uh uh web as well it's it's almost identical 84 % of customers what if they have a choice of different sort window and your software off the scene just 2 questions so do eventually children customers you could spend all this time doing some valuable pieces software any customs were suddenly using more and the worst thing is not only to customers have a choice but customers have a voice so uh it people will complain people don't Twitter and learn about yourself when people go on your mobile continued on the store and even 1 star review and that is the damage to your brand permanent damage to what you're trying to deliver it to us sphere customs on the flip side this is a study that we've kind of reduced a couple of times but it depends on location other than the honesty responses uh I think in the Bay Area people tend to recall 40 % but I also think that people tend to think that that actually are try is experiment tries enduring teams like trying to measure the amount of time when you're building something how much you spend in finding and fixing bugs so I'm talking about a From the point you receive either a customer rapport with the notifications the books that should actually getting something fix the patch delivered to figure out how much time spending on this because it is higher than you think so this this study that we we uh found that 49 % of engineering time is spent fighting fixing bugs and it's pretty close most of you were is pretty close to that but that is awful nobody wants to be finding a fixing bugs people you would be building value to the building features you things your customers care about not doing schlep not and diving into both files and digging into things so on 1 hand you've got customers and the customers you spend time building up abandoning software if you have a preferred that stability will performance all quality problems and on the other side you're wasting your time is a sulfur and you're an engineer manager
so far what kind of bill might deadly sins of production monitoring so most people are doing at least 1 of these electron a try 1 time when I did this talk together action hands and nobody wanted to share any atoms so as you look around and just look like guilty faces instead because of definition of so uh let's say committees and these and these license sigh
cinema while pretending nothing is wrong now this is the worst thing about early with the kind of Jomo slightly many scenes think shipping to production is the final step of the process now this comes from uh the I think an old-school mentality of different really cycles you build the ship and then you done you you want and you walk away um so horse changing and especially with everyone the rails community does is already but does not also was animal anymore you build their use send cousins as soon as you can and you see if it works and you see they like it and you you should test you see you see if this is something that is going to stick around and if you believe that nothing's wrong then none of this matters if you think everything's fine and you don't even know approach monitoring is but the thing is I see this time and time and time again we want to do it the customer the other day and they said all I said howdy production margin right now and they said all way into customers that's that's another sort of redundancy but the it was just horrendous the terrifying and these are some of the symptoms
and presumably people said these all of these in the office or on project they worked on but I've written tests of course that means that you are in every test that could cover every particular piece of data in every scenario no way no jobs like I don't believe it but the QAT will check that I had this set a company that didn't have acuity I'm not kidding someone said chelating look after that we did have a QA in the company I would think for me that's a classic would stay in development with that that was the the labels of but you know testing is any part of the process you can't you can only test things you can predict a given set of things that you you can think about and in reality most of the uh the production problems that come about are the things you couldn't think about what to expect so I mention this 1 already in this these
2 quite related waiting for customers in plane this is this is but this 1 is the most unforgivable like if you're delivering something to your customers uh you want what you have some criteria you have some faith in it and the reality is that most customers will not take the time to complain when there's a problem in your software so if you wait to that 1st customer complained you probably chosen 20 30 40 50 customers already because they were mad and it's so this 1 it I personally think is unforgivable was on like at across was of the product I a like doing products in the plane of the parties to give value to people to give value to customers or to give value of the given source community or whatever so if you wait until customers complain you failed the worry uh I mean this is why I put
it into a court because this is something something something very similar to this in real life so the
lack of visibility is is it is a huge 1 is uh somewhat less stable you say why we need we need to monitor this maybe put in log statements and you'd you'd have more falls out of production this is what look like it in a nobody goes a lot faster and just as let's just check everything's fine nobody looks does that too was a black hole is shove stuff and don't thousand nobody ever looks so there's no point in having production data unless you actually going to look at the and also surface in a way that makes sense and is actionable so will just check it loads I
remember there was a the 2nd 1 is I was in a team of about 4 or 5 years ago where someone said all of the animals and that particular case just there was no gold standard for it and they would so they there was so confident that they would have been undertaken some low low-cost when we look to the colors there was nothing bring anything logs which is also a fight so this 1
is a really really difficult 1 of sort on lack of ownership so you all no looking all their moderate similar about 2nd place you uh said you could add to production you being proactive whose job is it to actually look at these these problems whose job is it to spearhead the fixes for these problems now this is this is a very difficult 1 to solve in large companies but I've seen it work and I'm gonna go into some kind of recommendations on that later on but that the son sometimes is that it it manifests
in in in these kind of things and this is 1 when we look around the room the couple guilty the faces that once you build something you need to move on to the next thing but what what X who owns the who owns the problem or covered all uh components that policies someone's going to production and this article to the ship is used in time pressures I seen 100 times not my problem but you don't want people like that but I haven't got
so we've gone through the since these are the things that like every company that every team every person has at least 1 of these I guarantee book pattern we do that acne actually get to have a better place
so now I've got these kind of a set of rules I think it's a framework for by the billing production monitoring choosing a school all basically getting better in your own company in your teams and is like all principles can be applied across those 3 areas of monitoring uh not just error monitoring and not just performance moments about the Marine any any error so yeah the 1st 1 is
accepted yourself will break off the ship and once you do this it's very freeing it's very it makes you feel a lot better about things because you like it will be shipped faster but also it will make you less arrogant about your abilities a program UV light like it's the break of effects but that's OK and once you accept that some bugs in that through production you're a great position so that continues to improve your app based on more tapping in a well by some customers saying all based on that if it's breaking production so you
accepted things you know this is a problem the how do you actually find out about it she's no the if is wasting 49 % of your time finding affecting bugs all our time is spent digging into things looking to log files realizing did have diagnostic and trying to introduce yourself so if you use tool like bugs and they want to yourself that automatically detects problem situations for you uh you going to be in a much better position so most programming languages may provide for large every provides a framework to provide some kind of exception called error or something is really bad for example uh in rails applications you can write a right away and you can say draft my with requests and right middle and if an exception bubbles up captured so tell me about it and then let it bubble up again further for the rocks that all in a job in java you can hook into thread go on and on and exceptional on 1 core exception and will achieve the JVM tell you every time an honorable exceptions and even in performance and for example you can set up uh triggers or breakpoints so you can say that uh pay have a background thread that monitors each web request for example and if any inquest takes longer than some fixed amount of time so tell me about it this is a problem situation we care so generally in most platforms systems as either a built-in way of doing it or you can come up with an automated book yourself and be relatively easy to implement and if you wanna have a look at some examples of this all of plants snags crash detection uh it's the case and notifies the available open source for free on so if you want to reduce build something that that's yourself don't still snakes code obviously he's but things like easier but if you're going chickabiddy itself we've taken the time to find these books and put them in but on their in or notifies about sex of the con such contact if you wanted to dig into the kind things I'm talking about the so the you know the things are going
wrong and you put something in place to detect these problems what tends to happen once you've done that is you end up with noise you end up with a lot of data coming in that doesn't necessarily mean anything or you don't know how bad things out and 1 of the most important things uh in production wondering is not drowning in in noisy data so you wanna figure out what the highest priority thing is to fix what what is the most harmful bog in your in your software and locals addition sort key you never gonna look at those properties to streams streams of data what my matrix the other but you need some aggravated y's together into some meaningful way so this is an example in error monitoring in exceptional offering in the Ruby application uh you can say hang on a 2nd all of these exceptions the coming from the same line of code 1 of these exceptions happen to have the same error + exception class so you can take heuristics like this even bind together and you can use those to group together uh like events like cations and that allows you to prioritize you can say hey there was 1 of these are 1 of these are 1 of the 10 thousand of this so once you aggregating and this is this applies in any way as well you can do this in performance monitoring you say this page takes this long to grouping into 1 page you sense takes this long to load and you can also do on a part time and availability monitoring by saying the URL this euros orresponding can group this altogether the some if this helps to avoid data blindness effects is kind what I said but this is something that I struggle with a lot when your when looking morphologist devices so now if you use focus you want to see what was lots of the
so you've done a lot of 1 of the things I said earlier was the same was the lack of ownership and sometimes lack of ownership comes from a lack of visibility so how can you get visibility into these problems so most of the these days are not just using you know I and I always thought to be this response complements response and the last becomes and there's still a ton of people using the exception notified gem which sends crashes into your inbox into Eugene books last time I was a rose called someone told me that uh I said how do you do with the pressures on production and in volume in this said when we get a notification from Google from Gmail the bottom boxes of rate limit then we do we have a problem that's going fix it so this was the thing that I mean as long as that of you notifying you have something to me and that enough uh but these days use chess change up so I'll have the pleasure as having people in this room right now are using slack quoted chats everyone pretty much right so why not have those uh detected problems of an aggregated together coming to the channels you already working in were just discussing with your team for example about saying we have uh a front intima back in and and and we have exceptions coming in to the appropriate word so in bugs tagging in in the can settings of supports and say tell me when every individual section at now what I call a fire hose that works when you 1st launching application or maybe when you on staging for example beta but when your production affinity noisy you know that not that's why I say earlier about a lack of visibility but what we should do is you say things like that let me know when no more than a thousand people see the crash let me know when this point that know with a new error but we've never seen before comes so you can take this stuff and put into the channels you already communicated so
priortization once you've kind of done this aggregation we said earlier you can actually stop parts of and think so How would you prioritize things once you their creation so 1 way of saying this error is if this happened 10 thousand times that's pretty good it but what if that ever happens in thousand times because someone had 1 person of some bad data in the data in the database and he just stuck in a loop 1 Condor exploded and just kept on spinning and the so another way to prioritize these by how many users were impacted if there's a user-facing project you could say but let's tag each every exception comes in let's let's grab UID and then we can say well this error happens 100 thousand times to 1 new idea and you can learn a lot just from that kind of relationship that ratio of students but also to process another we can prioritize is by looking at uh actually solves again example an air monitoring and actually the problem if you look at was this handled exception or was on have exceptions all do you want to focus and prioritize on on things of customer impacting an unhandled exceptions will cause it arose out for example 500 cities like something went wrong does him embarrassing so taking kind of these heuristics number events Number of users severity welcome parts was put work now we go all the way down to number 6 um less than having told about fixing but we're just talking about being aware of them uh being aware production issues and having the visibility and servicing the worse problems but this is a complete waste of time in doing that unless you have the tools to diagnose these problems so most approach monitoring tools and if you building 1 yourself you need to do this at that point in time the production problem happens capture relevant diagnostic data to help you actually solve the problem as well so example everything such monetary in broadside again you can look in our get of and look cool stuff the diagnostic to automatically capture the inner Rails application if an exception happens within the capture the line of code the stack trace the line of code that have the crash happened but what else is useful you wanna know things like what was the URL of this happened and what was the parameters get premise those premises were available in that request then all these things then you get can get a little bit down and you can start sending stuff maybe will help you from your application like how many times this person looking recently as things like uh 1 of the most intense pressure from a particular so was the song that particular version of rails maybe forgot to upgrade 1 of our services rose 5 was still on rails foreigners of that we all the water out of it happen that we just embarrassing but I've seen crazy things I remember that we had but in my previous company uh we had 1 ropes server that was running Ruby 1 . 8 when every we other ones will update its ability to what what's going on why is just as acting differently as you have that diagnostic if he captured at point in time problem happens you can solve this for the surface what what the actual cause of the errors the if
there's 1 thing that I think the most important thing I want and take away and implement it seems this one's 10 all the other stuff I'm talking about is when it's all technology choice or technology implementation this is a fundamental organizational change so you got practioners go production issues what's the point in having all of that in place if no 1 is responsible for actually going and fixing it and this is the again this is a really tough wanted to solve all going to a couple of ideas and techniques of singing companies we work with but unless people care about these problems there's no point in detecting them so how do you actually do these things in in real
life right to link use value it's assess impacts assess a variant of things really captured diagnostic there does the tooling workflows where things get interesting use gene challenge in Israel was using slack jet embrace collaboration so this is a rather than having a culture of blame if you pick a all goes to look where you can look at a common history for example you can say hey you know what or I think I know why this happened you can collaborate around a particular issue we can assign but this is a good example of this week we have a concept is uh commenting in bugs like so in books and you can comment on an error so anyone your team can be all I think I know this is always this bad uh and 1 of Roberts like customers had afraid that had over 200 comments on a particular and it was the start of his book this leads bad and then someone else came in like this is related to the deployment we just pushed out any kind of evolved into this huge conversation around uh the potential causes of it and then there's a history as a prominent issue once you fix the problem you can go back in and see all the steps of given to them all the discussion people and and this is something that happened working even if you don't have a tool that does this if you put your attention the team site for example you can interleave problem information with a discussion with humans around why you think this tracking progress is an important 1 clear why so once the detected a problem happen we prove we fixed it happen we track which ones have been fixed in Norfolk so this is again a tool choice thing about site you can tag out above so errors as a fixed but ignored if you think it's like Google callable or something like that snoozed which kind of says hey this isn't bad right now but if it gets a lot was really need to know about but by tracking progress you can say you can kind of see what things and pocketed into the decisions that they have in the outputs now I mention this a couple
times there's no point in the universe unless you good team in place and the organizational uh set up in place to actually do things about it so the if you are
accepting the bonds that happened reduction you can embrace rapid iteration books how do you have to do with the issues now there's 3 ways i've seen teams do this and this is a real users people struggle with this this through reasonable 1 is incredible take now 1001 1 team is responsible for checking on approach modeling tools to say there is something wrong and what's the most of the problem right now as part of its obviously the bottom can collect knowledge they can amass knowledge of time there's also clear responsibility who's responsible for this stuff is also a cost so it's harder for individual contributors individual engineers to learn from mistakes if you put someone cleaning up off you messes with time then you're not going let you don't improve as a as an engineer but also the must communicate these common issues back to the rest of the engineering now this is the most common uh set up in all the big organization so I used to work in finance in finance they would have a bulk team and the book would be like a everyone this is broken should we fix and they put a whole on a patch or whatever on all the of this 1 but it it's the tried and tested historical this is a kind of modern way of doing things and this scales pretty well as well so rather than having uh individual person itself or system so uh we had worked reasonably we had a roll-call but gloriously the bug Warrior on a week retention and you come in and you would learn immediately about the entire system very quickly because you be looking bugs around your your role as long or would not be uh necessary fix things but you job was overruled to understand how bad something is and then be the champion for getting it fixed so in some situations you come in and say well that's just uh and unusable until checking that I can fix it myself to you or get in but it was situations that we knew what still difficult complicated as long as you or as long as you have someone in that role with is the stakeholder for the cost of then change can happen so Bob warrior system works really well use and there's companies full you know the prose of this is the entire team to get the entire team get to see and feel the pain of the customer on rotation which is fantastic is if you don't have visibility on this unit you think everything's just fine which is 1 of the things out of it also avoids is not my problem mentality also about the now the columns all the main point is can take a look a little bit longer uh to fix individuations but I think that the priors closely the pros outweigh the cons of this this is a controversial 1 and this 1 only works in certain organizations because it could potentially create a binding but in theory the person who lost touch the code is the best place to understand what caused the problem now unlike erroneous and the situations where I get dubious faces of finds his I'm on my way to some the state the 1 that has just changed to the after spaces yeah OK that's got other problems if you got people UK-based whatever the person you get refactored everything and shifted down by a couple of lines all uh what about the others in and uh something into his own class be more testable well you know what they still were in that would most recent if then messing around with that kind of thing in that code for some valid reasons maybe test of tabs to spaces is a valid reason uh vice-versa but if the mechanism batteries and they still should have context of what's going on and now I'm not saying anyone who touches the code gets an e-mail they blamed for this but I'm saying that they have the knowledge and and therefore might be the best 1st contact so pros are they have the knowledge of the affected code also this actually cost about than that probably opt the best basis and fix and they can learn from the mistake but the big problem this morning to really depends on the organization's from to the best of is that can create finger-pointing culture and proclaimed culture so as you can tell which 1 of that's this is my favorite it works it scales pretty well but if you have an environment where you have a positive attitude deposit velocity and there's no blame involved this forms a pretty small living and on so I would I would you guys taken from
this avoid the since so don't pretend nothing's wrong don't wait for customers to complain don't have a lack of visibility and don't have a lot of attention yeah embrace the core principles the
fundamental principles approach military so the accepts yourself will break off the ship but flight scared that automate and at Eric citizen crashes errors issues in production don't just have a string of events group like events together in aggregate notify your day 15 where they were to communicate his team Japanese e-mail if you want change is much much better prioritizing convex everybody the daily reality of a high-volume places you cannot fix everybody so prioritize which was effects make sure you have the darkness again available to actually fix these things and my simple 1 tend to make sure someone cares about us
and then take action so often be avoided the since then you've set of the core principles how you actually take action will selects all built during production Washington's based on these principles obviously unbiased and also used boats like monitoring crush wandering because these are the core principles go parts around but it where you choose naturally they fit these criteria get smart about the workflow make sure the unary workflow and make an organizational change if there's no 1 responsible for fixing bugs and caring about your customers in Africa that means that it does matter how happens that figure out a way to make sorry that said so we got a few
minutes floor of 7 minutes for questions uh any questions that that's a great question so the question the question is when you've got a lot of a low bugs that may be of the same each level of priority of same occurrences have you had to get those things so as as that you have you of of you yet the yeah you OK what that's it that's a really addresses and having it like it actually involved 0 in purge monitoring tools so that's the ways of seeing this stuff was brutal 1 is less than 2 so the birds away is declaring bankruptcy right degrees away coming in selecting everything in deleting everything and the starting from scratch and it works really well but it's scary but not without without a supervector the other technique that works pretty well is to have a look at a hacker form what we call couple weeks where you say right instead of just a leading and pretending that have these problems let's actually go through and fix them and some other stuff we've been building in that around the mean but was told so is so you can come in and say these things and say I care about is right now but let me know if it gets worse which try to build involved where you can't stop hiding these from your made so you kind get books the book the 1 of the ways is just the tree because bankruptcy 1 of our biggest customers to this and it was so please after they did that they they started using work it's it's tough that he has requested the question is said provides a customers today you something for perjury before they come into this fresh so it's actually pretty useless so 1 interesting thing to point out in response this is not the case the number of conferences the most of the cases of of uh Larocque on which is a key people larval PHP framework is great community uh that is the the speaking and it's 80 % of the people we talk to me by just looking at at works and that this terrified of the reason platform but so far it's like itself we have about a 50 50 split uh I think that probably most of our initial customers we using a tool called every before uh and we kind of global of refugees from every and so there was understanding there was a problem uh and this is a better way of solving that there were still saying that some of the 1 mentioned in Section notified gender so many people still use exception notified like yeah it's great do in the when you in development but as soon as you shift to anyone has not I spent us a reasonable way to solve this problem by a specialist tools out there like like books like the free from open source projects that I just knew what what's it's easy solution it's about 50 50 split but it does depend on the community the good news is the rows communities pretty savvy about the question a impressions pool and slow Mr. few now the thing is that the number of people in the
Umsetzung <Informatik>
Streaming <Kommunikationstechnik>
App <Programm>
Befehl <Informatik>
Vervollständigung <Mathematik>
Kategorie <Mathematik>
Gebäude <Mathematik>
Güte der Anpassung
Kontextbezogenes System
Motion Capturing
Kollaboration <Informatik>
Dienst <Informatik>
Rechter Winkel
Chatten <Kommunikation>
Stabilitätstheorie <Logik>
Selbst organisierendes System
Digital Rights Management
Klasse <Mathematik>
Demoszene <Programmierung>
Weg <Topologie>
Arithmetische Folge
Reelle Zahl
Endogene Variable
Affiner Raum
Spezifisches Volumen
Open Source
Binder <Informatik>
Elektronische Publikation
Patch <Software>
Wort <Informatik>
Streaming <Kommunikationstechnik>
Prozess <Physik>
Konvexer Körper
Statistische Hypothese
Kartesische Koordinaten
Einheit <Mathematik>
Prozess <Informatik>
Hook <Programmierung>
Figurierte Zahl
Funktion <Mathematik>
Zentrische Streckung
Physikalischer Effekt
Klassische Physik
Twitter <Softwareplattform>
Geschlecht <Mathematik>
Projektive Ebene
Web Site
App <Programm>
Physikalische Theorie
Framework <Informatik>
Inverser Limes
Zusammenhängender Graph
Speicher <Informatik>
Physikalisches System
Dreiecksfreier Graph
Innerer Punkt
Innerer Punkt


Formale Metadaten

Titel Your Software is Broken — Pay Attention
Serientitel RailsConf 2016
Teil 49
Anzahl der Teile 89
Autor Smith, James
Lizenz CC-Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.
DOI 10.5446/31575
Herausgeber Confreaks, LLC
Erscheinungsjahr 2016
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Your team has been tasked with releasing new and better versions of your product at record speed. But the risk of moving quickly is things break in production and users abandon your buggy app. To stay competitive, you can't just ship fast - you also have to solve for quality. We'll rethink what it means to actively monitor your application in production so your team can ship fast with confidence. With the right tooling, workflow, and organizational structures, you don't have to sacrifice release times or stability. When things break, you'll be able to fix errors before they impact your users.

Ähnliche Filme