Building For Gracious Failure
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 66 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46575 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Ruby Conference 201861 / 66
5
10
13
14
17
18
21
22
26
29
37
45
46
48
50
51
53
54
55
59
60
61
63
65
00:00
BuildingVideoconferencingSoftwarePrincipal idealMusical ensembleTelecommunicationWordSoftware engineeringBuildingPrincipal idealRule of inferenceMortality rateHand fanJSONXML
01:08
IdentifiabilityProduct (business)Shared memoryPhysical systemService (economics)Process (computing)DatabaseSoftware bugData management
02:15
Hill differential equationDisintegrationIntegrated development environmentProduct (business)Process (computing)Physical systemNumberLevel (video gaming)Error messageCartesian coordinate systemInstance (computer science)Integrated development environmentMoment (mathematics)Multiplication signHecke operatorStability theoryFluid staticsRight angleService (economics)CodeFraction (mathematics)BitVolume (thermodynamics)InformationLoginCoprocessorGraph (mathematics)Variable (mathematics)Term (mathematics)Basis <Mathematik>Traffic reportingOrder (biology)NavigationTable (information)Line (geometry)Projective plane
09:22
PlastikkarteConcurrency (computer science)Service (economics)Process (computing)Integrated development environmentDecision theoryPhysical systemImplementationProfil (magazine)CodeField (computer science)File formatDifferent (Kate Ryan album)CoroutineGroup actionBitCASE <Informatik>Social classPrototypePoint (geometry)Independence (probability theory)SubsetCartesian coordinate systemGodRow (database)CollaborationismError messageDatabaseBit rateHuman migrationTask (computing)Computer configurationMultiplication signNavigationProjective planeTrailTerm (mathematics)String (computer science)Sound effectSoftware bugParsing
16:30
Point (geometry)SubsetProfil (magazine)Error messageCASE <Informatik>Physical systemService (economics)User interfaceMobile appJava appletBoundary value problemField (computer science)Decision theoryString (computer science)Game controllerComputer wormCollaborationismAxiom of choiceSoftwareWebsiteMultiplication signIntegrated development environmentPartition (number theory)Greatest elementDatabaseMessage passing
23:37
Scripting languagePhysical systemCodeoutputComputer simulationChaos (cosmogony)Multiplication signPoint (geometry)Constraint (mathematics)Mechanism designIntegrated development environmentEntire function
25:34
CASE <Informatik>NumberPhysical systemString (computer science)Chemical equationDeterminantPartial derivativeComputer animationLecture/Conference
26:27
Fatou-MengeService (economics)Basis <Mathematik>Physical systemProfil (magazine)CASE <Informatik>Point (geometry)MereologyTraffic reportingData storage deviceInformationLatent heatState of matterMultiplication signIntegrated development environmentPartial derivativeValidity (statistics)MathematicsDebuggerType theoryCodeDependent and independent variablesField (computer science)Error messageConstraint (mathematics)Event horizonAbsolute valueNatural numberRow (database)SubsetTelecommunicationArithmetic meanLecture/Conference
31:52
Product (business)Error messageDifferent (Kate Ryan album)Process (computing)Term (mathematics)Metric systemDatabase1 (number)Physical systemNumberLeakCartesian coordinate systemPairwise comparisonTrailGoodness of fitProfil (magazine)Bit rateMultiplication signSlide ruleIntegrated development environmentEmailInformationoutputVisualization (computer graphics)Human migrationInternet service providerService (economics)Game controllerPoint (geometry)Right angleVolume (thermodynamics)Software bugTotal S.A.System callCodeSound effectFigurate numberLecture/Conference
37:16
XMLComputer animation
Transcript: English(auto-generated)
00:01
All right, welcome. I am James Thompson.
00:20
I'm a principal software engineer at Nav. We are working to reduce the death rate of small businesses in the United States. If that sounds like something you would be interested in, please come and talk to me. We are hiring and we are looking for remote engineers. Now, I am here to talk to you today about building for gracious failure.
00:40
How can we make failure something that doesn't ruin our days, that doesn't ruin our nights, and that doesn't ruin our weekends? I am not a fan of overtime. I have a personal rule that I will not work overtime unless I absolutely have to, and I will never take someone else's word
01:00
for whether or not I absolutely have to. But the reality is failure happens. It's unavoidable. We will have infrastructure go down. We will have people delete production databases.
01:21
We will have people deploy services that they should not. And so we have to plan for failure. We have to find ways to manage failure. That's the best we can hope for. We can never eliminate failure. None of us will ever write a perfect system.
01:41
And so we have to plan for our failures. We need to identify techniques, processes, and ways that we can make failure manageable. That's the goal. Now, I'm gonna share a few stories about failures that I've dealt with, and actually all of these are from the not very distant past.
02:03
All of them are failures that I've had to deal with over the last year. And the first one I wanna talk about is probably the one that bugs me the most. And that's the reality that we can't fix what we can't see. If we don't know something has gone wrong,
02:23
it's incredibly challenging, if not impossible, to actually resolve that thing. And if your users are your notification system for when something has gone down, unless you are an incredibly small startup, you're probably doing something wrong.
02:43
Visibility is the first step to aiding us in managing failure. If we don't know that our systems are failing, we're not gonna be prepared to respond to that failure. And instrumentation is one of the best ways to get that information that we need
03:00
to be able to act on and prioritize and deal with the failures that happen in our system. I recently changed teams at Nav. I took over what is now being called our data sourcing team We are responsible for the ingestion of data from credit bureaus.
03:20
Experian, Equifax, TransUnion, Dun & Bradstreet, as well as a number of other data sources. And we have to deal with a lot of garbage. And in particular, we have to deal with this garbage asynchronously. We have to deploy systems, jobs, workers
03:41
that are able to go through and update credit reports on a regular basis that are able to fetch alerts from these various bureaus and bring them together in a sane way. And so we have a job processor that was written in house. It is very similar to Sidekick or Rescue or Delayed Job
04:03
or any of the typical worker systems that you might be familiar with, but it was written in house. And whenever I picked up this project, I noticed that the only visibility we had into what was going on were the logs. And we were running this system in Kubernetes,
04:20
so we don't have a static number of environments that is running, we don't have a static number of systems, we have in the production environment, I believe at the moment about 30 instances of this application running. Collating 30 systems worth of logs and figuring out if something is going awry is not what I ever want to spend my time doing.
04:43
I don't know about any of you, but I do not fancy the idea of sitting down in a comfy armchair with a cup of coffee and scrolling through 30 services of logs. That sounds like a horrible way to spend any amount of time. And so I needed to figure out a way
05:01
to stop having to deal with logs. And I figured it out within the first day. I decided to use Bugsnag, because I didn't know how many errors we were having. I knew we were having errors, but I didn't know if we were having an unusual volume of errors, or if we were having any kind of,
05:20
anything that I really needed to care about. And so by using Bugsnag, I was able to go from the picture on the left, except much, much longer and ever-growing, to the picture on the right, where I can at least say, okay, I know how many errors I have. And I have a little bit of insight into how regular they are.
05:42
And the reality is, I don't know if this is normal. I only have a week's worth of data here. I can see that there's a little bit of variability in that one where we have 231,000 errors. I can see that there's a lot more variability in this one that we only have 2,000 errors for.
06:01
And we've got this one that's just shy of 10,000 that is super stable. It's lots of errors all the time. But I have absolutely no idea if any of this is normal. Now thankfully, this is our staging environment.
06:20
So I'm not really worried that we're not delivering to our customers. But when I look at this and see all of these errors, not really knowing if this is normal or abnormal, I can't trust this code. I don't feel comfortable deploying this service into production anymore, because if that 231,000 errors happens in production,
06:44
that's going to be a horrible day for me and my entire team. And so having this kind of visibility is the first step to being able to manage failure. I need to know that it's happening. And so tools like Bugsnag, Airbreak, Rollbar,
07:00
they give you that first level of visibility. But I still don't know if these are worth working on. I can go and I can talk to my product owner and I can try to ask, hey, this is the error that's happening. This is what I suspect is causing it. Is this affecting other teams? Is this something we need to prioritize and address?
07:21
But I don't have enough information to be able to say, yeah, this is definitely something we need to address, and I'm not trying to convince my product owner, I'm actually going to them trying to have them convince me that it's worth working on. And I don't like that when it comes to errors, especially errors of this kind of volume. And so there's another step in terms of visibility
07:43
that I think is really, really important, and that is metrics. And this was something that we just got deployed end of last week. This is actual graph from our staging environment, and the numbers on the left, I don't know what the heck SignalFx is doing there. It's supposed to just be counting,
08:01
and I don't know how we have fractional numbers of jobs. So I'm not sure what's going on there. This is where we're still trying to get our instrumentation right. But something that this did reveal to me when I know that we have thousands upon thousands of errors happening is that the blue line, which is jobs started,
08:22
and the red line, which is jobs failed, they're following each other. Almost every job that starts in our staging environment fails. Now I know I can't trust the code before I ship it into production. Because if we can't run it in a staging environment,
08:42
how in the hell am I supposed to know that this is safe to run in production? And so visibility is the very first thing that you need to do in order to manage failure. This is kind of the table stakes of managing failure. Before you can deal with anything else, you need to be able to visualize and track
09:01
and investigate your errors and not through logs. Because logs don't provide you enough information to be able to act on reliably. And so that's where we have to start. We have to start by making errors visible, by making the process by which we discover
09:23
that failures have happened, by discovering whether or not they're meaningful, whether or not the rate of failure is significant. The first step is visibility. And so there's tooling for this. If you're working in Ruby, you've got lots of options here.
09:40
New Relic provides a good bit of this in one package. You have systems like Signal Effects and Keen that provide just metric tracking. But this is something that you need to be doing. If your systems don't have a good way for you to know when errors happen, when failures occur in your code, and to be able to tell whether or not those errors are actually at an anomalous rate,
10:03
you're already behind. You need to catch up. And this is stuff that is very easy to implement. Now the service I'm talking about is actually written in Go. And God bless Go. They are not a friendly environment to implement this kind of instrumentation in.
10:21
Because there's no way, especially if you're running a concurrent system, to be able to catch everything that's happening across the concurrent Go routines. But Ruby is stupid simple. And so please instrument your code. Track metrics, not just the errors, but track how many jobs are starting in your system,
10:41
how many jobs are succeeding, how many are failing, how many HTTP requests you're getting, how many are returning different classes of error codes, whether they're 400 or 500, and how many are successful. You'll be able to then establish a baseline for what is normal, what is typical, and then you can do anomaly detection on top of that.
11:00
But you can't hope to do that kind of anomaly detection until you have a baseline, and until you have visibility into your system. And so that alone, if you do nothing else but leave here and implement bug snag or signal effects or new relic or any solution that gives you this kind of visibility, if you do nothing else, you will have benefited your team greatly,
11:21
and you will have likely saved yourself at least several hours in your job from having to deal with a failure that just arises out of nowhere because you didn't see it coming. Now, to move on from this, I want to talk a bit about some techniques for making your systems more gracious
11:41
in the face of failure, how the services that we build can be made to be more forgiving, how they can be more forgiving not only in terms of how they respond to different circumstances but also what they afford for other systems that depend on them.
12:00
So the first one of these affordances that I think we need to make is we need to get into the habit of returning what we can. And I have another story here, and this is one of an unexpected error that happened. Whenever I started at Nav, I had the task of figuring out how to deal with a service
12:21
or how to build a service that we were calling business profile. We keep records on lots of small businesses, and with those small businesses, we have to track a whole bunch of different data points, when they were founded, whether or not they're incorporated, what their annual revenue is, do they accept credit cards, all kinds of different facets.
12:42
We have about a dozen or so of these fields that we track. And the business profile service is responsible for maintaining a record of those fields over the course of time. Now there was a service that existed prior to the work I did that was a prototype that was shipped into production.
13:03
It then got abandoned, like all good prototypes that get shipped into production. And so, in the process of coming on and looking at this, I had to assess, okay, are we gonna keep this service and try to make it work, or are we going to just start fresh?
13:20
And I made the decision to start fresh, I'm still not sure whether that was a mistake or not, but a year later, having worked on the same service for a whole year, we have made the transition to this new system. And in the process of doing that, we had to bring over all of the historical records from that prototype system.
13:42
We needed to bring over about nine million independent data points that were all in a single table, and we needed to migrate those over so that we can maintain history. And we were able to do that successfully. We were able to do an ETL on that and bring all of that data over. But then as we started to transition
14:01
other services to rely on business profiles rather than the old service, we started seeing 500 errors. There were some folks who were asking for data from business profiles, and business profiles was returning a 500. We were able, because one of the first things I did in this project was install Bugsnag,
14:21
we're able to figure out what was causing those 500 errors, and we were able to identify that there was corrupt data in the database. Now, it wasn't corrupt from the database's concern, it was a string. It thought this looked fine. That's why the migration worked. But whenever the application tried to read this string out of the database, it's like,
14:42
I don't know what that means. It was trying to understand the format that string was in, and it just choked. And so we had a situation where the data that had come over from that legacy system, for some reason it got handled okay in that system, but when we brought it over into the new system, we weren't able to parse it. And upon doing some more investigation,
15:02
we realized, oh, this data's never been valid, it's just the other system was way more tolerant of reading out garbage from the database. And so was able to add a rescue clause to catch this specific parsing error and to get the system where it was no longer returning a 500 error.
15:23
And what we then decided is that we were comfortable returning an empty value, returning null, rather than returning a 500. Because we had other data, even with this corrupt field, all of the fields in the system
15:41
are independent of each other. They are related, but only loosely so. And so returning some of that data was still meaningful and still valuable. And so when the system encounters errors where it can't parse or can't deal with a particular value, it'll return null for that value. And it'll return everything else just fine.
16:02
And so that was an example of us being able to make this service, or me being able to make this service, return what we could. And it now means that collaborators with this service don't have to worry about whether or not I'm gonna 500 if a piece of data is corrupt. It's one less use case under which they have to worry about my system
16:21
doing something that they don't expect it to. And this is something that we should think about in our systems. There are lots of occasions where returning some data is far better than returning no data, or even worse, returning an error. And so can we think about values in our system
16:40
that are separable from each other? Can we think about the values that have to go together, say like a currency and then a value for some money amount, and values that move completely independently of each other, like the date that a business was founded and its annual revenue? Those two values have nothing to do with each other. If one's blank and one's corrupt, that's okay.
17:03
We can still return something. And so we should think about the values we have in our system and how we can be tolerant of those kinds of cases. Another case that we have is accepting what we can.
17:22
Now, that business profile service I was referring to, it has, like I said, about a dozen data points that it can accept. But again, because they all move independent of each other and they are separable, we don't have to get all of them at once. And in fact, collaborators don't have to even send
17:41
any value or even an acknowledgement that that value exists when they submit an update. They can just send a JSON payload with just the fields they want to update. And we'll accept that and it's fine. But we discovered that sometimes we'll be sent strings instead of numbers.
18:00
And our service doesn't like that. And so we made a decision that we still wanted to accept as much data as we could. And so if we sent along four fields and one was not what we expected, we wanted to go ahead and record the values of the three that were fine and then let the user know
18:21
that that fourth value had something wrong with it. And so we decided to build this service in a way, or to adapt this service so that it could accept whatever it can. And it will still notify the user that hey, something was wrong but I still accepted the updates for the things that I could.
18:40
And so we need to be forgiving with what other systems send to our services. We need to be able to accept what we can. Partial acceptance is often much better than total rejection. And so we again have to think about what values must go together and what values can we reasonably separate
19:02
and allow them to be accepted independently. All of this gets to the point of trying to make our systems tolerant and tolerable. Being able to tolerate what some folks that we may not, we may not know whether or not they're just testing,
19:24
we may not know whether they're expecting certain behavior, but if we can tolerate and be tolerable to other systems, it'll make our entire environment, our entire system more resilient. Now another detail that we have,
19:41
or another approach that I think is very important is that we need to trust carefully. And this is one that can be applied both to third party services and to services within an existing service ecosystem. The reality is that depending on others and depending heavily on others and other services
20:03
can make their failures your failures. And this again was another case in which business profiles ended up being a problem for us. And it wasn't so much business profiles fault except that business profiles was at the bottom of the stack and it was the thing returning the 500 error
20:21
when it couldn't read values out of its database. But what went wrong was when the service that was collaborating with business profiles saw that 500 error and it said, I'm just gonna forward this on. I'm not going to intercept that 500 error, I'm not gonna do anything about it,
20:42
I'm just gonna pretend like yeah, whoever's upstream from me, they'll know what to do with nothing. And sure enough, the service that was up a layer had no idea what to do with nothing. It had no idea what to do with a 500 error. And so it also returned a 500 error
21:00
until eventually we got all the way to the user interface and by the time we got there, we had an outage for an entire feature of our site all because not quite a half dozen services, out of all of them, none of them had been built to tolerate any of the services they trusted down the stack
21:24
not responding appropriately. Now, we had this situation because of a 500 error. It could have just as easily been caused by a network partition or the service actually going offline and being unreachable. The impact would have been the same. We would have had an outage for an entire feature
21:42
and a really nasty error message for our users all because nowhere along the way could we intercept and deal with this in a gracious way. And so trust carefully. You need to be careful who you trust and how you trust them.
22:00
This is most prevalent in a microservice or a service-based environment where a lot of times we assume trust between services. That's wrong. Pivotal actually had a illustration they just tweeted out a little while ago on the eight fallacies of distributed systems.
22:22
And of course there are things like the network has zero latency, unlimited bandwidth, all kinds of things that are absolutely not true. But in our service systems we tend to assume that all the services we interact with within our boundaries are trustworthy.
22:42
That's not true. Sometimes prototypes get shipped into production. Sometimes you're having to talk to a legacy Java app that no one wants to talk to, but they have to because there's no choice. Sometimes you have systems that are completely untested, but as long as we don't look at them or touch them,
23:03
it'll be fine. You can't trust the other systems that are running in your ecosystem. And so you need to build with that in mind. Whenever possible, don't return 500 errors if you're dealing with a service that you have control over.
23:21
But expect that other services are going to return 500 errors or something far, far worse. We need to assume that failure is going to be a reality because it is. We have to get into the mindset, we have to get to the place where we expect failure.
23:40
That's the big takeaway. We have this mindset that arises out of chaos engineering and the notion of chaos monkey and the simian army and the ability to simulate infrastructure failures. But the reality is for most of us, our infrastructure isn't the most likely thing to go down it's the crappy code that I and y'all write.
24:07
We're much more fallible than an automated script. Unless that automated script requires user input which can then take down an entire AWS region.
24:20
And so we need to get to the place where we expect failure. We need to get to the point where not only in regards to our infrastructure, but also in regards to the systems that we build that we anticipate the ways they can fail and that we build in mechanisms and processes to be able to raise the visibility of failures, to be able to tell whether those failures
24:40
are meaningful within our constraints. We must prepare for it, otherwise we'll always be stuck suffering from them. If we're not prepared for failures, they will always take us by surprise or worse, they won't take us by surprise, but they'll still mess up our day, our night, our weekend, our month.
25:02
Because we have nothing else to do other than just firefight. And the first step is to raise that visibility. Once we've gotten to the point where we have our failures visible to us and we can analyze them, then we can start figuring out how can we make our systems more forgiving,
25:22
more tolerant, and more tolerable within our environments. Now, we have some time for questions. I wanted to make sure that we left some time here. And so there is not a mic to run around, so you will need to be in a position or comfortable yelling at me,
25:42
and then I will repeat as best I can what you asked, and then provide you something resembling an answer. Right, yeah, so the question is, how do we balance being tolerant, particularly with data ingestion, versus being strict and making sure that we don't end up with garbage in our system?
26:02
Is that summarized? Okay, so that's going to come down to a business case. In the case of business profiles, the example that I've used, we made the determination that accepting partial data, in particular, was valuable, but we also made the determination that we did not want to accept garbage data. So if someone tries to send us a string
26:21
where we want a number, we're going to tell them that's not acceptable. And so that's something that's gonna come down to a service-by-service basis, where we have to make the assessment, what's acceptable to us? Now, I don't think it's a good idea to build services that just accept whatever gets sent to them and stores it,
26:41
because then you end up in the situation where you just have garbage data, and your BI and data science folks will absolutely hate you for that. So don't do that. Don't just take whatever is sent to you and store it. Make sure you're performing some basic validations on it to ensure that even if you're doing partial acceptance, that you're still rejecting outright garbage.
27:02
I think that's actually a really good place to start, but figuring out whether you can deal with partial acceptance or not is entirely a case-by-case, service-by-service, and that needs to be vetted by the business side of things. Yeah, absolutely. And that is a situation where,
27:21
and when I say partial data, I don't mean accepting some data that's valid and some data that's invalid. If the data can be ruled out and said, this is invalid, always reject that. Yes, so he was making note about stronger params as being a system in Rails that allows you to do type checking, but then also raising the concern over accepting,
27:42
again, invalid data versus valid data, and that is a point that is worth clarifying yet. Don't accept invalid data. If you can look at it and say, this is absolutely not acceptable, by all means, reject it. The point that I wanna make is that when we're talking about partial acceptance is if you're in a situation
28:01
where some of the values in a record don't necessarily have to be accepted with other values in the record, take what you can. In many cases, and this is something that actually comes out of the notion of event sourcing, where the present state of a system is discoverable by replaying all of history
28:21
and seeing how changes over time have then altered the state to get to where you are now, and in that kind of a system, a partial update, because you still have valid data that can be used to reconstruct other parts of a record, partial data can still be accepted and still keep that record in a valid state.
28:42
And so the business profiles that I'm using as an example is one of those where all the pieces can be thought of, all the values can be thought of independently, and so updating one is just as meaningful as updating all of them. And so being able to accept what you can, when you can, still delivers value in certain cases,
29:01
and it's absolutely something that needs to be vetted. In the far corner. Yeah, okay, so in the case of returning a partial response, do we still return a 200 status code? In the case of this particular system, we do still return a 200,
29:21
but along with that, we have notified all of the consumers of this service that they need to always check the errors parameter that we return to them, which will always contain an array of any values that were not acceptable. There are other ways you can handle that,
29:40
depending on, again, the constraints of your environment, and so because we did accept the data and it was okay, we do return a 200 code, but that may not always be appropriate depending on the circumstance. Yeah, so the question is, is in some cases, the partial data may not be ideal,
30:00
and so do we provide opportunities to undo that? In our particular use case, we don't have a situation where partial data needs to be rolled back. The most common way that data enters this particular system is either from bureaus, where we're scraping specific pieces of information
30:21
out of a report and using that to populate, and then in that case, we know what fields are available in those reports, and if there are values, we bring them in, or where we're accepting user input, and in those situations, this was actually something that came up recently where a user went in, or we discovered that if a user went in
30:40
and updated their business's industry, their annual revenue, and a couple other pieces of data, that there was an, at least from the front end's perspective, they were not expecting it to accept partial data and to leave the value that was incorrectly formatted unchanged, and that was a failure in communication
31:01
for us to let front end know that this is how this service behaves, and so that was one case where they were not expecting it to do a partial update, they were expecting a hard error, and in talking with them and communicating with them, we were able to say, it's like, no, you just need to make sure that you're always checking that errors array
31:21
to know if there was any fields that weren't handled, but we've not yet had a use case where we need to roll back just because of the nature of this particular service that I've been working closely with. Yeah, so how do we distinguish valid use of this system since we do have partial response and folks who are trying to exploit the system? In this particular case,
31:43
business profiles is buried deep in our infrastructure, it sits behind at least three other systems that handle access control, and so it is actually an insanely trusting service from the access control standpoint, which is why it is not publicly routable. It's only exposed through other services
32:01
that provide that access control scenario, but that would absolutely be something that you would need to mitigate against if you were gonna have a service that would return partial data if it was publicly accessible. You would wanna make sure that it has appropriate access controls in place to make sure you don't leak data.
32:22
Yeah, so the question is, do we keep metrics on when we have these situations where reading data out of the database is not possible when we have those error cases? And yes, we do. We don't have signal effects metrics where we're not tracking it that way, but we do still have a bug snag notify call
32:41
that will record the context, what the value, we will actually fetch the raw value from the database, put that into the payload, and then actually notify on bug snag for that so that we can see what values are causing this and hopefully discover where they're coming from. Up to this point, we've been able to identify that all of those corrupt values
33:00
came from the migration where the other system was just much more accepting of input than the new system happens to be. Yeah, so how do we prioritize, once we have visibility on errors and failures that are happening in our system, how do we prioritize which ones that we wanna work on? And that's something where you need a good product owner.
33:20
You need someone who understands where the business value is, what the impact is or the potential impact is on users, whether there is any, and of course, you need to help them by providing as much detail as you can in terms of what you know. But that's ultimately, in my opinion, a product decision, and it's something where we as engineers
33:40
need to collaborate with the product owners to determine which fires can we let burn and which ones do we need to put out. And that's, of course, with the examples that I gave earlier in bug snag, all of those, as soon as we got the instrumentation in place, I turned them into tickets, added as much detail as I could, and then I notified my product owner and said, hey, here's what we've got,
34:01
here's what I think is the problem, and then can you do the legwork to figure out, are other teams impacted by this? Are any customers impacted by this? And how much about these do we care? It has to come down to, will fixing this deliver business value, or will it,
34:20
or not provide business value, but will it restore business value we're currently missing? And so, until you can answer that question, it's difficult to prioritize them from a technical standpoint, other than the fact that, looking at this one that's 230 something thousand, I hate getting the every new 10,000th one email from bug snag, that's really annoying,
34:41
but that's my only metric right now. So eventually, I'll probably fix that one, just because I don't want those emails anymore. Yeah, so how do you keep, particularly errors like the stuff you see in bug snag, because all it does is just provide you with what's going wrong. How do you distinguish that, the background noise, from the things you actually need to work on?
35:02
And that's where I think bug snag and systems like it are not enough. That's why I really like SignalFx or Keene, or something like that, where you're able to actually see not only that an error happened, which you have to have separate metric tracking for that, but how significant is that error rate in comparison to the total volume
35:22
of traffic coming through your system? And so whenever I showed that slide earlier, where the failure rate is perfectly tracking, just on a slight lag, the number of jobs started, that's a huge red flag, because, what was it, there's no gap between failures and starts.
35:41
But you will need to look at what is that gap, and that's where other visualization tools like SignalFx, actual pure metrics libraries, will allow you to get an idea of how big is this problem? And of course, bug snag can help you there, because in some cases, depending on the way your application is structured, it will actually tell you number of users affected,
36:03
because of where business profiles and the worker system that I showed earlier, because of where they sit in the stack, there's no way to identify the user that triggered certain errors, so we have no idea what the impact is until we start asking people. And so that's where the separate metric
36:21
to be able to track jobs started, jobs finished, jobs failed, HTTP requests, different error codes, status codes that way, becomes very important to have those metrics to allow you to sanity check whether or not the errors you're dealing with are actually affecting enough of your user base to be worthy of inspection and further follow-up.
36:45
Yeah, and so some more input there on how to prioritize bugs is taking into account severity, taking into account frequency, and being able to, again, provide more detail on how impactful a particular bug is, and so that's, of course,
37:00
the more information you have, the easier it is to figure out how impactful a given failure actually is in your environment. Right, well, I think we are out of time now. Thank you all for coming.