Don't Forget the Network: Your App is Slower Than You Think
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 24 | |
Number of Parts | 89 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/31508 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
RailsConf 201624 / 89
1
2
3
5
7
8
11
12
19
22
28
29
30
31
33
45
47
50
52
53
58
62
67
71
74
76
77
78
79
80
81
84
85
86
88
89
00:00
Cartesian coordinate systemKey (cryptography)Computer networkMultiplication signException handlingSlide ruleMereology2 (number)Mobile appAvatar (2009 film)Computer animationLecture/Conference
01:43
Software developerDampingProcess (computing)Web applicationMobile appOpen sourceMultiplication signInteractive televisionMoment (mathematics)Projective planeXMLComputer animation
03:00
Associative propertyControl flowDialectSoftware developerLine (geometry)Data managementLattice (order)Computer networkServer (computing)Bit rateRevision controlMultiplication signInstallation artComputer animation
04:12
LaptopMultiplication signMobile appComputer programmingDependent and independent variablesGoodness of fitLastteilungNumberDifferent (Kate Ryan album)Product (business)MeasurementCASE <Informatik>TrailServer (computing)ResultantInternetworkingEmailStructural loadFlow separationChemical equationLevel (video gaming)Line (geometry)MiddlewareBlogRouter (computing)Point (geometry)String (computer science)Computer networkSoftware frameworkOverhead (computing)MereologyCartesian coordinate systemService (economics)Traffic reportingContent delivery networkCondition numberGraph (mathematics)Game controllerStrategy gameBitHand fanExistenceEntire functionMathematicsGrass (card game)Software developerRight angle2 (number)Real numberSampling (statistics)Electronic mailing listFood energyProxy serverDivisor1 (number)Group actionService-oriented architectureType theoryProcess (computing)Arithmetic meanComputer animation
13:53
Server (computing)Latent heatComputer networkThread (computing)Semiconductor memoryNumberBefehlsprozessorComputer hardwareVirtual machineMobile appElectric generatorCartesian coordinate systemBenchmarkSuite (music)LengthInstance (computer science)CodeSpeicherbereinigungMultiplication signJava appletFraction (mathematics)Profil (magazine)Traffic reporting2 (number)NeuroinformatikDifferent (Kate Ryan album)Bound stateRun time (program lifecycle phase)Interpreter (computing)Software frameworkComputer programmingOrder of magnitudeMetric systemPoint (geometry)Fluid staticsOverhead (computing)Graph (mathematics)Codierung <Programmierung>VideoconferencingInheritance (object-oriented programming)Right angleLeakCellular automatonState of matterStructural loadData storage deviceMaxima and minimaService (economics)MereologyGroup actionGodFormal languageReal numberMatching (graph theory)VirtualizationTrailCovering spaceComputer animation
23:34
Graph (mathematics)Point (geometry)Mobile appMetric systemMeasurementFunctional (mathematics)TrailNormal distributionMultiplication signCartesian coordinate systemProduct (business)Data transmissionAverageCurveNumberExtrapolationPattern languageGraph (mathematics)outputBitBoss CorporationBlack boxInstallation artSoftwareServer (computing)Process (computing)Client (computing)Right angleQuicksortSource codeOnline helpMetropolitan area networkInformationAreaGradientComputer animation
28:10
AverageMultiplication signComputer fontBenchmarkMetric systemGraph (mathematics)NumberLine (geometry)CurveTime zonePoint (geometry)Single-precision floating-point formatResultantReal numberVideo gameSequelRight angleGreatest elementComputer animation
29:42
AverageGraph (mathematics)MathematicsComputer networkLastteilungMedianServer (computing)NumberMultiplication signMobile appGroup actionMetric systemSingle-precision floating-point formatLine (geometry)Information2 (number)TwitterRevision controlGame controllerPrice indexLatent heatState of matterCurvatureRight angleFunctional (mathematics)Graph (mathematics)Range (statistics)Point (geometry)Diagram
34:24
SpeicherbereinigungGraph (mathematics)NumberPressureGraph (mathematics)TrailDiagram
34:59
Virtual machineNumberGraph (mathematics)Line (geometry)Control flowRight angleComputer animationDiagram
35:44
Graph (mathematics)AverageVarianceNumberLine (geometry)Linear regressionGraph (mathematics)TrailSet (mathematics)InformationCross-correlationDifferent (Kate Ryan album)Computer animationEngineering drawingDiagram
37:13
Multiplication signServer (computing)AverageWordGraph (mathematics)Computer animation
37:46
Information overloadStandard deviationPrice indexMereologyCartesian coordinate systemPlanningFunctional (mathematics)CodePhysical systemMetric systemAverageSoftwareOperator (mathematics)Normal (geometry)Software developerComputer networkVirtual machineFigurate numberMultilaterationMobile appComputer animationLecture/Conference
40:11
Computer animation
Transcript: English(auto-generated)
00:03
Thank you for coming to my talk. That's very kind and generous of you to listen to me talk at you about things. My talk is called Don't Forget the Network,
00:21
Your App is Slower Than You Think. I'm gonna talk about, I guess, things that you probably haven't thought about yet, about how people use your application and about ways that people using your application are having a worse time than you think that they are.
00:43
I'm sorry. I don't really know of any good way to talk about this except by probably making you feel bad for your users. So brace yourselves and you'll be fine. Before I get to that, introduce myself. My name's Andre Arko. I'm indirect on almost all the things.
01:02
That is an avatar of me that now that I'm looking at it, that's one avatar old. I'm sorry, I'll get it fixed by the time I post the slides on speaker deck. I wrote a book called The Ruby Way, the third edition. I co-authored the third edition of The Ruby Way. It's actually pretty great. I learned Ruby from the very first edition of The Ruby Way and it was my favorite book
01:22
except that I couldn't tell anyone to use it because it was about Ruby 1.8. And so I updated it and it covers Ruby 2.2 and 2.3 and if you buy it in a couple years, you can use it to prop up your monitor and make it higher like I do with my copy of The Ruby Way second edition.
01:43
I work at Cloud City Development. We do mobile and web application development from scratch but mostly what I do is join teams that need someone really senior to help with their Rails app or their front-end app. I've done a lot of Ember stuff.
02:01
And I guess if listening to this talk makes you feel like you could use someone to help you feel less bad, talk to me later. That is literally my job. I work on something else you may have heard of called Bundler. I mean, I worked on Bundler for a really long time but it's been a really great experience
02:21
to work on open source and to kind of interact with every aspect of the Ruby community. People do things with Bundler that I would never in a million years have imagined that people do with Ruby. And then I get to help them try to solve their problems. And we've put a lot of effort into making it,
02:41
I guess, easier, I don't know about easy, but easier to get started contributing to open source through Bundler than a lot of other open source projects. And if you're interested in contributing to open source, definitely talk to me later or tweet at me and I would love to help you start contributing to open source. The last thing that I spend time doing
03:01
is called Ruby Together. Oh, I'm even wearing a shirt. And Ruby Together is a non-profit trade association for Ruby people and companies that pays developers to work on Bundler and on Ruby gems so that you all can run bundle install and it actually works.
03:21
And without companies and people giving us money, Rubygems.org just wouldn't stay up and you wouldn't be able to bundle install because we have to work on it every week to keep it up, it's servers, it's software, it all breaks all the time. And the only reason that we're able to keep it working now that there are so many people using Ruby
03:42
and using Ruby gems is because companies like Stripe and Basecamp and New Relic and Airbnb are willing to give us money so that we can pay developers to make sure that it all works. We haven't let Rubygems.org go down in the last year which is super great
04:00
but at the rate usage is going up, we need more people to give us money. If you are a manager or if you can talk to your manager about Ruby Together, that would be awesome. So the network and how your app is slower than you think. I guess routing is a thing that your app has
04:28
even if you didn't think that it does. I guess at one point there was a very widely shared article on Rap Genius' blog about how Heroku's router was a sham and everything was awful.
04:41
I guess unfortunately whether you're on Heroku or not, your app has a router and it's probably making things worse than you think they are. So let's talk about how that is and why that is and what you can do about it. So routing, what I mean is the part of your application's infrastructure
05:02
that takes the request from the outside world and load balances it or forwards it or somehow gets it through your infrastructure until it finally reaches your Rails app server and then your Rails app server does some stuff and tells me, hey, this took 45 milliseconds and then it has to go back through NGINX or HAProxy
05:24
or NGINX and HAProxy or whatever it is that you use back to the outside internet and then across the entire outside internet back to the user who was trying to find that thing out in the first place So how exactly does this work?
05:40
Maybe you haven't thought about this. I totally don't blame you on your laptop. This is a non-issue, right? In development, this is routing, you. You talk to your app. It's great actually. Unfortunately, in production, you need more than one app server and people are coming from a lot of different places
06:00
and so this is just like a generic Rails app. Not every Rails app will look like this but almost every Rails app looks like this. You have some outside level load balancer. You have some inside level, here's how we split requests up across all of the unicorns or all of the pumas or all of the whatevers.
06:21
And every single one of those lines adds time to what your users see that you never saw while you were working on the program on your laptop. So question time. Raise your hand if you know how long your routing layer takes.
06:43
That's what I thought. I've given this, I've asked this question in various different talks about eight times. I totally expected no one to raise their hands. I've literally had one person ever raise their hand. Eight talks, that's probably like, I don't know,
07:02
closing in on a thousand people now. I once gave, I once asked this question at a DevOps conference and zero people raised their hand. Like I don't expect you to know the answer to this question. But it's actually a really important question to ask because your end user's experience
07:21
is 100% directly impacted by this. Like someone who goes to your production app and tries to use it experiences 100% of your routing layer twice for every request that they make. And like, is it a long time? Who knows, none of us.
07:41
And then on top of that, so not only is there this question of how long does it take in the perfect case from the time they make the request to the time your app is processing it and then from the time your app stops processing it to the time they get the response. None of that time shows up in your nice New Relic graph
08:03
that's like how long this took. Like zero of those milliseconds are included in that number. So you can look at the number and be like, yeah, we answer all our requests in, I don't know, what's a good Rails-y number? 250 milliseconds. I feel like that's a pretty common one. But like, how much time do you need to add to that
08:22
before you know how much your users are actually experiencing? How do you even find out? And then once you find out, what if too many requests come in at exactly the same time? Just having that routing layer where all of your requests come to one point
08:40
and then they fan out across other points, this was like the main point of that Rap Genius article was that Heroku uses, and honestly, there's nothing else that you can really do that makes sense, you just kind of randomly assign them. Like, well, here's one for you, here's one for you, here's one for you, here's one for you. And the problem is, almost all Rails apps
09:02
have some requests that take 10 milliseconds and some requests that take like a second and a half. And when you're just throwing them out at random to every server that can possibly service them, unfortunately, statistically, it is very likely that you will end up with two horribly slow requests
09:20
stacked up behind each other, and then the really fast requests start to stack up behind those, and it isn't very long before you see a 30-second timeout and you're like, that makes no sense. That request, New Relic says that request takes 10 milliseconds, why would it hit a 30-second Heroku timeout? And so, it's not perfect,
09:42
but you can at least start to get a little bit of visibility into this using a New Relic feature called queue tracking, where you have your load balancer set a header that says, I got this request at this exact time, and then your app server says, well, I didn't get this request until this much later time,
10:01
and then New Relic can add a thing to your graph that says, well, your requests are spending about this much time just sitting around waiting for a server to have availability to answer them. And that can be a completely separate thing that people don't measure that is adding 50% sometimes, I've seen that, to the total user waiting request time.
10:23
And it wasn't even measured, no one knew it was happening. Everyone was just like, that's weird. It seems to take a lot longer to get a response than New Relic says it takes to make the response. I wonder, hmm, hmm, you know. So, ultimately, what I'm trying to impress
10:43
on all of you all is that the overall request time is not the number that Skylight, New Relic, pick a service, I don't really care, tells you your request takes. That's a good number, measure that number,
11:00
pay attention to that number. If that number changes a lot, you wanna know why that number changed a lot, because that's really important. But don't think that that number means that that's how long people are taking to get the results of your app running. Don't, right, exactly.
11:21
It's not the time that you measure that your app takes to run. And honestly, even that queuing track that I was talking about with New Relic, that requires that those clocks on the load balancer server and the Ruby app server be synchronized so precisely that they can measure milliseconds accurately, and it's very easy
11:43
to end up with clocks that are milliseconds off and then your measurements are off. And so, what you want instead is a holistic measure of how long does it actually take to be a person on the internet, say, hey, Rails app, I want to know a thing,
12:00
and then for the Rails app to say, okay, here's your thing, and then it arrives back. So the strategy that I've actually had that is really successful here is to deliberately create a Rails controller that returns an empty string, and then set up a service like RunScope or Thousand Eyes or one of Pingdom even,
12:23
like, there are services whose entire reason for existence is so that you can make requests to your own stuff from all over the world and find out how much delay your overall infrastructure adds to your application.
12:41
And if you have a Rails app that returns an empty string, I guess, honestly, you could even do like a Rack middleware that returns an empty string because New Relic measures the Rails framework overhead. So you just want to know about all of the time up to the time it hits your Ruby app and all of the time after it comes out of your Ruby app.
13:01
And so you can use one of these monitoring services to say this is the weather report for our users around the world, and honestly, I've worked at companies where 60% of their traffic was the US, but for no particularly apparent reason, 35% of their traffic was from Brazil. And then you really care a lot
13:22
about network conditions changing and meaning that traffic to Brazil got a lot slower today, you should figure out why that happened and maybe think about setting up a CDN in Brazil. Because if your traffic numbers are relevant to your business making money, they almost always are, this matters a huge amount,
13:42
and right now, chances are good, nobody has any idea what they are. Are they bad? We don't know, are they great? We don't know that either. Maybe they're great, like honestly, if all of you go home today and start monitoring these numbers and they're fantastic numbers, I will be extremely happy for you.
14:03
Based on past experience, unfortunately they're probably not gonna be that great. But knowing what they are is way better than having no idea that they exist. So, very closely related to things taking longer than you think they do,
14:22
let's talk about servers. So, I'm assuming that if you have things deployed, you have servers, this seems like a good bet. Let's talk about what's happening on your servers. Stuff, right? Like, you buy them and you rack them
14:42
or you rent a fraction of one or, I don't know, you rent a fraction of a fraction of a virtual machine that is a fraction of a physical machine, it happens, right? Like you end up with a piece of a computer and some stuff is happening on that computer.
15:00
And even if you bought the computer yourself and racked it yourself, it's still running a ton of stuff and you have no idea what that stuff is. And I'm not gonna tell you that you need to know what all of that stuff is, but I am gonna tell you that it's really important to know how that stuff is impacting the thing that you do care about,
15:20
which is your users and experience. And so, a big thing that impacts this, whether you use Ruby or Python or Node or Go, you have a runtime for your application, right? Like, even Go has a garbage collector and a framework that all Go programs run inside,
15:41
and what that means is your application sometimes isn't running while your program is running. And when that happens, your code isn't running, your instrumentation isn't running, and you have no idea how long that took.
16:02
So, if the garbage collector runs and your entire application just stops for a while, how do you know that that happened? How do you know how long it took? Like, you can't write, it's really hard to write code that measures time that your code wasn't allowed to run.
16:26
So, based on real-world usage, definitely Go and Java and Ruby all have garbage collection pauses where execution of your code just, nope, hang on, wait.
16:40
Gotta collect some garbage. Okay, that's good, you can keep going. And Ruby, I guess, recently has added a thing called GC profile that at least reports after the fact how long garbage collection took, which is awesome. But there are more reasons than just garbage collection that your code could end up paused.
17:02
And so, what you actually want is some way to say, I can tell that my code stopped working, stopped running, right? It was still working, but it stopped running for a second, and then it started running again, and how long was that? And there's this, I learned this trick from somebody who works at Paper Trail, Larry Marburger,
17:21
and I think he got it from some of his colleagues at Paper Trail, it's super clever. What you do is you start a new thread with thread.new, this is like Ruby-specific, but you can do this in any language. You start a new thread, and then you say, what is the time, sleep one, what is the time, and then you subtract them,
17:40
and you send the difference off as a metric. And if your code stops running, sleep one will take longer than one second. Little known fact. And so, by monitoring how much wall clock time passes while a thread in your application is calling sleep one,
18:01
you can accurately graph how much overhead the surrounding interpreter is adding to your overall execution time. And I have definitely seen this happen where you're running a Ruby program, and you're like, that's weird, it seems kind of slow, and then you check the how long does it take
18:21
to sleep for a second graph, and you're like, oh my god, we're spending 150 milliseconds doing something that's not running my program every second. And sometimes that means you have a memory leak, sometimes that means that machine just got into a really bad, weird state. But at least then you know, and at least then you know that it's exactly like that app server that's having this problem
18:42
and all the other app servers are fine. Super useful. I guess very, very closely related to this is the virtual machine that you're, now that we're talking about interpreter lag, you're probably also running that interpreter inside a virtual computer. And Amazon, DigitalOcean, Kuroku, Engine Yard, OpenStack,
19:04
you're either running on a VM or you're running on a VM inside of a VM, or maybe even if you use Docker, a VM inside of a VM inside of a VM. Hooray. And as you can imagine,
19:20
this is yet another way to have weird times where your code doesn't run and you don't actually know it because your code literally couldn't run. And it's even worse than that because sometimes you'll end up with resource specific contention, right? Like in a VM, what if you're on a VM and one of your co-tenants suddenly, like your co-tenant is running a memcached server
19:43
and so it's just like all the memory IO is going to your co-tenant. How do you even know if that's a problem? What if they're doing something really storage heavy and that means you can't get IO anymore? And so there's like, at a minimum,
20:01
the resources that you may care about are CPU and memory and disk, right? And network IO, I guess. Waver number two of network of IO. And you don't know when you get a shiny new empty VM, like maybe everything is great
20:21
and maybe that machine has basically no network IO available or has basically no memory IO available because of co-tenants that you don't know exist. And so Netflix has a really clever way to check for this, I guess. They've written about it kind of at length and what they end up doing is
20:40
Netflix spins up a new EC2 instance and then before deploying to it, they shove a giant pile of benchmark suites onto it and they run the benchmark suites and then they compare it to what they've decided is acceptable performance for that price point on EC2 and if it's below their acceptable benchmarks, they throw away that VM and get a new VM
21:01
and then try the benchmark suite again and then throw that one away and then get a new one and eventually, they hit an instance that meets their criteria and they said in their paper that they have observed almost an order of magnitude in difference in performance at the same price point because Amazon sells both the newest generation of hardware
21:23
and two or three or sometimes even four generations old as the same VM, very large air quotes around same, and then you have to deal with co-tenancy issues where you may have a VM on a very old, very heavily contended physical machine,
21:41
you may have a VM on a brand new uncontended machine and so for Netflix, they said that doing this and I'm probably gonna get these exact numbers wrong, I'm sorry, it's been a while since I looked at that paper, they saved something like a third of their overall server costs by doing this benchmarking and only accepting VMs
22:00
that met their minimum criteria because of the amount, like they have a static amount of traffic that they need to serve but they got machines that were more capable to do it at the same price point and so they just needed to spin up less machines and pay Amazon less money. I guess, so you're probably not Netflix,
22:22
this probably doesn't matter to you that much but it is at least something that you can be aware of when you're like, man, 10 servers seem to be enough to serve this traffic last week, right? And then specifically, do you know what it is that your app cares about?
22:41
It is entirely possible that your application is completely CPU bound and you honestly don't even care if your co-tenants are doing tons of IO but you care a lot if your co-tenants are doing video encoding. Maybe it's memcache server and you're just memory IO bound. Maybe it's Postgres and you're everything bound.
23:02
Postgres just wants everything. But this is the kind of thing that actually matters and knowing this difference can be a really big difference in how many servers you need, how much your servers cost and as you get bigger and bigger, one third more performance for the same cost
23:22
becomes a larger and larger number that is worth putting more and more effort into getting. So now that I've convinced you that you need to measure all of these things that you weren't measuring before, let's talk about metrics. I guess, good point about the Ruby community,
23:41
they're pretty good at measuring metrics. That's great. New Relic makes it really easy. You're gem install New Relic, hooray, I have metrics. Metrics are really important. Tracking things, as we were just discussing, is really the only way to know what's happening. Without metrics, your production is kind of a black box
24:00
and you're like, oh, things aren't as good as they were before. I don't know why or probably even how exactly because I wasn't able to measure, didn't know how to measure the things that matter. So I saw really, the first time that how important metrics are really hit home for me
24:20
was in 2009 at GitHub's first CodeConf. I saw a talk by Koda Hale called Metrics, Metrics, Everywhere. And kind of the underlying point of his talk was that the reason that all of us have jobs and the reason that all of us write software is to deliver business value, whether that's to our bosses or to customers or to clients.
24:44
Most of the software exists for the purpose of delivering business value, especially if you're getting paid to write it, right? And if you can't measure it, you can't tell if that is what you are doing.
25:01
So having said that, you're probably not super impressed by me telling you that metrics are important, right? So you do need to know what's going on. There's a catch. Once you have metrics, you have a tendency to become convinced that you now understand what is happening.
25:22
And I don't blame you, I do this too, right? It's like a human thing. You're like, oh, I'm measuring a thing, now I understand it. Just like being able to see the speedometer does not tell you how the car's transmission and engine work, being able to see a metric on your application does not tell you about how and why it is working.
25:42
It just tells you something is very different than it was before, and now you need to figure out what it is and why it is different. And a very common problem is that having metrics, having some visibility makes people think that they have total visibility, and that just isn't how things work, unfortunately.
26:04
So at the end of this little bit about metrics, this is probably gonna be you instead. I'm gonna talk about some ways that metrics actively mislead you, and the biggest thing that causes this kind of misunderstanding driven by metrics is averages.
26:24
When you have a lot of metric information, especially if you have a bunch of app servers, the easiest way to distill that down into something that you can quickly communicate is to take the average. A super good example of this is the way that New Relic's dashboard, when you first open it,
26:41
it's like, here's a giant number, this is the average of all requests across all app servers. So you see those graphs, you see the numbers going up and down, you're like, great, now I know what's happening with my app, right? Unfortunately, no. Brains are really highly developed, carefully tuned pattern matchers.
27:01
This is how humans can see Jesus in toast. This is how you can see an average and think, I know what that means. So your brain's immediate extrapolation from an average is probably what's called a normal distribution.
27:20
There we go, normal distribution. You think, oh, the average, it's gonna be right at the top of that. This is often called a bell curve and it's what happens when all of the inputs into the graph are generated by a random function. Tell me if you think that your app is a random function.
27:42
I mean, maybe it feels like it's a random function, but your app is not actually a random function and the practical upshot of that is that it doesn't look like this at all, right? Like, this is a more realistic graph of what might be producing an average that's right at the zero point on that graph.
28:05
To kind of drive home how wildly misleading averages can be, let's look at a bunch of real-life graphs at the same time. This is a whole bunch of different measured metrics from a real-life thing. It was a MySQL benchmark. It doesn't really matter what it is.
28:21
From left to right, it's collecting the number over time. Near the left, it's the things that were fast and then as you go to the right, it's things that were slower and slower and slower during the benchmark. And so, the small black vertical lines that you maybe can't see very well
28:40
represent the average for that particular line. So, to make it easier to see, I'm gonna line up all of the averages of this same graph. So, not only do any, none of these look like a bell graph, right? Bell curve, not a single one of these looks like a bell curve.
29:01
Worse than that, most of them have zero actual data points at the average line. It's really characteristic to have a very large number of points either clustered together in the fast zone or spread out over the long tail of slow things
29:20
but if you look down near the bottom, some of these lines don't even have a single result that's near the average line. And so, if you're looking at New Relic, you might not even have any requests that take the amount of time that is the number of milliseconds that you're seeing in giant font on your dashboard.
29:43
This is the problem of averages, right? Unless your metrics are being generated by a random function, the average is going to actively mislead anyone who sees it. There's a great quote about this from a tweet by a friend of mine, SF Eric. SF Eric.
30:01
Problem with averages, right? On average, everyone's app is awesome. And so, again, averages, I guess the single good thing that is really great about averages is that they can tell you that something changed.
30:23
You can say, oh, my average was this before but my average is this now. That's weird. The problem with averages is that they can't tell you what changed or how it changed. And it's actually possible to get that information out and so I'm gonna show you how to do that.
30:41
So, while averages can tip you off that something changed, so here's a graph of an average. And as you can see, things are taking about less than 100 milliseconds. But that effectively means there could be tons of things
31:01
happening that take about 100 milliseconds or there could be tons of things happening that take 10 milliseconds and tons of things happening that take three seconds. It's an average, so there's literally no way to know. One way to get around this is to graph the median rather than the average. The median is the number that was bigger
31:22
than half of the numbers and smaller than half of the numbers. The great thing about the median is that you are sure that it actually happened, right? The average may or may not have ever actually happened but the median definitely happened. And so, if we add the median to this graph,
31:41
you can see on the purple line, we now actually know more information than we did before. Most half of the values are actually very, very fast. It looks like around in the 10 millisecond range, maybe 20 milliseconds. Even though the average jumped all the way up to 150 milliseconds at one point,
32:02
at least half of the requests were happening still equally quickly. They didn't slow down. That tells us that since most of the requests didn't slow down, this wasn't like an application-wide change, right? We didn't suddenly get a really slow load balancer. We didn't see really, this wasn't really a network switch problem
32:22
where all of the traffic was impacted. The next thing you can do is graph other places. The median is the 50th percentile, right? Half is below, half is above. Start graphing the 95th percentile. 95% was below, 5% was above.
32:40
Here you can see that the slowest 5% of requests got dramatically slower. More than 10% slower than, 10%, more than 10 times slower than the median. And that's what dragged the average up. Oftentimes even better than the 95th percentile
33:01
is the 99th percentile. This gives you a good idea of what one, it's one out of 100, right? So this is actually a pretty good indicator of what the occasional slow request looks like, right? Well, I had to rescale the graph.
33:20
And the slowest 1% of requests are now clearly the entire reason why the average tripled. The median stayed exactly the same. The median is now a flat line. And that slowest 1% is probably some single specific controller action that you now need to go find and figure out what exactly happened
33:40
to that specific single thing. And so just by graphing the percentiles rather than the average, we can immediately rule out about half of the possible problems that made our average slower. And it works the other way around too. If you look at the graph of the 99th percentile and it isn't dramatically different even though your average is higher,
34:02
then you know not to look for a single controller action. You know to look for a systemic problem. Aggregate graphs, this is another really common thing where aggregation is a fancy way to say I got the many versions of this metric from many servers and then I averaged them.
34:23
So here again, here's an average graph. This one happens to be taken from the actual Bundler API. And it is a graph of the trick that I mentioned where you call sleep one and then you see how long it took. So the number that we're tracking here is milliseconds and it went from taking two milliseconds
34:42
to one second plus two milliseconds to taking one second plus five milliseconds. And that means that garbage collection pressure must have been twice as bad, question mark. We don't know, this is an average. And you can improve this with breakout graphs.
35:01
If you are collecting a number from 25 machines, put 25 lines on your graph instead of one that will mislead you about what all of the different machines are doing. Here's a breakout graph of the same data. Kind of like I was mentioning before, with the breakout graph, we can see holy crap, this had to be rescaled.
35:20
One of the machines started taking 35 milliseconds per second to sleep. But all the other machines were basically fine. And so we wound up resolving this issue by just killing the one dyno that was having trouble and restarting it as a fresh dyno but we didn't have to nuke all of our dynos, we didn't have to, right? This narrowed down the problem immediately
35:42
just from having a breakout graph. So, do it, visualize your data. Here is an example of why visualizing your data is so, so, so important. These are some different data sets. Each orange dot is a single entry on that data set.
36:03
Can anyone guess what the blue line is? So, that average, average, average, average, it's actually even worse than that.
36:22
The average of Y is exactly the same on every track. It's actually even worse than that. All four data sets have the same average of X, average of Y, variance of X, variance of Y,
36:42
correlation of X and correlation of Y, and the same linear regression. Actually graph your data and then look at it because the numbers of the averages and the variances and the correlations and the linear regressions don't contain any of the information about what is different in those graphs.
37:03
One final note, a lot of people that even talk to me about how awful averages are, I then am like, oh, hey, so how do your alerts work? And a lot of people have alerts that are set up to only talk to them after the average is bad.
37:23
And as you can maybe guess, by the time the average is bad, it is too late. Definitely break out your alerts as well as your graphs, right? You wanna know when the first server went down, not when the average of the servers is a down server.
37:45
Right, so, ultimately, I really just wanted to, let you guys know that the network is a part of your application. Most people don't think about it because they don't have to interact with it in their day-to-day development on their own local machine.
38:02
And after you have deployed your application, it is really user experience that matters, not how many milliseconds your Ruby app spends running code. That's it.
38:21
So the question was, if you don't alert on averages, how do you prevent continuously alerting, getting alert fatigue, and then not noticing that something actually bad happened? And the question included a note that there is no silver bullet for this, and unfortunately the answer is there is no silver bullet for this.
38:42
So the best plan that I have ever seen from the best operations people that I have worked with is figure out what the baseline of your system when it's functioning is, and alert when your system is not that. That means figuring out how many requests
39:02
you're successfully serving per minute, and alerting when it deviates from that more than 50%. Figuring out when it's normal to have that deviate, and then not alerting on that. And it's actually a ton of work because every single application has a completely different norm. Some Rails applications, they serve like 50 requests a minute
39:23
and that runs their entire profitable business. Some Rails applications serve hundreds of thousands of requests a minute, and they're not profitable yet. You need to figure out how it is that your metrics look when your company is functioning,
39:41
both software-wise and company-wise. And you need to alert when it's not the thing that gives you the indicator that things are okay. That's really the best advice that I have for you. Five seconds. Any more questions?
40:04
All right, I'm happy to talk about this stuff later.