Get Instrumented!
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 144 | |
Number of Parts | 169 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/21150 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
CAN busHypermediaSoftware developerPresentation of a groupVideo gameGraph coloringBitRow (database)Computer virusNumberGraph (mathematics)Web pageBlock (periodic table)Lecture/Conference
01:14
Cycle (graph theory)Boss CorporationPhysical systemFunction (mathematics)Operator (mathematics)PredictabilityFile formatSystem programmingTask (computing)Right angleOrder (biology)Software developerComputer animation
02:25
Pressure volume diagramAverageMetric systemPrice indexGoodness of fitDatabaseObject (grammar)Design by contractMultiplication signNumberLevel (video gaming)Time seriesMereologyOrder (biology)Representation (politics)Service (economics)Cycle (graph theory)CASE <Informatik>TimestampSampling (statistics)WordMetric systemRootCondition numberLie groupSystem programming9 (number)Computer animation
04:11
Server (computing)Operator (mathematics)Query languagePlanningSystem programmingTime seriesMobile appDatabaseComponent-based software engineeringMultiplicationBoss CorporationNumberWebsiteGraph (mathematics)Metric systemSemiconductor memoryCartesian coordinate systemIntegrated development environmentInformationState observerDifferent (Kate Ryan album)LastteilungWeb 2.0Bit rateComputer animationXML
06:07
Type theoryMetric systemGauge theoryHistogramAverageMedianSet (mathematics)Data storage deviceInformation retrievalImage resolutionMetric systemStreaming mediaMetadataType theoryBitClient (computing)System programmingMultiplication signData storage deviceTotal S.A.Execution unitPhysical systemHypermediaSlide ruleFunctional (mathematics)Model theoryNumberDimensional analysisCombinational logicProduct (business)CurvePoint (geometry)EmailChannel capacitySampling (statistics)Bit rateArithmetic meanSpacetimeGradientInformationTimestampElectronic mailing listQuery languageVulnerability (computing)CASE <Informatik>WebsiteOrder (biology)AverageLimit (category theory)MereologyMeasurementPartition (number theory)ResultantDataflowSoftware testingServer (computing)Graph (mathematics)Price indexMusical ensemblePresentation of a groupCartesian coordinate systemData conversionData miningCategory of beingRight angleRepresentation (politics)Parameter (computer programming)Web pageMiniDiscImage resolutionQuality of serviceTask (computing)Service (economics)EstimatorTrailHistogramPopulation densityTime seriesMedianTerm (mathematics)Logical constantField (computer science)Goodness of fitEnterprise architectureSocial classStructural loadSound effectOvalImplementation2 (number)Fitness functionVideo gameComputer fontDampingSeries (mathematics)Event horizonMobile appMathematicsGauge theoryCore dumpSet (mathematics)Computer animation
15:29
Information retrievalData storage deviceConfiguration spaceSystem programmingDensity of statesHistogramData typeMetric systemFile formatComa BerenicesSummierbarkeitBit rateBit rateCASE <Informatik>SummierbarkeitMultiplicationData storage deviceProcess (computing)Point (geometry)CommutatorInstance (computer science)Buffer solutionDatabaseSampling (statistics)NumberLimit (category theory)Water vaporProduct (business)Game theoryProper mapMultiplication signInheritance (object-oriented programming)Level (video gaming)Wave packetPrimitive (album)Fitness functionGoodness of fitTotal S.A.Theory of relativityException handlingVideo gameWeightGroup actionConfiguration spaceSystem programmingPhysical systemIntegrated development environmentMetric systemService (economics)Software testingHistogramScripting languageComplex (psychology)Presentation of a groupType theoryData centerTime seriesLocal area networkWeb serviceMereologyVector spaceTable (information)Range (statistics)Functional (mathematics)Sound effectCellular automatonPredictabilityExistential quantificationCore dumpInverse elementCategory of beingMedical imagingLaptopCombinational logicSeries (mathematics)Content (media)Speech synthesisMeasurementQuantileForm (programming)Computer configurationWeb page2 (number)State observerJava appletImage resolutionBackupTesselationInterpolationFront and back endsDampingLastteilungWeb 2.0File formatServer (computing)SoftwareSlide ruleHigh availabilityNeuroinformatikQuery languageComputer animation
24:50
Client (computing)Programmable read-only memoryData storage deviceInformation retrievalGraph (mathematics)Software testingCoprocessorEvent horizonDisintegrationWorld Wide Web ConsortiumPressureBus (computing)Integrated development environmentStatisticsProxy serverPower (physics)Noise (electronics)Thresholding (image processing)Data managementExecution unitSource codeRight angleGoodness of fitVisualization (computer graphics)Bridging (networking)System programmingImage resolutionData storage deviceService (economics)Type theoryMeasurementMoment (mathematics)Graph (mathematics)Multiplication signCASE <Informatik>QuicksortINTEGRALMereologyLinear regressionScaling (geometry)Web pageCondition numberChannel capacitySampling (statistics)Cross-correlationData centerExpressionDrill commandsMathematicsServer (computing)DemonDifferent (Kate Ryan album)BuildingSoftwareWritingMiniDiscHookingWeb 2.0Limit (category theory)Query languageComputing platformDiagramComputer animation
29:36
3 (number)Addressing modeSoftware engineeringMiniDiscRead-only memoryStructural loadComputer networkSystem programmingMetric systemRegulärer Ausdruck <Textverarbeitung>EmailServer (computing)World Wide Web ConsortiumProxy serverDatabaseStatisticsComponent-based software engineeringService (economics)Mobile appInstallation artHistogramGauge theoryStreaming mediaError messageSocial classException handlingAuthenticationSynchronizationWrapper (data mining)Thread (computing)DisintegrationService (economics)System programmingProcess (computing)Computing platformServer (computing)Decision theoryGoogolDifferent (Kate Ryan album)Software developerWeb 2.0Operator (mathematics)Software bugHand fanHypothesisDivisorMultiplication signOverhead (computing)Video game consoleChemical equationMobile appWeb pageLoginMereologyFile formatContext awarenessStatisticsDemonSemiconductor memoryThread (computing)Metric systemInformationStructural loadCartesian coordinate systemSoftwareProduct (business)MiddlewareLine (geometry)AuthenticationFlagCodeFunctional (mathematics)Configuration spaceConnected spaceClient (computing)CoroutineCore dumpRight angleComputer fileOpen setNumberMaxima and minimaLevel (video gaming)Domain nameProjective planeBitView (database)System callGauge theoryComputer-assisted translationDigital photographyHistogramNeuroinformatikBit rateHacker (term)Cross-correlationBefehlsprozessorTimestampFreewareRegulärer Ausdruck <Textverarbeitung>Web serviceMeasurementForm (programming)LogicMathematicsPrice indexCountingLibrary (computing)Instance (computer science)Lebesgue integrationWikiImage registrationPoint (geometry)CASE <Informatik>Mathematical analysisNatural numberLink (knot theory)Arithmetic progressionTrailWordDatabaseHypermediaIntegerBuildingObservational studyCurvature1 (number)Information overloadStatement (computer science)Software testingOrder (biology)PredictabilityStaff (military)Compilation albumData managementStress (mechanics)Computer animation
Transcript: English(auto-generated)
00:00
Our next presenter is Ginex Labak. He's a developer in Varjo Media. Thank you. Hello, hello. So first of all I would like to thank everyone who actually did what I asked to. I came to the front rows. Thank you so much for viewing me.
00:22
I have also to apologize. I'm a bit physically inconvenient because I managed to kind of get a blockade in my neck yesterday while doing my hair. So if you ever wonder how it is to get old, it's not gray hair, it's not being able to do your morning stuff without hurting yourself.
00:42
If you're local and you know some physiotherapist who could like de-block my neck, please come and talk to me after this talk. So other than that, some may or may not know me. I work at a small lead hosting company called Varjo Media. And I want to teach you how numbers and colorful graphs can improve your life and
01:03
the life of those that are impacted if you get paged out of bed at 4 am. And my clicker is not attached. So, concretely, I want you by the end of the talk to be able...
01:22
I'm not good at multitasking, please. One minute. So, I want you by the end of the talk to be able to predict performance problems. Because if you can prevent them, or if you can just stomp out a little blaze, it's so much better than having to fight a huge operations fire.
01:42
If something happens anyway, I want you to be alerted by your system with useful data and not by your hysterical boss on Slack or by your very angry customers on the phone. And if a fire is burning, I don't want you to stare at a useless top output
02:02
hoping to come with some inspiration while your boss is still poking their finger into your back. I want you to have data about your systems right at hand. And, of course, in a meaningful format. And that also includes historic development, because once you run into a fire, you want to know how you got there.
02:23
Especially because these cycles should feed back. Once something happens, you want to be alerted the next time, and ideally you want to prevent it. So, if you want to reason about situations like this is good, this is bad, this is really bad, you need an objective representation for these situations.
02:47
And for that you can express the quality of the level of your service. So you may have already seen those words, usually with a third different word. So first you need an indicator, which can be the request latency or the uptime, something like that.
03:04
Once you have an indicator, so you have something to talk about, you can formulate service level objectives. Which can be latency must be always under 100 milliseconds, you have to have five nines uptime, things like these. And then finally, probably the most famous one, once you have objectives, you can formulate contracts and agreements on top of those numbers.
03:26
Because what will happen if those objectives are missed? Are we going to sue you? Are we going to cancel our contract? Now, agreements are not part of this talk, but SLIs and SLOs are. Because SLIs are just metrics, and SLOs are conditions you want to fulfill.
03:44
In other words, you want to get alerted if you are not fulfilling them. So, one step back, what are metrics? Metrics are numbers or samples in a database. They are time-stamped, which makes them a time series. And you're going to have to have a lot of time series, which means that you can correlate them.
04:04
For example, in this example, your request latency will just reload, a very typical use case. Now you get those time series by adding instruments to your system. And a system can mean anything, it can be your app or it can be a server. It's just like on a car or on a plane, except that these instruments now get hooked up to a time series database that will store them.
04:27
And then depending on what time series database you're using, it will allow you to do queries and operations on top of them. Now, you have to obviously instrument your app. Next up is probably some dependencies, like your database, your web server, your load balancer.
04:44
They all carry very important and useful information for you to correlate with your application data. Then, of course, your environment, your server load, your memory, your IO activity. And finally, and that's kind of underappreciated, I think, you can also instrument your business.
05:02
Like the number of your customers, the number of your paying customers. Maybe you're not in San Francisco, so the numbers are more similar here, but still. Or your daily revenue. So seeing a graph that correlates your front-end latency with your sign-up rates or your revenue can be enlightening sometimes.
05:20
Especially if you are arguing with your boss about whether or not you need that SSD. Now, nothing of this is new. People have been doing this for years. I've been doing this for years. Actually, I've been talking about this last year, just here. But in the past, you had to choose multiple components with various trade-offs.
05:42
And most notably, they aren't integrated. And this is a bad situation to be in if you wake up one morning and say, Okay, I want to have metrics, what do I do? And now, you have to learn basically everything and then choose what you want to use. And others, like StatsD, they have some really, really bad properties, but you don't realize that until the fire is burning.
06:02
So, I find that Prometheus is different. And that's because it gives you a well-rounded and opinionated metrics and monitoring system, which is integrated. It's absolutely flexible, but it has a proven and well-documented starting point. So, opinions are a dime a dozen, obviously.
06:22
But in this case, it's okay to listen, because it's more or less a reimplementation of Google's internal monitoring system that has been implemented by ex-Googlers working in that case. At SoundCloud, they were just missing their pet monitoring system.
06:40
So, to give you an idea how it works, let's have an architecture walk. Core feature, of course, is the storage of time series. Now, a time series is really just a named stream of float samples with timestamps, that's all. But Prometheus wants you to think in terms of four types that are built on top of these streams.
07:07
So, first there are counters, which are for counting events or counting anything. The important property is that counters can only increase, but they can increase by anything. So, you can use them to measure your network traffic or to count your errors, whatever.
07:23
If you need to set arbitrary numbers, a gauge is for you. A gauge is for exposing numbers, and it can be set to anything, so it's used for things like server load, temperatures, or the number of active requests right now. And these two are pretty obvious how they map on a timestamped float stream, but the others are more interesting.
07:44
So, a summary takes measurements, so it observes measurements and allows you to compute the rate they come in, like requests per second, and the average measurement, like the average request time. Now, some clients, and Python is explicitly not one of them, also allow you to define percentiles which are then computed within the app.
08:05
The reason why it's not in there is that it's not really useful, because you cannot aggregate meaningfully percentiles. It's just not how math works. So, instead, you should use histograms, and that's like the working horse of metrics in this case.
08:27
It's also about observing values, and you keep track of averages, but additionally, you define buckets. And these buckets should have the typical sizes of the values we are measuring, and then Prometheus can estimate percentiles server-side from these buckets.
08:44
Which also means that you are not deriving numbers in your application while it's serving some important requests. It's a very nice property. Now, I've said percentiles twice now, which is because they are very important, so I'll give you a quick rundown, so just you're on the same page.
09:01
And it starts with the premise that averages are probably less useful than you might think. And to have something concrete to talk about, let's assume we measure request latencies. And I think it's fair to say that request latencies are a good indicator of the quality of a service. Fast requests are good, slow requests are bad, it doesn't matter if it's a web page or an API.
09:23
In any case, you want it to be fast. Now, the average time is not the average user experience in this case. Let's look at this example. No user is experiencing a latency of 2.8 in this point. So, not only is it not the correct answer, it's also muddling all numbers together.
09:48
And you don't see that one request is really, really bad, while the others are just fine. And the problem here is that there are no bell curves in production. Each production data you will encounter in your life is skewed in some way.
10:07
So, yeah, and that means basically you may be wasting your time on optimizing a perfectly good average case, while there is just some outlier for some reason and you will not ever find it if you do not know that it's an outlier.
10:23
So, what is the average experience, or what does the average user experience here? It's one. And if you remember high school, there is a function that I would have told you. So, it's a median, which takes a sorted data set and gives you the middle value, or the average of the two middle values if it's an even-sized set.
10:44
Now, the median strength in representing the average user in this case is also its biggest weakness, because this still returns one, and I think we can all just agree that this is not a useful information to receive.
11:00
So, unfortunately where the median comes from there is more, and this brings us back to the percentiles. They also partition a sorted data set, but this time into 100 parts. And then you look at the nth value for the nth percentile, or the nth percentile p is the upper limit of n percent of data set values, which sounds super confusing, but it just means the following.
11:22
If the 50th percentile is one millisecond, then it means that by one millisecond 50% of your requests are done. That's all that it means. And if you think about it, that's actually our median again, which again is not useful by itself, but we can go further.
11:41
We now have a parameter that we can tweak. So, let's look how long the 95% of our fastest requests took. And we see we have a problem. Something is very, very wrong. And something between 50 and 5% of our users are affected by this. So, at this point you can drill deeper, because as said before, Promethus is computing its percentiles server side,
12:08
so they are not fixed. You can always try to find others. And, yeah, with an average you wouldn't have gotten any useful either. You would just think that all take forever.
12:20
Now, the problem with percentiles, and not a lot of people talk about that, is that they throw away most of the data. And that's a problem if you want a representation of your service health or your service quality. So, in the end, you still need the average to have a number that distills everything
12:41
and doesn't just look at certain values. So, now that we have the math out of the way, let's talk about naming. Anyone who ever used Graphite, or together with StatC, will have seen something like that. They put the metadata into the metric name, which is kind of annoying. So, any modern TSDB, and Promethus is one of them, switched to bare names.
13:08
So, the best practice here is to prepend with an app name, which is not a good app name, it's just a short app name, so I can use a big font on my slides. And to append a unit. A total is a counter.
13:22
If you are measuring times, you would have seconds or something like that, so it's a bit self-explaining. Now, this metadata is added using so-called labels, which looks like this. And each new label combination still adds a new time series, or how they call it, a dimension.
13:40
So, that means that you do not get less time series, but it's much more readable, it's structured, so you can do aggregations on it in a much nicer way, like formulate queries on label values, it's really nice. Now, how do you get those values?
14:00
And that's where it gets kind of interesting, because contrary to the most metric system, Promethus is pool-based. Which means that each instrumented system exports its metrics via HTTP, and Promethus scrapes them for you. So, if I'm using the metaphor from before, you add instruments to a system, and Promethus looks at it regularly,
14:24
writes them down as a timestamp, and is done. This means a lot of things. So, first of all, you can adjust the resolution of each single target by configuring how often the metrics are scraped. So, if you want more frequent scrapes, you get more precise data, but it uses more disk space, it's always a trade-off.
14:46
It also means that if scrapes fail for some reason, like, say, a high load, you don't lose data or meaning, you just lose resolution. Which is kind of important, because your average rates still make sense.
15:00
Compare that to a push-based approach, where lost samples actually mean that your rate is sinking. So, it looks like things are going down, although it's rising beyond the capacity of your system to report metrics. And this makes Promethus really, really great for monitoring. But it's a bad fit for things if you want, like, accounting, that's a common question on the mailing list.
15:25
You do not get the single values, like the single request times, you just get averages out of it and can do useful data on it. But it's not an accounting system, so then you have to go for something like Postgres or InfluxDB, if you need each single number.
15:41
Now, there's a few problems too, of course. So, one is short-lived jobs, like your backup script. You're not going to convert your cron scripts into web services, just so someone can scrape metrics. And there's an official solution for that, it's called the PushGateway, which will receive the data from your short-lived script, and it retains them for Promethus to scrape.
16:04
Problem solved. Then there's, of course, the problem of target discovery. If you want to scrape something, you have to know that it actually exists. Some people consider this a problem, but it's actually just moved the problem of knowing what your production systems are from monitoring into your metric system.
16:24
Because Nagios also needs to know about all your systems, so you're not getting around about telling some system about your systems. And you can do it either by configuration, this will tell Promethus to scrape itself, so it gives you the number of time series and your buffer usage.
16:41
This is an exporter, a target, or an instance. It means all the same. And a group of those together are so-called jobs. So, for example, if you have multiple from user service, you could scrape them all there. Or if you have multiple backends of the same web service, they are one job, but multiple instances.
17:02
And now, these two values, you get automatically for each scraped metric as labels. So you can do filtering on top of this and aggregation. So, in practice, of course, you're not going to do static configuration. You will use some kind of service discovery. We personally use console, it works great, but people have been using it with other systems very successfully either.
17:26
Now, there's one final problem, and this is actually a problem. And that is closed or netted or load-balanced systems like Heroku or end-user appliances that run in a local network of a customer.
17:43
Because you cannot expose things, really, and if you do, people may get really mad at you. So, in the case of Heroku, there have been talks about an official plug-in. As far as I know, there's nothing concrete yet. And other than that, there's no really good solution, Prometheus is not a good fit for this.
18:00
Generally speaking, Prometheus is intended to run in the same network as its targets. If you cannot do that, you probably have to look elsewhere. So, but there's a lot of advantages too. So first, high availability is super easy. You just run multiple Prometheus servers and point them at the same exporters. Done.
18:20
And this also means that you can have production data in your test environment. So, for example, we had an intern, and we wanted to make him work on our metric system. So, we never had him touch our production in Prometheus. But he had a Prometheus on his notebook, and he got access to the metrics and points of the systems that were relevant to his work. And he could do everything he needed. That's a very nice property.
18:44
Then, outage detection is really easy. If scrape failed, you know something's fishy. Reasoning about how long you didn't hear from a system, so it's probably dead, is possible, but more complicated. What I personally like is the predictable effect on the infrastructure.
19:03
Because more traffic does not mean more metric traffic. It's always the same. You said once how often you want to scrape your data, and that's it. Which also means that it does not congest an already busy network if something is going on in your system.
19:23
And finally, it means that instrumenting third parties is pretty easy, actually. Because any production-ready system has some kind of instruments that it exposes to its users. So any database has a special table of performance metrics. Web servers have their status pages. Java has its JMX.
19:42
Now we just have to take these metrics and transform them into something that Prometheus understands. And it turns out, what Prometheus understands is pretty easy for you to understand, too. So let's look at how it looks like. This is what an exporter exports. There's always at least the option of the human readable format.
20:03
And in this case, it is the first part of a histogram about request latencies. Again, very bad metric name. Short metric name for a big font. Now, this is the first one, which is the first part of it.
20:24
And this time series is the number of measurements that have been observed. So how many requests did we observe? And the second one is the sum of the measured times, like the total time observed. So in this case, we had 390 requests that altogether took 177 point something seconds.
20:46
And this is super cheap to keep track of. We're just adding float numbers. And these are also literally the samples that Prometheus stores if you're using the summary type in Python.
21:00
This is all you get. So, to get person tiles, as I've said, you also need buckets. And they look like this. In this case, we have six buckets. The LE label gives you the upper limit that the sample has to fit into. It trickles down, so something that fits into 0.5 also fits into 2.0.
21:25
This is the number of samples that fit into this bucket. Now, Prometheus can interpolate person tiles from this. And that's good enough in practice. And you can always increase the precision of your person tiles by adding more buckets.
21:41
But you have to make sure that your values distribute evenly over your buckets, or distribute at all. Because if all values are just in one bucket, Prometheus cannot compute anything meaningful out of it. So please define your buckets based on the latencies you have, not the latencies you would like to have. Because that's not useful.
22:02
So, we have metrics in a database. What do we do with them? We query them. And for that, we use the Prometheus query language called PromQL. And I don't have enough time to give you a proper intro. And there's really amazing stuff going on. You can implement a game of life in it. Well, I'll give you a few examples.
22:21
So, you usually have a lot of related time series that you want to aggregate to one or to a few. So, for example, say you have many backends in multiple data centers. And you want a total request rate over all backends. So, we will work ourself from the inside out.
22:43
Here's the counter again, which you saw on the slide before. And to compute the rate, the function needs a so-called range vector. So, this means this returns a vector or an array of values of the past one minute. How many that are depends on the granularity of the data.
23:02
So, how often you scrape your targets in the one minute. And the rate function will compute the rate. How fast is this counter rising? And at that point, you have the request rate for every single backend in every single data center.
23:24
And now you just sum them up and you have one value. You know, the total request rate over everything. Now, what if you want to know the rate of the backends in one data center? Then you just add a filter, which looks like this. And here you can see how nice it works if you have labels that are structured instead of having to work with dot-separated names.
23:52
The rest is all the same. Now, if you want to have the request rate for each data center broken down by data center,
24:01
you drop the filter again and you tell the sum function to retain the DC label. So, in this case you get as many rates as you have DCs. Simple. Now, what else is interesting? Percentiles, of course. And Prometheus uses so-called fee quantiles, which completely oversimplified a percentiles divided by 100.
24:23
So, this is the 90th percentile. And we take the rate of the buckets we just saw before, and histogram quantile will do the rest. So, of course this gives us as many histograms as you have data, as you have label combinations,
24:46
so you may want to aggregate it. But other than that, we have our percentiles that we always wanted to have. Nice. So, I hope you have somewhat of a taste how powerful PromptQL is. And then it is used by all its consumers, which most notably are visualizations.
25:07
So, there's the internal one, which is not pretty, but at least it's not XJS. It's nice for playing around, drilling in, so something's going on, let's quickly look what could it be. And then you use the query elsewhere.
25:21
It's a bit limited because it has only one expression per graph, so you cannot do any correlations whatsoever. But you can build dashboards with Go templates if that's your thing, but it's not mine. So, PromDash has still the best integration, because it used to be the former official visualization thingy,
25:40
but it's deprecated now because Grafana has merged official Prometheus support. So it's deprecated, don't bother, go for the real thing. I think everyone who ever saw Grafana, I think a good measure of people are in this room just to find out how to use Grafana or what to do, because it's the best and best looking dashboard software right now.
26:00
It has many many integrations, you can build dashboards from different sources, so you can introduce Prometheus and still keep your InfluxDB or Graphite and integrate them in one dashboard, which is really nice. Especially because it gives you a step-by-step introduction, so yeah, use this.
26:21
There's no reason to use anything else. The final piece of the puzzle is alerting. You can use PromQL to formulate alert conditions, and Prometheus then will push them into a separate daemon called the alert manager. So again, example time. Let's talk about monitoring for full disks, because once a disk is full, it's too late.
26:44
But alerting on some random threshold can lead to noise, which leads to alarm for tea. So let's use a crystal ball to be notified in time without noise, and for that we want an alert that fires when the disk is going to be full in four hours. And this is our crystal ball.
27:03
It's more high school mathematics, and it's called linear regression. So in this case, if given the samples of the past one hour, the disk will have less than zero capacity in four hours,
27:24
and the condition is true for five minutes, so a small spike doesn't just fire off some alerts, then we want to be alerted. How do we want to alert it? Again, it's completely pluggable, it integrates with a lot of notification backends, of course email, pager duty, web hooks.
27:42
So yes, you can have slack. So, how do you get this web scale, which is, I promise, the really final part? The answer is federation. Prometheus servers can get the data from other Prometheus servers. And the typical use cases are aggregation, which can mean that you have one Prometheus server per data center, or one per team, or one per type,
28:06
and then you aggregate all this data from these Prometheus servers into one big. Or for downsampling, say you have one really, really fast with SSD server, which is scraping all your targets, and you have high resolution data, which you want for monitoring, say.
28:22
But you also want to save some history of your data, how your servers behaved over the years. So, in this case, you would just sample it down to a lower resolution for long-term storage by a second server which has slow disks but big disks, and yeah, that's all there is to it.
28:44
So, you should have a general idea how Prometheus works now, so let's look how to get data into it. And there's a lot we can do without touching your code, so let's start with all breaking things. And Prometheus has been public for over a year. It has a very active ecosystem.
29:02
The 1.0, by the way, has been released, I think, this week. And I've already pointed out that it's easy to write exporters for third-party things. And that's the reason why there are so many already. And it includes bridges, which is really cool, because it means you can use your existing instrumentation pointed at these exporters,
29:23
and they will transform whatever you are doing right now into Prometheus format, and Prometheus can give you the nice alerting and graphing and whatnot. So, native is better, though, so let's start with platforms. First, fully featured servers. There's the official node exporter.
29:41
It will instrument your servers from the inside, like Metal, KVM, LXT. Now, you know what picture comes next? One-process containers. They like Docker, of course. They are instrumented from the outside using container APIs, and it's called C-advisor. It's not Prometheus-specific, and I believe it's from Google.
30:03
So, depending on how you run your system, decide. Installing such a daemon gives you full system insight. You get statistics about CPU, memory, network, I.O., and much, much more. And this is super useful if you want to put your own metrics into context. So, installation of these should be an automatic part of provisioning new servers,
30:23
and not something you have to remember or only do when you think about it. Then, another non-intrusive method is mtail. Mtail will follow any log file, and it will compute metrics on the fly based on regular expressions. It's very powerful, and in some cases, like the Apache web server,
30:43
you even get better metrics if you set a custom log format and use certain regular expressions to extract them. It's better than the status page that it's serving, so you should definitely check it out. Now, no matter whether a status page is a log file, you should always instrument the outer edge of your infrastructure,
31:05
which usually is some web server or something like an HAProxy, a little balancer. Now, if you look from the outside, there are also blackbox exporters, so think pingdom, just for free.
31:20
They will probe your system using HTTP, TCP, or even ICMP, aka ping. But they add additional load, which nothing of what we talked before really does. Again, databases, every database, even Mongo, has some way to get that out of it, use it. And if you run your own infrastructure, there's also an SNMP exporter.
31:45
So at this point, we have already detailed information about our platform, we know how to look at your app from the outside by analyzing logs or even probing it, and we know how or that we can instrument third-party dependencies. So assuming you instrument your web server, you can already correlate request times with platform metrics like the server load,
32:08
and dependency metrics like what the hell is going on in your Postgres. This is good, but we need to drill deeper. Oh, sorry, I forgot clicking, I'm so excited. So we have to touch your code. There it is.
32:23
And to make things interesting, we'll use an example, and since this is a computer conference, the example involves cats. So, let's assume you've built a groundbreaking product, software that determines whether a photo contains a cat. So, now you need to deploy it as an HTTP service, where the user posts a picture,
32:45
and you reply with a meow or a nope, depending on what the picture contains. So, how hard can it be? Let's build a Flask web service. And you don't really need to know Flask to understand this.
33:04
You just check authentication, which, because your colleagues read hacker news, is a microservice written in Go deployed on Docker. And you have an expensive computation that does the actual business logic, which the important fact is it's a cat.
33:21
Now, I bet a lot of you have already written APIs like this, it's really fast, it's really cheap, now let's instrument it. And for that we use the official Prometheus client package. And even before we change code, we do the least we can do. We just start the HTTP endpoint, which then runs in a separate thread. Why? Because on Linux you get process statistics for free immediately.
33:47
And that includes your memory usage, the timestamp of when your process started, your CPU time, the number of open files, and the maximum number of open files.
34:00
So without changing a line of code really, you can already detect memory and file handle leaks, which happen and are really painful when your server just stops accepting connections and you don't know why. And you can monitor whether we approach the FT system limit. Nice. But let's start instrumenting.
34:23
And for that, we define some metrics. First, a histogram that measures our request latency. Then, a histogram that looks how long the actual analysis takes. And finally, a gauge that will tell us how many requests are active right now.
34:44
And now we add them to the app. So, we just add these two decorators that do exactly what they sound like. The one tracks how many function calls are in progress, which is how many views are in progress. And the other one measures the time that is spent in this function.
35:04
Now, you might be saying that middleware would be much better, because you can have labels with the view name and the status code. And you'd be completely right. Please do that. I do that. But the exact middleware is a bit out of scope here. So, additionally, we measure the time.
35:21
Come on. We measure the time to analyze. Because for all we know, all the time sinks into authentication, which in turn is not instrumented, at least ostensibly so. And it is because I've decided to make it a shared package, and you should instrument the package itself.
35:41
Because if you use some package ten times, why should you instrument it ten times? So again, we define a metric with the time spent. I want four errors. And that's especially because, as I've said, it's a microservice, which makes it a distributed system, which makes it fail in the most inconvenient ways in the most inconvenient time.
36:04
So you have to look out for that. So whenever we fail, we increment the error, and we try again. And I'm aware that this is not how you retry in a distributed system. So, if the rate goes up, you have a problem.
36:23
And a big problem. But we also count the invalid login attempts, because they are red flag two, because either you may be under attack, or you have some subtle failure in your authentication server, which manifests itself as wrong credentials, but actually just means that someone changed their data format or something.
36:46
Now, these metrics have the same name in every app you use them, and you differentiate them using the job label. So, if done properly, which means you instrument your shared libraries, you put web-related metrics into middleware, or even
37:03
into your whiskey container, because both G-Unicorn and especially micro whiskey offer a lot of possibilities to hook into them. You're left with one extra line in your view, which is both tolerable, and I really think we should not be ashamed, feel ashamed about instrumentation.
37:23
I'm kind of allergic to having a lot of instrumentation that repeats itself in your code, and it pollutes everything, and you should totally try to pull things out into decorators and middlewares. But in the end, any serious production software has instrumentation, anything that you connect to with your web apps, or whatever you are writing.
37:48
So, do it too, nobody ever regretted to have too much information if things go sideways. Now, you may be asking, what about async? Well, you may not, but I do, and that's why I've
38:02
written Prometheus Async, which supports async.io and Twisted, and does the right thing with deferred and core routines. And because I'm bad at math, I did not re-implement the metric logic, but instead I simply wrapped the metrics from the official client, and that's all there is. This allows you to use the official client in async.io applications.
38:26
Now it comes also with a few goodies, so let me call them out. It has an IO GTP-based metrics exporter that is much more flexible and configurable than the one that comes with the official client, and you can start it in a separate thread, which means it's useful with any Python 3 application out there.
38:45
You do not have to use it with async.io applications. I personally use it with my pyramid apps, I just need the configurability. Then, it also includes auto-registration with a console agent, which is because we use console, but service discovery is
39:03
kept completely generic, so whatever you use, you just have to write two functions to integrate it with your favorite ones. So, it basically means you just set in your own code or start metrics endpoint and register it, and as soon as your metrics are up, console will know about it.
39:21
And console is very well integrated with Prometheus, so it's very little overhead for you to get this working once you've put the pieces in place. So, time is running out, but everything is instrumented, so let's wrap up really fast. And what did I promise? I promised prediction.
39:43
If you have good dashboards, if you use predict-linear, or even better, hold-winters, which allows you to apply a smoothing factor that will favor older or newer, depending on how you set it, values, you're just fine. Alerting, there's alert manager, there's a very powerful way to interact with it, and it integrates with almost everything.
40:08
And then there's the holistic overview. And if you instrument widely, you will have the data. You can build dashboards, you can play with PromQL, you have everything you need if the thesis hits the fan.
40:23
And this is not theoretical. Last week, we had a really big operational emergency in our company, which was not our fault, we ran into a very obscure bug that only happens on obscure platforms, previously. So while the operational staff, I'm more on the developer side, was busy trying to contain the fire, I've built a dashboard for them.
40:47
So we could immediately see, we try this, what happens? Oh, load is still rising, let's try something different. This is very useful if you don't have to just keep pressing uptime or staring at the top.
41:01
I believe I've covered everything, so I hope you're eager to measure all the things. Please study the talk page, as always. It contains all the links, all the projects. Follow me on Twitter, get your domains from Varo Media, and I'm not taking questions because I'm really bad at understanding questions on stage. But if you have any questions, I'm out there, I'm here until Sunday, just come and chat me up. Thank you.