We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Observability with Prometheus and beyond

00:00

Formal Metadata

Title
Observability with Prometheus and beyond
Title of Series
Number of Parts
69
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
If you're running a system at scale, you need tools to maintain it. This talk gives a high level overview of what observability and monitoring mean, and how to use Prometheus, Loki, Cortex, and Tempo to monitor your stack.
Goodness of fitState observerLevel (video gaming)XMLUMLMeeting/Interview
Strategy gameWordState observerArithmetic meanData managementSoftware engineering
SoftwareDatabasePhysical systemLoginSoftwareMultiplication signData managementOpen sourceSoftware developerDifferent (Kate Ryan album)Projective planeJava appletComputer architectureProgrammer (hardware)Software engineeringComputer animation
System programmingPhysical systemData storage deviceDilution (equation)Service (economics)Core dumpInterface (computing)Kolmogorov complexityDifferent (Kate Ryan album)Self-organizationState observerPhysical systemService-oriented architecturePrice indexMeasurementComplex systemSymbol tableService (economics)Software developerMultiplication signProduct (business)FehlerschrankeSoftwareData managementMetric systemCuboidLoginConnectivity (graph theory)MereologyBlack boxOrder (biology)WindowArithmetic meanInformationWordDependent and independent variablesBitCondition numberLevel (video gaming)Kernel (computing)Error messageSubject indexingExistential quantificationCodeContext awarenessDifferent (Kate Ryan album)Coma BerenicesBoundary value problemBit rateAbsolute valueObject (grammar)Interface (computing)AreaWebsiteQuicksortNumberComplex (psychology)Set (mathematics)Task (computing)Vector potentialGroup actionSelf-organizationOperator (mathematics)Design by contractCovering spaceData storage deviceComputer animation
DatabaseSeries (mathematics)TimestampEvent horizonCovering spaceData centerEvent horizonDerivation (linguistics)Response time (technology)HistogramService (economics)Instance (computer science)Multiplication signSet (mathematics)Bit rateGauge theoryAverageTime seriesDatabaseCASE <Informatik>BitTimestampConnectivity (graph theory)DampingComputer animation
Programmable read-only memoryComputer animation
HierarchyData modelProcess (computing)Military operationAerodynamicsCategory of beingService (economics)Point (geometry)Metric systemDifferent (Kate Ryan album)Query languageMechanism designAdditionServer (computing)Dependent and independent variablesProgrammable read-only memoryComputer architectureConnectivity (graph theory)CuboidCommitment schemeNP-hardBlack boxProduct (business)Integrated development environmentEvent horizonMeasurementRevision controlNumberCodeMereologySoftware testingHierarchyPhysical systemForm (programming)Computer animation
GoogolInstance (computer science)Metric systemLoginState observerData storage deviceContext awarenessQuicksortConnected spaceGene clusterScaling (geometry)Projective planeTime seriesQuery languageTracing (software)WhiteboardComputer animation
BlogPrice indexMetric systemPhysical systemDirected setEmpennageProgrammable read-only memoryProjective planeOpen sourcePhysical systemSubject indexingQuery languageProgrammable read-only memoryFrame problemMultiplication signLoginMetric systemIncidence algebraScaling (geometry)InformationService (economics)Computer animationDiagram
Multiplication signLoginNumberError messageMathematicsMessage passingArithmetic progressionQuery language
Content (media)Instance (computer science)TimestampSubject indexingMereologyOpen sourceTimestampMetric systemLoginOrder (biology)Computer fileState observerProjective planeTracing (software)Line (geometry)Computer animation
Tracing (software)Open setHydraulic jumpMetric systemBlogObject (grammar)Elasticity (physics)Scale (map)GoogolUsabilityOpen sourceProjective planeTracing (software)Data storage deviceObject (grammar)Sampling (statistics)Multiplication signPoint (geometry)Arithmetic progressionMilitary baseComputer animation
LoginQuery languageMetric systemTracing (software)Open sourceQuicksortComputer animation
Metric systemSeries (mathematics)BlogPoint cloudPoint cloudSeries (mathematics)Musical ensembleOpen sourceLimit (category theory)NumberInstallation artComputer animation
Observational studyTracing (software)Multitier architecturePlastikkarteClient (computing)Metric systemTerm (mathematics)Default (computer science)Limit (category theory)Data loggerSeries (mathematics)NumberUser interfaceLoginFreewareService (economics)Maxima and minimaQuicksortArithmetic meanComputer animationXMLUML
Transcript: English(auto-generated)
Good evening and good morning to those of you not in Germany So this is going to be a beginner level talk about observability and Prometheus and some of the other tools that we also have available for this so
We're at Berlin buzzword, so I figured I'd start off with a slightly unrelated topic, which is what is a buzzword? So buzzword is a phrase that has become fashionable. Sometimes they lose their meaning because they became fashionable So, I don't think I know what strategy means anymore nor do I have any clue what synergy means they've been put in so many places
People don't really know what they mean anymore when you say them but the best buzzwords got to be fashionable because they they're about something that's important and That's kind of where the word observability has gotten to nowadays. That's not useful for beginners. So
this topic is Meant to build up some of that knowledge to help you understand why exactly this is a buzzword and to start applying it the way In ways that are actually useful to you So first you may be wondering who I am. I am a senior software engineering manager at Grafana Labs
My name is Merle Kranz. I've worked for 18 years as a C++ and Java programmer as a software architect All of that time as an individual contributor I'm also very active at the Apache Software Foundation. I've made contributions to the incubator conferences to the community committee to
diversity I'm also very active in and around financial management of the Apache Software Foundation and a little open-source project called Apache FinRact Where I contributed to the microservices architecture. I Got into software development because I think that software can make the world a better place These are incredibly powerful tools and software developers are using them to solve or help solve some of the hardest the most beautiful
The most ugly problems that humanity has and I went to work for Grafana Because I wanted to enable software developers to do this more effectively
Now in today's systems in today's worlds There's there are a lot of different kinds of systems that people put together to try to understand what's going on in their software systems They look at databases they try to apply tribal knowledge People do look at Prometheus. They look at logs
But when you're dealing with a problem that just just occurred and that needs to be solved quickly This can be especially frustrating because it's so fragmented This is going to be about tools that help us to Solve this in a slightly less fragmented more holistic way first. We actually need to understand. What are we trying to solve?
What is observability? We need to get some definitions of some of these buzzwords in place So monitoring versus observability. Well monitoring is something that people do People go and they examine system behavior people go and they look for explanations for that system behavior They're looking at the system behavior because they're responding to an alert or they're looking at the system behavior because they made a fix
And they want to make sure it works. So they're they're examining the response of the system now with these new conditions In order to do that People need to be enabled to do that via efficient and relevant data collection
They need to be able to store the data in a way that enables fast querying. They need alerting and but they need that not within simple systems because I mean monitoring a simple system is Simple you don't need a big system for that. They need it within complex systems
But the word the meaning of some of these words has been diluted over time There's a fair bit of cargo culting going on so The kernel of the idea about observability is about changing the behavior It's actually changing the companies in which in which these systems are applied Monitoring is taken on in some sub places is taken on just the meaning of collecting the data and not using it
so you've got like a data lake or Full-text indexing And these are all cool technologies, but what are they for? Observability should be about enabling humans to understand complex systems
It's not just about finding out that something's broken. Although that's pretty useful, too It's about digging in and understanding why it's broken and understanding that as quickly as possible Now I mentioned earlier that this isn't particularly useful for simple systems because simple systems can be understood without without extra tra-la-la
This is about complex systems, and I'm not talking about just any kind of complexity There are kinds of complexity that can be removed or reduced for example fixing a bad design or removing code that you don't need anymore but some Some complexity is inherent and you'll hear this a lot when you when you hear teams talking about having moved something from a monolith to
Microservices and they're like all we did is get different problems Or the same problems, but in a different context Well, that's because the system was complex and moving it from one place to another isn't going to change that That can however make it easier to compartmentalize the complexity To place boundaries around it so that it can be understood in a smaller context
It should be possible to distill certain aspects of a complex system Meaningfully in order to be able to observe it and this brings us to the next question Who is observing it well the SREs are observing it your site reliability engineers Another buzzword, right?
So What what is a site reliable is reliability engineer? Well, we mentioned earlier that any setter well We didn't but this is this is sort of what software is about Any set of tasks or any tasks that's repeated often enough is a potential software problem. So
Google does a lot of operations because they do a lot of operations. It's a software problem So, how do they distill that software problem out? Well, one of the one of the tools they use there and that have spread to the rest of the software world is
Something called SLI's SLO's and SLA's. Well, what is an SLI? SLI is a service level indicator And that's a carefully defined Quantitative not qualitative measure of some aspect of the level of service that is provided
So once you have a measure once you have an indicator Then you can also set objectives a target that you want that indicator to reach And once you have a Target, then you can also start to make agreements externally or internally saying that if we don't achieve this target, then we will
Pay fine take an action, whatever For a lot of organizations the service level objective is as far as you need to go Sometimes you need a service level agreement. You don't always see the service level agreement in this context SREs are trying to
Align incentives across the organization And they're trying to do this across for services Each of which may have different owners different teams, but have contracts that define their interfaces and they're doing this across organizations that include developers that include
Operational people that include product managers and each of these people may be focused on different aspects of the performance of their system So what putting out these common indicators does is it helps to align these incentives If you get the right indicators, and then once you have an indicator
You can get everybody looking kind of through the same window They need a shared view so that they can all be seeing the same thing and reacting to the same information So what actually should they be measuring? What actually should you be measuring in your services? Well you need to be careful to pick something to measure that relates as directly as possible to what your users care about
So one good example is latency users care about the speed with which your website responds It's also true though that measures affect each other so for example if you Improve latency by making your website respond faster. You might do that by
failing out More quickly if a service doesn't respond so you could actually increase your error rate That might be an acceptable trade-off for you It depends on what your business is so what you need to be doing when you're defining your service level objectives is you need to avoid absolutes, so if you were to try to example for example to set an error rate of
0% error rate well, then you're probably going to end up paying a heavy price in other areas If instead you can can think more Carefully about what your error rate is then you can exchange a slight increase in the number of errors
For something else that might also be important to you So this is kind of what error budgets or other SLOs are about is making it so that you can think About all of these pieces together and the way they affect each other And achieve more than one objective Customers do care about their services being up. They don't care about the individual components
and they Don't necessarily they're not necessarily concerned if the air if there's an Once in a week if there's an error So once you have a measure then you also have to ask yourself what to alert on in your services
Well, this one's both complex and simple at the same time The simple of it is you only alert on things that are impacting your customer service That are either now impacting your customer service or will be very soon Don't alert on anything else because you're going to wear people out and people will start ignoring your alerts and here's another aside
What is black box monitoring versus white box monitoring? Well Rather than escalate one of the elevate one of the other above the other Just consider that that both have advantages, but we will be focusing more on white box monitoring in this talk
The white box monitoring looks more at the part. So it actually looks into a component And captures out metrics or logs black box monitoring looks at it from the outside based on behavior or does it respond? Now typically Some aspects of black box monitoring are covered by white box monitoring if you can get the metrics from a service
And it's at least responding to your customer metrics. So let's cover briefly what Prometheus does in this Prometheus was inspired by Google's Borg mom. It's a time series database
that Saves basically a time step and a float 64 value to a set of labels That allow you to locate it So maybe you have a service and you have a whole bunch of instances of that service you have it in a region So you can query based on those labels Now it's very common to do dashboarding of Prometheus via Grafana
So Prometheus is not for event logging and we'll be covering event logging in just a little bit. So What is Prometheus saving well Prometheus is persisting a time series
That is a set of recorded values that change over time for each of the services or each of the components of the services That that you are observing Individual events can be merged into counters within the service Or they can be They can be captured or an aggregated outside of the service in Prometheus, there's typically
There's typically the counter the gauge and the histogram are the ways in which data is typically safe So a counter is something that continually rises You just add to a counter you can derive rates from counters For example by saying it increased by this much in this amount of time
A Gauge is something that can change over time the disadvantage to gauges is You might have a time time that you capture and there might be an event in between there that you miss So let's say you have a temperature and a data center and it looks fairly cool, but if you're if the
Rate at which you're capturing it is slow enough. There might be a spike in between that you just didn't see Nonetheless gauges can and can serve important purposes Another important case is histogram. So maybe you want to see the the data bucketed service latency is a really good example because
Some of your customers are going to see a very small percent of your customers are going to see very slow response times and They're going to remember it more than the customers who see your average response time So just because it's a very small percentage of your customers does not mean that it's not important And you can visualize all of this
using Grafana, which is fun and Because the querying via prom prom QL Prometheus QL is so flexible This makes it very that this using this prom QL within Grafana Makes it possible to do all kinds of interesting things with your data after it's been captured
So what are the main selling points of Prometheus? Well one it's highly dynamic So you have built-in service discovery, which means that you can add components into your into your architecture into your landscape without having to manually register them with your Prometheus server it
Automatically gets added via the same service discovery mechanism that you use also for your for answering customer requests There's no hierarchical model. So it's just in dimensional label set Again I mentioned prom QL just a second ago. You can use prom QL for processing You can use it for graphing you can use it for learning and you can use it for for exporting
So you're using the same query language for for everything and it's very simple to operate. I basically just started up It's also really fast It's a part of the reason it's really fast because it's a pool based system. It's not event based and
It's primarily white box monitoring However, there is a black box monitoring aspect, which is basically if it hits your metrics in point requests, you know polls your your metrics endpoint And you don't get a response then that's a simple form of black box monitoring In addition Prometheus makes hard API commitments within major versions. So it remains
compatible within major versions So here's some examples and simply simple examples of The Of the measures that you can capture so if you can look for example at HTTP HTTP requests
And then you can look at different environments production or a test environment You can look at different methods post or get you can look at it by code and then you can see the number of requests within each of these categories so Is it scalable well
Kubernetes is the board Prometheus is basically borkman within this context and Google couldn't have run the board. That is their kubernetes clusters without borkman Kubernetes and Prometheus are designed and written with each other in mind
They absolutely if Google can run them at scale then you probably can too One Prometheus instance has been seen to have as many as 125 million active time series at once
So it can take on a lot of data Now a Prometheus is less optimized for long-term data storage So there are a couple of projects working to sort of pick up the back end of that you can connect Prometheus with Thanos which is historically easy easier to run but slower and it scales storage horizontally
Cortex which is catching up on how easy it is to run You can scale storage you can scale the adjuster and you can scale the query or horizontally as well So that's just metrics, but observability typically has three pillars That is metrics logs and traces. So let's think about logs
Let's move on to Loki Loki is an open-source project under the AGP a license at Grafana labs Which follows the same label-based system as Prometheus so you can query your logs on
Any of the same metrics that you're that you're querying them on and in Prometheus Especially if they're if you're putting them into the same system if it comes from the same From the same service, then it will have the same labels This makes the information cross referenceable which can be very very useful in the middle of an incident
So let's say you see a spike and now you want to look for all of the logs within the time frame around that Spike you can do that by using the same query that you used to look at the spike It's also very efficient because it is not creating a full-text index It's only indexing on the timestamp and on the label
This means that you can work with logs at scale without having a huge cost of a very large index You can turn the logs into metrics to to make it easier to work with them I'll show you an example just a screenshot of an example in a little bit And because you're pulling the data out of the system basically via prom tail
It's very simple to set up So this is an example of pulling metrics out of your logs If you look closely at the query, you'll see that it's querying for the errors so you can actually look at the number of log messages that contain an error over time and
Then you'll see You'll see a progression. You'll see the the tendencies and the changes over time Let's take a closer look what I was mentioning before about what a log log entry looks at looks like I remember I said you have index data and unindexed data and part of the key part of the trick here is that you're not
Indexing all of it. You're only indexing the timestamp and the specific labels the Prometheus file labels You're not indexing the rest of the log line, which does not mean you can't search on the rest of the log line, you can What that means is in order to scale out search what Loki does is first you search on the labels
And you return those parts of logs and then it just has a full text search massively parallelized across the rest of that so That gets us logs. Remember I said we have three pillars of observability We have metrics you have logs and we have traces and for traces
There's another this is the newest child of Grafana labs also an open source project also a GPL licensed Tempo is there for traces. So this includes this is an object store only. It's 100% compatible
with open telemetry tracing It's not a sampling. This is all of your traces that are getting stored However, it is exemplar based for those of you who don't know what exemplar means exemplar is basically like If you have a progression of data, but you have too much data to save all of it. You can save
Individual points over time and then Go to those samples if something if you if your data shows that there's a problem You can pick out one of the examples that was saved from that time. So that's what's an exemplar is This is exemplar based. But again, it's not sampling because 100% of your traces are saved It also because it it's based on the same
label set you can move easily from From Prometheus to Loki to tempo and back again So bringing that together you can move like I mentioned it just now you can move from your logs to your traces
You can move from your metrics to your traces you can Move from your traces to your logs any which way you want to go you can go with this because it all lives sort of within the same query language and Because this is all open source. You can also run it yourself. So I mean, I'm not here Of course, I work for Grafana labs. I would love if you pay us
To run it and we would be happy to run it for you For small installations, we'll even run it for you for free for free forever you get a 14-day Grafana cloud pro trial and then after that you have Limitations on the number of active series, but again, you can run it for free with us in our cloud
And if you're If you're interested in setting up the stack as I said, it's open source If there's something that you want to change you can look at it You can figure out little improvements that you want to make or big improvements. That's one of the advantages of open source So with that, let me say thank you. And are there any questions?
Thanks for the talk. I think it's a great Introduction to all things monitoring I
Haven't seen any questions, I guess if there's not a question, I'm curious at Grafana labs beyond provide like the solution for all three like Metrics logs and pieces or just for metrics like traditionally a sort of Grafana as Grafana labs
But I guess I'm learning something Yes, we offer hosted Prometheus we offer a Hosted Loki and we offer hosted tempo which is traces and of course We have a beautiful UI that you can put on top of all of that so you can access all of that via our UI
All right, I guess maybe oh there's a question that just came in now, let me read it out to you Then Sam the Like the fact that we don't have sampling
Concerns Somebody like in terms of course So is there sampling I guess I guess the question comes down to like the oil provides sampling If not, like the costs could get expensive It is possible to Do sampling but we charge by the series Rather than I mean there are there are gigabyte limits on some of the services but most of those services
We provide by the number of series singer that you're persisting So if you're using the free tier, you're not going to get charged. Anyways, you don't even have to put in your credit card So you're not going to get accidentally charged like Amazon does sometimes
So if you're concerned about that then try it out and see what happens Experiment with it under the free tier I Guess the second question coming in is would you recommend this for small setups or is there a
Minimum size making it worthwhile There are some small setups you might even want to do it on I've seen some really interesting right us where people Monitored an aquarium with Prometheus now, you probably don't need log files for that one. Probably don't need traces for that one either But I think Prometheus at very least it's it's easy to set up Prometheus and it's fun, too
And it's Grafana also is very easy. So just putting that on top of Prometheus. You can look at the data Excuse me yeah, I mean it's It's not the problem was talking about in my talk here But I certainly I really think that Prometheus is easy enough to use it. You can even use it for small problems
And then I guess there's one more question that just came in it says I'm new to Prometheus Are there any clients out there that leave well? with Prometheus other than Grafana
So Prometheus does come with a kind of a very simple default web interface There yeah, but I think really Grafana is the best thing out there for it up until very recently Prometheus was actually delivering Grafana with their releases so the Prometheus team also clearly sees Grafana as the best way to examine their data