Get Instrumented!

Video thumbnail (Frame 0) Video thumbnail (Frame 1859) Video thumbnail (Frame 2918) Video thumbnail (Frame 6267) Video thumbnail (Frame 10274) Video thumbnail (Frame 14847) Video thumbnail (Frame 17438) Video thumbnail (Frame 20984) Video thumbnail (Frame 23892) Video thumbnail (Frame 25853) Video thumbnail (Frame 33137) Video thumbnail (Frame 37256) Video thumbnail (Frame 38328) Video thumbnail (Frame 39576) Video thumbnail (Frame 41346) Video thumbnail (Frame 43256) Video thumbnail (Frame 44395) Video thumbnail (Frame 45680) Video thumbnail (Frame 48718) Video thumbnail (Frame 50153) Video thumbnail (Frame 52132) Video thumbnail (Frame 53519) Video thumbnail (Frame 55379) Video thumbnail (Frame 61580)
Video in TIB AV-Portal: Get Instrumented!

Formal Metadata

Get Instrumented!
Title of Series
Part Number
Number of Parts
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Hynek Schlawack - Get Instrumented! To get real time insight into your running applications you need to instrument them and collect metrics: count events, measure times, expose numbers. Sadly this important aspect of development was a patchwork of half-integrated solutions for years. Prometheus changed that and this talk will walk you through instrumenting your apps and servers, building dashboards, and monitoring using metrics. ----- Metrics are highly superior to logging in regards of understanding the past, presence, and future of your applications and systems. They are cheap to gather (just increment a number!) but setting up a metrics system to collect and store them is a major task. You may have heard of statsd, Riemann, Graphite, InfluxDB, or OpenTSB. They all look promising but on a closer look it’s apparent that some of those solutions are straight-out flawed and others are hard to integrate with each other or even to get up and running. Then came Prometheus and gave us independence of UDP, no complex math in your application, multi-dimensional data by adding labels to values (no more server names in your metric names!), baked in monitoring capabilities, integration with many common systems, and official clients for all major programming languages. In short: a *unified* way to gather, process, and present metrics. This talk will: 1. explain why you want to collect metrics, 1. give an overview of the problems with existing solutions, 1. try to convince you that Prometheus may be what you’ve been waiting for, 1. teach how to impress your co-workers with beautiful graphs and intelligent monitoring by putting a fully instrumented Python application into production, 1. and finally give you pointers on how to migrate an existing metrics infrastructure to Prometheus *or* how to integrate Prometheus therein.
Web page CAN bus Hypermedia Block (periodic table) Graph (mathematics) Video game Row (database) Number
Metric system Service (economics) Multiplication sign Time series Design by contract Average Mereology Number Goodness of fit Root Pressure volume diagram System programming Energy level Representation (politics) Lie group Task (computing) Physical system Condition number Predictability Software developer Sampling (statistics) Price index Timestamp Word Computer animation Personal digital assistant Order (biology) Right angle Cycle (graph theory) Object (grammar) Metric system
Logical constant State observer Presentation of a group Metric system Multiplication sign Client (computing) Quality of service Estimator Component-based software engineering Type theory Bit rate Semiconductor memory Oval Data conversion Physical system Social class Boss Corporation Enterprise architecture Structural load Data storage device Sampling (statistics) Sound effect Price index Measurement Category of being Type theory Order (biology) Website Metric system Resultant Point (geometry) Web page Trail Dataflow Server (computing) Implementation Histogram Time series Streaming media Average Field (computer science) Number Goodness of fit Population density Average Term (mathematics) Operator (mathematics) System programming Histogram Multiplication Graph (mathematics) Information Model theory Planning Database Gauge theory Cartesian coordinate system Timestamp Integrated development environment Personal digital assistant Query language
Curve Functional (mathematics) Graph (mathematics) Information Multiplication sign Median Median Average Limit (category theory) Mereology Product (business) Type theory Arithmetic mean Personal digital assistant Average Hypermedia Order (biology) Set (mathematics) Website Musical ensemble Resultant Partition (number theory) Vulnerability (computing)
Point (geometry) Slide rule Server (computing) Service (economics) Multiplication sign Execution unit Combinational logic Time series Data storage device Parameter (computer programming) Metadata Dimensional analysis 2 (number) Number Information retrieval Bit rate Average Representation (politics) System programming Software testing Task (computing) Graph (mathematics) Model theory Total S.A. Bit Timestamp Computer animation Query language Right angle Metric system
Scripting language Complex (psychology) Email Channel capacity Image resolution Image resolution Gradient Electronic mailing list Sampling (statistics) Data storage device Mereology Number Information retrieval Data mining Web service Arithmetic mean Process (computing) Bit rate Core dump System programming MiniDisc Metric system Spacetime Physical system
Multiplication Group action Service (economics) Commutator Time series Water vapor Data storage device Instance (computer science) Product (business) Number Process (computing) Computer animation Buffer solution System programming Configuration space Configuration space Metric system Physical system
Presentation of a group Existential quantification Metric system Local area network Multiplication sign File format Primitive (album) Mereology Proper map Information retrieval Medical imaging Computer configuration Exception handling Predictability Sampling (statistics) Data storage device Fitness function Sound effect Measurement Type theory Category of being Process (computing) System programming Summierbarkeit Metric system Web page Laptop Point (geometry) Histogram Service (economics) Patch (Unix) Time series Data storage device Number Product (business) Wave packet Goodness of fit System programming Energy level Software testing Form (programming) Data type Histogram Inheritance (object-oriented programming) Weight Coma Berenices Database Density of states Limit (category theory) Integrated development environment Personal digital assistant Video game Game theory Table (information)
Point (geometry) Histogram Functional (mathematics) Multiplication Cellular automaton Range (statistics) Content (media) Combinational logic Time series Total S.A. Bit rate Inverse element Computer animation Bit rate Vector space Quantile Personal digital assistant Data center Speech synthesis Summierbarkeit Series (mathematics) Summierbarkeit
Information retrieval Cross-correlation Visualization (computer graphics) Drill commands Graph (mathematics) Programmable read-only memory Expression Client (computing) Data storage device
Goodness of fit Graph (mathematics) Event horizon Visualization (computer graphics) INTEGRAL Disintegration Programmable read-only memory Source code Software testing Right angle Coprocessor Measurement
Noise (electronics) Channel capacity Linear regression Multiplication sign Programmable read-only memory Execution unit Moment (mathematics) Sampling (statistics) Client (computing) Data storage device Thresholding (image processing) Information retrieval Data management Mathematics Personal digital assistant Pressure Condition number World Wide Web Consortium
Web page Service (economics) Scaling (geometry) Image resolution Source code Data storage device Client (computing) Data storage device Mereology Information retrieval Type theory Personal digital assistant Data center Integrated development environment Bus (computing) Quicksort World Wide Web Consortium
Service (economics) Process (computing) Proxy server Bridging (networking) System programming Statistics Computing platform
Software engineering Context awareness Statistics Service (economics) Structural load 3 (number) Computer network Mereology Befehlsprozessor Read-only memory Natural number Semiconductor memory System programming System programming Addressing mode MiniDisc Metric system
Point (geometry) Web page Email Server (computing) Regulärer Ausdruck <Textverarbeitung> Metric system Service (economics) Proxy server Computer file Multiplication sign Database Neuroinformatik Product (business) Web 2.0 Cross-correlation System programming Computer-assisted translation Computing platform World Wide Web Consortium Service (economics) Information Server (computing) Structural load Computer network Database Instance (computer science) Regulärer Ausdruck <Textverarbeitung> Statistics Component-based software engineering Word Digital photography Computer animation Personal digital assistant System programming Metric system
Point (geometry) Mobile app Installation art Thread (computing) Process (computing) Computer animation Logic Personal digital assistant Code Client (computing) Computer-assisted translation Neuroinformatik
Trail Histogram Functional (mathematics) Link (knot theory) Computer file Code Multiplication sign Open set Number Semiconductor memory System programming Software testing Authentication Histogram Mathematical analysis Streaming media Maxima and minima Bit Line (geometry) Gauge theory Timestamp Connected space Curvature Befehlsprozessor Process (computing) Computer animation Metric system Arithmetic progression Middleware
Installation art Metric system Thread (computing) Code Information overload Multiplication sign 1 (number) Image registration Client (computing) Mathematics Bit rate Predictability Stress (mechanics) Streaming media Staff (military) Price index Measurement Thread (computing) Data management Process (computing) Order (biology) Configuration space Right angle Metric system Point (geometry) Functional (mathematics) Service (economics) Wrapper (data mining) Divisor Authentication Disintegration Product (business) Hypothesis Operator (mathematics) System programming Computing platform Compilation album Form (programming) Information Counting Line (geometry) Cartesian coordinate system Word Error message Computer animation Software Logic Synchronization Statement (computer science) Social class Exception handling Library (computing)
Wiki Domain name Building Observational study Hypermedia Projective plane Integer
our next presented his he mentioned about his develop very invariant and you Lou who so 1st of all I would like to thank everyone who actually did what I asked you came to the front rows thank you so much for curing me on I have also to apologize I'm its armed with physically inconvenience because I managed to kind of get a blockade in my neck is so applied during my hair so if you ever wonder how it is to get old it's not great here it's not being able to you mornings stuff all hurting yourself if your local and you know some physiotherapists could are like the block my neckties common problem you have after talk so other than that also may or may not know me I regret of small that hosting company called Berry media and I want to teach you numbers and called graphs can improve your life and the life of those that are impacted if you get a page out of bed at 4 AM and my clickers was attached so I want you by the
end of the talk to be able to move to the next task in order set so and what is the end of the talk to be able to predict performance problems does you can prevent them or if you can just go out of all the blaze it's so much better than having to fight huge corporations fire if something happens anyway I want you to be alerted by your system with useful data and not by the spherical boss on like or by your very angry customers on the phone and if
a fire is burning I don't want you to stare at the useless top all put hoping to come with some inspiration while you most still poking definitive I want you to have the what systems our right hand and of course the meaningful for and that also includes historic development because once you run into a fire you want to know how we get there how you got there were especially because this cycles should feedback was something happens you want to be alerted the next time and ideally you want to prevent is so if you want to reasonable situations like this is good this is bad this is really bad on you need an objective representation for the situation and armed for that you can express the quality of the level of service so you may have already seen those verses root of 1st different work on so 1st you need the indicator which can be the request latency order of time something like that once you have an indicator so you have something to talk about you can formulate service level objectives which can be latency must be always under 1 of MS are you have to have 5 minds of time things like this and finally probably most famous 1 was the objectives you can formulate contracts and agreements on top of those numbers because what will happen if those objectives are missed are goodness you are has or contracts now agreements are not part of this talk what is a license missiles are because as lies just metrics and those are conditions you want to fulfill in other words you want to get alerted if you are not fulfilling them so once the bank 1 I matrix metrics are numbers were samples in a data base their timestamps which makes them of time series and you're gonna have to have a lot of time series which means that you can correlate them for some this example the reduced latency we just very typical use case where you get those most of time series by
any instruments through system and the system can mean anything can your add or it can be a server just like a car on the plane accepted this instruments logadd hooked up to our promises its little storage and and depending on what kind of that I'm pseudonymous using that will allow you to queries and operations on top of them not have obviously instrumental add next makes up is probably some dependencies like that database you've observer alliances they all carry and of important and useful information for you to correlate with your application in and but of course environment is similar to memory your ideal activities and finally and that's kind of appreciated anything you can also instrument of business at the number of customers the number of paying customers but again not in San Francisco so the numbers are the more similar here let's still on a daily revenue you know so seeing a graph correlates your friend and latency with your site operates or you're our you can be enlightening sometimes especially if you you are on arguing with your boss of all it's very relevant you to assist the not nothing of this new people have been used for years I have been doing this for years actually have been talking about this last year just here on but in the past you have to choose multiple components with various various trade and most notably reader are integrated and this is a bad situation to be in a few wake up 1 morning and say OK I want you to have metrics would later and knowledge you have to rely on basically everything and and choose what you want to use and others like that's the best some really really bad properties but don't realize it until the fire's burning so I find that promise this is different and that's because the density of the around and opinionated metrics and monitoring system which is integrated it's absolutely flexible but it is a proven and all that document starting point so obvious are about those obviously but in this case it's a people isn't because it's more or less every implementation of Google's internal monitoring system that has been implemented white eggs versus working in that case that some and they were just missing the effects of monitoring systems so given an idea birds let's 7 architectural good feature of course is the storage of time series a lot of time series is really
just unnamed stream of field samples with constant over time stamp that all approvals wanted to think in terms of 4 types that built into on top of these streams so 1st counters we just counting the rents for counting anything but the important properties of conversation only increase but it can increase by anything so you can use them to measuring natural repress or to counter you whatever if you need to set arbitrary numbers engages for you gages for exposing numbers and it can be said to anything so it's used for things like server load temperatures order number of active requests right now and these 2 are pretty obvious holiday met on timestamped flows stream and others are more interesting results Summary takes measurements so it observed measurements and allows you to compute the rate the comment like requests per 2nd and the average measurement like an average request now some clients and I think this is because the not 1 of them also allow you to define a person class which are then computed with that the reason why it's not in there is that it's not really useful because you cannot aggregated meaningfully percentiles so that's just not home at so on instead well you should use histograms and it's like of the working ourselves of metrics in this case is also about observing values and you keep track averages but additionally you define buckets and these buckets should have the typical sizes of the values that we are measuring and then from this can estimate percentiles server-side From these buckets it also means that you not of deriving numbers in your application while serving some important requests this over a nice property nice present has twice now we just because they're very important so I give you a quick rundown suggest you on the same page and started premise that averages are probably less useful than you may think and have something concrete to talk about our let's assume we measure request latencies and I think it's fair to say the readers latencies are a good indicator of the quality of service fast requests are good so requests about it doesn't matter if it's a web page or at the very end API In any case you want to be faster now that which time is not the original experience in this case of that's a good example no users experiencing a latency of 2 point this point are so not only is it not the great cancer it's also on modeling all numbers together and you you don't you don't see in 1 request is really really bad what others are just fine
and the problem here is that there noble curves in production it's every it's production enabling country it's skewed in some way
so yeah and it means basically you know you may be wasting your time on optimizing a perfectly good average-case widest just some old lire for some reason and you will not you know refine it if you don't know it is an all so what is the average experience so what does the average experience here it's 1 and you remember high school there is a function of 2 graph told so it's a media which takes sorted out as a vitamin value or the average of the Truman let used if it's an even order even sites said now all the mean strength in representing the average user in this case also also is because our weakness because this still returns 1 and I think we can all just agree that this is not a useful information received so unfortunately where the meaning comes from there's more and this brings us back to present as they also partition sorted set but this time to 100 parts and you look at the and value for every person or then that and person type P is the upper limit of and per cent of that assessment results supra confusing but it just means the following if the 50th percentile this 1 Music then means that by 1 millisecond 50 per cent of the requests are done that's all it means if you think about it is actually median again which again is useful
by itself when you can go for it you know I the parameter we can treat so let's look how long the 95 95 per cent of all facets request and we see that a problem something very very wrong and something between 50 and 5 % values are affected by this so at this point you can drill deeper because as I said before from his is computing is person test server side so they are not fixed you can always look at trent finalists and yet when the average you wouldn't have a log of many useful either you just think that all take forever now the problem person task and not a lot of people don't talk about it is that they throw away most of the data and that's a problem if you want the representation of your your Service health or service quality and so in the end you still need to average to have to have a number that is distilled everything and doesn't just look at certain values so now that we have the mass out of the way lexical naming I will ever used graph right or there is this you will have seen something like that are they put the metadata into a metric which is kind of annoying so any model DB and from this as 1 of them switched to their names so the best practice here is to prepare an apple in which is not a good that's just a short names like use big slides army and 2 at a unit of total Esau counter of if you are measuring times you would have 2 seconds or something like that so it's a bit of explaining not this metadata is entered using so-called labels which looks like this and each new label combinations so as the the new time series or how to call it dimensions and so that means that you don't get less time series but it's much more readable is structured so can argue much because you are gregation they conditions like in a much nicer way like formulate queries only will use it it's really not about how how do you get those values
and it's very it's kind of interesting because contrary to today on most metric system from users this whole based which means that each instrument system exports its metrics wire HTTP and Prometheus scrapes them for you so you if I'm using a metaphor from before you add instruments system and from investment at a regular rate write them down as the timestamp and is that this means
a lot of things so 1st of all you can adjust the resolution of each single target by configuring hold of the metrics of scraped so if you want more frequent grade you get more precise data mining is more of a disk space so as a trade it also means that scrapes fail for some reason like say heilo love you don't use data or meaning just resolution which is kind of important because you average rates still makes sense to compare that to approach based approach a lost samples actually means that you rate is sinking so that looks like things are going down although it's rising beyond the capacity of the system to report metrics and this makes a promise is really really great for our monitoring that's a bad it for things that you want like counting that's a common question mailing list you do not get the same values that a single request to just good every just part of it and can to usefully on it but it's not like accounting core of comic systems then you have to go for something like PostgreSQL or complexity if you need at each single number now on there's a few problems to of course someone is short-lived jobs like you make up script you're not going to convert you cross streets into into web services just so someone can script metrics and there's another solution for and it's called the push
which will receive the data from the for of script and it's retains them from from the vicious great problem solved then there's of course the problem of
Titus gratuitous correct something you have to know it it actually exists on some people consider as a problem but it's actually just moved to the problem of knowing what to production systems are from monitoring into we'll metric system because what used also needs to know about all these system so but you're not getting around mobile telling some system of water systems and again you can it either by configuration this will tell promises to scrape itself that gives you a number of time series and buffer usage on this is how it and explore the target for instance means all the same and and move on those together possible job so for example you have multiple commuter service you could describe them all their or if you have multiple back and say that there is there 1 job but multiple instances and now these 2 values you get automatically for its greatest metric as labels so you can filtering on top of this and aggregation so in practice of course you're not
going up to static which you will use some kind of service discovery we personally consul in groups greater but how people have been using it with other systems very successfully not just 1 final
problem but this is actually a problem and that is close or netted world systems like room or and use of appliances that run in the in a local network of a customer it can expose things really and if you knew people make it really mattered you so In the case of the could've talks about an official in as far as I know there's nothing concrete yet and other than that yes a really good solution Prometheus is not a good fit for this generally speaking primitives is intended to run in the same level as its target if you cannot do that you probably have to so but there's a lot of advantages to the 1st high availability super easy you just run multiple from the service to point them at the same time exporters done this also means that you can have production and you test environment so for example we had an interest and you want to make him broken all metric system so we can we never have had a another patch of production from but he had a bruise on his notebook and he got access to images and point of the system relevant to his work and he could do everything you need it that's a very nice of property that on all the detection is really easy this great failed you know something's fishy our reasoning about how long you been here from the system so it's probably that is possible but there are more complicated what a person like is the predictable effect on infrastructure because more traffic does not mean more metric traffic it's always the same use and once hold you want to be to scrape data and that's it which also means that does not congested can already be seen at work if something is going on in your system to and finally means that instrument 3rd parties is pretty easy actually me because any production-ready system has some kind of instruments that it exposed to its users so any database has a special table performance metrics that's the receptors that pages job why has its it JMX I'm just have to take these metrics and transform them into something that promises understands and it turns out that for me to understand it's pretty easy for you to understand who so let's look at how it looks like this this 1 explorer exports it's there's always at least the option of the human readable format and in this case it is the 1st part of of a histogram about request latencies again very bad metric I'm sure metric therefore big form not this is the 1st 1 which is the 1st part of it and this time series of is the number of measurements that have been observed so how many requests to be observed in a 2nd 1 this the sum of the measured time like the total time observed so in this case the at 390 requests that although there to 170 7 point something segments and this is super cheap to keep track of we're just having full numbers and is also little examples that up from the store speaker using the this is a summary of type in Python this is all you get so the percentiles let's say you also need buckets and you look like this In this case 6 packets they're all in the label it's in the upper limit that ability of the sample has to fit into it trickles down to something that fits into 0 . 5 moles of points into also fits into 2 . 0 this is the number of samples that are fit in this package now me this can deployed person from this and it's good enough in practice and you can always increase the precision of the present has by adding more buckets but you have to make sure that you values distributed evenly over the buckets or distributed at all because it all the just 1 bucket promises cannot our compute anything meaningful out of it so please define buckets based on the latencies you had not studied and you would like to have because that's of so we have metrics in a database what do we do with them we and for any use that from
the of training which called from and I at the level of time to give you a proper intro and there's like really amazing stuff 0 going on can implement the game of life and it might give you a few exceptions so
you have you will usually have a lot of related to our time series that you want to aggregate to to 1 or 2 if you and also for example say of many bank and in multiple data center and you want a total request rate over all that so we will work are cell from the inside out user counter again which is on the left side before and to compute the rate function at it's so-called range vector so this means this returns a vector or an array of values of the past 1 minute how many that are depends on the on the granularity of the answer so how often you scrape the parties in the 1 minute and the rich functional computed the rate hold fast is this this this counter rising and the point is that the request rate for every single by in every single that and and I just sum them up you have 1 when you know that total request rate over everything now what you want to know the rate of the bank and you want just add a filter which looks like this and you can see how nice inverse if you have labels in a structured instead of having to work with the separated names the rest is all the same now if you want to add the request rate for each data center about broken down by the center there you drop the filter again and you told the sum function to retrain the DC label so in this case you get as many rates as if he sees simple now what else is interesting person cost and Prometheus uses so-called the quantiles speech completely or simplified person sometimes divided by 100 so this is the 90th percentile and we take the rates of the buckets we just saw before and his around content and the rest we so of course this gives us on as many of his as you have died a series of label combinations so you may want to have to you may want to aggregated but noted that we have a lot of persons
has that you always wanted to have might so I have give some of the detainees powerful promiscuous and now it is used by all its
consumers which most notably our visualizations so there's that
internal 1 which is not pretty but at least not as it's nice playing around drilling in so something is going on the preclinical what could be and then use to query elsewhere on the the limited because it has only 1 expression paragraph so you cannot do any correlations whatsoever but again the best course with places as you think but it's not mine so from
national has still the best integration because it used to be the former official visualization but it's deprecated now because refiner has emerged official from research support with deprecated don't bother to for a real thing I
think everyone ever so refiner I I think a good measure of people in this room just fine on hold graph on or what to do because it's the best the best looking discourse of the right now as many many integration you can believe that works from different sources so you can introduce from 1st and still keep going reflects the BER graphite and integrated in 1 the score which is really nice especially because it gives you a step-by-step introduction so on their use this there's no reason to use anything else the final piece of the
puzzle learning and you can use prompt you of from with other conditions and from the most of push them into a separate unit called the manager so again the example and let's have a morning
from for full-disk because once it is is full it's too late but I don't think of some random threshold can lead noise which is to all of so let's use a crystal ball to be notified in time with awards and for that moment alert that fire so it is this going to be full in 4 hours and this is all a crystal ball it's kind of it's more high school mathematics and it's called linear regression so in this court case if given the samples of the past 1 hour but this will have less than 0 capacity in 4 hours and the condition is true for 5 minutes so a small spike doesn't just fire of some other to than we want to be alerted how do want to learn again it's completed
pluggable it integrates the love of notification back and of course you know pages you did that so yes you're gonna have slack so how do you get this scale which is a promise of really final part dances Federation from a thesaurus can get the data from other from service in a typical use-cases part a gradation which can mean if you have 1 promises Europe for data center for 1 per team or won't type and the aggregation of data from these from the source and 1 big or for downsampling say you you have 1 really really fast when this is the sort of which is great all your targets any of high resolution data which you want for 1 important thing but we also want to save some history of you the whole history of behavior over years so In this case you would just simple it down to a lower resolution for long-term storage where 2nd religious so this 1 big this that's all there is to it to it so you should have a general idea how promises verbs not so let's look at the data into it and
there's a lot you can do without touching Newcomb solids so let's start
with a breaking thinks and for Mississippi all public for over a year and has a very active because system of the 1 . 0 by the way far has been released I think this week as already pointed out that it's easy to write explorers with the party thing and that's the reason why there are so many already and includes bridges which is really cool because it means you can use the existing instrumentation pointed at EC of edges exporters and will transform whatever you're doing right now and to promote format and from can give you the nice alerting and graphing and what not so what they did is better than so that somebody
platforms 1st fully featured service there's official notice for that it'll instrument Europe there from inside like metal KVM galaxies now you know what picture comes next on process
containers they like over the course of that instrument it from the outside using container API and that's called the advisor it's a from specific and from Google so depending on how you run a system that decides and installing such
gives full system inside you get statistical CPU memory natural I and much much more and this is very useful but if you want to put if you want to put you all metrics into the context so installation of this should be an automatic part provisioning service and not something you have to remember all of you when you think about that
are and a non-intrusive method is and then and that will follow your files and compute metrics on the fly based on regular expressions that's very powerful and in some cases like you serotonin metrics if you set the customer performance out of our use certain regular expressions to extract them it's better than the status page it's and it's serving so you should definitely check it out now not matter where set log files you should always instrument alter edge of infrastructure which usually is somewhat similar or better something like an 80 proximal now if you look from the outside that also like fox exporters so think just for free will progress system using HTTP TCP or even I C M P a gaping but they add additional load which nothing of what we thought before readers from again databases every database even when you some way to get rid of that our users and you're infrastructure there's also an understand key exporter so at this point we have already detailed information about our platform you know how to get you at from the outside by analyzing locks or improving it and you know how word that you can instrument third-party dependencies so assuming you instrument web server you can already correlate request times with from metrics like like this through load and dependency metrics like what the hell is going on you Paul's this is good but we need to do that to a deeper and for that of privilege thinking and that's it so we have touch a coat that and to make things interesting but use an example and instances a computer conferencing example in
cat so let's assume you build a ground-breaking product so often determines that photo contains a cat so now you need to deployed as an HDP service 30 years of post a picture when you're replying within the hour a note depending on what the picture contains so hallmark can it be let's build the flask of service and you need to know
really know flask to understand this you just check off occasions which because your colleagues we can use some Microsoft was written in gold deployed on look and you have an expensive computation the sexual business logic which the important fact is is a cat not that a lot of you have already written a case like this it's really fast region knowledge instrument and for that we
used the official promises client forget and forward change code to do the least we can do we just start the http and point and runs in a separate thread why because on Linux you get process
statistics for free immediately and that includes your memories such have and the time stamp of when you're all process start your CPU time the number of open files and the maximum number of flats so without changing the light of a line of code really began already detect memory and find links which have and are really painful when over just stopped accepting connections and you why and you can monitor or where we approach the FT system in the nights that lets us instrument and for that we defined 2 metrics 1st a histogram that measures all request latency then a histogram that looks hole on the actual and analysis takes and finally engaged it will tell us how many requests are active right now and always and and that so we just have these 2
0 the greatest the do exactly what they sound like the 1 tracks so many of our function calls on progress which is how many of these are in progress and the other 1 measures the time passed that test that that is spent in this function now you might be saying that middleware would be much better because you can have labels that give you name status code and you'd be completely right piece that I did that but that's admitted there is a bit of scope here so additionally we measure the time what is the measured time to analyze because for all you know all the times sings in the authentication which in turn is not instrument it that is ostensibly so and it is because I have decided to make its 1st I and you should instrument the package itself because if you're
use some of which 10 times why should you instrumented 10 times so again we define a metric with the time spent with whatever's and especially because I've said that the micro-services which makes it a distributed system which makes it failed in the most inconvenient waste in the most convenient time so you have to of work so whenever you fail the increments to error and of yet we try again and I never did this is not how you reach right in a distributed system so if the rate goes up you have a problem and a big problem but we also of counts the inlet looking looking at hand because there already like to because either you may be under attack or you have some failure in your of indications over which manifests itself as from credentials but actually just means that it's someone change that a form on something not these measures of the same name in every edge use them and you differentiate them using the the job label so if done
properly which means you instrument your shared libraries you put that related metrics in the middle there or even in to you with the container because both G unicorn and are especially Michael risky of for a lot of possibilities to begin to them reluctant 1 extra line you view which is both tolerable and I really think we should not be ashamed feel ashamed of all instrumentation I'm kind of allergic to having a lot of instrumentation repeats itself in your code and you includes everything and you should still way of trying to pull things out integrators and the middle but in the end any serious our production software has instrumentation anything that you connect to review of that that's or whatever you right so do with through nobody ever regretted to have too much information on there things go sideways not you may be asking what I always C or you may not but I do and that's why I've written from latest which supports basic I O and with that and that's the right thing we do 1st and quarantined and because I'm bad that I did not 3 implements the metric logic but instead of simply wrap the metrics from the official client and it's all there is this allows you to use the official clients and as spacing I'll applications outcomes also a few beauties so let me call announced it hasn't AI overture GPU-based metric that is much more flexible and configurable than the 1 that comes out of compliance and you can start started in a separate thread which means it's useful with any Python 3 application of the you do not have to to use it with a single application I personally isn't much from it just just the configurability then it also includes all the registration with the causal agent which is because we use consul but service discovery is kept completely generic so whatever you use just have the right to functions to integrated with your favorite ones so it basically means you just sit in your own code of false statements and point and register it and as soon as you mentioned are up consul in love audits and counselors variable integrated from the 1st so it's very little over an overload ones or all right for you to get this working once you put the pieces in place for so time is running out and but everything's instrument was brought up really fast I would that promise a promise predictions if you don't have good that sports if you use predict linear linear order even better hold interests which allows you to apply a smoothing factor that will favor older or newer depending how we said it use you're just fine alerting the alert manager that's a very powerful way to interact with it and it integrates with almost everything and interests alistic you so and if you instrument widely you will have to be on everything you can build that words you can play with from fuel you have everything you need is the thesis that the fact that this is no theoretical last week we had a really big emergency Operational emergency in our company which was not our fault we ran into a very obscure rock that only happens all of school platforms as previously and Comp so while operational staff I'm on the lower side was busy trying to contain the fire I feel that 4 so we could immediately see we tried this what happens those but it is still rising the stress something different this is very useful if if you don't have to just keep well of pressing of time or of staring at the top the I believe I have covered everything so I
hope you're eager to measure all the things please study the talk page has always offered contain all buildings all the projects following integer into domains from very media and I'm not taking questions because really bad and understanding questions on stage but if you have any questions and out there I'm here until Sunday just come in and shut me I think you have