AV-Portal 3.23.3 (4dfb8a34932102951b25870966c61d06d6b97156)

Providing Monitoring Result Data to Chef

Video in TIB AV-Portal: Providing Monitoring Result Data to Chef

Formal Metadata

Providing Monitoring Result Data to Chef
Title of Series
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Monitoring systems generate a wide variety of data relating to the health and state of services and data all over the network. This data is often useful to resources and recipes, but the check results themselves may reside on a separate server. Chefs are then forced to reimplement the checks themselves, leading to duplication of effort and the opportunity for confusion (when the reimplementation results do not match the original in all cases). In this talk, we will explore ways to make monitoring results easily available to Chef, leading to simpler code, better visibility, and faster, more reliable development.
Context awareness Data management Information Code State of matter View (database) Moment (mathematics) Bit Whiteboard Mereology Physical system
Principal ideal Context awareness Group action Metric system State of matter Differential (mechanical device) Multiplication sign System administrator 1 (number) Set (mathematics) Order (biology) Mathematics Coefficient of determination Bit rate Different (Kate Ryan album) Semiconductor memory Single-precision floating-point format Cuboid Vertex (graph theory) Physical system Social class Enterprise architecture Moment (mathematics) System administrator Data storage device Electronic mailing list Sound effect Bit Virtualization Process (computing) Befehlsprozessor Series (mathematics) Architecture Order (biology) System programming Website MiniDisc Lastteilung Right angle Metric system Physical system Resultant Flux Point (geometry) Web page Ocean current Digital filter Server (computing) Enterprise architecture Service (economics) Kalman-Filter Gene cluster Time series Average Event horizon Number Centralizer and normalizer Term (mathematics) Energy level Utility software MiniDisc Proxy server Multiplication Information First-order logic Expression Database Line (geometry) Incidence algebra Cartesian coordinate system Vector potential Loop (music) Event horizon Integrated development environment Query language Read-only memory
Point (geometry) Functional (mathematics) Service (economics) State of matter System administrator Moment (mathematics) Water vapor Price index Configuration management Rule of inference Hypothesis Number Data management Data management Latent heat Root Integrated development environment Personal digital assistant Order (biology) Configuration space Routing Descriptive statistics Physical system
Service (economics) Personal digital assistant Intrusion detection system Physical system 2 (number)
Complex (psychology) Group action Distribution (mathematics) Code State of matter Decision theory Multiplication sign Set (mathematics) Client (computing) Mereology Perspective (visual) Computer programming Software bug Local ring Physical system Service (economics) Decision theory Kolmogorov complexity Computer file Moment (mathematics) Complex (psychology) Electronic mailing list Fitness function Bit Hand fan MiniDisc Lastteilung Right angle Quicksort Metric system Asynchronous Transfer Mode Asynchronous Transfer Mode Server (computing) Service (economics) Link (knot theory) Computer file Virtual machine Maxima and minima 2 (number) Number Frequency Plug-in (computing) Overhead (computing) Shift operator Information Volume (thermodynamics) Denial-of-service attack Database Group action Cartesian coordinate system Integrated development environment Personal digital assistant Point cloud
Context awareness Server (computing) Metric system Touchscreen Service (economics) Code Multiplication sign Demo (music) Bit Student's t-test System call Product (business) Personal digital assistant Order (biology) Software testing Right angle Metric system Library (computing)
Point (geometry) Ocean current Slide rule Implementation Server (computing) Group action Multiplication sign System administrator Gene cluster Maxima and minima Jukebox Client (computing) Expert system Set (mathematics) Software testing Client (computing) Hecke operator Ultraviolet photoelectron spectroscopy Instance (computer science) Cartesian coordinate system Data mining Arithmetic mean Commitment scheme Integrated development environment Personal digital assistant Normal (geometry) Social class Quicksort Block (periodic table) Metric system Electric current Library (computing)
Ocean current Game controller Presentation of a group Implementation Server (computing) Metric system Service (economics) Dependent and independent variables Code Line (geometry) Real number Control flow Numbering scheme Online help Web browser Client (computing) Mereology Attribute grammar Element (mathematics) Product (business) Number Query language Code refactoring Vertex (graph theory) Social class Physical system Module (mathematics) Addition Demo (music) Wrapper (data mining) Block (periodic table) Client (computing) Timestamp Type theory Personal digital assistant Query language Order (biology) Right angle Metric system Resultant
Server (computing) Metric system Link (knot theory) Concurrency (computer science) Dependent and independent variables Code State of matter Line (geometry) Multiplication sign Attribute grammar Query language Vertex (graph theory) Electric current
Context awareness Context awareness Server (computing) Service (economics) Demo (music) Electronic mailing list Parameter (computer programming) Representational state transfer Cartesian coordinate system Disk read-and-write head Bookmark (World Wide Web) Product (business) Mathematics Metric system Information security
Slide rule Execution unit Shift operator Query language Line (geometry) Internet service provider Instance (computer science) Bookmark (World Wide Web) Software development kit
everybody is so if you did not intend to attend a monitoring talk this is not enough not right 1 for you and and this is your chance to leave before I get upset that you got up and walked out the moment part and somehow I I made a right turn and so left and 1 appears in monitor Rama and so but soccer much effort and why I think monitoring is on board in a lot on monitoring I love monitor monitor monitoring is the most foundational thing that we can do in our infrastructure it is more important than conveyed management now the announcing it tonight it's because I just said that but monitoring monitoring monitoring I almost wanted put on my my sweaty shirt my in O dance around like the Palmer and because monitoring is super important and so you get this wealth of information in a monitoring system and so this talk is going to be about how do we have access to that well from that state of our infrastructure from within the chef-client which typically only has a small view of what's going on around it I'm also gonna show you a little bit code I wrote that illustrates how to do what it is I'm saying is so important so let's move forward and see if I can get my 2nd son there we go
OK this is really a common infrastructure architecture i I draw a lot of pictures on the chief architect at new context and so customers was see pictures I get involved in the pre sales as well suppose sales process so that means I don't just get to draw pictures I also to go implement the thing that I drew and so I care what it looks like as you can see this is a simple right this is that this is a small site uh looks like there are 3 load balancers with different IP is you got 6 servers doing something another the orange boxes because we don't really ultimately have to care what they're doing at this moment as I just made this picture up environments and there's lots of complicated routing that happens between them it looks like there's a catch and a database server with a slave it replica read-only replica maybe some necessary 3 buckets or something like that stored in all probably seen something like this before the OK now we monitor servers we monitor a lot of servers but we don't really care about servers like we're hearing about this morning we care about outcomes we 1 ensure a certain level of service and so we monitor service servers as a proxy for services which are and turn up proxy for outcomes that will be important to for so let's talk about how monitoring
systems work now we know that we do push vs pull out what is that mean exactly that means uh in a push a situation the monitor services are pushing interesting information into a central monitoring server or servers which will then collect that state information filter it's chartered maybe determine whether there's something actionable for some reason to notify administrative something has to happen and pull based system means that the monitoring system is holy so this is more common are you have a monitoring system that has a list of 1 thousand servers or something like that and it's sending out a predefined list of freak of queries to all these servers and gathering all this information up there's a polling loop as often it's parallelized as well but In general the more servers you have the longer it takes to collect a single snapshot of information and you're pretty much guaranteed when you have a large number of metrics that your monitoring that the ones that you obtained data from at the beginning of your polling will will have different data than the ones that are we obtained in the would just because of that time differential it's impossible to get the pure snapshot of your infrastructure so this a polling thing I'm talking about here now once you've collected all this data you can collect into a time series database are in flux to B B B is an example of this Prometheus does this even i gives the old DOG monitoring system from the nineties does this it has this proprietary database that stores current check results and a little bit of history and so you can query back and figure out what happened in the past finally there's event filtering because not every piece of data that comes in is actionable we only want to be alerted administrator if there's something that absolutely requires the administrators attention and in my opinion whenever we get alerted invited page that means that the 2 problems problem number 1 is whatever I was called on to a remedial get out of bed long in server in the sleep and my eyes whatever In the 2nd issue is that I got page in the 1st place because why couldn't these incredibly intelligent systems that we build resolve the problem themselves so this is what I just said we're not the outcome should be correcting the issue and increasing the intelligence of our systems now worship Conf so we know some really interesting things that we can do with a chef-client OK so we talk about metrics we defined why would limited might be for example what is the current CPU utilization on a system or it might be how much uh virtual set size of memory is assigned to a process this is due to traditional and we don't need to talk about containers here because on the in for level uh everything is processed so we can talk about first-order metrics and those are the ones I just called out CPU utilization on a single node that's a first-order metric this is terminology that as far as I know I'm making up so if you if you've heard of the terminology for this please come to me after this talk so we can discuss but a first-order metric is 1 thing on 1 system at 1 point in time these are simple things are things that are easy to get out right now from within the chef-client is a service running back that's built on we can just say it should be running chef-client will remedial here it is in a Pseudo mathematical expression and would be the metrics In would be the node and t would be the time of the atom up and get 3 the now I think that 2nd order metrics are more interesting than first-order metrics and that's what I take any 1 of these points I turn into a line just 1 of these points and turn into a line because when you think turn multiple Infoline it's even more complicated and this is really hard stuff to understand so for example and multiple metrics on 1 node at 1 time that's 1 thing turns into a line the other to stay as points on so far so multiple metrics that would be uh memory utilization and disk I O increasing together on 1 node at 1 moment are not increasing sorry are that could mean that the system is thrashing for example we can tell that just from reading to metrics on 1 node at 1 point in time we can also consume 1 metric on multiple notes at 1 point in time uh and this is so valuable in our current infrastructure side because we rarely will run a service on 1 node and our infrastructure if we carry even the slightest about it because having run infrastructure we care about our systems we not 2 we run clusters of applications and if I have 20 servers running the same application behind a load balancer and 1 of those servers goes down I don't wanna get woken up by and I have 19 more on that have to lose in 18 more of those before even start to pay attention but with a lot of monitoring systems and especially if you're only worried about first-order metrics you're only going to alerts and a very simplistic way so that the administrator is going to get woken up service died at 3 AM on node 19 and you still had 19 up more they're running the service so that this diverges from this principle that I was saying that we are interested outcomes the cluster status would be readable has 1 metric on multiple nodes at 1 time how is this the overall health of the cluster that would be in a service being out for example keeping it simple on the whole class of nodes that is the responsible for that service at that time the we can also talk about 1 metric on 1 node at multiple times an example of how this might be useful is looking at terms a lot disk utilization if my log disk is filling up at a rate of 1 per cent per month and it's just reached 90 per cent don't with but if it's filling up the 1 per cent every 10 minutes wake me up we will deal with it's not there's still 2 problems because that means my system was misconfigured and I have a potential disk full situation the so I have to think 2 incidents effects and went principal
monitoring that I care about what is well is descriptive monitoring as prescriptive configure management and Fig Management is where we define a specification of water environment is supposed to look like and the configure mentions offers an responsible for making itself monitoring on the other hand will describe as that same desired state but it will not make any alterations in order to make it happen because the point a monitoring is only to alert the administrator with human intervention is required monitoring is capable of doing remediation in many cases and that I believe that that functionality overlap should be minimized because if you're going to use a monitoring system to remediate the service situation why are you using configu management to begin with the indices monitoring for everything also remember that monitoring typically runs as non uses the chef-client runs as root the so if you do want to use a monitoring system to configure everything you're going to wind up the the running a monitoring agents route which things and because they're historically insecure or you're going to wind up with a whole bunch of Cedar route rules there and have to maintain and you're going to have to have a brittle infrastructure then 1 that monitoring the city roles pirate it's a terrible that's so user monitoring for monitoring user can say management for conveying management and that the 2 live together I mean of the scenario this morning during the keynotes because I thought it was fun and it's a it's a fun little hypothetical scenario about using monitoring which that illustrates the how I wouldn't wanna do OK the so let's pretend for a moment
that a system administrator is responsible for a meth lab as any smart engineer would this system administrator user Scharf manage the configuration of the systems but it this is the meth labs is full of volatile substances and it's from the unwanted law enforcement intervention and so there's a number of sensors and and so the system administrator doing what they know falls out 9 years except to the sensors was some Perl blue and
creates a system that looks somewhat like this on the top left you might see we have a fence where some sensors attached fencing cases fences breached by some unknown party we're done intrusion detection has be done I just don't its thing you notice of the lion on Fire fox pretty or something and then we have chef the chef retrieving data from this new ideas and when it runs it's going to set off an alarm if the the of the fence breach service has non critical so far so good a no
surprise to anyone this wanting shows up but the system works exactly as designed so team comes out and they cut the fence 9 seconds later now years the text the breach as as a service to critical 27 minutes later chef and activates the alarm success works perfectly clearly this this illustrates that monitoring data can be dangerous and use case for is not necessarily a learning so is
talk about the dangers like delayed-action chef runs up by the fold every half hour and the the solution to this problem is not just around the chef-client every 15 seconds and then the place surrendered every 5 minutes or something like that because they just couldn't keep the servers up so delayed action is a risk still data monitoring systems means that it's very likely that the data that you're getting from the monitoring system was valid as of some time period in the past some moment in the past and during the interval between when the data was generated and the chef-client consumes it changes could have happened it's so important to know that our so that you can make good decisions and said that once it introduces additional complexity because you have to think who's doing what where this data coming from the chef-client see something different locally you got more moving parts that means more things that can break more things to monitor and more reasons to get woken up which thinks the more failure modes because they're more moving parts you have to think about what about the link between the chef-client and to the monitoring server what happens if the link between the monitoring server and the things that is monitoring goes down what happens if there's a bug in the code that I wrote the talks the monitoring server but it's also very useful when we're trying to do things across a modern set of services a quorum based deployment for example um I wanted deploy my application using shaft to a set of nodes but if that node that the chef-client is running on at that moment is the only working node in the entire cluster I probably don't apply to it at that moment maybe I wanna wait until things look a little better next shot from or something like that In other cases and provisioning automation the same sort of thing if I have 5 dead nodes in a 6 node cluster maybe I'd like to tell my provisioning spend some more master role assignment and we had situations where spelling out the number of nodes in parallel on cloud or something like that and somebody needs to that be the master and others be replicas of this cluster machines they're all coming at us at the same time so you can either declare that the first one on the list according to some naming convention it becomes a master every time or you can say monitoring system do we have a master if not I'm going to become the master so I'm going to become replica but monitoring data is inherently at risk of being stale city 1 of the 2 masters in case this is were distributed locking becomes useful you can also weave together a cheap kind of service discovery if I need to know where is that what is the I t of the active load balancer for the situation I can ask the monitoring server it should not
so now let's talk about how we can make this monitoring data available to the ship climb well monitoring services have API it's the Piazza Austin because that means that I don't have to store the data anywhere else while it's on its way from the monitoring server to the client data bags or another way to do I could write some sort of a program that would regularly poll the monitoring server gather together all the information out of the monitoring server based on current and historical state right that a data bags are below that to the Chef server and the shift climb only has to pull state that but I'm not a big fan of that because in his environment so I said we started 10 thousand edits and a big complex environment these we have upwards of a million metrics which means that my Portia applying this might be downloading all this data every time it runs and this could easily overwhelm monitoring server or the chef server and we just because of the sheer volume of data so that's not a good fit when local high flood since this is another thing I've started trying out maybe I can write an 0 high plugin that'll run on each my chef-client nodes that will download just the relevant bits of information about that node or the services that node cares about that really complicated really quick doesn't work so well I can in let's say I can now implement the same thing as data dyads using local files written to disk or an external database but that has all the same problems as data bags so that's not really a solution is the were trying to minimize overhead were trying to maximize simplicity we want to get access to these metrics so we can act on them and we don't want extra things to monitor when all 1 extra things to break you can probably see my perspective on that's been woken up a lot of times the so what's implement this
right now I I bring a little bit of code ahead of time history on have to watch time well
at can you see my screen summary bright is it then I am a shy uh is a small people back and terrible I said so I feel for you on the the talk to me alright thank you alright so you can see that I've got a cookbook call Prometheus metrics and there's not really a lot going on this cookbook because this is a library cookbook all I wanna do is use whatever chef became against gives me to talk to my monitoring service which in this case is Prometheus and that make that data available to whatever chef-client and what whatever recipe were cookbook is rapid the scope of so all the interesting stuff it is going to be given by test by libraries so it's student to test 1st because we all test 1st right the in production of alright so I really quick spec and what was soccer over what this does real quick so I have a fake API you world is only standing out my Prometheus server I don't actually need a Prometheus server in order to run a test for
something that consumes a from 1 and I sure as heck don't want the people who I inherit my cookbook at the start of the own Prometheus server every time they wanna test stuff so there is a my favorite method is that the instruments skip to that this current cluster health so it looks like this scenario that haven't mine was that I only want do an action for instance deploy my application when the current cluster is healthy and so then I have to describe what healthy and unhealthy was a happy case was a sad case I mean and so I made a said client in a broken client and on top of had the client that's the normal line just called and so what i've defined as the 3 behaviors so we have to think about if the if the clusters unhealthy then I don't want to pull if the clusters healthy I do want to pull now broken is a funny case so Broken means I can't talk to Prometheus and so as an administrator i have to decide how I want much applied to behave in that case 2 I want to say well it I don't know what to cluster slides on this going go ahead and deploy a or do I wanna say it's probably not safe to the point that and that depends on your use case my personal preferences to go ahead and do it because maybe something I did is what broke contact with committed Prometheus server so even if we use a monetary system and and so you sort of the OK anyway so 1st and then written test before everything else is deep in implementation tests and for instance I when you'll see you will
see that and when you see you all understand alright so scandal libraries and you can see I wrote this library called Prometheus metrics and if you use a library before and shut OK great nothing special here this is just a pile of helper methods that are making available to the cookbooks that practice the and or that exist on the same environment that figure that with this as a dependency so it now I'm using Jason client J
client is a classic comes with the HTTP client gen I picked the HTTP client gem not because I thought it was the best thing on earth but because it's already included which uh you don't have to install anything additional on your system which means less Fig to manage west after break and that's how I like it the I'm using syntax holding in case you're wondering what this weird them screeners it lets me control of trauma so the putting this in the chef recipe specifically will make a module full methods and this include that in the class that people use I find that it's a little easier to relocate the code when I do it this way and refactor things now here's the fun part so we have
initialized method that will go into whatever class includes this module the and that that just takes the the base year over my Prometheus server I can pass on and as a node attribute into the recipe that calls it or something like that if I see fit by a given user fancy resource I wanna get that far and then very simply I have a health method that will call whatever query it is I give it to the client and give me the result from the J. Sommer comes back from the Prometheus API also very straightforward so health expects a block and of a query in a block query is sigh anything it's searchable 4 metrics or status in Prometheus and then the heavy lifting happens over here In this current cluster health method now for the purposes of this presentation I show this all into 1 class and real production implementation of this I would split this up most slightly and allow the wrapper cookbook to inject the correct class current cluster helpful method into the containing class that way it can be as stored right with the code that's responsible for maintaining the services that comprise the cluster so what is this code do its checks the result the now in this case the result is a scalar what that means in Prometheus land is that in a timestamp is it's an array with 2 elements the 1st element that is a timestamp and the 2nd element is a number that indicates In this case the percentage of nodes in the cluster that healthy the reason I chose Prometheus for the purposes of this demo is that it's easy to put together a 2nd order metrics I can make metrics of that purely composed of other metrics so I can say hey Prometheus I want you to monitor the status of this important service on each of these 20 nodes that I also want you to have another metric that is the percentage of nodes that are reporting OK that makes a really simple for me to consume the current cluster health i do 1 query cluster type cluster health cluster type might be dumb but the provide the API whatever it is you did have and this is my made of naming scheme you probably have your own and if the results Is that less than 50 per cent of the cluster is healthy and I say no whatever it was you were thinking about doing don't do it because that's that otherwise say OK and if a cantata Prometheus that's OK to just do whatever it is he wanted to do Molson that everything's fine how mine I use this a recipe
was several a quick look when we got
OK 1st of all I put I was the common I put in code that runs at compile time because it tends to bite people I right most of the chef code for people who don't know ships as well as I do and I had the money you open up and I want to bother the don't have to so I say no this runs a compile-time so I connects the Prometheus server I'm referring to a node attribute the name of which I made up and i pull a concurrent cluster health if I did OK we good if I get anything that isn't OK where we good will evaluate to false so far so good and then I just had to file do not push but unless we know tha this is a very simple example but I think it all states the principle that I'm trying to illustrate however non tha
many monitoring servers have X I picked a few off the top my head Prometheus since to I single who is downstairs and get you have chatted with them they're pretty awesome API that's read write in fact speaking change API I you change metrics right uh stuff is API others Nigeria Acacia still ideas you can have an application that provides API support so you can actually have a reason reasonable RESTful API to communicate with my arguments I described that from 6 or 7 of them are now use exchange solar winds has a REST API as those announced and I'll bet you that whatever your favorite monitoring product also has 1 if it's not on this list alright and the obligatory we are hiring new contexts is always looking for a sharp engineers where we do a lot of consulting services we also develop products around the mean security which has a lot to do with dead that's a ups but isn't only that and I be the questions questions anyone do we have questions the the so you mention that
there were several that the work with on that last slide there desires with all of the economy and monitoring providers for instance we use out the BMC tools is but that the Italian to to the 2 that he's shift query anything that's and so as long as whatever your favorite monitoring tool kit as as an API you can talk to it is the support from the shaft 5 any more questions I know it's it's getting late in the day you fell a lot of people talk about you all right looking around last what if