We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Starting the sysadmin tools renaissance: Flapjack + cucumber-nagios

00:00

Formal Metadata

Title
Starting the sysadmin tools renaissance: Flapjack + cucumber-nagios
Title of Series
Number of Parts
97
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Monitoring software is ripe for a renaissance. Now is the time to for building new tools and rethinking our problems. Leading the charge are two projects: cucumber-nagios, and Flapjack. A systems administrator's role in today's technology landscape has never been so important. It's our responsibility to manage provisioning and maintenance of massive infrastructures, to anticipate ahead of time when capacity must be grown or shrunk, and increasingly, to make sure our applications scale. While developer tools have improved tremendously, we sysadmins are still living in the dark ages, other than a few shining beacons of hope such as Puppet. We're still trying to make Nagios scale. We're still writing the same old monitoring checks. Getting statistics out of our applications is tedious and difficult, but increasingly important to scaling. cucumber-nagios lets you describe how a website should work in natural language, and outputs whether it does in the Nagios plugin format. It includes a standard library of website interactions, so you don't have to rewrite the same Nagios checks over and over. cucumber-nagios can also be used to check SSH logins, filesystem interactions, mail delivery, and Asterisk dialplans. By lowering the barrier of entry to writing fully featured checks, there's no reason not to start testing all of your infrastructure. But as you start adding more checks to your monitoring system you're going to notice slowdowns and reliability problems - enter Flapjack Flapjack is a scalable and distributed monitoring system. It natively talks the Nagios plugin format (so you can use all your existing Nagios checks), and can easily be scaled from 1 server to 1000. Flapjack breaks the monitoring lifecycle into several distinct chunks: workers that execute checks, notifiers that notify when checks fail, and an admin interface to manage checks and events. By breaking the monitoring lifecycle up, it becomes incredibly easy to scale your monitoring system with your infrastructure. Need to monitor more servers? Just add another server to the pool of workers. Need to take down your workers for maintenance? Just spin up another pool, and turn off the old one.
5
15
Thumbnail
48:33
41
Thumbnail
35:21
47
48
Thumbnail
1:03:30
50
75
Thumbnail
50:56
94
System administratorProcess (computing)Installation artTask (computing)Default (computer science)Cycle (graph theory)Unit testingVirtual machineSoftware testingQuicksortThread (computing)Communications protocolQueue (abstract data type)Physical systemProcess (computing)Descriptive statisticsPoint (geometry)ResultantInstance (computer science)TelecommunicationInformationCircleBitNumberCurvaturePhase transitionWorkloadTask (computing)Electronic mailing listLine (geometry)System administratorParameter (computer programming)Distribution (mathematics)Scripting languageValidity (statistics)Server (computing)Software developerData managementConnectivity (graph theory)Computer fileDefault (computer science)Formal verificationWeb pageEntire functionScheduling (computing)DemonHeegaard splittingParallel portCASE <Informatik>Transportation theory (mathematics)Plug-in (computing)Term (mathematics)Different (Kate Ryan album)Mechanism designOperating systemDatabaseCompilation albumOperator (mathematics)Right angleOffice suiteTube (container)DiagramFile formatScaling (geometry)Multiplication signComputer animationLecture/Conference
Default (computer science)Tube (container)Loop (music)Function (mathematics)Local ringConfiguration spaceEmailDigital filterAddress spaceInheritance (object-oriented programming)SequelScaling (geometry)Virtual machineFront and back endsCurvatureComputer configurationPhysical systemConnectivity (graph theory)Source codeMessage passingQueue (abstract data type)Object (grammar)Closed setDirection (geometry)Data managementQuicksortLine (geometry)NumberBitSoftware testingStructural loadTerm (mathematics)Video game consoleInstance (computer science)InformationKeyboard shortcutElasticity (physics)2 (number)ResultantWorkloadFilter <Stochastik>Electronic mailing listMultiplication signFrequencyConstructor (object-oriented programming)Integrated development environmentComputer fileCASE <Informatik>Loop (music)Quaternion groupSoftware maintenanceMereologyData storage deviceDatabaseSheaf (mathematics)Texture mappingTube (container)Default (computer science)Function (mathematics)System callRight angleConfiguration spaceWindowBell and HowellCodeDifferent (Kate Ryan album)BuildingPosition operatorSocial classDistanceCodeArithmetic meanParallel portMemory managementNormal (geometry)Lecture/Conference
Human migrationSoftware testingExecution unitMathematical optimizationBlock (periodic table)Digital filterInterface (computing)GoogolInstallation artFiber bundleFiber bundleComputer fileHash functionFilter <Stochastik>Connectivity (graph theory)Term (mathematics)Latent heatCodePlug-in (computing)Inheritance (object-oriented programming)BitQuicksortLimit (category theory)ResultantPhysical systemSet (mathematics)Software testingWeb pageSystem administratorStandard deviationProjective planeFinite-state machineHierarchyBlock (periodic table)Wave packetMultiplication signBit rateSingle-precision floating-point formatSequelMoment (mathematics)ChainHuman migrationSoftwareFunction (mathematics)Front and back endsComputer configurationFile formatLevel (video gaming)Integrated development environmentDifferent (Kate Ryan album)Object (grammar)Configuration spaceTraffic reportingInternet service providerArray data structureInformationNatural languageFunctional (mathematics)BefehlsprozessorBenchmarkWebsiteStructural loadUser interfaceStatisticsPoint (geometry)Process (computing)Type theoryRight anglePrinciple of maximum entropyCASE <Informatik>10 (number)Ferry CorstenProduct (business)Directory serviceWeb 2.0Water vaporInteractive televisionMappingTranslation (relic)Query languageString (computer science)Formal languageUniform resource locatorInterface (computing)Surjective functionCodeTwitterCartesian coordinate system
Reading (process)Web pagePasswordHash functionKey (cryptography)Latent heatWeightBitLibrary (computing)WebsiteVirtual machineLecture/Conference
Convex hullInformationAnnulus (mathematics)Web pageTrailTotal S.A.Software crackingTrailWeb pageDeterminismString (computer science)Computer configurationComputer animationLecture/Conference
Software testingServer (computing)Local ringKey (cryptography)PasswordMail ServerAddress spaceNavigationDisintegrationPhysical systemCodePlug-in (computing)Singuläres IntegralSystem administratorSoftware developerTaylor seriesStatisticsDemonInterface (computing)EmailTrailWärmestrahlungBefehlsprozessorStructural loadCommunications protocolWireless LANRead-only memoryEntropie <Informationstheorie>WritingArc (geometry)Computer fileComputer networkTape driveUltraviolet photoelectron spectroscopyMiniDiscTable (information)Java appletProcess (computing)Interface (computing)Client (computing)Time domainIntelCache (computing)Erlang distributionInstallation artGraph (mathematics)Server (computing)Set (mathematics)Physical systemSoftwareInteractive televisionLoginIntegrated development environmentSystem administratorConfiguration spaceDifferent (Kate Ryan album)DemonQuicksortEmailStructural loadVirtual machineLibrary (computing)StatisticsMoving averageCodeBuildingSoftware testingElectronic mailing listWebsiteSpecial unitary groupTerm (mathematics)Mobile appBoss CorporationContinuous integrationFormal languageProjective planeKeyboard shortcutVolume (thermodynamics)BitPasswordPlug-in (computing)Public-key cryptographyGraph (mathematics)Client (computing)Communications protocolCartesian coordinate system2 (number)Web applicationLocal ringWritingDirectory serviceRight angleInformationSoftware bugFunctional (mathematics)Programming languageConfiguration managementBlogIP addressArithmetic meanString (computer science)Phase transitionDiagramCollaborationismBridging (networking)Product (business)Water vaporDependent and independent variablesWeb 2.0Interface (computing)CASE <Informatik>AuthenticationConnected spaceMiniDiscDomain nameDistribution (mathematics)Denial-of-service attackInstance (computer science)Limit (category theory)Object-oriented programmingAddress spaceMultiplication signIntelligent NetworkSoftware developerCycle (graph theory)Data managementVideo gameOrder (biology)Computer animationLecture/Conference
Revision controlModule (mathematics)TouchscreenPixelDrop (liquid)Texture mappingCache (computing)SpacetimeMiniDiscRandom numberType theoryHash functionEmailObject-oriented programmingLink (knot theory)Intrusion detection systemWeb browserQuicksortProcess (computing)Graph (mathematics)Right angleElement (mathematics)TouchscreenWeb pageCodeStatisticsPhysical systemLecture/ConferenceComputer animation
Standard deviationConnectivity (graph theory)SurgeryInsertion lossDemo (music)Revision controlProcess (computing)Open sourceDenial-of-service attackNormal (geometry)State of matterHacker (term)Product (business)TrailPhysical systemLecture/Conference
Function (mathematics)Computer animation
Transcript: English(auto-generated)
Hello, and I'm from Sydney, Australia. I flew all the way here for Fosdam. And today, I'm going to be talking about making monitoring delicious again. So obviously, this talk is going to be about monitoring, right?
But first things first, we need to get some terminology out of the way so we're all on the same page. So we have the concept of a check. And a check's purpose is to perform some sort of verification or validation that something is working the way that you expect it to. Developers also know these things as unit tests.
And this is an example check. It's very simple. We're just pinging four times. And generally, what happens at the end of that is it will return good or bad or ugly whether what you were testing was within the parameters that you were expecting. And a monitoring system is constantly
monitoring for failing checks. So basically, it's running through this gigantic list of things that you want to check. And it's going to notify if something is amiss, something is not the way that you expected it to be. So monitoring systems, then, are essentially asking three questions. They're asking, what is the next check that I need
to perform? Was the check OK after I executed it? And who do we need to notify? Or do we need to notify anybody at all? So we take these three questions. And they actually map into these three distinct phases, the fetch, the test, and the notify phase. So if we represent that in a diagram,
it's basically this gigantic circle that's going around and around and around, right? The fetching, the testing, and the notifying. And within those phases, there are actually some subphases. In the fetching phase, we're doing some sort of lookup, maybe from a database or from a flat file or wherever. Then in the testing phase, you've got the execution of the check and then verifying the result.
And then in the notification phase, you're deciding whether you need to notify anybody. And if you do need to notify, then we need to call out to some other system to do that, whether that be via SNPP or XMPP or whatever the protocol is. And traditionally, monitoring systems have done this within a single process. So are some microphones still going?
Great. So traditionally, monitoring systems have done this within a single process. And it's been treated quite monolithically. You might be using threads or whatnot within that single process. But generally, this is all happening on the same machine. And if you look at other things, like clustered Nagios and whatnot, generally, they're just replicating this across a bunch of different machines.
But all these different processes are just happening in one place. And the thing that you realize about monitoring when you look at it in these terms is that it's actually what's called an embarrassingly parallel problem. And that's one for which little or no effort is required to separate the problem into a number of parallel tasks. And this is the case when there
are no dependencies between the things that are actually happening within the system. So if we recognize that it's a embarrassingly parallel task, you can start thinking about common data that needs to be sent between all these different components. So in this particular case, in the fetch and the test and the notify phase, we're sending
around an ID of a particular check and the command that we need to execute. So that's being sent here between the fetch and the test phase. And then on the notify phase, we're sending the same ID and the result that we got after executing that test. So we can actually collapse these into single phases themselves.
You can't perform a test without having a fetch, right? And in the same way that you can't actually perform a fetch without, so you can't perform a notify without fetching some data or some description. So this cycle itself can actually be broken out into two distinct cycles. We've got the testing cycle and the notifying cycle. And then you have some sort of transport mechanism in between
to send the data backwards and forwards. And once we've done that, we can actually start making some other assumptions, like pre-compiling the checks that the testing phase needs to do. So we can make that a very computational, inexpensive operation, right? It doesn't cost a lot to actually look up the checks
that we need to perform. We can do other fancy things, like making the transporters the scheduler. So the test phase doesn't actually care about when things need to be executed. They just know that they need to execute something now. And the transport is actually doing all that scheduling stuff for us. The other thing that we can do is we can remove the data
collection from the monitoring setup entirely. We can use other tools, like Ganglia or CollectD, to do that for us. And we can just focus on doing the monitoring itself, the actual notification. So we've got these distinct cycles here. And there's already going backwards and forwards. And this is where Flapjack comes in.
Flapjack is a tool that I've been writing for the last year or so. And it follows exactly the same principle. You have the workers, which are doing the testing phase, and the notifier, which is doing the notifying phase. And then you have BeanstalkD, which sits in the middle, that is doing the communication between all the different bits. And then for the pre-compilation
that I was talking about a second ago, we have a Populator, which is just getting some data out of a database, or however you want to represent your checks, and injecting it onto the Beanstalk. So a worker just needs to go, OK, give me the next check. And the Beanstalk makes it available to it. The nice thing about that is then we can start parallelizing the number of workers that are actually
executing those checks. It doesn't just have to be a single worker. You can spin up as many workers as you want to deal with whatever workload you have. So if we look at Flapjack, Flapjack is written in Ruby. It aims to be distributed, scalable, and it talks the Nardios plugin format, because there isn't a lot of point in reinventing the wheel. It aims to be easy to install, easy to configure,
easy to maintain, and easy to scale. And it should be just as easy to scale your Flapjack instance from one machine to many machines to execute the checks across many machines. So instead of just keeping it on like a single machine and running it, you can distribute the execution of that across as many machines as you want.
So now that we've split up the monitoring lifecycle, we want to look at the individual components that Flapjack uses to achieve this goal. And before that, we actually need to look at Beanstalk, which is the messaging transport system that makes all this possible. So Beanstalk D is a simple fast work queue
service that lets you run time-consuming tasks asynchronously. It's written in C. It's based on the memcache protocol, so it's very, very lightweight. You install it on your operating system using your distribution's package manager. And you start up a daemon here. Generally, your distribution will provide an init script for doing that for you. So within Beanstalk, it's just
like a lot of other messaging systems where you have this whole idea of producers and consumers. So a producer, if we look at the first three lines here, is just connecting into this Beanstalk, and it's putting some information on the Beanstalk. And then the consumer here is connecting into the same Beanstalk, and it's looping forever. And what it's doing down here on this Beanstalk reserve method
here is it's just blocking until a job is made available to it. Then once it's got the job, it will just put out the job body, and then it deletes the job off the queue once it's done. And this is essentially the way that the Flapjack itself works. The workers and the notifiers are consumers,
and the admin populators are the producers. And Beanstalk D has a couple of useful features that make this whole thing really easy to do. So by default, when you connect into a Beanstalk D, it just connects to a named queue called default. But Beanstalk has the concept of tubes, which are basically named queues, right?
So in this particular case, we have a checks tube and a results tube. And so that means that we can put the workloads on the individual tubes, and they don't ever have to touch one another. So the workers are just connecting into the checks tube, and the notifiers are connecting into the results tube.
The other nice thing that the Ruby bindings for Beanstalk D provide are a YAML, an easy way to serialize and deserialize actual Ruby objects when you put them onto the tube. So that means that you can deal with Ruby objects at either side of the message queue, and everything is nice.
So if we look at these components again, we've got the flapjack worker. And I like to describe the worker using this little story of the eternally forgetful shopper. So this is the shopper, right? And he goes into the shop, and he wants to buy something. And he's looking around, and he finds the thing that he wants.
And he goes to the checkout and pays for it. And going back to his car, and he's thinking, oh, crap. I forgot something. I have to go back into the store. So he goes back into the store and searches for the next thing, and finds it, and checks out, and blah, and does it again and again and again and again. So this is the way that the flapjack workers themselves work.
So the worker is basically in this gigantic loop that's saying, give me the next check that I need to do something with. Then it will execute that check and capture the output, take the return code from that, and store it. And then it takes the output of all of this, and it puts it onto the results queue as a result.
And it sets the check ID here, puts the output on there, and also puts the return value. But the fancy thing that it does is then it takes the same check, and it recreates it on the tube, or sorry, on the Beanstalk, but at the very, very end of it. And it sets a delay on it. And Beanstalk D won't make that check
available to other workers until that timeout has happened. So for instance, the frequency here might be set to 30 seconds. So the Beanstalk won't make that job available for 30 seconds. And then what it does is it deletes the check off the queue and just goes, and it does the next thing.
So the worker is very, very simple. It just starts up, attaches to the console by default, and you can pass it a bunch of options. Generally, you're using it with the worker manager, though. So by default, when you're on the worker manager here on the first line, that will start up five workers. Then you run it with the workers option passed to it, and that will start up another 10. So that means you have 15 running.
And then you run stop, and that will stop all the workers that are currently running on the system. The nice thing about this approach is that you can do near-linear scaling. So it means that the more checks that you have in your system, the more workers you spin up. And FlatJack copes with that extra load quite well. It also lends quite well to failover scenarios, where you have part of your worker cluster go down,
and you just want to be able to get back up and running. So say you have some sort of maintenance window that you need to have, where you need to take down half of your cluster, but you want your monitoring system to keep on running. So you spin up a whole bunch of new workers. You take down the part of the cluster that you don't care about, or sorry,
that you do care about that you want to do your maintenance on. Do whatever work you need to do, then bring them back up, and everything is fine. And the monitoring system keeps ticking over like there aren't any problems, like everything is completely normal. So the next part of the system, and probably the coolest part, is the Notifier itself. So Notifier works just like the workers,
in that it starts up, attaches to the console. There are a few more options that you can pass to it for configuration and whatnot. So, and you also have the manager as well, and that's generally the way that you're starting it. But for debugging, starting it interactively and seeing it works quite well. So we have this recipients configuration file here, which eventually will probably be moved out
into a database, but it's very, very simple. It's just an .ini file. You specify a bunch of stuff here, and all of this information is made available to the Notifiers when they decide that they need to notify. Then we have the Notifier configuration, which sets up all sort of deep, dark, mystic stuff inside Flapjack, but I'll talk about that, all these different sections here in a bit in a second.
So probably the coolest thing about Flapjack is the APIs, and I truly believe that all parts of the monitoring lifecycle should have as many hooks in it as possible so that you can customize Flapjack to make it as easy as possible to make it fit your environment, basically.
So there are three APIs that Flapjack exposes that make it really easy to customize. We have the Notifiers API, the Filters API, and the Persistence API. So the Notifiers API is very, very simple. You just create a Ruby object, and in the constructor, you get past a list of options
that you can do with as you please, and then you implement a Notify method, and when the Notify method is called, it will be passed a who, so the person that we need to notify, and the result that we need to notify about. So this lends to some really interesting things, like, say, a mock NRP instance, where you could use Flapjack to execute,
do all the execution of your checks, like with your existing Nargios monitoring system, but it doesn't actually do any of the notification. It just feeds the information back to Nargios, so you can use Nargios at the same time as using Flapjack, and they run in parallel. The next thing is an elastic notifier.
RIPnar, down here, he wrote a fantastic tool called mcollective, and what that allows you to do is do large-scale system orchestration. So, in simple terms, what you could do with an elastic notifier is say, Flapjack is telling you, Flapjack is telling itself that it's not able to keep up
with the number of checks in the system, because you've loaded in a whole heap of extra checks. So an elastic notifier would then send out stuff to machines that are ready to run Flapjack worker and say, okay, you should spin up and create a whole bunch of workers. They will deal with the extra load,
and the system basically sort of self-heals and looks after itself and codes for the load. And it also works in the other direction as well, where you have too many machines running the workers, and say you're running this on EC2 or something like that, and you don't want to be paying for all these extra machines, the elastic notifier could do the opposite, where it goes, okay, shut down all these machines
until we've reached the optimal load for the system. The next API is the persistence API, and there's a whole bunch of methods here, and if you look through the documentation, there's a lot of information about how to build different persistence APIs. Everything is very well tested as well, so the tests are a fantastic source for working out how to write your own persistence APIs.
Right now, there are two persistence backends that are provided with Flapjack. There's a SQLite and a CouchDB. I also have a MySQL one in the works as well. The persistence API gives you a whole bunch of advantages, such as subclassing. So let's just say, hypothetically, you have a MySQL backend, and you're using that on your Flapjack instance
in your business, and you find that there are particular workloads that you need to optimize for to make it run faster. So if we take this MySQL backend, and we subclass it, and we call it a MySQL with memcache backend, and we say take the getCheck method, and what we do is we make a call out to memcache first
to see whether we can get a copy of the check from memcache, which is obviously gonna be faster than hitting the database right. So if we don't get something back from memcache, then we just call the original method, which is the original getCheck method on the MySQL class, and that will do the lookup in the database and get that, and then we store that in the memcache, so the next time somebody needs to get
that particular check, they can just get it out of the memcache. The other nice thing about the persistence APIs is it represents all the information in the system just using standard Ruby objects, just the hashes and arrays and that sort of thing, which lets you do a lot of nifty things like migration, so if we have, say, some testing here,
you wanna say, okay, I'm using the SQLite persistence backend, and then I run the standard set of persistence tests, and then I migrate to the CouchDB backend here, and then I run the same test again, then the results should be the same. This is a great way to verify that if you migrate your monitoring system from one configuration backend to another,
that everything works in the same way that it was working previously. You can also do other things like benchmarking. You can build different loads in the system that go, okay, well, let's say I have 30% of my checks that are failing all the time, and then I have 20% that are sort of warning, and then the other 50% are working all the time, and we run all these different benchmarks
across all the different backends and different configuration options, and you can see for your environment what different backends are going to work best for you. And finally, web interfaces as well. The persistence API makes it very easy to build just a single web interface that doesn't care about how you're storing data in the backend. It's just talking over this API.
So it means you write the web interface once, and then you have to customize it for each backend that you're dealing with. And the final set of APIs in the Notifier are the filter APIs, and these are probably the coolest feature of Flapjack. So Flapjack takes the approach that we should always be notifying unless there's something that's blocking us from notifying.
So we have this filters chain here, and what this particular method does is it's going through all the filters and it passes in the result, and if any of those filters block, then we don't notify. So let's just take an example filter here. We have an OK filter, and what the OK filter does is it says, okay, if the result is not warning or is not critical,
then we do need to notify. And then you can couple that very easily with other things like any parents failed. So in a monitoring system, you're gonna have hierarchies of checks where some checks depend on other checks which depend on other checks and whatnot. So if a child check is failing and its parent is failing, you obviously don't wanna notify that
because the parent check is more important. So this is really easy to do. You can go here with the Persistence API. You pass in any parents failing of the particular check that we're dealing with right now, and if they are, then we block, and that means that we don't need to notify. So it handles that problem quite elegantly.
And you can also do other things like filters for downtime or for acknowledged alerts or anything like that. The sky's the limit, basically, when it comes to writing filters. The final component of Flapjack is the admin interface, and I won't really talk about that all that much because basically I've thrown out all the code
that I wrote because it was crap, and I'm working on new stuff that's fantastic. So the next important thing about Flapjack is that it talks to the Nagios plugin format, and this is really important for a couple of reasons, mainly because there's not a lot of point in reinventing the wheel because you're just going to do it right.
The fantastic thing about Nagios and the Nagios plugin format is that it provides a formal interface for writing plugins and consumers. So the interface being exit zero, exit one, or exit two translates to good, bad, or ugly, and you can provide extra information in there as well with the extra reporting stuff.
And the great thing about this is that it's so easy to implement that that's why there are tens of thousands of Nagios plugins out there. Why ignore all of them and switch to something new when they all do a fantastic job of what they do already? And the other great thing is that it's the industry standard in the monitoring world. Everybody understands and talks the Nagios plugin format,
so there's not a lot of point in switching away and trying to convince people to use something that's better because it works quite well. So the other thing about Flapjack is that it really strives to not do any sort of data collection at all. It is essentially a notification system
that things are bad, whatever those things may be. And it leaves the data collection problems and the actual writing checks themselves up to other projects that do that much better. And it really subscribes to the Unix philosophy of doing one thing and doing it well. So there are three different, I posit that there are three different types of checks. I think that there are gauges,
which are for getting sort of low-level statistics, like things that like Ganglia would provide information on or other things like CollectD. So low-level stats about CPU usage and network usage and all that sort of thing. Then you have behavioral checks, saying when I interact with the system in this way,
am I getting the result that I expect from it? And things like QCalm and RGLs do that quite well, and I'm gonna talk about that in a minute. And then finally, trending. And there's nothing really that does that all that well at the moment. The trending is more a function of the monitoring system itself. And eventually, the filters will probably
implement some sort of trending in some way. There's Recanoido as well, which is another monitoring system that is doing some interesting stuff with trending. So if you're interested in trending monitoring systems, that's definitely worth checking out. So we're gonna segue for a tiny bit onto QCalm and RGLs,
which is another tool that I wrote. And QCalm and RGLs is all about web testing and behavior-driven infrastructure. I'll talk about behavior-driven infrastructure in a minute, because it's sort of an out-there term. So very simply, QCalm is basically an executable specification. So you write in plain, human-understandable language
how you expect a system to be behaving. So in this particular example here, we're saying that when I visit this particular URL, so Google in New Zealand, and I fill in the query with Wikipedia and I press the Google search button, then I should see this particular string on the page. And internally, what QCalm does is it maps each of those steps over here
to these little Ruby DSL fragments. And what it will do is it will call out to some other system to do the interaction with the websites. And QCalm and RGLs makes all this stuff really, really easy to do. So normally, when you run Cucumber just by itself, which is traditionally a web testing project, but it works quite well in all these other cases as well.
All these features exist in a single file. So let's just say this is the search feature here. And when you run that, you'll get a bunch of pretty output that says, you know, I ran through all these steps, and they all worked, and it was fantastic. Cool. So what QCalm and RGL does is it does exactly the same thing. It runs through all those steps, and then if it works, then it will output
in the Nagios plugin format, whether it worked or not. And it means that you can write these high-level tests in plain human language and plug them into your monitoring system. So let's have a very quick look at how it works. So the idea is that you install a QCalm and Nagios gem.
It's distributed as a Ruby gem. And you run QCalm and Nagios gen to generate a project, in this particular case, FOSDEM 10. And then we CD into FOSDEM 10, and then we run this gem bundle command. And this gem bundle command takes all the different dependencies that QCalm and Nagios requires for it to run, and freezes them into the single application.
So that means you can just tar up that directory and then distribute it on your production monitoring environment, and that's it. So if we actually look at the way that it works. So here's one I prepared earlier. So within that, if we go QCalm and Nagios gen feature, say FOSDEM.org, and we're gonna look at the navigation.
Right, so this goes and it generates a bunch of stuff for us. You guys can see that okay off the back? Great. So if we look here, it's generated just a bit of scaffolding for us.
And if we run that right now, then hopefully that should work, assuming that FOSDEM.org hasn't just gone down. So QCalm and Nagios provides a bunch of built-in steps for doing things like interacting with websites. So this is built-in library saying when I go to here,
or when I press this button, or when I fill in, or all these different things, right? It also has other things like SSH steps, which I'll talk about in a minute, for interacting with machines over SSH and whatnot. But I'll get to that in a second. Anyway, if we go back here and we go, okay, when I follow, oops, when I follow, say, tracks,
when I follow tracks, then I should see, I should see lightning talks.
Okay, so if we run, so if we run that, right, so you can see here that there were four steps, the past, and that was all great. And say if we modify that to be, and I should see spoons of doom.
Hopefully that isn't on the page. Great, so we've got a critical here of one. So obviously that string wasn't there. So the cool thing about this is you can actually pass a bunch of other options. So if we pass pretty, it'll run through,
and it shows here that this particular thing failed. And if we go up, we see here, and I should see spoons of doom, expected spoons of doom, didn't see spoons of doom. Great. Okay, so yeah, that's cucumber and arduous. And you can do a bunch of other interesting stuff like this new term called behavior-driven infrastructure.
So just after I presented cucumber and arduous in October last year, Martin England from Sun piped up on the Puppet user's mailing list saying, hey, I've played around with this cucumber stuff before, and wouldn't it be sort of cool if we could take all this cucumber stuff
and apply it to the idea of configuration management or build management? And he basically put together this blog post describing how he was using cucumber to verify the builds of his system. So the interesting thing that came out of the discussion from this was that you can actually think of Puppet as being a build tool for configuring systems, right?
So the build tool or like a programming language. And then cucumber itself being a testing tool to verify that your systems are configured in the way that you expect them to be configured. The other interesting thing about this is that it's not puppet-centric, right? You could use CF Engine or Chef or do your own hand roll configuration. And the hand roll configuration thing
is actually quite interesting because let's just say, hypothetically, you have a bunch of machines that aren't puppetized and that have been sort of crafted over the years and nobody really knows what's going on with them but you wanna migrate to a configuration managed environment. So you could use cucumber and cucumber and RGS to describe how the system is currently working,
testing that all these different behaviors and interactions work the way that you expect. And then once you've done that, you can build a bunch of stuff with Puppet or Chef or CF Engine or whatever. And you basically iterate in your configuration management tool until all your tests are passing.
So there are a bunch of other things that are in the works like, say, mail server tests. So let's just say I wanna have a bunch of local logins for my mail server. So say that when I don't have any public key set and I SSH to this machine with this username and password, all this stuff should work. It also works for LDAP logins or whatever sort of authentication system that you're using.
And then other things like mail, right? So you're saying that when I am using this mail server and I log in with this username and password and I send this mail to this person, then it should send correctly. And obviously, the next step of this is the receiving at the other end, right? We can check that the delivery works okay, but if the user isn't receiving mail at the other end,
it isn't really all that useful. So the question is then, why would I wanna do this? The thing about monitoring right now is that most checks are actually asking the wrong questions. Most checks are doing some sort of ping or a TCP connect to verify that something is the way that you expect it to be.
And those things are basically asking, is my server up or can I see my application, right? That doesn't deal with a bunch of edge cases like a VM going down and the network stack being up. Obviously, it's still gonna respond to ping, right? Or it doesn't matter if your web server is up, if you're serving 404s all the time or 500s, it doesn't really matter, right?
And that basically means that your monitoring system is dead in the water. So Qcom and Ardeos allows you to ask the right questions a lot more easily. Things like, is my app behaving? Can I navigate around my website? Can I place an order? Can I sign in? All these different things. And we can actually start thinking of monitoring
to be sort of like continuous integration. So a traditional CI lifecycle is something like this, where you have the check out, the build, the test, and the notify phase, right? So if we think of monitoring as being continuous integration for production apps, this is actually an interesting idea because we can actually take the CI lifecycle,
strike out the check out and the build phase because somebody's already built the software for us, and we're just doing the testing and the notification. The funny thing about this is that this also looks really similar to those diagrams that I had earlier about what Flapjack is doing. So let's just think, okay, so in your monitoring system, what your checks are currently doing is saying,
can I see my app? Can I do some sort of TCP connect? And you're checking for a string or whatever. And let's think about that check that you're doing in a continuous integration lifecycle. Let's think about the tests that you've written for your code when you're developing and thinking about asking, can I see my app? It doesn't make any sense at all
that when you're developing the application, the only question that you're actually asking is can I see my app? Because yes, of course you can see your app, but it doesn't mean that it's functioning. It doesn't mean that you're making any money, right? The other thing to keep in mind is this is not new.
Other people have done this before. You can already do this with a bunch of different checks. If you're using checkX with checkY with checkZ, you can get the same sort of functionality. But the thing about QQom and RGOs is that it makes all of this reuse really, really trivial. So it means that instead of having to write the same checks again and again,
you can reuse an existing library of checks that other people have, that other people have written. And this is great because it means that you're writing less code, which means that there will be less bugs. Less bugs mean less alerts, and less alerts at 3 a.m. in the morning, which is obviously what we're all optimizing for. Right.
So this is a great quote that Bradley Taylor wrote. And obviously it's a bit of a jibe, but it's actually quite apt, right? It's really, QQom and RGOs is really about building bridges between sysadmins and developers, and increasing the collaboration between the two camps so that we can learn from each other.
So if we take another step back out from QQom and RGOs and we go to collectD as I finish up. So collectD is a lightweight statistic collection daemon with an emphasis on collection, sort of analogous to Ganglia if anybody was in the previous session. It's network aware,
which means that you can collect statistic locally and send them upstream someplace else. It has a plugin interface, and it also talks the Nargios protocol. So that means that any of the statistics that you collect with collectD, you can poke at with collectD Nargios, which means that you can plug it very easily into your monitoring system. And there's a huge list of plugins available for it.
And this is expanding with every release. It's actually really, really cool. So you should, if you're interested in any of these plugins, you should check them out on the collectD website. There is a bucket load of information there. So if we look at some example configuration very quickly, here we're having a collectD client. You can think of a collectD client as being like a Nargios agent
that you're running on a machine. So we're loading up a bunch of plugins, and most of these plugins don't actually need any configuration. And we're saying up here that we want to collect these statistics every 20 seconds. And then we have this network plugin, and we're saying that all statistics that we collect locally, we want to send up to this monitoring.mydomain.org, or you can do multicast stuff,
or you can specify IP addresses or whatever. So then on the server at monitoring.mydomain.org, we're saying we're collecting stats every 20 seconds, and we're using the network plugin and not as many of the other plugins. And we're saying up here that we're listening on this particular address, and all statistics that come in,
whoops, all statistics that come in, we're going to write them out using RID tool to this particular directory here. And we're doing, we're holding onto those statistics for 900 seconds before we flush them out to disk. And you can also use other things like RIDKHD, which was mentioned in the last talk as well, if you have huge volumes of statistics that you want to log out to disk.
The other awesome thing about CollectD is that there are language bindings for the network protocol. So it means that within your applications, you can instrument statistics from within your web app or within your Tomcat app or whatever, and send them over the network to a running CollectD instance,
which is a great way if you need to instrument statistics within your applications without having to build all sorts of extra crazy stuff on top of it. So finally, going back to Flapjack, some stuff about what's happening in the next few months. So right now, Flapjack is distributed as a Ruby gem, which is really ghetto and inappropriate
for a system administration tool. There are a bunch of people, some of whom are here in the audience, who are building packages for different distributions, and to those of you who are here, I thank you. The other nice thing about Flapjack in the next few months will be implementing nice graphs in the admin interface. It will make it a lot easier to sell to your boss
or whoever when they've got nice, pretty stuff to click on So there's another project that I've been working on called Visage. And, whoops, apparently this link is broken, sorry.
Okay, here we go. So what Visage does is it renders the raw statistics that RID, sorry, the collective writes out the RIDs, and it renders them in the browser. And not just rendering them in the browser, but it means that everything that you see here on the screen is actually a DOM element,
so it means that you can do funky things like if I put my mouse over this particular thing here, I don't know whether you can see up the back, but sort of fading in and whatnot, that's sort of cute. And you can toggle them in and out, and all that sort of thing. And you can also do other things like that.
Sort of neat. Which, the other thing that's in Visage that I haven't publicly released yet is all this stuff is embeddable. So all these graphs that you see here on the Visage dashboard, there's some code that I've written that you click on this embed link, and it spits out a bunch of HTML that you just paste into a page, which is fantastic if you want to create, you know,
dashboards of all your different statistics that are floating around your system. The last thing is a job insertion API, which if you're interested in hacking on Flapjack, you should come and talk to me about it later. So, thank you very much for listening. Who here has questions?
Do we have any questions at all, or have I dazzled you all with my brilliance?
About Flapjack, whoops, sorry. Can we use it in production? I have a older version of it running in production. I've done a fairly heavy amount of brain surgery to it recently, where it's not really in a production-ready state. But that's certainly changing.
I'm hacking on it quite vigorously. Thank you. Any more questions? Does anybody want to see demos of stuff? I don't know.
No more questions? Oh yes, over there. You have a few components who talk Nagios, but where does it leave Nagios itself in the picture?
So the question was, where does it leave Nagios in the picture? And the answer is it leaves Nagios out of the picture. I don't see, like, an investment for Nagios, right? And that's what I was aiming to try and be. Right now, you can think of Flapjack as being the infrastructure for building
a monitoring system, but as I'm rounding off the rough edges, eventually the aim is to be like the de facto standard for monitoring in the open source world. No more questions?
Okay, thank you very much.