We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Log all the things!

00:00

Formal Metadata

Title
Log all the things!
Title of Series
Part Number
163
Number of Parts
169
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Honza Král - Log all the things! Many times these logs are thrown away or just sit uselessly somewhere on disk. I would like to show you how you can make sense of all that data, how to collect and clean them, store them in a scalable fashion and, finally, explore and search across various systems. ----- Centralized logging (and the ELK stack) is proving itself to be a very useful tool in managing a production infrastructure. When combined with other data sources (application logging, business data, ...) it can provide even more insight. This talk is an introduction into the area with some overview of the motivation, tools and techniques that can prove useful. We will show how the open source ELK (Elasticsearch Logstash and Kibana) stack can be used to implement this. It is geared towards people familiar with the DevOps concept that are looking to improve their lives by introducing smarter tools.
11
52
79
Server (computing)World Wide Web ConsortiumStructural loadChemical equationMetric systemServer (computing)Representation (politics)Exception handlingInformationBefehlsprozessorWeb 2.0Arithmetic meanEndliche ModelltheorieTimestampStructural loadGroup actionComputer fileService (economics)Event horizonQuicksortLevel (video gaming)DatabasePhysical systemVirtual machineSoftware bugAutomatic differentiationClient (computing)WhiteboardLoginGoodness of fitGenderError messageInheritance (object-oriented programming)Scaling (geometry)Message passingProcess (computing)BitOpen sourceProduct (business)Multiplication signTwitterLine (geometry)MathematicsSelf-organizationStreaming mediaDatabase transactionDifferent (Kate Ryan album)Cross-correlationLastteilungBlogMultiplicationGraph (mathematics)Chemical equationNumberState observerScripting languageLatent heatMathematical analysisMereologyOnline helpOrder (biology)Sampling (statistics)File formatMemory managementControl flowFlow separationPattern languageSoftware developerConfidence intervalGodLecture/Conference
Stack (abstract data type)Beat (acoustics)Process (computing)Physical system1 (number)Workstation <Musikinstrument>Product (business)FamilyQuicksortType theoryFunctional (mathematics)CodeServer (computing)Boilerplate (text)Configuration spaceModule (mathematics)CuboidMappingCellular automatonMultiplication signNeuroinformatikBlock (periodic table)MereologyVisualization (computer graphics)Mathematical analysisDifferent (Kate Ryan album)BitRaw image formatWordData storage deviceSet (mathematics)Level (video gaming)Message passingComputer fileVirtual machineState of matterCross-correlationComputer configurationPattern recognitionCommunications protocolIP addressLogical constantSurgeryUniform resource locatorFunction (mathematics)Filter <Stochastik>BenchmarkoutputResultantInformationCategory of beingElectronic mailing listLocal ringRow (database)WebsiteNetwork socketSoftwareQueue (abstract data type)CASE <Informatik>Term (mathematics)Revision controlRight anglePattern languageField (computer science)File formatPlastikkarte2 (number)Figurate numberTracing (software)Line (geometry)Open sourceBeat (acoustics)Arithmetic meanMultiplicationWeb 2.0Stack (abstract data type)Database transactionPole (complex analysis)Flow separationAuthorizationTouchscreenClient (computing)Elasticity (physics)HTTP cookieStandard deviationParsingWeb browserBinary codeSoftware bugData structureComputer animation
Port scannerInformationClient (computing)Subject indexingFilter <Stochastik>Sensitivity analysisGraphical user interfaceNumber1 (number)String (computer science)Library (computing)Operator (mathematics)Sampling (statistics)Data storage deviceRevision controlPhysical systemAnalytic setoutputOpen setPrice indexFormal languageCuboidFunction (mathematics)WindowDifferent (Kate Ryan album)MereologyCASE <Informatik>EmailPoint (geometry)Queue (abstract data type)IP addressReal-time operating systemQuicksortAddress spaceSpeech synthesisElasticity (physics)Expected valueMultiplicationMiniDiscPlanningConnected spaceLogicVisualization (computer graphics)Cartesian coordinate systemHeegaard splittingGroup actionHash functionLevel (video gaming)Execution unitScaling (geometry)Stack (abstract data type)Order (biology)WeightBackupRight angleMultiplication signSemiconductor memoryPosterior probabilityReplication (computing)WordDemosceneScripting languageBookmark (World Wide Web)BitSequelError messageStatuteCivil engineeringOpen sourceLoginComputer fileLecture/Conference
Web browserDifferent (Kate Ryan album)Revision controlWebsiteData analysisGroup actionPhysical lawVisualization (computer graphics)Computer animation
Dependent and independent variablesRow (database)CASE <Informatik>GeometrySource code
Software testingTrailLogarithmSoftware testingQuicksortWeb 2.02 (number)InformationDebuggerBeat (acoustics)Virtual machineHigh availabilityTrailField (computer science)Structural loadComputer architecturePhysical systemCartesian coordinate systemComputer fileBuffer solutionPresentation of a groupMessage passingWeb pageBefehlsprozessorService (economics)Multiplication signQuery languageFlow separationDatabaseQueue (abstract data type)CASE <Informatik>Different (Kate Ryan album)Group actionMathematicsProduct (business)ChainProcess (computing)Insertion lossSet (mathematics)Configuration spaceMetadataLastteilungData structureEmailServer (computing)SequelExtension (kinesiology)AreaError messageArrow of timePoint (geometry)Level (video gaming)Channel capacitySheaf (mathematics)Computer animationLecture/Conference
Data loggerPlug-in (computing)DataflowInformation securityMessage passingData miningClient (computing)Integrated development environmentProduct (business)Revision controlSoftwareLoginCartesian coordinate systemGroup actionOpen sourceService (economics)Level (video gaming)Functional (mathematics)CASE <Informatik>AuthenticationDifferent (Kate Ryan album)Interface (computing)Combinational logicInsertion lossMultiplication signProcess (computing)Set (mathematics)Point (geometry)Scripting languageEntire functionHoaxGreatest elementCommunications protocolACIDLecture/Conference
Transcript: English(auto-generated)
So, welcome to this talk about Elasticsearch, Logstash, and Kibana and the rest, of course. We have today Honza Kral. It's a Python developer and a Django Core dev, okay? So let's go.
Thank you. So good morning, everyone. So I'm here to talk about logging all the things. And while that seems somewhat obvious, I want to take it a little bit step further and to explore what logging is, what it can be, and what are the important aspects to keep in
mind when you're doing logging, especially when you're doing centralized logging, and also the motivation for it. So also, every good talk begins with the definition.
So what is a log? What do we mean by logs when we say it during this talk? Well, essentially, when you come to think about it, log is any sort of message, any sort of document, any sort of piece of data that has a timestamp and doesn't change
after it's been created. So it can be applied to many other things than just logs as you think of them as lines in the file. It can be a Twitter feed, it's essentially also logs, any sort of stream of events as they happen, but also something that happens in your organization on the business side.
All the invoices that you send out, all the money that you get back, all these transactions can also be considered logs and can be treated in very much the same way. You can actually use the same system to treat them, and I'm trying to convince you here that it would be beneficial.
And also metrics can be viewed as logs. So your CPU usage, your free memory, and all of this information that's traditionally stored next to the logs in some other separate systems, but in reality, it's kind of the same data. But except for some textual representation of what's going on, we have a numerical one.
But it's really the same thing, and you want to treat it the same way, and you want to work with it exactly the same way as you would with logs. So I will probably keep saying logs throughout this talk.
Just keep in mind that it applies to anything that has a timestamp and doesn't really change much once created. So that's what logs are, that's what logging means for us today. So why should we care? Why do we talk about these logs and metrics so much?
Well, currently, any company out there generates huge amounts of data. All the information that's going on, all the different events that's happening, an incoming request from a user on the load balancer, then on the web server, then on the server serving the static files, then on the database.
And this is just when we're talking about a simple website. Imagine that you have anything more complicated. Imagine, God forbid, that you have something like microservices, and you need to track all of the different services and all the different requests that are going around. That's a lot of data, and those are just the technical data.
You also have a lot of business data. How is your business doing? How is your traffic? What sort of thing is happening on the business side of things? So those are all the data, and a lot of the times, they're just going to waste
if they're recorded at all. Some of them, you always have to track because your business depends on it, but a lot of it is going to waste and can be used. But let's start from the simple questions. What happened last Tuesday, preferably at three in the morning?
If a customer comes to you and says, hey, I use your service, I really like it, but last Tuesday at three in the morning, I had this annoying thing happen to me, and it really bugs me. What do you tell them? How do you find out what's actually happening?
Well, the first approach is you typically try to grep it. You have some log file somewhere, you try to grep it, and that breaks down pretty quickly. That's fine on your local machines if you're looking for something that just happened,
but if you want to look on a production system, that's typically not that nice because you would have to go to multiple machines. Sure, we can all do SSH scripts and stuff like that. That's fine. We can go to multiple log files. Even that we can do with grep. It's getting a little hairy, but you can still do that.
What's harder to do is any sort of analysis or discovery because grep, you already have to know what you're looking for. If you don't know what you're looking for, there is no way that grep will help you. Lastly, the crucial part of the question is, Tuesday at 3am, who can grep all the different
log files for what happened at Tuesday 3am? I see no hands raised up, and that's probably fairly accurate because time is fun, especially when you're dealing with computers.
The nice thing about time and time format is everybody likes their own, and people don't really like to share, so we have just a sample here of some of the formats that you might see in very common log files.
Some of them are quite interesting. For example, Postfix just assumes that it will never run for more than a year. That's not really the confidence I'm looking for in my systems, but okay. You can see that they're all very different. Some of them are not even sortable.
How does this work if I want to grep for something that happened Tuesday 3am? The obvious answer is you cannot. We need something else instead of grep. So what's it going to be? So let's see what we need to do. But also, we want to be able to do more things than just look at the individual log
files or individual events from individual sources. We want to be able to correlate different events, and that's why I gave you such a huge spiel about what log actually is and what it can be.
Because only once you get logs or data from multiple sources do you really get to see some interesting stuff. For example, if you compare your logs from your load balancer and from your web server, and just look at the raw numbers, you can immediately deduct certain things in certain behaviors.
For example, if your traffic on the load balancer is going way up, and the web server traffic is still going steady, that's probably a good thing. That probably means that you have some sort of caching on the load balancer and it just works. But if you see them rising together, that means that the caching that you have in place doesn't work.
And that's something that's nearly impossible to discover without having both of these systems together in one system that you can compare these numbers. The same difference is web server versus database. You also want to have some sort of caching, you definitely don't want to scale linearly
that the more requests that you do, the more database requests that you do. That doesn't scale really well. So you need to be on the lookout for these kind of patterns. And this is, again, something very difficult to discover. Also, what happens when you see a rise in errors on your web server?
Does it maybe correlate with a new deploy? Or with a new employee getting on board? Or a new client? Or something like that. But you can also not stop there and you can go into the more business-y kind of sort
of thing. So we bought these ads, we bought this traffic from someone. Do we really have something to show for it? Sure, for a lot of these things you can go to external services. But external services are external, they don't know your system. So they might be difficult to tie in to the rest of your infrastructure.
So this is sort of everything that we want. We want to be able to look at what happened at Tuesday 3AM and we want to be able to answer all of these questions to do the correlations. So how will it look? What's the ideal state of this proposed system?
So we need a central storage. We need something that can handle the different data that's coming from different sources, that can handle the amount of data. We also need the data to be enriched. We don't just want the raw data, the raw text file from the log.
That's not interesting. We want to parse it and also we want to do some enriching. So for example, if we stick with the example of running a web server, we have a URL in there. We want to map the URL to the article and the author who wrote it or the product in our eShop and the category of that product.
Because once we get that, we can immediately see much more information in our data. The same as when we have the client IP or the user agent. We might want to see which country did they come from. And also additional stuff like we see some cookie in there.
That's cool, but was that user logged in or not? What was the username? All of this information. And once we have this information, of course we want to be able to search on it, to filter on it, to get the results back. So if you know you have an annoying user called Honza and he bugs you, hey, I cannot
find anything on your website. You can just easily do a search, say hey, from this user, did I see any 404s? Maybe there is something wrong with this browser or something. So this is what we want. We also want to be able to analyze all of it, so not just look at individual records,
but look at patterns, visualize the data, and be able to discover some interesting stuff. So essentially what we've designed here, or what our wish list equals to, is centralized logging.
That's a technical term for the system, and it consists of several steps, and those steps will not be surprising at all to you right now. So we need to collect the data. We need to parse the data in case that they are in a textual format. So we need to extract the different fields that are otherwise hidden in the text.
We need to create a structure from the text. Then we need to do the enriching step, so do the GeoIP lookup on the IP address and other stuff. We obviously need to store the data somewhere that's capable of doing the search and aggregations,
and finally, and most importantly, we need to visualize the data, because we as humans, we are pattern recognition machines. It's very easy for us to spot an anomaly in a pattern. It's very hard for a computer to do so. You would have to instruct the computer specifically what to look for, or you would have to have
a very, very smart computer, and smart computers are expensive, especially in time. So how can we accomplish that using the Elastic Stack? So Elastic is the company I work for. We produce all of these things to do all of that, and don't worry, this is not a
sales pitch, everything is open source. So this is how it maps. So in the center of everything, to store and doing the search and analysis, we have Elasticsearch, which is the data store that can handle this amount of traffic.
For visualizations, we have Kibana, we'll see pretty, pretty screen shots later. And for the collection parsing, we have two products, we have Beats and we have Logstash, and they are a little bit different, whereas Beats is more like a lightweight agent that will sit on your machines, collect the data and send them somewhere, either for further
processing into Logstash or directly into Elasticsearch, Logstash is more heavyweight, it has much more options, but it's also much heavier to use. Just to demonstrate what I mean by that, Beats is a small agent written in Go, it's a single statically compiled binary that you can just upload somewhere and it will work, whereas
Logstash runs in JRuby, so written Ruby runs under JVM, I'm pretty sure that's fairly popular in this crowd. For more sophisticated or if you really need more from your system, this is typically how
it would look, that you would use Beats to actually collect the data and then Logstash for doing the parsing and enriching, because that's what this is all about. So this is the overview, so now let's get into it. So the first step in the process is Beats. And Beats is sort of just a family of products, there are several different Beats and most
importantly you can create your own Beats. Beats is written in Go, we even have a Beat generator so you can just run a command that will create all the scaffolding, all the boilerplate code for you and you just have to essentially write the one function that actually collects the data.
And we have several different types of Beats out of the box, so let's see some examples. The first one that we have here is MetricBeat, MetricBeat is something that regularly does something and collects some data, it has different modules. This is an example working configuration where we want to monitor Redis, every one
second we want to essentially capture the info from host 1. And we also have one for Apache where every 30 seconds we want to do the same thing. And then Beats have these modules for Redis and Apache, it knows how to go in and fetch
the information. We also have a FileBeat, which is essentially just yeah here is a log file, just keep tailing it and optionally say if you see a line that doesn't begin actually with a hash, just merge it with the line before, so that probably means that there is a stack
tracer or something that spans multiple lines, so we can group them together already on the Beats level when we are first collecting the data. Because doing it later is a hard problem when you have data coming from multiple sources, how do you identify which actually belong together in a simple message.
And then my favorite Beat is PacketBeat. All you have to do with PacketBeat is you say I have this protocol running on this port and then PacketBeat will just keep monitoring the network and logging what's going on in there. And because it understands the protocol it can give you more information.
For example it understands the Postgres protocol so it can tell you yeah oh yeah this is a select, this is a transaction, this is a select going to this table and log all of that information in a structured manner. And finally once you, these are all the inputs that you can have and then finally you have
an output. Output is either Elasticsearch or a file or standard out or in this example it's Logstash so we'll just take it and send it to Logstash for further processing. So it's a custom TCV protocol to get it into Logstash so we can do some more stuff.
So let's follow our data and go to Logstash ourselves. So what Logstash is, it's a data ingestion pipeline. There are inputs, then there's a bunch of filters and then there are some outputs. It's really just that, it's really a pipeline. So what are the different options?
There are many many different inputs. The most interesting ones, at least for me, is all the different queues that we have. Redis, Kafka, RabbitMQ, ZeroMQ, all of the different ways how you can get data from a queue. Also how you can get data just from the network. You can just open a TCP socket and listen to whatever comes in or there is specialized
one like the Beats input that's pretty obvious or even syslog or log4j. You can even just go to S3 or SQS or some other systems. So many many different types of inputs, how you can get the data and ingest them into
the pipeline. Then the meat of it is all the different filters that you can apply to your data. This is just a small sampling of the filters that there are and highlighted are the ones
that again I personally consider more interesting. For example, anonymize. If you consume some data that can potentially contain some sensitive information like email addresses and stuff that you don't necessarily want to expose to everyone in your company but you want them to have the ability to inspect the logs, you can just anonymize
everything which will go through a one-way hash. So all of the same emails will correlate to the same hash but nobody will be the wiser which actual user this is. I've talked about the GeoIP filter that it will just take an IP address and give you back the country, the city, where the user came from so you can visualize very nicely
on a map where your traffic is coming from. And you can even know, for example, which users around the world have the best experience, the best latency or the worst. Grok is if you want to parse text, JSON is kind of obvious if you have data in JSON,
so just parse it as JSON. User agent is if you've ever seen a user agent string in your log files, it's a nightmare to make sense of even for people. So user agent actually will parse that into a structured information. This is Chrome version 7.72 and it's running on Windows.
And then LogSage has, again, a number of different outputs. The crucial one is probably Elasticsearch but there are many others, you can just write it to a different queue to be processed by another system, you can write it to a completely
different storage. If you're so inclined, you can even write it to MySQL or something like that. I don't know why you would do that but you can. And it actually might make sense for some of the data because what you can do with LogSage quite easily is you can, say, put all the data in Elasticsearch and if you see some
critical error, send that to me over email and if it's really, really critical, just ping pager duty and have my pager go off so I can jump on it right away, so we can have multiple different outputs with filters, so you can be alerted immediately what's going on in real-time.
So that's LogSage, it's really not that hard in concept, you have inputs, you have a number of filters and you have some outputs. The only interesting part is that you can have multiple outputs and obviously multiple filters and multiple inputs.
And then the data gets into Elasticsearch. So what is Elasticsearch? Again, just to super high-level overview, it's a distributed search and analytics engine. It's open source, it's document-based, by document what we mean is everything that
you can express as JSON, we can index and we can search on and analyze. It is based on Apache Lucene which is sort of the library that does all of the heavy lifting and it is very friendly. It speaks JSON over HTTP. Then there are obviously clients in any of your favorite languages.
My guess is that your favorite language would be Python, so we do have a Python client for Elasticsearch that you can use. And the nice part about Elasticsearch is that it is distributed and it has some qualities that make it very well suited for the logging use case.
So how does it look inside of Elasticsearch? Again, the most highest of levels of overview. So Elasticsearch is a clustered solution, so you have a number of nodes that work together. From the outside it's completely transparent, you don't really care what's happening
inside and what's in what node. As a client, you can always talk to any of the nodes in your cluster and they will all answer the same questions in the same way. So you don't have to worry about any of this, but it's nice to know how it works so that you can reason about what your expectations can be.
In the cluster, the data is stored in indices and each index is essentially a connection of shards. So what we do is we say we have this index, which is just a logical grouping of documents, and we'll split it five ways, so we'll split it into five shards.
And each of these little shards we'll store twice, you know, in case we lose one node so we can still keep going on. And these shards are actually the unit of scale of Elasticsearch. When we have the cluster, so in this case we have two indices, one with four shards and one replica each, so we have two copies of each shard, and one with only two shards
and no replicas. We don't care really about that index that much. And those shards are what lives actually on the nodes where the cluster keeps rearranging, so if I were to add one more node, the cluster will say, oh, I have a free node
and it will move some of the shards over to that new node. You will have a primary and a replica, which is just a logical difference, it really doesn't matter that shards look exactly the same, they do exactly the same amount of work. So again, something that typically you don't have to worry about.
But what this means is a very important thing. When you search through the orders index, in this case, you will have to go out to search for shards. And that's okay, we can absolutely do it, we can even stand.
And that means that it is the exact same operation if I want to search for shards no matter where they come from. So they can be inside one index or inside four indices. And the only thing that really matters is the number of shards. And this allows for some interesting, interesting things where we can create a new index every
day with any number of shards. Typically you would start with one shard when you're starting the system and you would grow the number of shards as the one shard becomes not enough.
And then when you search, you just search over as many indices as you need data for. So if you want data for the last seven weeks, you just search over the last seven indices. And this also means that you can treat the indices differently. So for the current index, for the index for today, you will have more replicas and
you will have it on those nodes that live on stronger boxes, the boxes with SSD drives and everything because those indices are doing the most work, they're actually actively indexing new data. And as the data gets older, so a week-old index, so you will just back it up, you will
do a snapshot with Elasticsearch, you will store it on S3 or something like that, and you will remove the replicas, which means that at this point if you lose some node, you will lose some data. But that's okay, you have a backup, and also this data is not that important, it's a week-old, it's okay to save a little money sometimes.
Then a month-old data, you might want to move to weaker boxes, so boxes with just huge spinning disks that everything will live on, then you can even close the indices, so they will still live on the disk, they will not be in memory, they will not be available for search, but you can make them available for search very easily just by opening them.
And finally, after some amount of time, you can just delete the data. So you have a very clear plan how to degrade your data and make them use less resources
even while keeping them. Sure, it will mean that if you search for older data, it will be slower, but that's okay. 90% of your users will probably just want to search today, or yesterday, or typically actually just last one hour, they just want to see the dashboard for the last one hour
that they can actually just put on the wall and have always there auto-refreshing every minute. So that's one nice feature of Elasticsearch that's very relevant to the logging use case, how you can make use of it. And speaking of dashboards, so the last sort of part in the Elastic Stack is Kibana.
Kibana is a small JavaScript application that provides visualizations for your data in Elasticsearch. It doesn't have to be log data, but with log data, that's how Kibana started, and that's really where it shines.
And you can see immediately here what I talked about, you can immediately see a gap here in the data. And you can see it, because again, you're human, I assume. So that's why visualizations are important. And this lower one, you have, so this is split by country, for each country we split
again the users of our website, whether they're authenticated or not, and for each of these two groups, we ask what browser they're using. And immediately you can see very different things for different countries. So we have China here, where we have mostly authenticated users and some not
authenticated users, and in the end we have, I don't know what country, but nobody there is logged in. And you can see that immediately, because it just pops out, because, you know, again, the human thing and all that.
You can, if you use the GYP filter in Logstash, you can see where your users are coming from, just by clicking on a map. And you're not limited to just pretty pictures, you can actually drill down to the individual records and you can do search. So in this case, I'm looking for responses that went to IE6 and are 400 to 600 kilobytes
in size, and I can see the individual records, I can see the individual URLs, you can see that we are using the data from US government, they actually publish this dataset publicly.
So you can drill down, you can click into it, and you can see all the different values. So putting these things all together, this is how it looks logically. You collect the data with beats, you send them to Logstash to enrich them, you store them in Elasticsearch, and you visualize them using Kibana.
This is sort of the ultimate thing, well, the ultimate architecture would be that instead of just the arrows, you would probably have like a queue in each, instead of each arrow, you would have like a Kafka between beats and Logstash, and between Logstash and another Logstash that would then put it into Elasticsearch, but that's only once
we're talking hundreds and thousands of requests per second, like 100,000 requests per second or millions per second. If you only care about thousands per second, you can just do it directly like that and you will be perfectly fine. If you need more capacity, you just add more machines, more nodes at each level,
you can have more than one Logstash, obviously, and more than one Elasticsearch. You should have more than one Elasticsearch to get any sort of high availability and resiliency. So this is how it works, it's really not that difficult to set up, you can just start with everything on one machine.
When you want to start, I recommend you just use beats and Elasticsearch alone, no Logstash, it will just work, and only when you discover more things than you need, like doing the enriching, etc., you can introduce Logstash in the middle, will be minimal change in your configuration, and you can sort of grow from there.
First thing that you will do is probably separate Elasticsearch and Logstash on separate machines, and then sort of keep growing from there. So how does Python come into it? What are the concerns when you're logging from Python? So first important one is to enhance your logs.
Don't just log while this happened, but also tell, for example, how long did it take? How many queries to the database did it take? Or how long did they take? Also include some sort of metadata, so who is the user who requested this? What is the page that we are currently on, again, speaking about the web example.
And ideally, log it just as JSON, because if you log it as text, then you will have to parse it later. So you're both serializing it into text, and then parsing it out from the test. Both of these things are pretty error-prone, and they take a lot of CPU.
And no human is going to look at the individual message. We will be looking at it through Kibana. We care about the individual fields, not about the one textual presentation including all of that. So log as JSON.
And the way how to do it is there is a Python package called structlog that's actually created by Hinek. He's somewhere around, I believe he's now giving a talk at one of the other tracks. And what structlog enables you is to do exactly that. So add structured info to your logging. Add qualified fields with names and values.
With that, you can track the info through the services. So if you, for example, have your load balancer, you can attach a session ID as an HTTP header, and then track it. Even if it has to go to two different web servers, you can, in the end, put them
back together and track the one request through your different systems. You can add a little comment to each one of your SQL queries, again, to match it back to the request that started it. So you can sort of track the one user action on your front end to everything that happens
on your back end. Then ideally, you want to log that into a file. You can send it directly to a queue or to Beats or to LogSesh or something like that, but at that point, what happens if your logging infrastructure goes down or if you want
to upgrade it? The worst case scenario here is that it will actually impact your production, your application. That's not really very acceptable. So what you want to do is sort of use some sort of buffer, and the easiest buffer that you can find that's most universally supported, well, that's a file.
So just log it into a file, and then you can have a file beat sitting there listening to it and sending it to either directly to Elasticsearch or to LogSesh for further processing, and you can be perfectly fine. If your logging system goes down because you're just playing with it and you're still
not committed, that's fine. Your application will still run. You will not lose any data. You can backfill them later, and it will give you a lot more flexibility. So I think that this is okay for our overview of what's possible, why you should
do it, and what are the key concepts that you should keep in mind when designing a system like this. And now we have some time for questions.
Can we get the mic? Yeah, it should work. Okay, so any questions?
Open source solution for user authentication in Elasticsearch. Currently, no. But Elasticsearch speaks HTTP, so what you can do is stick an nginx in front of it and do HTTP off and SSL on it.
It's very difficult to do different levels of access. It's possible, but there might always be the weird corner cases. But that will get you 80% there, and it's very easy to set up. If you need more than that, unfortunately, currently, you have to pay us money.
But we do offer commercial plugins for Elasticsearch, and the one doing security is one of them. Hi, great talk. What's your suggestion for an environment where we are installing a full-over stack
of products at the client side, and we don't have access, so everything is there? Sometimes we just can pull the log files if we ask clients. So what's your suggestion for this scenario to effectively get the logs stored in Elasticsearch?
So there are two different ways how to do this. One is that you install Elasticsearch in Kibana at the client with every installation, but that's probably only worth it if the client will get some value out of it as well. If that's not the case, just create a pack of the logs, ship them over, and then have
your own stack that will be configured to actually get that pack, read through all the logs, run it through the entire pipeline, get it into Elasticsearch, and then visualize it. And at that point, it's only up to you whether you will just create a temporary one, like on AWS, just for each one of these packs that you received, or you will have
one big one, and you will get the data from all the clients. Make sense? Cool. Okay, another question? We have one here.
So you mentioned that Beads understands different protocols, and we can configure it to listen to the TCP port and send logs down the pipe. Let's say I have my services running in Docker. How do Beads play with them?
So with Docker, there are several ways how to do it. You can install, so the Docker listens on the network interface, so the easiest way for this particular Bead, which is packet Bead, you can just run it inside the Docker container with the application that you're trying to monitor.
Or you can run it in a separate container that you configure the networking that will be able to listen to that traffic. Alternatively, you can not use the packet Bead and just log things directly using the metric Bead, which can live in its own container and just keep pinging the other services.
Alternatively, Docker has its own logging functionality that you can then feed into Logstash. So you can use Docker to collect all the logs from all your containers, aggregate them together and send them to Logstash for processing and for loading into Elasticsearch.
So there are many different approaches, it depends exactly on what you're trying to do. Okay, thank you. Sure. Another question? Yeah, two. Okay. We have two minutes. Can you? It's not really a question, but a pet peeve of mine is that people log what they're doing
and I can see from other things that what you're doing, please log why you're doing things. Something is secret sauce, I don't care, but the things that you do want to log, are able to log why you're doing stuff, please log the why. You heard the man.
About this flow of log messages through time, are there some hooks in Elasticsearch for that, like after one week do this with data or do you have to just write scripts?
Thank you for that question. That's what I forgot. Yes, there is a tool, it's called Curator, it's written in Python actually, and it allows you to do just this. And also in new version of Elasticsearch, Elasticsearch 5, which will come hopefully later this year, it's already built into Elasticsearch, so it's an API inside Elasticsearch.
So Elasticsearch Curator, if you just search for it or if you pip install Curator, that's the tool, it has a command line interface, so we just stick it into your cron and periodically you will run Curator, everything older than five days, remove it, or any other actions that you might have. Okay, another question.