Infrastructure Monitoring with Postgres
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 611 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/42110 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2017 |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Metric systemQuery languageMathematical optimizationBitMixed realityLevel (video gaming)GradientXMLComputer animation
00:41
Computer programmingFormal languageSoftwareSystem programmingMetric systemSoftwareBitProcedural programmingSheaf (mathematics)CodeSoftware developerComputer animation
01:21
DatabaseSicFormal verificationFirmwareData storage deviceSystem programmingLevel (video gaming)Bridging (networking)Variety (linguistics)Uniform boundedness principlePlanningBuildingComputer animation
02:00
MultiplicationSupercomputerComplex (psychology)Hidden Markov modelMultiplication signDatabaseCore dumpQuicksortLatent heatData analysisWordInformation technology consultingComplex (psychology)Computer animation
02:40
Point cloudComputing platformData storage deviceComputer networkMultiplicationBuildingPoint cloudBitComputing platformCartesian coordinate systemXMLUMLComputer animation
03:21
Military operationMiniDiscDatabaseWebsiteSoftwareComputer hardwareCartesian coordinate systemStress (mechanics)Context awarenessServer (computing)Structural loadDatabaseXMLUMLSource codeComputer animation
04:08
Sign (mathematics)MiniDiscServer (computing)ModemMiniDiscData storage deviceSystem programmingGraph (mathematics)Different (Kate Ryan album)Control flowSource codeXML
04:50
DatabaseComputer hardwareMathematical analysisInformationMultiplication signMeasurementUtility softwareServer (computing)10 (number)AreaGraph (mathematics)XML
05:36
Profil (magazine)DatabaseWeb 2.0Statement (computer science)LoginCartesian coordinate systemUniqueness quantificationIdentifiability
06:16
System programmingPlanningDecision theoryInformation securityVideo trackingIdentifiabilityUniqueness quantificationSystem programmingQuicksortInformation securityDifferent (Kate Ryan album)Direction (geometry)SoftwareLevel (video gaming)TrailLoginComputer file
06:55
ExplosionService (economics)DatabaseEvent horizonDigital rights managementQuicksortBitCategory of beingSoftwareOpen sourceBookmark (World Wide Web)1 (number)Web 2.0Virtual machineComputer animationXMLUML
07:36
Online helpEmailServer (computing)Web 2.0System programmingEvent horizonCategory of beingScripting languageXMLUMLSource code
08:17
System programmingEvolutionarily stable strategyInformationEvent horizonMathematical analysisComputer configurationState of matterMultiplication signMoment (mathematics)Source codeXML
08:56
Host Identity ProtocolProcess (computing)Stack (abstract data type)Data loggerMultiplication signSystem programmingServer (computing)LoginMessage passingSource codeXML
09:35
Visualization (computer graphics)DatabaseBefehlsprozessorMetric systemSystem programmingPoint (geometry)StapeldateiError messageComponent-based software engineeringIP addressWeb 2.0Classical physicsGraph (mathematics)Set (mathematics)System programmingSource codeXML
10:14
Metric systemSystem programmingMetric systemBefehlsprozessorMiniDiscSpacetimeMultiplication signBookmark (World Wide Web)Semiconductor memorySign (mathematics)Goodness of fitGraph (mathematics)SupercomputerXML
10:56
Graph (mathematics)Component-based software engineeringServer (computing)DatabaseDebuggerBefehlsprozessorGrass (card game)1 (number)StatisticsSlide ruleBit rateXMLProgram flowchart
11:36
Visualization (computer graphics)Goodness of fitSlide ruleLink (knot theory)Axiom of choiceSocial classWeb 2.0Queue (abstract data type)Host Identity ProtocolCoefficient of determinationMobile appBefehlsprozessorGraph (mathematics)DebuggerSource code
12:17
Image warpingMetric systemDatabaseService (economics)NumberGraph (mathematics)Axiom of choiceTime seriesMetric systemDatabaseProjective planeGreatest elementSource code
12:56
DatabaseMetric systemImage warpingProjective planeSource codeComputer animation
13:34
Database1 (number)Open sourceField (computer science)Multiplication signGrass (card game)Source code
14:14
QuicksortBitProjective planeTime seriesSequelData storage deviceComponent-based software engineeringXML
14:53
Greatest elementSystem programmingInformationLoginMetric systemServer (computing)BitSoftwareStorage area network
15:37
Euler anglesConsistencySemantics (computer science)Communications protocolDatabase normalizationBackupPoint (geometry)Metric systemMiniDiscQuery languageRange (statistics)AverageReading (process)Series (mathematics)Hausdorff dimensionElectronic mailing listDimensional analysisBounded variationData structureFile formatSurjective functionoutputTexture mappingModel theoryNumbering schemeMeasurementSingle-precision floating-point formatProjective planeMetric systemSet (mathematics)LoginINTEGRALSystem programmingFluxFunctional (mathematics)NumberCommunications protocolFile formatDimensional analysisQuery languageMechanism designPoint (geometry)BitMoment (mathematics)Graph (mathematics)Right angle1 (number)Data storage deviceSeries (mathematics)SequelWeb pageConfiguration spaceData structureRange (statistics)Semantics (computer science)QuicksortStress (mechanics)Component-based software engineeringDatabaseProcess (computing)MeasurementElectronic mailing listSingle-precision floating-point formatGradientVolume (thermodynamics)SoftwareMultiplication signMiniDiscBefehlsprozessorServer (computing)Moving averageMereologyField (computer science)ConsistencyRelational databaseAverageCache (computing)SpacetimeStructural loadLastprofilQueue (abstract data type)Doubling the cubeWindow2 (number)Category of beingKey (cryptography)CurveMeta elementTime seriesOverhead (computing)Elasticity (physics)Euler anglesLine (geometry)Roundness (object)TimestampBuildingGroup actionDatabase normalizationBuffer solutionComputer animationXML
24:31
Metric systemElectronic mailing listQuery languageHausdorff dimensionMathematical optimizationSpacetimeTerm (mathematics)Covering spaceSubject indexingSeries (mathematics)Local GroupWindowBounded variationPartial derivativeDatabase normalizationResource allocationFunction (mathematics)Price indexView (database)FrequencyRoundingMeasurementAverageInformationTime seriesQuery languageElectronic mailing listMetric systemMeasurementMultiplication signQuicksortSubject indexingRange (statistics)Data structureDimensional analysisFunctional (mathematics)Maxima and minimaRelational databaseRight angleDatabaseFlow separationOperator (mathematics)BitSeries (mathematics)Table (information)2 (number)TimestampPoint (geometry)Goodness of fitFrequencyWindowSoftware testingView (database)Row (database)Graph (mathematics)SummierbarkeitSystem programmingSystem callMedical imagingCuboidNetwork topologyNormal (geometry)Online helpBuildingCountingCASE <Informatik>SequelMultiplicationArithmetic meanSet (mathematics)CodeSource codeXMLUML
33:04
Metric systemMiniDiscPartition (number theory)Query languageSubject indexingBit rateStapeldateiProcess (computing)Database transactionSicData storage deviceDimensional analysisDigital filterBounded variationFile formatHausdorff dimensionMaxima and minimaOpen sourceTexture mappingMessage passingParsingRegulärer Ausdruck <Textverarbeitung>InformationPattern languageTable (information)BlogVelocityStructural loadQuery languageService (economics)Subject indexingParsingNumbering schemeSpacetimeMeasurementWordMiniDiscComputer programmingCommunications protocolSheaf (mathematics)Regulärer Ausdruck <Textverarbeitung>Set (mathematics)QuicksortField (computer science)Message passingGraph (mathematics)Elasticity (physics)Point (geometry)Table (information)Dimensional analysisBitMereologyPattern languageResultantLoginCore dumpOrder (biology)Goodness of fitDisk read-and-write headNumberStapeldateiData storage deviceMultiplication signOpen sourceConnected spaceConfiguration spaceSoftware testingFile formatMatching (graph theory)Row (database)System programmingInteractive televisionElectronic mailing listComputer configurationBound stateComputer fileQueue (abstract data type)Network topologyRight angleNormal (geometry)Proxy serverWorkstation <Musikinstrument>Bit rateProduct (business)Flow separationTimestampNatural languageArithmetic meanMultiplicationSequelPlastikkarteData structureSoftware repositoryMetric systemLatent heatTotal S.A.Source code
41:37
Ordinary differential equationBit rateProcess (computing)View (database)BitSoftware testingNumberLoginProof theoryLaptopDatabaseMetric systemData conversionScaling (geometry)Goodness of fitSet (mathematics)2 (number)Computer virusReplication (computing)Query languageQuicksortTime seriesPoint (geometry)System programmingOpen sourceInsertion lossRow (database)Table (information)Queue (abstract data type)Bit rateMeasurementMereologyPlug-in (computing)Instance (computer science)Different (Kate Ryan album)CodeVideo gameCASE <Informatik>Functional (mathematics)Revision controlComputer hardwareType theoryDefault (computer science)TunisBinary codeComponent-based software engineeringComputer clusterReading (process)EmailStructural loadProjective planeSequelContext awarenessGradientGraph (mathematics)SoftwareOperator (mathematics)Product (business)Multiplication signData structureConfiguration spaceUML
50:10
UMLComputer animation
Transcript: English(auto-generated)
00:00
The next speaker is Stephen Simpson who will talk to us about infrastructure monitoring. Excellent, thank you very much. So a lot of you probably don't know me as it's been pointed out in the last couple of days. I haven't been particularly active in the Postgres community but I'm looking to change that.
00:25
So this talk is going to be a bit of a mix of everything here. It's not really going to delve into the details of Postgres very much but more how we build something on top of Postgres. So I think it should appeal to people at most levels with Postgres.
00:41
So we're going to talk about monitoring so I'll give you a quick background on specifically what I'm trying to do. Then we'll go in depth a bit on how to use Postgres for metrics. And there's another section on using Postgres for a few other things which you might not ordinarily think to use it for.
01:07
So I'm actually a software developer, I'm not a DBA so I apologise if my SQL and my procedures are a bit shaky. I'll do my best. So I primarily do C and C++ and Python, that kind of things.
01:20
I've also dabbled with Perl in the past but kind of learnt better after a while. Usually systems level kind of things. I'm based in Bristol out of the UK which I don't think I've met anyone else from Bristol. It's primarily famous for this bridge, it's quite pretty. It's probably more famous for its cider.
01:42
So if you ever go to Bristol you'll see in pretty much every pub more varieties of cider than you can ever imagine. I don't think there's many places else in the world which are like that. So I've done a few bizarre things. I've worked for a couple of start-ups. One building 10 gigabit ethernet switches and then kind of thought hmm, maybe I'll do something completely different.
02:08
Worked for a database vendor for quite a long time based on Postgres core. So I spent about five years working with Postgres but not on Postgres specifically. And that database was sort of geared towards big data, analytics, that sort of thing.
02:26
All the buzzwords you can think of, we had them. So I now work for a company called StackHPC and we're a consultancy for HPC on OpenStack. We primarily deal with complex kind of infrastructure and this is kind of where this talk came from because you kind of need to monitor it.
02:48
I'll give you a quick background on OpenStack because there is a booth over in building K. There's some guys you can go and talk to about it if you like. But it's a, the buzzword kind of title for it is cloud orchestration platform.
03:05
But what it really is is it's a bit like AWS but you can do it yourself. That's probably the shortest way to put it. I can talk to you more about that later if you want to come and find me. The thing with OpenStack is it's quite complicated. So I like to think of it as a complex distributed application to run your complex distributed applications.
03:28
So it's quite subjective I think. You hear differing opinions on it. It's a very useful tool and it's gained quite a lot of traction in the past few years. So operational visibility of this is critical because there's a lot of things to go
03:42
wrong and you need to make sure that you notice when it does go wrong. So monitoring. In the context we're thinking of monitoring is this kind of thing. So you've got a load of servers, you've got a website, you've got a database like PostgreSQL.
04:03
By the way everything in this talk you could apply to using PostgreSQL to monitor PostgreSQL. That's fine. In fact I think it should be encouraged. So it could be a cluster of systems, could be a disk drive, anything. We do all these graphs and we make alerts and it emails us when it all goes horribly wrong.
04:24
And generally this tends to be quite different for everyone. Everyone has their own idea of what they want from a system. A few things that we kind of think are quite important. Fault finding and alerting. So you need to be told when something breaks, when something goes down.
04:42
When you need to replace a disk. Fault post-mortem and preemption. So once something's gone wrong you want to prevent that from happening again. And ideally you want as much information as you possibly can about everything that was going on around the fault. So you can work out what caused it and make sure it doesn't happen next time.
05:08
You can also use a monitoring to measure utilisation and efficiency of your infrastructure. It's kind of amazing how many people, if you ask them if they've got a huge cluster of servers.
05:20
They've spent tens of millions on these servers. It's quite often they don't really know how much it's being used. It might actually only be used for a few days a week or a few hours of a day. And it's just sat there wasting electricity the rest of the time. As soon as you kind of graph these things it becomes immediately clear. And you can even take it a step further and use some of these techniques for performance monitoring and profiling.
05:47
So how fast are my database requests? How fast are my web requests? A good example of this in Postgres would be with statement logging. So you could actually monitor the latency of all your database requests.
06:02
You can take this a little bit further and go into the realms of application profiling as well. And start tracking things like HTTP requests. In OpenStack specifically, every OpenStack API request is linked with a unique request identifier. So you can actually trace all of the HTTP requests around the system.
06:24
In a sort of different direction, you can also use it for auditing, security. So especially things like tracking log files or tracking network requests or SSH logins, things like that.
06:43
That's all things that we kind of want to do. And at a higher level than that even, decision making. So planning what your next system is going to look like based on how much your old system is being used. How well it worked or how well it didn't work.
07:01
It's kind of the manager's dashboard. Dollars per hour, that kind of thing. So, do a little bit of background on existing tools you might use for this sort of thing. And the first kind of category of tool might be worth pointing out, I mostly work with open source software. I don't tend to buy a lot of proprietary software, so I'm sure there are
07:20
plenty of proprietary monitoring tools which I don't know about and will never mention this talk. So I'm sorry if one of them happens to be your favourite. But in the open source world, one category of tools you tend to find are ones that do checking for you. So they'll ping machines, make sure they're up. Make sure that your web server is servicing requests.
07:44
And these systems usually give you a dashboard and they'll quite often store history of the events as well. And a classic example of this is Nagios or Ichinga. Someone told me the other day there's another fork as well. A whole category of things.
08:03
And a lot of people tend to write their own as well. Bash scripts like this seem plenty of as well. And sometimes that's all you need, at least you know it kind of works. The issue with these kind of systems is it will tell you when something goes wrong, but it won't tell you why necessarily it's gone wrong.
08:26
What you really want to know is all the information about what happened around that kind of event. So post-mortem analysis is usually the only option. So you go digging around. The system was in a state at the time it went wrong.
08:41
It died. Unfortunately it's now restarted and it's in a fresh new state and you've got to trace back to try and work out what happened. So something else that's gaining popularity at the moment, you're probably aware of this. Kibana, Elasticsearch and Logstash.
09:00
They've done a great job of advertising this as the ELK stack. And this is a way of centralizing all your log files together, which is hugely useful. I mean this seems to be what people like using at the moment, but for a long time our syslog or the systems before that were all very capable of shipping all of your logs to a central server and grepping it however way you like.
09:26
But these systems are quite useful, unless you do full text searching on your log files and see where your connections happen and how often your log messages are firing. So if you see this big peak in errors at some point in the middle of the night, then maybe a batch job's gone crazy or something like that.
09:46
The Logstash component of it is useful for transforming your data. So if you've got your web request, a classic example is you've got web requests, you've got the IP address and you use Logstash as a plug-in to determine the geolocation of that request based on the IP.
10:01
And then people draw pretty graphs of the world and they find most of their users are from North America because the most people are in North America. So the next set of systems are these metrics-based systems. So we collect metrics like CPU percentage or disk space over time.
10:26
And what these let us do is these let us get some insight into what happened before your system fell over. The system falls over and just before it fell over, disk space was rising. That's a good sign of what might have gone wrong there or memory rising slowly and things like that.
10:45
So these are becoming very popular, very, very, very popular, as you'll see in a minute. The tool for this in at least the HPC world and a lot of other worlds has been Ganglia. And this kind of gives you these nice little graphs. And it usually comprises of a collector component that runs on all your servers with a
11:03
database and an aggregator, which gives you a front end and pulls all the data together. So you get graphs like CPU usage over my entire cluster. And this one's actually Wikipedia's. So you can actually go there and look at all of Wikimedia's server stats, which I had great fun with the other day.
11:22
Just digging around, seeing what's going on there. Although, as I checked it yesterday, so I wrote this slide quite a few weeks ago, I checked it yesterday and a big banner comes across the top saying Ganglia was deprecated. So make of that what you will. And the thing that they link you to is actually, I interested the next thing on my slide, next thing, next slide.
11:43
And that's Grafana. So this is kind of I guess you could say this is like the hipsters kind of tool of choice. It's actually a really good tool. I like it a lot. It's very Web 4.0 or whatever app now, very clicky and draggy and really it's quite nice to use.
12:02
I don't mean to dog on it too much, but it does look very hipstery. We don't let that be off. And it does the same thing. It draws graphs for you, CPU usage, queue depth, whatever you like. But this is only a front end. This doesn't actually store any data for you and it doesn't collect any data for you.
12:24
Only draws graphs. It connects to a service which gives the numbers back to it. And this is kind of what is interesting in that there's a lot of choice of what you can use for this.
12:41
I've listed I think about 20 there of time series metrics databases and there's more I haven't listed here. And they're not insignificant projects either. Notice at the bottom there is one that actually uses PostgreSQL to store its data, which I thought was worth mentioning.
13:05
So that kind of makes the rest of the talk slightly invalid, but we'll carry on anyway, seeing as I've already written it. So the interesting thing about all of these databases, they're not insignificant projects. So a lot of them have backing from quite well known companies or quite significant companies.
13:22
Some of them work for companies. Some of them are built by people. Some of them are built by companies with investment funding or trying to make startup companies. They've got money behind it and funding.
13:42
So I kind of thought, well, when did all this start happening? I'm sure some of you might see where this is going. So I drew this kind of chart of when all these databases started appearing in the world. And bear in mind, these are only open source ones as well, because that's kind of the field I'm primarily in.
14:01
So in 2000, Ganglia has been around a long time, been able to draw your grass for a long time. Originally started in 2000 from University of California, I think. And then Graphite arrived in 2010, which is kind of a bit of a replacement for the internals of Ganglia,
14:21
because some people thought RRDTool was getting a bit long in the tooth or maybe they just wanted something fun to do. And then sort of in 2013 to 2015 timeframe, just an explosion of a project doing this time series monitoring kind of stuff.
14:40
It's kind of interesting when we think about NoSQL as we think about the document stores and things like that. But it seems like time series has also been quite a big component of this, sort of became trendy again. There's even, I think, a new one that was announced just last year, 2016, one at the bottom there.
15:03
So the system, going to talk now about the system we're primarily working on. And this system is geared around OpenStack, but it kind of encompasses what a lot of people are building for their monitoring infrastructure now. They want the alerting, they want metrics, they want logs, which is all the information we want.
15:21
We want it central, we want it pretty. So we start off with our servers, our software, storage, network, all of our bits that we want to monitor. And we've got all these metrics and all these logs coming out of all of this. We need to store it and present it in some way. So the project that we're working on in the OpenStack world is called NASCAR.
15:44
And it's a set of APIs for letting you consume metrics and access metrics. And it kind of integrates with OpenStack and all the multi-tenancy that OpenStack gives you. So each tenant has its own set of metrics, its own set of logs and all this kind of stuff.
16:01
So it's quite nice. It's actually quite a nice system. As you'll find out, there's a lot to it. So the first bit of it, it has a MySQL database to store configuration and alerts and things like that. Like most of OpenStack, there's a database behind a lot of the components.
16:21
So it then has a time series database of some sort for the metrics. The common one, the popular one is InfluxDB at the moment. It also supports a few other ones. I think it supports Cassandra. They tried supporting Cassandra and there's a Vertica backend, things like that. Then that feeds Grafana, which has its own internal SQLite database.
16:45
It stores dashboards and things like that. The logs are all stored in Elastic via Logstash and Kibana sits at the front, the UI for that.
17:01
Then someone put Kafka in the middle because we need to handle peak loads, so we need a queue for that. So everything goes to Kafka, goes through Logstash, back into Kafka, into Elastic, into Grafana, into InfluxDB. And of course, because you've got Kafka and Elastic in there, you need a zookeeper as well.
17:23
Another three nodes for that. So you can kind of see where I'm going with this. That's six persistent data stores to monitor your infrastructure. This is quite a common system. A lot of people are doing this outside of OpenStack.
17:42
The Manaska project really encompasses what people want to use at the moment. Having this many databases, each with their own HA protocols, each with their own quirks, persistence, layers, each of them you have to back up if you're doing it properly.
18:07
It becomes a lot of overhead there. So it's a commendable right tool for the job attitude, but why not Postgres? Postgres can store data.
18:22
So if we just used Postgres to have fewer points of failure, fewer places to back up, fewer redundancy protocols, a more consistent set of data semantics, and you can reuse your existing knowledge of Postgres as well. So this seems like a good idea.
18:41
So if we look at these components again, well, as it turns out, the Manaska team have already ported this to Postgres. So you can already use Postgres as a backend for this. As with a lot of OpenStack, you can use Postgres with it. Grafana actually also supports Postgres, which is nice, and MySQL, but we don't care about that.
19:03
Why not replace the time series database with Postgres? Postgres can do time series, it's a relational database, it's not a new concept. Why not put the logs in there as well? We've got full text search. We lose Kibana, because Kibana and Elasticsearch are quite tightly coupled, but Grafana is able to display logs quite nicely.
19:25
If we don't have a huge system, there's a possibility we could just shove the data into Postgres as fast as we can, without doing any of the processing and use that as our buffering mechanism.
19:41
And we don't need that. No Elastic, no Kafka. And while we're there, why don't we use Postgres's text mechanism to get rid of that as well? Satisfying, isn't it?
20:00
So some of that you're probably thinking, oh, some of that's a bit of a push. And you're right. So I'm going to sort of go a bit deeper now into the metrics part, because that seems to be what people are most interested in at the moment. So how would we store our metrics in Postgres? So the system we're building is quite modest.
20:21
It's an 18-node cluster. We take 200-ish metrics every 30 seconds. So quite limited. We want a six-month history. The server we've been given currently to do this on has a terabyte of disk space. So that's all we've got. And we want queries to be fast, because this is all user-facing.
20:42
So you want to click it, and it comes up without noticeable. So we want our queries to be less than 100 milliseconds. So with the time series data, we kind of have two categories of query. We have one query, which is, get me all the measurements for this series. So get me all the measurements for a particular host, something like that.
21:04
And then the other kind of category of series is, I want the average CPU load of all my hosts. So you end up sort of averaging them all together, and then you get one line instead of two or 80. I'll go into a bit more depth in a bit.
21:21
So I'm going to move on from that. The other sort of category of queries you end up having to do is you want to find out what metrics you've got in your system. These are all changing dynamically. Hosts are coming and going. VMs are coming and going. Networks are coming and going. It's all changing all the time. So you want to work out what you're actually monitoring, what's available. So we want to be able to list metric names.
21:43
Maska has this concept of dimensions. So things like host name, mount point, process name. And we want the values of those. So we want to be able to say, list me all my host names. So the data and the queries for this is the next thing we'll look at.
22:00
So the data comes into this system in JSON format, because everything's JSON. The new XML. And the structure is kind of irrelevant for this talk. You get a timestamp, you get a name, you get the value, and you get a set of dimensions, a list of tags, essentially,
22:24
sort of key value pairs. So for a CPU percentage, you get a host name. And there's a value meta, which is kind of an extra bit of data you can store, but never really gets processed in any way. So, fairly simple.
22:42
Stick this in Postgres. You're probably thinking this is a terrible idea, but there's more to the talk. So we could just shove all the data in. The dimensions we'll store as JSONB, because that fits nicely. We want to access them quite fast. The value meta, we don't really care about,
23:00
we just want to pipe it back out again, so JSON's a good fit for that. All of these systems, if you dig into the details, they're all double precision. None of them do anything other than double precision, so float8 is fine for all the values, and the name and the timestamp. Please always use timestamp TZ. Nothing else, ever.
23:20
You'll thank me one day. So this is a query we might run to get a single series out of the system. So I've got a function there that rounds a timestamp to the nearest number of seconds you pass it. It's not a Postgres built in, but it's very easy to write.
23:41
You can find examples of it all over the place. So we often want to say, give me all the values between this particular time range for this series, and we might also want to say, make sure it matches these dimensions. So we get that single series of CPU time for host dev 01.
24:03
And then we might want to, so this is just an example of, we actually want to get all the series, we want to get that individual series, we want to get it for all the hosts. So we want to, so again, we can dig into our dimensions field, grab the host name and then group by the time window and the host name as well.
24:23
And finally, we might want to do this combo query where we sort of roll everything up into one big metric. So CPU percentage for all my hosts. Don't care about dimensions. And so the metric name list,
24:41
well, we could do that. It would work. A lot of you might be thinking, that's going to take a while. Yeah, it's going to take a while. But it's pretty good. Dimension names, a bit more interesting. Postgres has a nice function to get all of the keys out of a blob of JSON for you.
25:02
And likewise, if you want all the values for it, then you can just dig into the JSON and pick out the dimensions. Fairly straightforward so far. Not a very complicated SQL. It's not going to be very fast if we store it like that. So we do need to optimize it a little bit.
25:22
So if we stick with our denormalized schema, we could just put some indexes on it. We can use a gin index for the JSONB so we can pick out the host names really fast. We can put a index, we can put a multi-column index on name and timestamp. So when we're looking for a particular time range and a particular name, Postgres does that very efficiently.
25:44
Sort of walks through the B-tree to find the right name and then walks further through the B-tree to find the timestamp, and then just iterates all the way down. Really good structure. So a lot of information here. I won't draw it too much, but the queries I decided to look at for performance
26:00
were some of the series queries of varying time window. So over a small period of time, over an hour, over six hours, and over 24 hours. And then the listing queries as well. So the kind of interesting thing about this
26:22
is this kind of rather naive schema we've put together doesn't actually do too bad. We can pick out individual series really fast, but doing these kind of queries over the big time windows, so over the six hours and the 24-hour time windows, it really starts to sort of grind away a bit.
26:43
I should mention all these tests I talk about is over one day of data, and that's about 45 million rows on the table. I don't think I mentioned that earlier. So just rescale that query so we can look at it a bit better. Those other queries are kind of on the edge of our 100 milliseconds,
27:00
but a lot of them are actually in there, which is quite interesting. And unfortunately, those other queries where we were picking out metric names and dimension names, as you can imagine, they take a long time. So select distinct over 45 million rows. Yeah, it's not going to be fast, I'm afraid. Let me zoom in on that a little bit. You can see we are well over our requirement
27:23
of 100 milliseconds for all of those queries. And of course, you're all shouting at me, this is stupid. You should have two tables, and you're probably right. So the first kind of thing we can do to improve this is normalize out the two tables. So we've got a separate metrics table
27:40
and a separate measurement values table, and they're joined together with an ID. It says some other advantages as well, which I'll talk about in a minute, but it does make the queries quite a bit faster. It does mean we need to do a little bit of fiddling around
28:03
when we insert the values. So we find the ID of the metric, and then we tag the measurements with that value. I don't expect you to read that now, but it's kind of the rough thing you have to do. Not too tricky.
28:22
And the queries that we can do on this are exactly the same as before. It's just we can use that view to do the join for us, just technicality more than anything else. You could write the queries with the join in if you wanted to. And the index is mostly the same as well, so we have a multi-column index on the metric ID and timestamp,
28:41
and we have just an index on name and dimension, so when we're doing that normalization, we can find the name and dimensions really fast. And that helps a bit. Our queries are down to 500 milliseconds for some of them. For the 24-hour query, we're still way over one and a half seconds, two seconds,
29:05
so we need to do a bit more work. The listing queries, however, because we've already normalized all of those lists out, we've already done that distinct operation effectively, are now well in our requirement.
29:20
So, that's good. So the problem with this is, and this is the problem with all time series data you eventually get, the actual detail that you need for a particular graph becomes less interesting the bigger the time window you're looking at. So if you're looking at a 24-hour time window
29:41
or a six-month time window, you don't need to plot every point from every 30 seconds on that graph. So you really don't have to store all of that detail. And this is what a lot of these time series databases seem to be doing, is they're doing these roll-ups as you push the data in,
30:02
so instead of having every 30 seconds you have a data point for each of your metrics every two minutes or every five minutes. And this really shrinks down the amount of data you need to query for a 24-hour time window. So I call this the summarized schema.
30:22
It's conceptually similar to RRD and tools like that. And again, a lot of the internals of these other databases are doing this thing as well. They're building up coarser roll-ups to query instead. So we can do that in Postgres. We can build these summary tables, and depending on the functions you want to do,
30:41
you could compute. So I've done sum, count, min, and max here, because then you can kind of aggregate them together higher up. And I've got a unique constraint there just to make the creation of these summaries a bit easier.
31:00
So we attach a trigger, so when we insert data into our main measurements table, because we want to keep the detail really fine-grained into it, we just create a trigger which adds a, essentially updates a row, the row for that particular time point for that metric ID.
31:21
And then if there's a summary already there, then we combine it so when we're summing them, we add them together, or for minning, we create the min of the one that's already there and the new one we're adding in. And that's it. That's kind of all you've got to do to build these summaries. Turn the Postgres into a
31:41
time series database. So what I've kind of discussed here isn't the most efficient way of doing it, but it does the job, as we'll see in a second. So a few technicalities just to join the tables together.
32:02
And the queries, so the queries for the small intervals, we still go to the raw data. For the larger intervals, and these queries are of course a lot faster. So that's good. So those 24 hour queries that were taking two seconds,
32:21
we're just ploughing through the data as fast as we could. Now we do something a bit smarter, we're only querying the rolled up data, the data we pre-rolled up. Even our 24 hour query comes back in less than 100 milliseconds. Pretty good. And of course, some of the dimension listing queries are the same.
32:44
So, the other thing that we want to consider is can we actually get this amount of data in? So we need to be able to get a day's worth of data in less than a day, otherwise we're going to start lagging behind quite significantly. And for this system, so,
33:01
we've got quite a lot of headroom. I think this, the summary scheme equates to about 15,000 measurements a second. So, you know, it's not a lot, but bear in mind this is a very naive scheme and a very naive sort of set of triggers we've drilled together to make this work.
33:21
So, pretty good result. The other thing worth noticing is the normalization, as you might expect, really does reduce the amount of disk space you need. And in fact, this was necessary for us because the, in order to get to that one terabyte for six months value, we had to hit less than six gigabytes a day of data.
33:43
So we were well over that with the denormalized data. So, this isn't the whole story. I mean, this is an example of how you could do it. To really make this work in production, there's a few things we're going to do, and there's a few things you're going to have to do as well.
34:01
You're probably going to need coarser summaries. So if you want to do a six month average, you're probably going to need maybe a summary for every, a point every hour, or maybe a point every two hours. You're probably going to have to partition the data, and that's kind of the assumption I made. So all of these tests have been on a day of data, so you're probably going to want to partition it by day.
34:24
And that also makes dropping the data very fast. And there's a few tricky ways you can optimize the producing of these summaries, because the way we're doing these summaries, we're doing an update for every measurement. So, B-tree lookup and an update of a row.
34:40
So, not the most efficient way of doing it. We really want to take a whole batch of these values. So for a five minute interval, we want five minutes worth of values, shove them all into a smaller format in one go. And there's a few ways to do that, but I'm not going to go into that detail. This is kind of meant to be an overview. So we can do metrics.
35:00
It wasn't too hard. So this next section is kind of a bit wider stretching perhaps. So what else could we put into Postgres? Well, we said earlier we could put logs in there. So, let's do that. So we want some centralized log storage. We want it to be searchable.
35:21
So we want to be able to search for things like connect and HAProxy, and we want to get values back. We want it to be time bounded. And again, they've got to be fast, because we're going to produce interactive graphs and interactive lists of these things. We don't want to keep people waiting around.
35:40
So this is the kind of the data we get. It typically comes from our syslog. You get things like severity, the program name, host name, and you get the message. So again, basic schema, timestamp, message, dimensions.
36:02
And the sort of queries we want to do, if you're used to using kind of Elastic and Kibana, you might want to do things like this. So you want to say, find me logs with the word connection, where the program name is HAProxy. And we can do those with Postgres. Postgres has got full text search. It's admittedly a bit longer.
36:21
We can do a TS query on the message, and then we can do a contains on the dimensions. As long as we've got that indexed with some GIN indexes, it'll be nice and fast. I haven't dug in as in-depth for these kind of this section of stuff,
36:42
so I've got no performance numbers for this, I'm afraid. A bit of kind of anecdotal fiddling around shows that you can store quite a lot of logs with some basic GIN indexes on them and get some really fast queries. That's good. If any Postgres people want to tell me whether it's a better idea to combine those
37:01
into one multi-column query, multi-column index or not, that would be really interesting. I'm not sure. I haven't tried it yet. So log parsing. So this is our log message again, our log structure, a bit of JSON that we've got out of rsyslog
37:20
or wherever. What we want to do is we want to notice that that has connect from, and it's from HAProxy, and we want to tag it. So this is the sort of thing that logstash will do for you. We want to tag it with connect because then it makes it really easy to search for it later on and draw kind of graphs with it.
37:41
We then want to get all of this data out, and we want to store it in a structured way so we can search for a service name or protocol. It's nice and reliable and robust, and we're not doing horrible regexes all over the place. I shouldn't have said that because we are going to use
38:01
horrible regexes to do this. Postgres can do regexes. If you've used logstash for a long time, you realise that the core thing is the grok. I mean, that's the really good bit of logstash, and it's a regex, and you match your log messages, and it pulls out all the field data for you.
38:22
So with a bit of extra JSON garnishing, we can make this nice into a little JSON bob as well. That's quite good.
38:41
So we've got our schema for our logs, which we looked at a minute ago. What we really want to do with this data is we want to pass the message against the patterns, and then we want all those dimensions extracted and added on as extra fields. So this is kind of an idea. This might not be the nicest way to do it, but there's certainly a way you can do it.
39:02
We can have a patterns table so we can store our patterns in the table, and we can store the field names alongside them. So that part of the regex ends up being the source IP. You sort of match them one-to-one and then use our little trick earlier.
39:22
And again, just keeping this simple, let's use a trigger. So as we insert logs into our logs table, we pre-process them. So for those people not too familiar with Postgres or SQL, this trigger lets you pre-process the row before it's inserted into the table. So we're using that. So we iterate over all of our patterns,
39:41
match all of our patterns. If it matches, then we add the dimensions on attaching the triggers to the table. Interesting. So when we insert our log into the table with those dimensions,
40:02
it actually grows all of this extra stuff because of the trigger, which is really nice. And this actually gives us a way to dynamically add patterns to our log parsing, which is really cool. So it's not in the log stash. You have to edit config files
40:21
before you can add new patterns to it. Interesting advantage. Again, probably not the most efficient way to do it, but good enough. So, stretching even further now. What do you mean by queuing in these systems? Why do these systems have message queues in them?
40:44
So the point of the queue is really to handle really bursty traffic. So you might have a system that's spewing logs out at quite a rate. And you really want to keep this, and you want to persist this as soon as possible so you don't lose it. But you don't really have to query it straight away.
41:02
It's okay if it sits there for a few minutes. As long as it's there and you can look at it eventually, then that's kind of good enough. And it's really the only option you have if you're getting gigabytes of logs and there's only a certain amount you can process unless you have an entire cluster just for processing logs, which I presume some people do by the way they speak about it. So really all we want to do is
41:20
we want to just write the data to some sort of persistent storage as fast as possible. Don't have to query it, and we'll worry about structuring it later, processing it later. So if we go back to those ingest rates, so I've turned the numbers I had from earlier kind of on their head from ingest time to ingest rate. So our summarised time series data,
41:44
about 15,000 of them a second, because we're doing a lot of processing on it and we're not doing it very efficiently. But the less processing we do, the more we can shove in. So if we go back to the denormalised version, we can actually put in quite a lot more seconds.
42:00
So we can take in 140,000 metrics a second. That's quite good. We could simplify this. I mean, if we just stored the raw JSONB that we got from the software, doing even less processing, now we're up to a much higher number.
42:21
I think I've got my scales wrong there, sorry. So that one's about 140. Well, let's do even less processing. Let's store this JSON. The JSON, for those of you who don't know, is just a text version of the JSON stored in the column. The JSONB does some extra fiddling around for you to make it more efficient to get to.
42:42
Well, the thing about the JSON type is it still validates your JSON for you, which is great most of the time, unless you really just want to shove data into a table really fast. So if you store it as varchar, you can really press out even more, well over 250,000 measurements a second now.
43:03
Does anyone have any idea of what I'm going to put there next? Copy, anything to do with copy? So if you go to the darkest depths of Postgres and use binary copy, then you can eke even more performance out,
43:20
nearly up to 350,000 of these a second. So 350,000 rows of JSON, and these bits of JSON, they're about 300 bytes long, so the metrics are similar size to the logs. So that's quite a big system. You can handle pretty impressive ingest rate just by stripping all the processing
43:41
away and just shoving it in there and having a sort of background process, doing the processing later on for you and let it catch up over time. So it's possible we could replace the queue with Postgres as well, which would be nice. Have all of our persistent data in one place. So what's kind of the conclusion of this?
44:05
I kind of view Postgres as a, as the hipsters might say, data persistence toolbox that just happens to use SQL. I think a lot of people are scared off by SQL. And if I hadn't worked for a database vendor, I might be equally scared by SQL,
44:22
but I still think it's useful and I still think you can do a lot of things with it that kind of the NoSQL crowd like to do. I don't think you should be scared of it. The batteries aren't always included though. You have to think about your problems and do some work. Postgres isn't a bespoke solution for your time series. It's not a bespoke solution for your log searching.
44:44
It doesn't mean it's hard, but reducing those number of systems can be a huge operational advantage if you've got quite a small team. One bit of software, one bit of persistence to understand sort of six. Good. And use and deploy what you know and what you trust.
45:01
You trust Postgres to store your data for you, then why not use it? That's a good idea to me. So that's it. Thanks for listening. Thank you very much.
45:20
Do we have any questions? I'll go there next. Hello. Can you tell us what kind of hardware and main settings you were running on that 45 million queries, sorry, logs that were entering Postgres?
45:41
Yes, it was my laptop. Oh, okay. Any particular tuning settings? No, I was, I had Postgres running in an LXD container. Stock. Default config, yep, just installed it. So we can make it better. You probably could make it a lot better, yep, definitely.
46:11
Hi. You said you could replace the InfluxDB with Postgres and said that Grafana could read that, but I couldn't find anything about Grafana reading from SQL.
46:24
Ah, yes, you're right. So there is a Grafana pull request, I believe, open where they're actually building in support of SQL databases, which is quite interesting. But what we're actually doing here, so that Manaska component I talked about briefly has an HTTP API for storing metrics
46:43
and for obtaining metrics as well, and running queries. And that's part of an OpenStack project, and it's all written in Python. So what they've done is they've built a Grafana plugin that talks to that API, and that's how you're accessing the query. So we're not actually querying Postgres directly, but we're going through this kind of API layer.
47:01
But if you did, if obviously you were outside an OpenStack context, then it might be worth keeping an eye on those SQL functionality pull requests in Grafana, because they look quite promising. One question. Is this code available on GitHub or something? Not currently, though.
47:21
My employer is more than happy to open source everything we do, so I can, if you drop me an email, I'll let you know when we've made it available. We'll make it available in the coming weeks.
47:43
How you got all the data in database first? How we got the data? Yeah, how we got them into? So the performance tests will run using copy through a Python, a little Python shim.
48:01
If you do a lot of Python with Postgres, inserting individual rows or even insert many is very inefficient. You're much better off using copy in like Psycho PG-2, and then you get really, really good insert rates into Postgres tables. And can you tell about scale of your infrastructure just monitored in production?
48:23
Because people who are listening, is it applicable to my infrastructure, or just proof of work? Yeah, so that's sort of an interesting point to make, really. This works because our system, our infrastructure is quite small. So if you haven't got extreme requirements, you can make your life a lot simpler
48:41
by doing something like this, I think. So the system we're going to deploy this on, the system I've been testing, the scale I've been testing is 80 nodes with about 200 metrics coming from each node every 30 seconds. And the system we're going to run it on is ludicrously overpowered, because all these tests were run on my laptop. So it should be proof of work.
49:03
I think you should be able to stretch it quite a way. You can't prove that. Might I recommend that when you run an ingress system for the high-speed data you're collecting, that you use it on a different Postgres than the one that you actually collect the long-term data on?
49:22
That's a very good idea, yeah, definitely. Actually, I was having a conversation with someone on Friday, actually, about potentially using the logical replication to feed a lot of the... So if you really want to optimize that summarizing process, actually feeding it asynchronously through the logical replication to a different Postgres instance
49:40
means the summarizing and all that extra work we're doing isn't holding up loading the data. So you'd actually be able to load data a lot faster in that case. So yeah, that's a really good point. Any more questions? Okay.
50:02
Thank you very much. Thank you.