Why monitoring sucks, and how to improve it - TIB AV-Portal

Why monitoring sucks, and how to improve it

00:00

0

Zugehöriges Material

Norwegian Developers Conference (NDC)

Formale Metadaten

Titel

Why monitoring sucks, and how to improve it

Serientitel

Anzahl der Teile

163

Autor

Lizenz

CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben

Identifikatoren

10.5446/50059 (DOI)

Herausgeber

Norwegian Developers Conference (NDC)

Erscheinungsjahr

Sprache

Inhaltliche Metadaten

Fachgebiet

Genre

Abstract

Computers are good at solving recurrent problems. Much better than humans are. And still, we keep them dumb with a set of simplest heuristics when it's about monitoring complex infrastructures, leaving the largest part of the job - issue recognition and analysis - to ourselves. This might work with a server or two, but definitely won't in a larger setup, even if we convince ourselves it would. We need new approaches to monitoring our systems that combine the best of software engineering and mathematics. In this talk, I will explain the vision and the targets towards it.

NDC Oslo 201547 / 163

1

57:21

Promises and Generators in ES6

2

40:51

You can't change this - Immutable JavaScript

3

56:32

Xamarin.Forms: Native iOS, Android, & Windows apps from ONE C# Codebase

4

1:01:59

ng-owasp: OWASP Top 10 for AngularJS Applications

5

52:47

Lightning Talks - Trelford, Mavarez, Gravråk, Śliwoń and Veum Møllersen

6

50:12

JavaScript Forensics

7

56:33

IoT and Machine Learning, a true story of a Smart Grid

8

1:00:49

Implementing the logic for a board game in Elixir

9

43:34

How to build and be a part of a highly efficient team

10

55:09

Hafslund AMS: Drinking from the firehose at a large Internet-of-Things project

11

31:15

Developing interfaces: Developers are the new designers

12

58:11

Browserception - Building our own browser with CSS_HTML and Node.js

13

1:02:59

Business Logic: a different perspective

14

1:04:08

Cloud-Scale Event Processing with the Reactive Extensions (Rx)

15

56:56

Beer analysis using Kibana 4 and elasticsearch

16

55:23

F# as our day job by 2016

17

48:42

An Overview on Encryption in C++

18

56:23

Deliberate creativity

19

51:22

When order does not matter

20

1:00:48

To SQL or NoSQL?

21

1:03:59

Making Hacking Child’s Play

22

1:04:33

History and Spirit of C and C++

23

1:01:45

Introduction to Windows 10 UWP and Adaptive Design

24

54:31

How do you scale a logging infrastructure to accept a billion messages a day

25

1:01:10

The code behind the vulnerability.

26

55:28

Getting the first PR into .NET and other tales from an OSS contributor

27

54:20

The mess we've made with JavaScript

28

1:02:37

29

1:04:40

Get more than a cache back!

30

1:06:21

31

1:05:40

F# for C# Developers

32

34:41

Evolving architecture for API delivery on Azure

33

1:03:15

Engagement Techniques: How Do You Get People Engaged and Motivated?

34

46:49

End-to-end Functional Web Development

35

46:20

DevOps Yourself: Fast-Track Your Windows Development Environment Setup

36

1:04:33

Data exploration and analytics with elasticsearch

37

1:03:19

Crafting Evolvable Web API Representations

38

1:00:33

Continuous Delivery Patterns for Cloud-based Applications

39

1:06:33

Building Isomorphic Applications in JavaScript

40

1:01:51

Boosting security with HTTP headers

41

1:01:03

BDD for embedded systems

42

1:04:51

AWS vs. Azure - architectures and choices

43

1:01:11

NDC Oslo 2015 - Aurelia: Next Generation Web Apps

44

1:04:23

Advanced Continuous Delivery Scenarios

45

1:04:17

A security testers toolkit

46

1:01:00

595 billions income - untouched by human hands

47

47:09

Why monitoring sucks, and how to improve it

48

52:42

Universal Windows Apps: Develop for all Sizes

49

1:03:19

Troll Hunting on the Internet

50

49:43

Theft, Tributes and Collaboration

51

1:04:18

NDC Oslo 2015 - The rest of ReST

52

43:53

Removing barriers

53

1:03:18

Multicore Software Development on ARM

54

1:10:53

Mob Programming, A Whole Team Approach

55

38:50

Metaprogramming Elixir

56

44:30

Maximizing Throughput on Multicore systems with Erlang

57

43:28

Let’s Start an Epidemic

58

1:05:51

Building Lego Robots with Elixir

59

57:17

Ten simple rules for creating your own compiler on the CLR

60

56:39

The State of Payments Online

61

1:03:43

Standing on a Beach, Staring at the C

62

56:27

Running Docker and Containers in Development and Production

63

1:04:03

Real-time, offline-capable, noBackend web apps with Firebase

64

45:53

Patterns and Practices for Embedded TDD in C and C++

65

54:01

.NET Rocks Panel on Application Security

66

57:54

Modern architectural patterns for the cloud

67

1:07:04

Microsoft Azure Web Jobs: The new way to run your workloads in the Cloud

68

58:56

Introduction to CocosSharp: 2D Game Development in C#

69

43:39

Lessons Learned: Benchmarking NoSQL on the AWS Cloud

70

58:31

Lightning Talks - Sørensen, Øvstetun, Yates, Gardner and Henriksen

71

1:01:16

Lifestyles of the rich and frameworkless

72

54:05

Learning Client Hypermedia from the Ground Up

73

43:01

The guide to measure what matters.

74

1:02:41

Web Development in 2020

75

1:01:15

The elephant in the room...

76

32:30

77

59:55

Memory Access Ordering in Complex Embedded Systems

78

1:01:55

Making .NET Applications Faster

79

1:01:47

It’s all messages now: where are my abstractions?

80

49:30

How to turn software into competitive business advantage

81

49:36

Hacking .NET(C#) Application: Building and Breaking Layered Defense

82

1:07:41

FPGAs - a 1000x performance increase. How? What? Why?

83

1:02:00

Crunching through big data with MBrace, Azure and F#

84

1:03:47

Continuous Delivery - the missing parts

85

51:51

Rosyln Code Gems

86

1:03:31

Building IoT Device Applications in JavaScript and C++

87

1:01:12

Authentication and authorization in modern JavaScript web applications

88

58:37

What’s new in ASP.NET 5 and MVC 6

89

48:20

This is not the search you are looking for

90

58:04

The Real Price of Shared Pointers in C++

91

58:20

The New JavaScript: ES6

92

44:13

The Art of Software Gardening

93

51:45

Testing JavaScript

94

56:12

CSS Tips & Tricks

95

1:00:08

Norway, Software and Me

96

48:52

.NETCore Blimey!

97

41:19

Bluetooth Low Energy, RealTime Communication & Protocols

98

1:05:54

Linux Device Drivers

99

59:32

Learning from Haskell

100

1:04:53

Lean and Functional Programming

101

49:11

Keynote - Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World.

102

1:02:13

IMDB showdown - comparing OrigoDB, Redis and SQL Server Hekaton

103

41:19

How TDD and SOLID principles can improve your HW abstraction

104

40:55

Fundamentals of Type-Dependent Code Reuse in C++

105

56:32

Front-end Fun with Sass and Coffee

106

46:46

Form with Function

107

59:32

Faking and Mocking Legacy Embedded C

108

52:42

Extremely scalable cloud applications made easy using the Azure service fabric actor model

109

56:52

Enterprise Tic-Tac-Toe — A functional approach

110

56:40

Embedded Linux development using the Yocto Project

111

38:45

Computation expression in context - a history of the otter king

112

1:02:33

Async and Streaming JS

113

58:00

ASP.NET 5 & DNX: It’s a X-Platform Party!

114

1:00:09

Multi-lingual text search with Elasticsearch and Lucene

115

1:00:25

All Your Tests Are Belong to Us

116

56:09

Ads in Mobile Apps and Games 101

117

50:15

A tour of the language landscape

118

1:00:52

A Lap Around .NET 2015

119

28:27

Not Even Close - The State of Computer Security

120

1:03:28

Microservices, cutting through the Gordian Knot

121

58:03

Big Scrum: All You Need and It’s Not Enough

122

49:52

Functional Data

123

1:04:25

What’s new in Visual Studio 2015 + ALM

124

49:27

Lessons learned from game developers

125

1:01:12

Up and Running with ASP.NET on Linux

126

1:00:19

Typical Pitfalls in Agile Software Development

127

54:03

Type-Driven Development

128

53:49

129

52:23

The State of DevOps in Windows Land

130

51:20

The Set of Natural Code

131

1:10:49

The Power and Practicalities of Immutability

132

1:10:06

The Need for Closure

133

1:03:04

Testdriven C++ with Catch

134

54:38

Test-Driven Development for [Embedded] C/C++

135

53:53

An introduction to Swift

136

59:25

Succeeding in Failing

137

55:33

SOLID Architecture: Slices not Layers

138

46:14

Shazam mobile apps - Data driven project management

139

52:53

Web API Security – Patterns & Anti-Patterns

140

57:01

Resolving Conflicts in Collaborative Occasionally Connected Mobile Apps

141

52:40

Real Communication with Real People in Real Time with WebRTC. For Reals.

142

1:00:43

Principles Of Microservices

143

58:06

Phoenix - a framework for the modern web

144

55:35

Omnisharp: .NET sans Microsoft

145

39:36

Modelling complex game economy with Neo4j

146

59:14

Migrating from ASP.NET Web API to MVC 6

147

56:56

Loosely Coupled Apps with MassTransit and RabbitMq

148

54:54

Idioms for building distributed fault-tolerant applications w/ Elixir

149

1:02:52

High Performance in the Critical Rendering Path

150

1:00:08

Event Sourcing and DDD with F#

151

57:21

Don’t demo facts. Demo stories!

152

48:18

Desktop applications using JavaScript and Electron

153

1:07:47

Designing & Creating Accessible Web Pages

154

53:43

Declarative REST: State Machines for the Web

155

59:30

CQRS – but different

156

1:01:05

Continuous Delivery of Embedded Systems

157

59:17

Continuous Delivery for Architects

158

57:17

C++ on the Web: Ponies for developers without pwn’ing users

159

54:33

Building 3D simulators for oil & gas

160

48:55

Automating environments with Azure

161

57:47

Applying S.O.L.I.D. Principles in .NET/C#

162

54:54

A guided tour of the BigData technologies zoo

163

46:48

An Anti-fragile, Practical and Evolutionary Approach

Automatisches Abspielen

Sprache

Text

Bild

00:00

BestimmtheitsmaßDifferenteMehrrechnersystemARM <Computerarchitektur>SoftwareentwicklerNichtlinearer OperatorWechselsprungGüte der AnpassungGrundsätze ordnungsmäßiger DatenverarbeitungComputeranimation

00:51

RandverteilungKartesische KoordinatenARM <Computerarchitektur>FlächeninhaltZellularer AutomatComputeranimation

01:29

Plug inQuick-SortZahlenbereichMultiplikationsoperatorPhysikalisches SystemOffice-PaketDienst <Informatik>Computeranimation

03:36

MAPPunktComputeranimation

04:17

ModelltheorieInternetworkingInternetworkingArithmetisches MittelNeuroinformatikComputervirusKartesische KoordinatenPhysikalisches SystemRationale ZahlDienst <Informatik>WiderspruchsfreiheitRechter WinkelFlächeninhaltMultiplikationSprachsyntheseModelltheorieRuhmasseMathematikUnrundheitRichtungRechenschieberComputeranimation

06:25

Anpassung <Mathematik>Numerische TaxonomieMereologieSystemprogrammierungGraphModelltheorieMathematikZeitbereichModelltheorieMereologieFlächeninhaltEinflussgrößeInternetworkingFlüssiger ZustandSchwellwertverfahrenGrundsätze ordnungsmäßiger DatenverarbeitungMathematikerinProdukt <Mathematik>MathematikDomain <Netzwerk>Komplex <Algebra>SoundverarbeitungRichtungSprachsyntheseCASE <Informatik>SchießverfahrenHalbleiterspeicherQuick-SortRelativitätstheorieRechter WinkelDatenbankMultiplikationsoperatorKartesische KoordinatenArithmetisches MittelBitMessage-PassingReelle ZahlNumerische TaxonomieMultigraphAnpassung <Mathematik>Dynamisches SystemUngelöstes ProblemStandardabweichungTorusGeradeDruckverlaufSoftwareentwicklerSystemaufrufStabilitätstheorie <Logik>GRASS <Programm>RuhmasseAggregatzustandRelationale DatenbankVirtuelle MaschinePhysikalisches SystemComputeranimation

14:15

ModelltheorieMetrisches SystemCASE <Informatik>ModelltheorieGraphfärbungService providerKorrelationsfunktionZentralisatorMathematisches ModellSprachsyntheseFlächeninhaltMathematikerinMultigraphMathematikKontextbezogenes SystemTermPunktHalbleiterspeicherArithmetisches MittelFormale SemantikAutokorrelationsfunktionGewicht <Ausgleichsrechnung>XMLComputeranimation

16:59

ÄhnlichkeitsgeometrieRoboterAuflösung <Mathematik>Auflösung <Mathematik>RobotikAusnahmebehandlungReelle ZahlSoftwareschwachstelleMultiplikationsoperatorRichtungQuick-SortVirtuelle MaschineStichprobenumfangKomplex <Algebra>Rechter WinkelWechselsprungAusreißer <Statistik>DatenfeldDivergente ReihePhysikalisches SystemKomplexes SystemComputeranimation

19:29

SpieltheorieAuflösung <Mathematik>DefaultComputerforensikStreaming <Kommunikationstechnik>SpieltheorieStreaming <Kommunikationstechnik>Quick-SortFastringProzess <Informatik>InformationZweiSprachsyntheseEchtzeitsystemAnalytische MengeAuflösung <Mathematik>WasserdampftafelStichprobenumfangAbgeschlossene MengeFlächeninhaltBinärdatenDatenfeldPixelTermMAPMultiplikationsoperatorCASE <Informatik>RichtungE-MailDemoszene <Programmierung>Physikalisches SystemComputerforensikDruckverlaufBenutzeroberflächeDatenbankFront-End <Software>ResultanteDatenverarbeitungEinfügungsdämpfung

24:02

EreignishorizontVirtuelle MaschineArithmetisches MittelComputerunterstützte ÜbersetzungGraphfärbungTwitter <Softwareplattform>EreignishorizontSystemaufrufMAPInformationsspeicherungPhysikalisches SystemKomplex <Algebra>Virtuelle MaschineGRASS <Programm>MathematikQuaderFlächeninhaltSprachsyntheseMomentenproblemOrdnung <Mathematik>Interaktives FernsehenKeller <Informatik>StrömungsrichtungLochkarteMultiplikationsoperatorVerband <Mathematik>RichtungComputeranimation

27:48

Element <Gruppentheorie>Mathematische LogikSystemidentifikationNegative ZahlRückkopplungEreignishorizontNegative ZahlNichtlinearer OperatorModelltheorieParallele SchnittstelleKomplex <Algebra>Twitter <Softwareplattform>FacebookSystemzusammenbruchEntscheidungstheorieÄhnlichkeitsgeometrieVirtuelle MaschineBitAdressraumMereologieCASE <Informatik>ARM <Computerarchitektur>OrtsoperatorMultiplikationsoperatorLoopOrdnung <Mathematik>Arithmetisches MittelMathematische LogikComputerspielDruckverlaufOffice-PaketRückkopplungNetzadresseAlgorithmusMathematikStrategisches SpielKlasse <Mathematik>VerschlingungLineare RegressionSystemidentifikationForcingProzess <Informatik>Quick-SortPhysikalisches SystemWort <Informatik>Rechter WinkelDatensatzKlassische PhysikSchwellwertverfahrenComputeranimation

33:54

CodeKonfigurationsraumRückkopplungFokalpunktOperations ResearchSystemprogrammierungSprachsyntheseRichtungErwartungswertRückkopplungKonfigurationsraumNichtlinearer OperatorAuflösung <Mathematik>Basis <Mathematik>EntscheidungstheoriePhysikalisches SystemTermDruckverlaufTouchscreenAnalytische FortsetzungTransaktionMixed RealitySurjektivitätComputeranimation

37:13

SpieltheorieQuick-SortMaschinelles SehenMechanismus-Design-TheorieProdukt <Mathematik>Ordnung <Mathematik>MultiplikationsoperatorPhysikalisches SystemDistributionenraumExpertensystemMultifunktionAnalytische FortsetzungAxiomKomplex <Algebra>Reelle ZahlSchwellwertverfahrenRichtungAusreißer <Statistik>CASE <Informatik>Gebäude <Mathematik>GraphHalbleiterspeicherFormation <Mathematik>Office-PaketBildschirmfensterSystemplattformHardwareFlächeninhaltVerteilte ProgrammierungVirtualisierungMapping <Computergraphik>Arithmetisches MittelComputeranimation

42:25

Computeranimation

Transkript: Englisch(automatisch erzeugt)

00:00

Okay. I think we're starting. This is one weird spot to speak at. It's actually like, you know, everybody is just walking from there to there, stopping by, sitting down, disappearing again. Everybody is distributed, like, you know, to the edges, just to jump away very fast if the talk is boring.

00:21

And I have almost nobody in the middle here. So that's weird. Okay, let's do the best. How about Humor here in Norway? Humor. Good. How many of you guys are Ops guys? Ops. Ops. Nobody. No. Oh my God.

00:42

Okay, yeah, this is a developer conference. Whatever the difference is. So yeah, just briefly about me, probably you don't just know me. I hope that this is okay that I have, like, margins here, because I didn't expect this widescreen here.

01:04

Oh yeah, anyway, Pablo Baron, I'm from Germany. Yeah, from Germany. CTO of Instana, we are the next big thing in the area of application performance monitoring and monitoring in general. This is not a sales pitch, I promise.

01:21

It's just an accident. No, definitely it's not a sales pitch, I'm never selling anything. So yeah, okay, if you're not in Ops, how many of you are actually doing monitoring of systems actively?

01:42

Amazing. So all the others are just developing software? Okay. Yeah, that's a cool job, man. Well, I know you. Yeah, well, if we speak of...

02:01

Actually, how many of you would agree that monitoring itself sucks? Probably nobody? Well, one guy? Amazing. Two guys, three guys. Awesome. Let's do the best to convince you that it really does. Because actually it sucks big time.

02:20

And this is what I will focus on in this talk. It's sort of popular these days to rant about things. I'm not only ranting, I'm introducing a couple of concepts that are probably pretty new that have always been forgotten and are more or less obvious these days, how to improve the monitoring which is very essential for IT systems these days.

02:45

So here we go. The number one rant is, yeah, there is not one single tool which can do all that that we need from the monitoring itself.

03:01

And even worse, it's like it's a zoo of incomplete tools. So what actually... I mean, I've seen somebody... It's a pretty large insurance shop. They have 50 of them, 50 different solutions added on top of each other.

03:22

Because, yeah, well, yeah, Graphite doesn't support this, but we need Nagios here. So let's add this. Oh, yeah, Librato probably has a plugin for that and so on and so forth. So you just stumble from one into the other and add on top and on top and on top. Do you guys remember Apollo 13?

03:41

It's pretty cool what they actually have tinkered together there, this OC2 filter thingy. But this is actually how our monitoring solutions seem to work right now. It's just like, you know, it's a patchwork. It's even worse than that. It's more like Jenga. So when you just realize, oh, yeah, well, this technology is not supported,

04:03

but we need it because the CTO said we need this technology. Let's add a brick on top of it. So at some point it probably will crash, but we can go up like 36 or 37 levels.

04:22

Speaking of consistency of tools, the problem in the monitoring area is that it's itself, it's too diverse. So one tool cannot promise everything. So what happens is actually tools do. Do you hear me? Can you hear me good? Okay. I never listen to myself.

04:45

Yeah, it's too diverse. But what is really needed, and this is one of the improvements that need to happen sooner or later, or actually sooner rather than later, is that when a tool which is responsible for monitoring

05:00

doesn't understand something exactly, like, okay, when we reach this threshold, this has this meaning, it needs to, well, sort of mimic a human, like try to reason about this in some analytical way. And this is something I will have on a couple of slides as well.

05:22

Like it goes in the direction of math. You probably expect this. It's probably too boring for math right after lunch. Did you enjoy lunch? Are you getting fed up at this conference? This is amazing, right? You can just go down for food.

05:40

Yeah, we'll have to get a couple of rounds in when I get back home. Yeah, the next thing, model. What is the model of IT? What is that? Yeah, we have servers, right? We have applications.

06:04

The problem is that this is actually how typical monitoring solutions are considering the model of IT systems. It's like there is a computer, well, there is an application, there is a printer, there is a switch, there is something like that. And there is this evil internet cloud.

06:21

Yeah, that's historic, actually. The problem that we have, during the past couple of years, we've added multiple layers, and we go on adding layers and layers and layers and layers on top of everything. So we are in the world of Lego bricks, where we're just like, you know, oh, here we go, hypervisor, then we go with a VM,

06:43

then we go with a container, then in the container. You can actually do container inception, did you know that? Did you ever try this? You run container and container and container. This is fucking amazing, I'm sorry. You can do this, but nobody should. I'm not allowed to swear, right? You probably will cut it out after the talk.

07:04

Yeah, so, speaking of adaptive models, do you guys know this meme? Okay, cool. Yeah, we have very complex taxonomy and geography in the world of IT systems.

07:22

Very complex. And it's all moving parts. Actually, nothing is stable there, because everything can interact with everything. Everything can rely on everything. It's not just like every application needs a database these days, and so on and so forth. It's much more complex than that. So instead of having a written-in-stone relational model,

07:42

which is then sort of projected to a relational database, which is behind almost every single solution, which is available right now, sort of relational database, well, they go with MySQL, whatever, we should embrace graphs. It's too dynamic for stone-hard stuff.

08:03

It's a dynamic graph, it's ever-changing. So, kinds of relationships will always change, come and go. That's where graphs kick in. Which brings us to the outdated technology. Those of you who are doing monitoring, what tools are you using right now?

08:26

Awesome. Cool. Nagios? No. Graphite? What's that? Okay, yeah, well, ELK, cool, yeah. It's not outdated, of course.

08:43

Yeah, probably going to an ops conference would reveal a little bit more of the tools that are being used. Of course, when you're a developer, you probably don't know what tools are being used in ops. And you probably just don't care. Yes.

09:00

The next thing. Next thing. Cool. Awesome. Yeah, the next thing. Okay, outdated technology is a little bit interesting thing. It's like when you come with a knife to shoot out. I didn't find a picture for that because the Internet is full of pretty weird pictures.

09:23

But you can compare it with something like this or something like that. I mean, it's a mobile phone still, it works. It probably works yet. But we probably shouldn't use it anymore. So, the messages here are generally.

09:43

Technology itself, and you know this better than me, technology itself is crucial for any solution. Actually, for any solution. Actually, for real, any solution. But also for monitoring solutions as well. And very many, even New Relic and then guys like that,

10:01

they have implemented their whole stack like 10 years ago. And I would claim that it's a little bit outdated because otherwise features could be added much faster. And the model of the modern IT could be, well, implemented a little bit more flexible.

10:20

But the thing is that when you look at your monitoring solutions, you should look at them like at any other technology you use in your stack. And it should go in your future, not in your past. Like, let's use something which is 20 years old. But, well, it might reveal some problems if we have some problems in production.

10:43

So let's speak about math. Do we have mathematicians here? No. Really? Oh my God. Cool. Naive. Well, naive math.

11:06

Do you guys still do oil check manually? Awesome. Well, it's Norway, right? It doesn't mean a thing. I'm not judging, I'm just saying.

11:21

I know people who probably would not know how to do that manually. But what I want to speak about is thresholds. Thresholds are completely useless. Because even with oil, I mean, this is a moving liquid. So it can just shoot over, shoot down, and so on and so forth. And then you just have a totally wrong picture when you look at it and it's somewhere here.

11:44

You know, there. This thing doesn't work here. Well, fair enough. But still, it's like on top of this thing there. What do you call it in English? What is the name of it? Measurement something.

12:01

Stick. Yeah, stick. Okay. So it's, in this case, it's on his left hand somewhere, the oil. Well, wrong picture in this case. Too manual. Thresholds are pretty useless because when you look at memory and stuff like that, well, okay, yeah, you're shooting from time to time over a threshold, 80% of memory usage.

12:21

So what? What meaning does it have? Do you want to be alerted with that just because you went over 80% of memory usage in this machine? Ideally, you have 100% usage and never shoot out. Well, depends on the case, of course.

12:41

The other thing which is very popular in the area of monitoring is baselining. Baselining is a very ambivalent thing because when you want to do it right, you need a lot of math to measure a real stable baseline. Because when you look at the picture like that, well, this area is completely fluted.

13:05

Nothing moves anymore. Everything is stable. So when you measure this as a baseline from the past two hours, it doesn't make sense at all. It's not a baseline because it's real bad state, but nothing changes. So going only after change measurement is also a pretty wrong thing.

13:29

So when we speak of math in the area of monitoring, it's actually the... And this is something that really goes into this direction slowly. When you look at new start-ups popping up, like signal effects and so on,

13:47

so monitoring is becoming a mathematical domain. It's a mathematical problem. And in this case, it's not only just like simple thresholds. So going with the simple statistical things like, well, it's two standard deviations from the mean, then it's a problem.

14:05

It's more complex than that. But the quite opposite of it is blind math. Do you remember Pulp Fiction? Does anybody really know what it is in this case?

14:23

Did they ever say what it is? They never did, right? I still have no idea what it is. But we see it. We see something. It's shining gold. Wow, cool. This is also valid for tools. When they look at your metrics, like something is there.

14:43

I need to tell you that something is happening. The other thing is, everybody is speaking of correlation. So we need to correlate like A with B and so on. When you do this, this autocorrelation... Not autocorrelation is wrong. When you automatically try to correlate things that you have no meaning of,

15:02

in this context, like this pair, what could happen is that you clearly see that the color of the pants people are wearing in Norway correlates very well with the amount of rain in Australia.

15:23

So how much meaning does this have? Your monitoring solution will leave it up to you to decide, does it make sense or not? So what I'm trying to say with that is, whenever you speak of correlations, you need to speak of semantics as well. And this is what is actually partially missing in the area of monitoring,

15:45

which is very important, is that semantic knowledge is a very central concept for monitoring. You need, as whatever tool provider or solutions provider, whatever, you need to understand what is actually happening when you look at things, how things behave together.

16:01

Does it make sense that an I-O weight goes together with a memory consumption metric or whatsoever? All these things. And whenever you speak of mathematical models, probably some of you are mathematicians, but you don't say that, when you speak of models, models are something that make sense.

16:21

So you model after a real world. You just try to solve a problem in terms of mathematics. And this is very crucial, not only for monitoring, but, you know, I've been playing around with this whole big data stuff. You literally don't find anything when you don't know beforehand what you're looking for, seriously,

16:41

because at some point it just becomes ridiculous what you find there. Which brings me to the next topic, and I would call it eyeball intelligence. So you have eyeballs and you are intelligent when you sit in front of a bunch of graphs.

17:00

Who's looking at something like this all day long? Probably everybody. Everybody likes it. Come on, guys. We all love charts on black backgrounds.

17:21

So yeah, we don't feel like Lieutenant Data from Star Trek, right? Yeah, except of that, actually he's a robot. When he looks at things, he has a different pace of how he can process stuff. And when we look at hundreds of machines of systems, we don't feel like data,

17:43

we feel like, hmm, no idea what is happening. And the next thing is that, actually, well, USS Enterprise is on autopilot most of the time.

18:00

They only jump in when it's serious, when there is an issue and they need to do something. I'm not a pilot, but I've been talking about this with a couple of pilots, and it's actually pretty comparable to piloting an aircraft. Most of the time you're on autopilot, when things go wrong, you need to solve this manually yourself.

18:23

So you just need, and this is the direction where the monitoring tools and solutions, whatever you built inside your setup, will go into. You will have more intelligent robots that will support you or partially replace humans, actually,

18:44

when it's about real boring things, because it's about a lot of machines, complex systems, where a human is not able to understand, to grasp everything, not even looking at graphs and charts. Even when your tool will show you, there is an outlier.

19:02

It's still up to you to decide, is this outlier something that makes sense or not. It's still the same. Also, what is a problem right now is a sort of weak resolution. And it's, so there are tools that are resolving like, well, they're sending data from the field,

19:24

like once in a minute or once in 10 minutes or 15 minutes, a sample of 15 minutes. So it's comparable to this. Do you remember this device? Does anybody do it? Or am I the only guy who's old enough for that?

19:42

But you still don't have it, right? You don't have it anymore. I hope so. Yeah, there's one-way camera. You can take pictures with that, but you will throw it away, and pictures are pretty bad quality. The next thing is the Big Bang. When you try to do a resolution like once in a minute, when you sample to a minute,

20:05

Big Bang was something that has happened within a couple of seconds. You will completely overlook something like that. You will not see it. It's probably strong enough that you will see like, there is a peak. Awesome, I see a peak. But you don't have any details anymore. What was that? What is it all about? And so on.

20:24

So we have a high level of pixelation, and pixelation itself, in terms of monitoring, is, well, pixelation itself is good for retro games when you implement a Mario game, a simulated one, or if you have something to hide.

20:43

But what I want to tell is, we need to go for resolution, which is below one second. And on demand, we need to send even more data. This data needs to be pre-processed, prepared, and sent over, once we require it.

21:04

And speaking of all this real-time, no real-time, near real-time, near near-time, near near real-time, and so on and so forth. Yeah, we're there. Do you guys have a game like this in Norway?

21:20

Some folks in Germany do have. The idea is to, you have two bins on both ends, and you need to, well, to get your water from one bin to the other through this pipe, like, manually. I mean, no pressure, nothing. Like, just spill it in. So this is how solutions that are not being, have been built for real-time processing of data,

21:46

or near real-time processing of data, how they actually look like when they speak of real-time. So I have a database, and, well, somebody will request the database, and then I will spill this information into sort of a stream, maybe.

22:02

But it's not end-to-end stream through. The other example is the snail mail. Do you have it in Norway? You don't have it? No, I'm kidding. But nobody should be using it anymore, seriously. What is it actually for? Yeah, postcards, maybe.

22:23

So real-time thing is, when you have, when you see a, like, previous second or a second before the previous second, it's already useless. It's in the past. You don't win anything. It's just like forensics.

22:41

You just look at the past, and you try to understand what has happened. So everything in this monitoring world, and I'm pretty sure it will be the case more and more, will go into the direction of real-time data processing, like, well, near real-time data processing,

23:00

where the whole data is being streamed from your field back to the backends, and over to the user interface, or whatever, alerting analytics on whatever systems are running behind the scenes. Because this way you don't lose time, where you don't need to lose time. That's what I mean.

23:27

That's pretty cool. Forecasting yesterday's weather. I'm in Oslo for now three days, and I wanted to spend, like, half a weekend here on Saturday. And every time I look at the weather forecast on my iPhone, it's like, I mean, it changes completely.

23:46

So this is one thing that weather forecasts in areas that have quite an amount of water close to them are totally useless. This is a meteorological problem, but this is also a monitoring problem, because, can you guys read it?

24:19

I repeat it, sometimes I want to go back in time and punch myself in the face.

24:25

The meaning of that is pretty simple. Okay, yesterday something evil has happened on my system. Something has crashed totally. Yeah, cool. So what? I mean, it's today now.

24:46

Forecasts need to go into the future, of course. That's what forecast actually means. So when you have past events, you will crunch this data, whatever, again, whatever tools you're using, you can write it straight to your HDFS and then Hadoop around on it, or Spark around.

25:02

You've learned about Sparky as well. You can do whatever you like. You can query it with your ELK stack and so on and so forth. You can do everything. But it's, again, it's in the direction of postmortems and learning a little how your system behaves actually. But what is necessary is like, you know, today is only yesterday's tomorrow.

25:22

We need much more forecasting to prevent issues, and the forecasting should be real, accurate. This is quite a problem, but weird enough, nobody really takes care about this in this whole world of monitoring. Everything is speaking of math with big data. That's where the money is.

25:42

But on the other hand, we need to do much more in order to prevent systems from crashing, from misbehaving. Okay, another thing is ghost in the machine is probably nothing that has a meaning at the moment.

26:01

So those of you who are on Twitter or whatever social network you use, do you remember this discussion with this weird dress? I mean, I didn't even follow it. It's weird. So the discussion was about what color actually this dress has. Is it gold or is it blue whatsoever?

26:24

Yeah, well, our eyes and our brain is a very complex area. But speaking of Schrodinger's cat, it's like the general simplified idea is, we have a cat in this box and we don't know if it's dead or alive.

26:46

And this is where I want to mention that we speak more and more about immutable infrastructures. We speak of machines that can be added to a data store and then disappear.

27:05

Nobody would really care about this. Look at React and Cassandra and data stores like this. Nobody would care about this because the majority still works, you still have your quorum satisfied and so on and so forth. So that means that we have much more flexibility about things that can die

27:21

and come back and disappear again where the current monitoring solutions or the current monitoring world have no real good support yet for that, for this situation. It's like everybody is drawing a map of things that are there. And when you start killing things, returning things, it's suddenly new things that pop up.

27:46

So it's just a replacement of the other one that was previously running there to satisfy this weird Erling guy here. It's not only about Erling.

28:03

I'm not eventualizing anything here. I just suggest that you look at the concepts there. This is the most important thing about this conference, what I've learned here. You just get introduced to concepts. Whatever you do with the concepts, it's your job. But one of the concepts from the Erling world and now Akka world and so on,

28:21

the idea is just simple, you just cut everything into very small pieces and those small pieces are allowed to crash independently, come back again through a configured strategy. They come back, they crash again, they come back. So whenever I mention let it crash these days, I also mention bring it back

28:40

because ops people don't like the word crash and this is something that I stumbled upon in ops that I've been teaching how to operate React for example. It's like every second there is a lock entry, crash, crash, crash, crash. So they actually configured Nagios to react on crash as the word in the lock.

29:02

So this thing just was going crazy all the time. Okay, the other thing is that identifying something by the IP address or a name or whatever, name is probably not the best example but IP address, it can change but this will be still the same thing.

29:21

So we need better methods of identification or even similarity checks that are again a little bit more mathematics there, like saying this thing that is appearing now here, is this a replacement for the other thing that has disappeared a couple of minutes ago? Because otherwise you just get confronted with two of them

29:41

and you in your brain need to sort of, to satisfy this relationship, like okay, A is equal to B, yes, but you're stupid too, you don't know it. The other thing is alerting. Who's on duty from time to time?

30:03

Like with a pager, with a classic pager, yeah, cool. Do you like it? Depends, right? I think that what we have with alerting in most of the systems is pretty binary.

30:21

That means that a thing is either dead or it's alive. You can turn it yellow, sort of, but it's still, yellow doesn't mean that I'm forecasting it will get red, it's just probably a threshold which is 60% of something of some resource that is yellow.

30:40

Yeah, okay, I can sleep on. 80% of something, I should better be getting up and cooking some breakfast because I probably will have to go to the office. Or I can do it remotely if I'm lucky enough.

31:05

That's when you get a false positive at 2 a.m. and need to get up to repair something. So, a lot more needs to be done. When we speak about this math and these models that are much softer than what we know from hard thresholds,

31:27

there is much more to do, to be done, in the modern tooling in order to prevent false positives and false negatives. And this is actually something that you would, all these false positives and negatives, when you have an idea how to actually, how to capture these false events,

31:46

you need to learn to train your algorithms with that, you have it as labels. Does anybody know how machine learning go? Sort of an idea of it? You essentially have an algorithm and you train this algorithm with some stuff

32:01

and when you have a classifier, which is amazing for, of course, classification is amazing for some of the monitoring problems, as well as regression is amazing for the other part, but when we're in the classification, the best thing you can have is, I have a classic, I have labeled records or something like that, that say this is bad, this is good. This is amazing.

32:21

Because a machine can make a decision based on that and say, okay, we have like 94% probability that this is bad. And this is how you can decide. And this is where the monitoring needs to go to, into this softer area, because all that stuff, all that things get more and more and more and more complex and grow and grow and grow. It doesn't need to be Twitter or Facebook.

32:42

But everybody who's working with containers will start like working with minimal things and these minimal things are like much more than previously. It's not anymore these three VMs. It's like, it's now hundreds of containers running in parallel. The complexity grows and grows. And what actually should be done is that user supports this,

33:01

because this is the cheapest way how the classifier can learn is that the user will do something. Just say, okay, user tells you this is bad. I'm currently having a problem. I'm currently not having a problem. Which is then the next topic, this feedback loop. Currently, when you need to change some logic against all these false positives and so on,

33:26

for your case, in a, well, modern tool, yeah, you get instructions like IKEA cabinet.

33:41

So you actually, yeah, you can swear around, you can tell the machine, yeah, it cannot be the case. It cannot be good right now. But what you need to do right now, currently, is you actually code or configure. You have like a whole screen of configuration where you need,

34:03

okay, when this and that, but not this, then do that, and so on. This is not like, this is not the way how feedback works. The best feedback that you can get from somebody who knows, who understands how things work in terms of monitoring performance and so on.

34:20

It's just like one click. Okay, I can judge immediately. This is not a problem. Bang. So, and also this is, as I said, this is the basis for classification. This is amazing how much you can do when you allow your users to give fast feedback.

34:45

Speaking of intelligence, well, none of you guys have named tools that are going into this direction more and more. Is anybody old enough or weird enough to know this guy?

35:04

No. Okay, John McLaughlin. Don't care about him. He's a weird musician. The thing is that he's trying here. This is a good picture. He's trying to get this square thing into a round hole.

35:23

So, whenever a monitoring solution is expected also to be the basis for business decisions, I would claim that this is totally wrong, because it puts a lot of pressure and expectations onto people who are not really into business.

35:45

I mean, we are there to support systems that they run. Whatever business is, we are supporting them, but they shouldn't offload the work onto us, onto technicians. So, business intelligence and monitoring are probably mathematically, from what I'm claiming here, mathematically go into the same direction.

36:04

Yes, it's okay. As well as like, you know, like predicting of gas behavior in a bucket also would go mathematically into the same direction, but it doesn't mean that these are the same things. So, what business expects are the business requirement is actually stable operations, not more than that.

36:25

Not that you are able to tell them what is a business transaction by looking at your locks and looking at the behavior of the systems. This is the wrong way to look at these things. So, yes, I claim that we shouldn't mix this, and what I observe, everybody observes this actually.

36:43

Of course, this is where the money is, that people are going to the direction of, or tools are going to the direction of business intelligence, claiming that you can learn about your business, looking at system data. Wrong, you can't. You can, of course, when you have custom fields, but this has nothing to do with monitoring anymore, with the classic one.

37:05

And this is the hardest one probably. So, when we nowadays want to implement a quite sophisticated monitoring solution, which

37:25

also has a continuous look on performance and so on, we need all-rounders. We need people who understand very deeply the corresponding platforms, Unix itself, or if you're on Windows, then the Microsoft world,

37:42

all the hardware, all the virtualization, everything around this, so playing every single instrument. The thing is that it's expensive. These people cost money, and it's not very many of them around, seriously.

38:00

I'm not claiming that not everybody is expert in this area. I'm just saying that what I've seen is that, and this is actually what my previous company is making money with, is that a real expert who will help you, because you just don't have time to look at these things. You will just buy somebody for two days, and they will solve problems in your running system.

38:24

So, yeah, it's expensive, like not in Swedish expensive, it's Norwegian expensive. It's like the booze here, pretty expensive. A lot of Corona. Also rare, and that's what I mean with not even 1%.

38:45

And I think that what needs to be done in order to improve that, that the solutions we have, the solutions that are being provided to us, that help us monitor, they shouldn't be as plain as they currently are.

39:04

Like introducing charts and probably just showing me one outlier or something like this, because this is then a tool for a real expert who's real expensive. They will earn money within those two days where they try to consult you or try to solve your production problems.

39:20

You should be able to do that alone. So you need mechanisms in the tools that will support you and probably completely make obsolete very expensive experts in this case. I'm not sure it's possible, but this is the direction it should go from my opinion to. So, yes, this is what I mean when I say that monitoring sucks, and that's what I mean when I say that it can actually be improved,

39:46

but it's a lot of work, and it's a long way to get there, because currently monitoring is, or at least still two years ago, it was like a forgotten child, an abandoned child. Nobody wanted to care about this.

40:01

We have now conferences around monitoring. People take care more and more and more. Some mathematical assumptions are naive, but at least it's more than just looking at a simple 80% threshold for memory. This is trying out distributions and looking at distributions. It goes into the right direction, but it's a long way.

40:21

We will get there. Thank you very much. We have plenty of time for Q&A. I'm not sure if Q&A is a concept here, but we have plenty of time.

40:49

Oh, yeah, I'm not here to sell anything. Can we make a deal? I can show you later a sort of visualization,

41:01

because, I mean, I wouldn't push the product we're building right now, because it's the wrong place to do so, definitely. But you can experiment with everything that you know from gaming. You know, they've sorted out how to do maps, how to do navigation in complex worlds,

41:20

and this is the direction it will go to, definitely. It's not any more like a solid graph that you see and it doesn't move any... Any other questions? Oh, my God, was that bad. Yeah, so just ping me. I'm at the conference till the end.

41:44

I'm still here tomorrow, but I don't expect that anybody will talk to me tomorrow. I can give you a quick idea of what we are working on, but I will exclude it from the talk completely to keep it clean. I hope that you get the ideas to get the vision, and every vision has targets towards the vision.

42:03

It's small steps and everybody will go into this direction, definitely. I'm pretty sure, so expect a lot of movement in the world of monitoring and don't build on real old grandpa tools anymore. There is much more out there which is well doing its job, seriously. Thank you very much.