We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

From Zero to Useless to Hero

00:00

Formal Metadata

Title
From Zero to Useless to Hero
Subtitle
Make Runtime Data Useful in Teams
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
We introduced distributed tracing, central logging with trace correlation and monitoring with Prometheus and Grafana in a large internationally distributed software development project from the beginning. The result: Nobody used it. In this talk we show the good and not so good sides we have learned while introducing and operating the observability tools. We show which extensions and conventions were necessary in order to carry out a cultural change and to awaken enthusiasm for these tools. Today the tools are a first-class citizen and people are shouting when they are not available.
Run time (program lifecycle phase)BitComputer animation
Context awarenessPersonal digital assistantProduct (business)Complete metric spacePlastikkarteRepository (publishing)Game controllerCollaborationismComputing platformPoint (geometry)Projective planeMultiplication signPrototypeComputer animation
Computer architectureSoftware developerSlide ruleComputer animation
SoftwareComplex (psychology)Mathematical analysisArchitecturePhysical systemInformation securityDefault (computer science)Client (computing)Physical systemPoint cloudError messageMathematical analysisSoftwareContent delivery networkPublic key certificateFirmwareData storage deviceToken ringService (economics)System administratorProxy serverGroup actionAuthenticationConnected spaceCartesian coordinate systemComputer architectureIdentity managementComplex (psychology)Right angleMobile appOcean currentSystem callInternet service providerScaling (geometry)Virtual machineUniverse (mathematics)Level (video gaming)Natural languageRevision controlComputer animation
Standard deviationChainMatrix (mathematics)FluxDifferent (Kate Ryan album)Event horizonProjective planeOperator (mathematics)MathematicsCloud computingInformationExecution unitComputer animation
Data modelSimulationCone penetration testGeneric programmingStandard deviationRandomizationService (economics)PlastikkarteData modelMereologyContext awarenessComplete metric spaceMetric systemRun time (program lifecycle phase)FehlererkennungLatent heatTerm (mathematics)InformationSoftware developerSurfaceSoftwarePhysical systemPresentation of a groupCASE <Informatik>RootDatabaseField (computer science)CausalityComputer animationProgram flowchart
Product (business)Different (Kate Ryan album)Operator (mathematics)
Tracing (software)Metric systemCross-correlationComputing platformQuery languageCuboidUser interfaceInteractive televisionLoginSoftware developerMultiplication signComputer animation
Expert systemUtility softwareQuery languageProjective planeMultiplication signUsabilityArithmetic meanComplex (psychology)Slide ruleRun time (program lifecycle phase)Expert systemUtility softwareMetric systemGroup action
Link (knot theory)Software testingGraph (mathematics)Line (geometry)InformationQuery languageInheritance (object-oriented programming)Descriptive statisticsLoginLink (knot theory)Traffic reportingPhysical system
Link (knot theory)Execution unitComputing platformBlogMetric systemLink (knot theory)View (database)System administratorSoftwareMetric systemRun time (program lifecycle phase)State observerCartesian coordinate systemData managementMedical imagingTracing (software)Field (computer science)Group actionLevel (video gaming)CASE <Informatik>Software developerComputer animation
DisintegrationFunctional (mathematics)Simultaneous localization and mappingGroup actionCASE <Informatik>Tracing (software)INTEGRALWeb 2.0Software developerExistential quantificationComputer animation
Image resolutionFunctional (mathematics)Intrusion detection systemDisintegrationOnline chatCommon Language InfrastructureError messageSoftwareProjective planeSoftware bugContext awarenessRun time (program lifecycle phase)Software testingOperator (mathematics)MathematicsExistential quantificationBuildingPhysical systemLink (knot theory)Software developerError message
CASE <Informatik>UsabilityMereologyComputer animation
Process (computing)Metric systemInsertion lossChainLink (knot theory)Axiom of choiceLoginMultiplication signStandard deviationPoint (geometry)Process (computing)IterationRight angleTable (information)UsabilityPerspective (visual)Game controllerBuildingPresentation of a groupData management
Point cloudOpen source
Transcript: English(auto-generated)
Awesome. Well, I'd like to introduce the next speakers. Florian and Robert is going to talk about from zero to useless to hero. So take it away.
Yeah, thank you guys. So, yeah, as I already, as I was already said, basically today we want to tell you how we took a lot of useful useless runtime data and basically turned this into a little bit more useful runtime data. My name is Robert.
And if this thing works. You can see me over there. And that's Florian, you can look us up on Twitter, or whatever. And we will be here after the talk so let's let's have a chat on this. I will give you some context first on where we did this thing. So we had this little project inside Deutsche Telekom where we basically built a complete
voice assistant platform so we wanted something where we can build our own voice assistants. And the idea behind that was basically instead of using the assistance that you know like Alexa and so on. We wanted to have our own platform to deliver something which is completely GDPR compliant, where the user has a lot of control over his data.
And finally, bringing him basically the best experience when he is interacting with other products from our ecosystem like TV, telephony, or smart home.
Now, typically, such ambitious projects are delivered within the so called I hop at Deutsche Telekom so that's basically an innovation factory, which produces such things, and we really do it from prototype to production so really the whole way.
And you can imagine that if you do that, within a short timeframe, your team and everything around it grows immensely. So we basically started at zero with a very small team with a prototype. And in the end came out and here I have some numbers for you, basically with one international cooperation so we
deliver this platform together with our taco friends from orange, and over 900 direct collaborators on the project, probably many hundred more if you if you look at a bigger or chart. And finally also having like, I think, over 500 active
get repositories that make up this platform, and over 100 unique services, which are all doing something to deliver this voice assistant. So that's the context here. So, thank you. Um, so we started time point zero, so this must be started as a greenfield project so nothing to see for miles
around. Everything is crystal clear, what a lack, we could use the tools we want to so we can do high driven development. And I promise the architecture I will show you on the next slide is not as beautiful as this landscape. But let me start for the smart speaker first needs some voice services, so we
have the group is for handling voice commands to apply natural language understanding and so on. We also need some services that act with the devices rollout in firmware versions, sending push notifications in front of these two service groups we have two gateways, one for handling the admin access and one for handling the API.
So admin some mostly administrative calls that are done by humans, everything runs in a Cuban in its cluster so we have this new shiny bright, brave new world of the cloud native things. It runs in some cloud.
The provider doesn't only provide the virtual machines all right storage technologies, and in front of our cluster we have a proxy that terminates the SSL connection. What's missing is the skill. The skill is you will know it from Alexa or something else. So if you ask for the weather forecast we process the voice command and if you know we should call
the weather skill we call the weather skill and the weather skill knows where he asks for the current weather forecast. For example calling another weather provider. I lied if I say there was nothing to see for miles around of course there is something in the Deutsche
Telekom universe for example identity management system or the content delivery network that we need to ship firmware versions to the speakers. So the speaker is one client. The second client is the mobile application and we also need to build
a secure software system and therefore we have tokens and certificates and two-factor authentication to everything that's necessary. And I feel that is quite a complex architecture and a complex architecture leads to a complex software system because everything is distributed and the more the application is distributed the more
things the more pieces you need to watch and it's more likely that things are happen to fail. And you have latency and you can't scale away latency so you need to handle this and you need to understand what happens if an error
occurs and therefore you need to collect all the things and combine them and do analysis and I feel that the analysis is also quite complex. But that's not a problem that we are facing. We are the first one that we are only facing this problem. There are others and gladly there is a toolchain away level and I feel that the default toolchain within the cloud native world is really advanced.
So for Matrix we are using Prometheus and Grafana. So Matrix is sampling based operational data for textual event based operational data we use Fluent. We collect them with Fluentd and store them in
Elasticsearch and use Kibana to analyze it and for distributed tracing, the spans, the unit of work. It's also event based and we are using Springcloud Flux and Zipkin for exploration.
So we started this project I guess four years ago and since then we did some changes to the initial setup. We replaced Zipkin with Jaeger. We want to roll out Yumio in the next months to replace our logging stack and we do not run Grafana and Prometheus within our cloud platform. We use Grafana Cloud.
So in the next step we thought we need to combine all this information. We need to link it as close as possible. So it's different kinds of operational data and because we're enterprisey we come simply up with a solution called
standardize all the things and everyone was really happy about our standards that we roll out in our development team. So let me present the generic standard random data smart hub service data model.
So the name was just for the presentation but I tried to visualize what it did. So in the center there's the service and the service is a context and the context is for example the user that issues the request or the tenant the user belongs to or the endpoint that was hit by the request from the user.
And we simply want to ensure that this context is spread around all the technical monitoring data. So we defined a logging concept that defines what's in and what must not be in the log due to the GDPR issues.
We do structured logging so we simply defined how the fields are named so that we can ensure that we can correlate the log of the different services simply by querying for a trace ID. We did the same for tracing so we defined tracing best practices. So some of the context of the service must also be modeled in the trace for example in specific tag names.
We also give guidelines in terms of completeness metric. So we want to ensure that every important feature within our service is covered by spans. And of course we also standardize metrics.
So every incoming outgoing request exposes RAT metrics following the RAT principle. Who knows who had heard of RAT? Okay, quite a few. And we also standardize database metrics.
And the tags of these metrics always ensure that the context of the service, parts of the context are represented. And we do not stop standardizing these here so we also standardize readiness and liveness checks. So it's up to the developer to implement his check but the surface is uniform.
So the checks also expose metrics. For example the errors checks count. And we also defined that every request if successful or not must return the trace ID. So in any case we have this small thing of information that's necessary to
jump into our runtime data lake and to find the runtime data that is useful. And well we forced all the other team members to implement the things and metrics are coming in and we are really happy and everything works.
So we understand our software system and we can identify the root cause and yeah. So we thought well this should solve all our problems and they will love it. How could we be so wrong? And yet it's easy. Our team is colorful. So our team is here.
GNS consists of different experts, testers, product management, data scientists, developers, developers, developers, operation heroes. And yeah if you're looking at our solution we built a monochrome solution. The solution fits for us but not for the others and that's the problem.
So that was the time where it was useless because we are monochrome. So Robert and I, we are platform developers. We built a solution for us. We know how the query language of Prometheus works a bit. The user interface of Grafana. We know what's in the data and what tags can be used to correlate logs with metrics and traces and all those things.
And I'm looking really forward to Grafana to implement the correlation between metrics and traces. So therefore this was only a little benefit for the team because only we used it because the hurdle was too high.
So and we looked outside the box and this is from about face interaction designed by Alan Cooper. And it simply means everyone starts as a beginner if he uses a product. And after some time he's getting better and gets an intermediate user. And if he stops using and if he continues using the tool he will get an expert.
And there are only a few experts because not people are stupid because people do not have the time to use this tool. And our tool is simply an expert toolchain. And what we needed to do is to implement means to allow more people becoming intermediate users.
And now it's your time to present the rest of the talk. Thank you. Thank you. Yeah, so that slide should look like we put a lot of thought into it.
But we didn't in the beginning and that was our problem. Basically, we had a lot of utility. It's not, it should be on. Maybe I have to talk a little bit louder. Thanks. So we had a lot of utility, as Gloria and said, which means we had
a lot of interesting metrics and we knew somehow the people should want to use them. But on the usability side, we really fell flat. So, we didn't think about how the individual roles in the project would use this data and what their workflows are really are. And that's what we try to tackle.
And now I will show you some examples of what we did to basically, first and foremost, improve the accessibility of the runtime data. Meaning, how can we make it really easy for our users, developers, testers, operations, people to get started with the metrics, to jump into them and not thinking about complex query languages or how to transition from one tool to the other.
So, one simple thing we did here is basically just deep linking. So, this is a Grafana dashboard. That's one graph showing, for example, if a test run was successful, so we're running tests continuously to see that.
And here you can see, I can click this little info button and I get a tool tip and in that tool tip, there could be a description and also links, deep links to other tools. So, here, when I see, for example, a test is failing and the red line is starting to go
up, I can directly jump to the test reports and I can also directly jump to the system logs. And when I do that, there's already a query that, for example, queries all the logs that are somehow related to those tests. So, that makes it super easy to get started with one tool and then jump
to the next and it's actually very low hanging fruit because it's super easy to do. Here's another example. So, we have an in-house pipeline tool to basically manage software promotions between stages.
It's not a full CI CD system. In the background, it's still just using GitLab and GitLab CI. It's more like a specialized front end. And again, the theme here is how can we make it for the users, in this case, the developers, very easy to understand the software promotions and also, in the same view, get some runtime data about their software.
So, what you see here is I can promote my software, for example, the admin management from one stage to the other, but I can also click that status button. And when I do that, I get this view. And in this view, I see, for example, typical things like, okay, that's the image tag that's currently deployed.
So, of course, we're using Kubernetes, but I also see some other high level metrics. For example, I can immediately spot if my application is restarting, and again, I have deep links, you see that in the middle, to find, for example, traces for this particular piece of software or the logs.
So again, for a developer who uses this basically every day, it's very simple to go from there and then get into the observability stuff. Here's another example. We have like a landing page, we're calling this gangway, and that basically means when you log in as an administrator to
one of our platforms, you get this overview, you get some short links to all the diagnosability tools, but you also get some quick actions. So here we can see that if I want to search for a certain trace ID, I can just put it in there, click the button below, and again, I have a deep link that is already describing on what field to filter.
So I don't even have to think about trace ID. How is this field really called trace ID to find it? Is it different? I don't need to know. I can just click. And the same we're doing here for user ID because that's also a common use case. I want to do a look up by a certain user who complained about a problem.
That's another use case that's very cool. This is more an end user-driven use case. So here you can imagine our customer calls our hotline and saying something's not working in my speaker, and here we again have an integration with traces and logging
because the support agent can ask the user, okay, to fix your problem, we want to gather more data, we want to activate tracing for you on every request, and we want to write debug logs, am I allowed to do that? That's a typical GDPR thing you want to do. And then the support agent can activate this through his
web interface, which he's using to troubleshoot the user, and the cool thing here is that if he does that, in the end, again, we're using the data that developers are accustomed to when developing because they're looking at traces, logs, and so on that they know.
And when you do that, this could be one of the outcomes. So you have a problem with the user in production, and you're gathering the runtime data because the user agreed, and then somebody from testing or operation actually raises an internal bug ticket and says something's not working, and here you have the magic link for everybody to understand and get some
context because it's basically putting in the trace ID of where it went wrong. And as you saw our tools before, this makes it possible for everybody to just take that trace ID and get all the context. So that's very easy. And finally, as a last example, we had now some ideas that basically was catering to developers, operations
people, first level, and this is something very cool where I think it's kind of catering to everybody. So at some point, we had the idea, we're building a voice assistant, and it's hard to always
have this device for testing and so on, so there needs to be other solutions to do it. And here, we had the idea, okay, we can build a chat board. That's our MetaMouse, our chat tool we're using, and you can basically use it to interact with the assistant, and again, you see that if something goes wrong in the interaction,
you get the deep links to our diagnosability tool so you can easily match what you just did to the runtime data you now need to figure out what's wrong. And this was very helpful and the cool thing about this is also that you have this idea of learning by
example so as people use that in the chat, you can watch that and understand how you can actually use the system. So we saw some changes in our culture and I won't go over them in detail but in general, you can say, with ability and trust, increase a lot.
We have more awareness for each developer of his software because it's easy to get runtime data and ownership and error culture, definitely improved. Everything is visible to everybody in the project so you can't hide anything and that's basically changing how you think about failures.
Yeah, and like this, we basically came from zero to kind of hero usability on that part and we basically made it happen because we listen to our users, our internal people, our teams.
We built use cases for them to make it happen. So this is our last slide. That's basically our golden advice. So if you want to be successful with diagnosability, select your toolchain, standardize all the things as Florian said, link and combine the stuff so it's easy for people to jump
between tools, integrate them into their everyday life, their everyday tools and processes they use and then they will love you. And maybe you will also love them because they're asking less stupid questions. And that's the end.
Thank you. So I would just first please ask you to remain seated. We have some time for questions so don't raise your table yet because it's noisy.
So first of all, do we have any questions? Just raise your hand and I'll come running. You're tired, aren't you? Oh, there is one. I'll be right there. Hi. Nice presentation. This is, I believe, a difficult topic to deal with.
How long did it take you to standardize and basically implement all this everywhere? So I think the standardization, it had multiple phases, but I think we started out with standardization of metrics, logs and so on very early, which was definitely a good choice. I think it's very hard to do this later. All the usability stuff you saw on top, like how
to wire this together, that took a lot of iterations and I think we probably burned a lot of money and time on building Grafana dashboards and figuring out the users don't use them, throwing them away and trying again.
But that's something you can very easily do with Grafana so it was fine. Anyone else? Going once. Oh, there it is.
So that's a very nice success story, but something must not have gone the way you... You already had something that didn't go as intended and then you fixed, but was there anything that didn't work at all when you had to really say, we don't understand how to do? Is there like a big failure that...
A big failure, let me think. So in general, if you want to talk about challenges for us, honestly, it was a challenge to get the tools, basically. You might think, yeah, DT, big company and so on, but we actually struggled to get all
those tools and explain to management that we need those diagnosability tools in this distributed world of microservices. So that was hard from the viewpoint that you say you can just ask the management, you can't fix it yourself. Everything else was basically under our control and was just iterating so that didn't feel hard.
Yeah, and we simply installed all the tools and thought, well, this is a good starting point and what we figured out is that nobody expect us was using the tools and they are coming up with the same silly questions every day.
And so definitely we installed and nothing worked from a cultural perspective. Okay, thank you very much. I'll see you in 15 minutes for the next talk. Thank you.