We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Practical introduction to OpenTelemetry tracing

00:00

Formal Metadata

Title
Practical introduction to OpenTelemetry tracing
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Tracking a request’s flow across different components in distributed systems is essential. With the rise of microservices, their importance has risen to critical levels. Some proprietary tools for tracking have been used already: Jaeger and Zipkin naturally come to mind. Observability is built on three pillars: logging, metrics, and tracing. OpenTelemetry is a joint effort to bring an open standard to them. Jaeger and Zipkin joined the effort so that they are now OpenTelemetry compatible. In this talk, I’ll describe the above in more detail and showcase a (simple) use case to demo how you could benefit from OpenTelemetry in your distributed architecture.
14
15
43
87
Thumbnail
26:29
146
Thumbnail
18:05
199
207
Thumbnail
22:17
264
278
Thumbnail
30:52
293
Thumbnail
15:53
341
Thumbnail
31:01
354
359
410
Software developerTouchscreenSoftware developerOperator (mathematics)WebsiteMultiplication signBitComputer animation
System programmingSoftware developerSoftwareLocal ringDataflowComputer animation
SoftwareModul <Datentyp>TelecommunicationComponent-based software engineeringSystem programmingState of matterTracing (software)Computer networkStandard deviationContext awarenessInformationService (economics)EmailFile formatPhysical systemLatent heatInternet service providerInheritance (object-oriented programming)Metric systemBlogOpen setWorld Wide Web ConsortiumLevel (video gaming)Web 2.0LoginContext awarenessMetric systemWorld Wide Web ConsortiumComputer architectureProjective planeSoftware development kitTracing (software)Inheritance (object-oriented programming)Connectivity (graph theory)State observerDataflowService (economics)Computer animation
Proxy serverDatabaseView (database)Series (mathematics)Open setArchitectureSoftware development kitCodeSimulationMaxima and minimaComputer animation
Control flowInternet service providerVideo gameRun time (program lifecycle phase)Library (computing)Gateway (telecommunications)Library catalogReverse engineeringProxy serverMetropolitan area networkClient (computing)CuboidPlug-in (computing)File formatCASE <Informatik>Scripting languageProxy serverConfiguration spaceService (economics)Projective planeProduct (business)Run time (program lifecycle phase)MereologyJava appletReverse engineeringSoftware developerPoint (geometry)Inheritance (object-oriented programming)Front and back endsVideo gameLibrary catalogComputer animation
Computer-generated imageryWindowVolumeLocal ringIntegrated development environmentRevision controlLibrary catalogLibrary (computing)Mobile appAttribute grammarBlogAtomic numberState of matterProduct (business)Grass (card game)Motion blurMaizeExecution unitMilitary operationTotal S.A.Service (economics)View (database)Inheritance (object-oriented programming)BootingRepository (publishing)Fatou-MengeJava appletApache MavenVacuumRootEmulationDependent and independent variablesIntegerGlass floatProgrammable read-only memoryInterior (topology)Query languageInstallation artCache (computing)Point (geometry)Musical ensembleParsingGamma functionTowerTracing (software)Digital filterOptical disc driveState diagramInformationString (computer science)Computer wormoutputRouter (computing)Server (computing)Disk read-and-write headError messageOnline chatCodePhysical systemConnectivity (graph theory)Social classLambda calculusSimultaneous localization and mappingData modelCurvatureComputer networkDenial-of-service attackMaxima and minimaModel-driven engineeringMenu (computing)Software frameworkSoftware developerJava appletMereologyBitDiagramLibrary catalogSystem callCodeStress (mechanics)BootingTracing (software)Point (geometry)Cartesian coordinate systemMultiplication signComponent-based software engineeringRevision controlVariable (mathematics)Product (business)Inheritance (object-oriented programming)Gateway (telecommunications)Web 2.0Integrated development environmentService (economics)Goodness of fitSpring (hydrology)Medical imagingCASE <Informatik>Web applicationDatabaseOpen setLibrary (computing)CodeRun time (program lifecycle phase)Different (Kate Ryan album)ResultantComputing platformComputer animation
Process (computing)File formatLibrary (computing)Formal languageIntelDemo (music)Multiplication signJava appletComputer animation
Process (computing)Dynamic random-access memoryAlpha (investment)Order (biology)Demo (music)Computer animation
Programmable read-only memoryLibrary catalogRevision controlInformationCodeBootingSocial classOrder (biology)Term (mathematics)Formal grammarServer (computing)SoftwareTwitterConfiguration spaceSoftware developerPredictabilityProduct (business)Physical systemPerspective (visual)Image resolutionOperator (mathematics)Process (computing)BitIncidence algebraThread (computing)Direct numerical simulationOverhead (computing)Demo (music)Right angleSystem callCASE <Informatik>Computing platformSoftware frameworkComputer animation
Interior (topology)Query languageProduct (business)RootLibrary catalogCodeLocal ringIntegrated development environmentView (database)Crash (computing)Medical imagingContext awarenessConfiguration spaceMetric systemCartesian coordinate systemKeyboard shortcutGoodness of fitError messageSoftware repositoryConnectivity (graph theory)Integrated development environmentFocus (optics)Attribute grammarOperator (mathematics)LoginPlanningTracing (software)Computer animation
Program flowchart
Transcript: English(auto-generated)
Hi, everybody. Thanks to be here for this talk. That's a lot of people. I'm Nicolas Frankel. I've been a developer for a long time. And I would like to ask how many of you are developers in this room? Quite a lot. Who are ops? Just as many. And who
are DevOps, whatever you mean by it. Yeah! So this talk is intended for actually developers, because I was or I still think I'm a developer. So if you are an
ops people, and for this, for you is not that super interesting, at least you can direct your developer colleagues to the talk so that they can understand how they can ease your work. Well, perhaps you've never seen that, but I'm old or
experienced, depending on how you see it. And when I was starting my career, monitoring was like a bunch of people sitting in front of screens the whole day. And actually, I was lucky. Once in the south of France, I was told,
hey, this is the biggest monitoring site of all France. And actually, it really looked like this. And of course, there were people like watching it. And that was the easy way. Now I hope that you don't have that anymore, that it has become a bit more modern. Actually, there is a lot of talk now about
microservices, right? Who here is doing microservices? Yeah. Yeah, because if you don't do microservices, you are not a real developer. But even if you don't do
microservices, so you are not a real developer, and I encourage you not to be a real developer in that case, you probably are doing some kind of distributed work. It's become increasingly difficult to just handle everything locally. And the problem becomes, yeah, if something bad happens, how can you locate how it
works? Or even if something works as expected, how you can understand the flow of your request across the network. I love Wikipedia. And here is the observability definition by Wikipedia, which is long and, in that case, not
that interesting. So I have a better one afterwards for tracing. So basically, tracing helps you to understand the flow of a business request across all your components. Fabian, where is Fabian? Fabian is here, so he talked a lot
about the metrics and the logging. So in this talk, I will really focus on tracing, because my opinion is that, well, metrics is easy. We do metrics since ages, like we take the CPU, the memory, whatever. Now we are trying to
get more like business-related metrics, but it's still the same concept. Logging also. Now we do aggregated logging. Again, nothing mind-blowing. Tracing is, I think, the hardest part. So in the past, there were already
some tracing pioneers. Perhaps you've used some of them. And well, now we are at the stage where we want to have something more standardized. So it starts with the trace context from the W3C. And the idea is that you start the trace
and then other components will get the trace and will append their own trace to it. So it works very well in a web context. And it defines like two
important concepts that Fabian thinks already described. So now I am done. So I have the same stupid stuff. So here you have... Oh, sorry. Yes. It
reminds me of the story. I did the same to my colleagues. They didn't care about the presentation. They only remembered that. Okay. So here you have a trace. And here you have the different spans. So here the X1 is the
parent one. And then the Y and the Z1 will take this X span as their parent span. So this is a single trace. This is a single request across your service. Web stuff is good, but it's definitely not enough. And so
for that we have the OpenTelemetry stuff. OpenTelemetry is just a big bag of miracles all set into a specific project. So it's basically APIs, SDK tools, whatever, under the OpenTelemetry level. It implements the
W3C trace context. If you have been doing some kind of tracing before, you might know it because it's like the merging of OpenTracing and OpenCensus. Good thing is the CNCF project. So basically there is some
hope that it will last for a couple of years. The architecture is pretty simple. Basically you've got sources, you've got the OpenTelemetry protocol, and as Thamian mentioned, you dump everything into a collector. Collector should be as close as
possible to your sources. And then some tools are able to read like data from it and to display it into the way that we expect to see it. What happens after the OpenTelemetry collector is not a
problem of OpenTelemetry. They are collectors that are compatible, and for example, you can use Jaeger or Zipkin in a way that allows you to dump your data, your OpenTelemetry data, into Jaeger or Zipkin into the OpenTelemetry format. So you can reuse, and that is very important, you can reuse
your infrastructure if you're already using those tools, but just switching to OpenTelemetry. And then you are using a standard, and then you can switch your OpenTelemetry backend with less issues. Now comes the fun developer part.
If you are a developer, you're probably all lazy. I know, I'm a developer. So the idea is OpenTelemetry should make your life as a developer as easy as possible to help your colleague, like diagnose your problems. And the easiest
part, if you do auto instrumentation. Auto instrumentation is only possible in cases where you have a platform, when you have a run time. Fabien mentioned Java. Java has a
run time, which is the JVM. Python has a run time. Now if you have Rust, it's not as easy. So in that case, you are stuck. My advice, if you are using a run time, and probably most of you are using such run times, whether
Java or whatever, use it. It's basically free. It's a low hanging fruit, and there is no coupling. So basically you don't need extra dependencies as developers in your projects. So since it's called practical introduction, let's do some practice. So here I have a bit better
than a hello world, so I have tried to model like an e-commerce shop with very simple stuff. It starts just asking for products. I will go through an API gateway, which will forward the product to the catalog, and the catalog doesn't know about the prices, so it will ask
the prices from the pricing service, and it will ask the stocks from the stock service. The entry point is the most important thing, because it gives the parent phrase. Everything will be from that. So in general,
you have a reverse proxy or an API gateway, depending on your use case. I work on the Apache API 6 project. It uses the Nginx reverse proxy. On top you have an OpenResty, because you want to have Lua to script and to auto reload the configuration. Then you have
lots of out of the box plug-ins. Let's see how it works. Now I have the code here. Is it big enough? Good. So I might be very old, because for me it wouldn't. Okay, here that's my architecture. I'm using
Docker Compose, because I'm super lazy. I don't want to use Kubernetes, so I have Jaeger. As I mentioned, I have all in one. I'm using the all included, so I don't need to think about having the telemetry collector and the web to check the traces. I have
only one single image. Then I have API 6. Then I have the catalog, which I showed you. Of course, I have a couple of environment variables to configure everything. I wanted to focus on tracing, so no metrics, no logs. I'm sending everything to Jaeger,
and then I do the same for pricing, and I do the same for the stock. Normally at this point, I already started, because in general I have issues with the Java stuff. So here I'm doing a simple
curl to the product. I've got the data, which is not that important, and I can check on the web app how it works. So here I will go on the Jaeger UI. I see all my services. I can find the traces. Here you can find the latest one, and here is
the thing. If I click on it, it might be a bit small, right? I cannot do much better. You can already see everything that I've shown you. So I start with the product from the API gateway. It forwards it to the product, to the catalog. Then
I have the internal calls, and I will show you how it works. Then I have the GET request made from inside the application, and then I have the stocks that respond here. Same here, and here we see something that was not mentioned on the
component diagram. From the catalog to the stock, I go directly, but from the catalog to the pricing, I go back to the API gateway, which is also a way to do that for whatever reason. And so this is something that was not mentioned on the PDF, but you cannot cheat with open
telemetry. It tells you exactly what happens and the flow, and the rest is the same. So regarding the code itself, I told you that I don't want anything to trouble the developer, so here I have nothing regarding open telemetry.
If I write hotel, you see nothing. If I write telemetry, you see nothing. I have no dependency. The only thing that I have is I have my Dockerfile, and in my Dockerfile, I get
the latest open telemetry agent. So you can have your developers completely oblivious, and you just provide them with this snippet, and then when you run the Java application, you just tell them, hey, run with the Java
agent. Low hanging fruit, zero trouble. Any Java developer here? Not that many. Python? Okay, so it will be Python. Just the same here. Here it's a bit different. I add
dependencies, but actually I do nothing on it. So here I have no dependency on anything. Here I'm not using, I'm using a SQLite database because, again, I'm lazy. I don't care that much. But here I have no dependency, no API
call to open telemetry. The only thing that I have, it's in the Dockerfile again. I have this. Again, I'm using a runtime. It's super easy. I let the runtime, like, intercept the calls and send everything to
open telemetry. And the last fun stuff is Rust. Any Rust developer? Please don't look at my code too much. I'm not a Rust developer, so I hope it won't be too horrible. And Rust is actually, well, not
that standardized. So here I don't have any runtime. So I need to make the calls by myself. The hardest part is to find which library to use depending on which framework to use. So in this case, I found one, and perhaps there are better
options. But I found this open telemetry OLTP stuff. And here this is because I'm using XM, I'm using this library. And so far it works for me. I don't need to do a lot of stuff. I just, like,
copy-pasted this stuff. Copy-paste developer. And afterwards, in my main function, I just need to say this and this. So I added two layers. So if you are, if you don't have any platform, any runtime, you actually need your
developers to care about open telemetry. Otherwise, it's fine. Now, we already have pretty good, like, results. But we want, we may want to do better. So we can also ask the developers, once they are more comfortable, to do manual
instrumentation even in the case when there is a platform. Now I will docker-compose down. And it takes a bit of time. I will prepare this. And on
the catalog side, now I can have some additional codes. So this is a Spring
Boot application. What I can do is add annotations. Like, I don't, I noticed there were a couple of Java developers, so it's the same with Kotlin. It's still on the JVM. So basically, I'm adding annotations. And because Spring Boot can
read the annotation at runtime, it can add those calls. So I don't have to, like, call the API explicitly. I just add some annotation, and it should be done. On the Python side, imports this
trace stuff. And then I can, with the tracer, add some, again, explicit traces. So, internal traces. And from the Rust point of view, because I already, like, did it explicitly, it worked. And now you can see that I am in deep trouble because it happened a lot of time. The
Java application doesn't start for a demo, and that's really, really fun. So I will try to docker-compose down the catalog. And docker-compose, hey, what happens? Dash? Are you sure? No, no, no, no, no, no, no, no!
Not with the new versions. Oh, really? Yes. That's fine. We are all here to learn.
Stop. Thanks. The stress, the stress. Yeah. Honestly, if there is any, like, person here able to tell me why this
Java application sometimes has issues starting, because I've added one gig at the beginning, and it's stuck always here. So I can tell you what you
should see normally. If I'm lucky, I made a screenshot, and yes, here, but it's the beginning, it's the Rust one. So here, this is what you can have in
Python. This is what I added explicitly. I have five minutes. Well, if the demo doesn't work, it will be much better. Then I won't have any problems with the timing. Here, you can see that this is the trace that, yeah, this is a trace that I did
manually in Python. And here, we can see that I filled the ID with the value. And on the Java side, again, nope, nope. I think it will be here.
This is not the manual stuff that I added. Yes, it is. You have the fetch here. You have the fetch here. So this is the span that I added manually. I'm afraid that at this point, the demo just refused working.
Yes, it's still stuck. I will stop there. I won't humiliate myself further. When it's done, it's done. Perhaps if you are interested, you can follow me on Twitter. You can follow me on Mastodon. I don't know what's the ratio. More importantly, if
you are interested about the GitHub repo, to do that by yourself, perhaps with better configuration of Docker Compose with the right memory, it would work. And though the talk was not about Apache API 6, well, have a look at Apache API 6. It's an API getaway, the Apache way. Great. Are there some questions now? I never
got so many uploads with a filling demo. Please remain seated. Please remain seated so we can have a Q&A. Who had a question? Thank you. Very good talk. I have two questions. So
one is about... Let's start with the first one. Right. Yes, yes, yes. How much overhead does this bring in Python and Java or Rust? How heavy is this instrumentation? That's a very good question. And the overhead of each request depends on your own infrastructure. But I
always have an answer to that. Is it best to go fast and you don't know where you are going? Or to go a bit slower and to know where you are going? I think that whatever the costs, it's always easy to add additional resources
and it doesn't cost you that much. Whereas a debug incident across a distributed system can cost you days or even like weeks in engineering costs. And you are very, very expensive, right? Okay. Thank you. And the second one is have you encountered any funny issues with multi-threading or multi-processing? Something
like when your... Can you come closer to your mic? Your server just now was not starting. So some software when you have multi-threading or multi-processing and have you encountered any issues when the instrumentation caused you trouble? This is
not production stuff. This is just better than the hello world. So I cannot tell you about prediction issues. You should find people who have these issues. As I mentioned, it's a developers oriented talk. So it's more about pushing the developers to help ops do their job. For production issues, I must admit I have no clue.
Hi. In the case of runtime, does it always work with also badly written application? I mean, how bad can an application be before it stops working? I'm not sure I understood the
question. So how often do you need to do it before it stops working? No, no. I mean, let's say I use deprecated libraries, bad clients, something that it's not doesn't work as it's supposed to be for the instrumentation perspective. I mean, I do request to the network using UDP client,
something I've written myself, some custom stuff that I'm imagining that the instrumentation sits between some layer of the network, which is going to the
internet, for example. And so how bad can I be before it stops recognizing a request from junk? You cannot be bad. Okay. Well, it's a moral issue first, but then on the platform side, the also
instrumentation, they work with specific frameworks and tools. It's those frameworks and tools that know how to check what happens and to send the data to open telemetry. So if you don't play in this game, nothing will be sent. Okay.
So on the manual instrumentation side, it's an explicit call. So it depends what you want to send. Yeah. I was thinking of auto instrumentation. So let's say I do DNS resolution by myself, and then I just throw a request to an IP. Let me like
show the Python stuff here. This is what I showed you in the screenshot. This is what I write, and this is the attributes that I want to have, right? So basically if here you
have something that is completely unrelated, it's up to you. That's why it's easier to start with auto instrumentation. And then once you get a general overview of what you have, and your ops start saying, hey, perhaps we want to have more details here,
then you can come with manual instrumentation. But start with the like less expensive stuff. I didn't really answer the question. I understand it, but that's the best I can do regarding it. Sorry.
Okay, and then for the talk, for the agent you're using the Dockerfile, how you can configure it, for example, you export the tracing for Jagger or other stuff? Regarding the Dockerfile, sorry? Yeah, how you can configure
the agent to send the traces, for example, Jagger or other... The Dockerfile doesn't mention where you send it. The Dockerfile just says, hey, I will use OpenTelemetry. And it's during configuration, it's like in the Dockercompuls file, where I'm using like
agreed upon environment variables, where I'm saying you should set it here or here, or you should use logging or tracing or metrics or whatever. So that's very important to like separate those concerns. On one side in the Dockerfile in the image you say, hey, I'm ready for OpenTelemetry. And when you actually deploy it
to say, okay, OpenTelemetry will go there for the metrics and there for the tracing and for logging, I will disable it or whatever. Thank you for... Sorry, go ahead. Sorry. And then you have a Docker image that can be like reusable.
Thank you for being good Fostern citizens to remain seated. Next question. Thank you for your presentation. So my question is, does OpenTelemetry support error handling like Sentry? If not, is there any plans to do that?
It's really useful to catch crashes and capture the context of the crash. So that's it. Thank you. If it happens, when you mean crashes of OpenTelemetry itself or of the components that are like under watch? Yeah, of the application
that's monitored, yeah. Well, Fabien showed you how you could log and like bind your traces and your logs. So you could have both here. My focus was just on tracing, but you can reuse the same Docker GitHub repo
and just like here, put the logs somewhere in, I don't know, elastic search or whatever. No, because it's not a sponsored room. And then you can check and you introduce some
errors and then you can check how the two are bound and you can like drill down to where it failed. Okay, thank you.