Practical introduction to OpenTelemetry tracing
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/61982 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2023123 / 542
2
5
10
14
15
16
22
24
27
29
31
36
43
48
56
63
74
78
83
87
89
95
96
99
104
106
107
117
119
121
122
125
126
128
130
132
134
135
136
141
143
146
148
152
155
157
159
161
165
166
168
170
173
176
180
181
185
191
194
196
197
198
199
206
207
209
210
211
212
216
219
220
227
228
229
231
232
233
236
250
252
256
258
260
263
264
267
271
273
275
276
278
282
286
292
293
298
299
300
302
312
316
321
322
324
339
341
342
343
344
351
352
354
355
356
357
359
369
370
372
373
376
378
379
380
382
383
387
390
394
395
401
405
406
410
411
413
415
416
421
426
430
437
438
440
441
443
444
445
446
448
449
450
451
458
464
468
472
475
476
479
481
493
494
498
499
502
509
513
516
517
520
522
524
525
531
534
535
537
538
541
00:00
Software developerTouchscreenSoftware developerOperator (mathematics)WebsiteMultiplication signBitComputer animation
01:44
System programmingSoftware developerSoftwareLocal ringDataflowComputer animation
02:30
SoftwareModul <Datentyp>TelecommunicationComponent-based software engineeringSystem programmingState of matterTracing (software)Computer networkStandard deviationContext awarenessInformationService (economics)EmailFile formatPhysical systemLatent heatInternet service providerInheritance (object-oriented programming)Metric systemBlogOpen setWorld Wide Web ConsortiumLevel (video gaming)Web 2.0LoginContext awarenessMetric systemWorld Wide Web ConsortiumComputer architectureProjective planeSoftware development kitTracing (software)Inheritance (object-oriented programming)Connectivity (graph theory)State observerDataflowService (economics)Computer animation
06:05
Proxy serverDatabaseView (database)Series (mathematics)Open setArchitectureSoftware development kitCodeSimulationMaxima and minimaComputer animation
06:35
Control flowInternet service providerVideo gameRun time (program lifecycle phase)Library (computing)Gateway (telecommunications)Library catalogReverse engineeringProxy serverMetropolitan area networkClient (computing)CuboidPlug-in (computing)File formatCASE <Informatik>Scripting languageProxy serverConfiguration spaceService (economics)Projective planeProduct (business)Run time (program lifecycle phase)MereologyJava appletReverse engineeringSoftware developerPoint (geometry)Inheritance (object-oriented programming)Front and back endsVideo gameLibrary catalogComputer animation
09:44
Computer-generated imageryWindowVolumeLocal ringIntegrated development environmentRevision controlLibrary catalogLibrary (computing)Mobile appAttribute grammarBlogAtomic numberState of matterProduct (business)Grass (card game)Motion blurMaizeExecution unitMilitary operationTotal S.A.Service (economics)View (database)Inheritance (object-oriented programming)BootingRepository (publishing)Fatou-MengeJava appletApache MavenVacuumRootEmulationDependent and independent variablesIntegerGlass floatProgrammable read-only memoryInterior (topology)Query languageInstallation artCache (computing)Point (geometry)Musical ensembleParsingGamma functionTowerTracing (software)Digital filterOptical disc driveState diagramInformationString (computer science)Computer wormoutputRouter (computing)Server (computing)Disk read-and-write headError messageOnline chatCodePhysical systemConnectivity (graph theory)Social classLambda calculusSimultaneous localization and mappingData modelCurvatureComputer networkDenial-of-service attackMaxima and minimaModel-driven engineeringMenu (computing)Software frameworkSoftware developerJava appletMereologyBitDiagramLibrary catalogSystem callCodeStress (mechanics)BootingTracing (software)Point (geometry)Cartesian coordinate systemMultiplication signComponent-based software engineeringRevision controlVariable (mathematics)Product (business)Inheritance (object-oriented programming)Gateway (telecommunications)Web 2.0Integrated development environmentService (economics)Goodness of fitSpring (hydrology)Medical imagingCASE <Informatik>Web applicationDatabaseOpen setLibrary (computing)CodeRun time (program lifecycle phase)Different (Kate Ryan album)ResultantComputing platformComputer animation
19:23
Process (computing)File formatLibrary (computing)Formal languageIntelDemo (music)Multiplication signJava appletComputer animation
19:48
Process (computing)Dynamic random-access memoryAlpha (investment)Order (biology)Demo (music)Computer animation
20:17
Programmable read-only memoryLibrary catalogRevision controlInformationCodeBootingSocial classOrder (biology)Term (mathematics)Formal grammarServer (computing)SoftwareTwitterConfiguration spaceSoftware developerPredictabilityProduct (business)Physical systemPerspective (visual)Image resolutionOperator (mathematics)Process (computing)BitIncidence algebraThread (computing)Direct numerical simulationOverhead (computing)Demo (music)Right angleSystem callCASE <Informatik>Computing platformSoftware frameworkComputer animation
25:01
Interior (topology)Query languageProduct (business)RootLibrary catalogCodeLocal ringIntegrated development environmentView (database)Crash (computing)Medical imagingContext awarenessConfiguration spaceMetric systemCartesian coordinate systemKeyboard shortcutGoodness of fitError messageSoftware repositoryConnectivity (graph theory)Integrated development environmentFocus (optics)Attribute grammarOperator (mathematics)LoginPlanningTracing (software)Computer animation
28:51
Program flowchart
Transcript: English(auto-generated)
00:06
Hi, everybody. Thanks to be here for this talk. That's a lot of people. I'm Nicolas Frankel. I've been a developer for a long time. And I would like to ask how many of you are developers in this room? Quite a lot. Who are ops? Just as many. And who
00:27
are DevOps, whatever you mean by it. Yeah! So this talk is intended for actually developers, because I was or I still think I'm a developer. So if you are an
00:44
ops people, and for this, for you is not that super interesting, at least you can direct your developer colleagues to the talk so that they can understand how they can ease your work. Well, perhaps you've never seen that, but I'm old or
01:07
experienced, depending on how you see it. And when I was starting my career, monitoring was like a bunch of people sitting in front of screens the whole day. And actually, I was lucky. Once in the south of France, I was told,
01:23
hey, this is the biggest monitoring site of all France. And actually, it really looked like this. And of course, there were people like watching it. And that was the easy way. Now I hope that you don't have that anymore, that it has become a bit more modern. Actually, there is a lot of talk now about
01:47
microservices, right? Who here is doing microservices? Yeah. Yeah, because if you don't do microservices, you are not a real developer. But even if you don't do
02:00
microservices, so you are not a real developer, and I encourage you not to be a real developer in that case, you probably are doing some kind of distributed work. It's become increasingly difficult to just handle everything locally. And the problem becomes, yeah, if something bad happens, how can you locate how it
02:21
works? Or even if something works as expected, how you can understand the flow of your request across the network. I love Wikipedia. And here is the observability definition by Wikipedia, which is long and, in that case, not
02:41
that interesting. So I have a better one afterwards for tracing. So basically, tracing helps you to understand the flow of a business request across all your components. Fabian, where is Fabian? Fabian is here, so he talked a lot
03:06
about the metrics and the logging. So in this talk, I will really focus on tracing, because my opinion is that, well, metrics is easy. We do metrics since ages, like we take the CPU, the memory, whatever. Now we are trying to
03:24
get more like business-related metrics, but it's still the same concept. Logging also. Now we do aggregated logging. Again, nothing mind-blowing. Tracing is, I think, the hardest part. So in the past, there were already
03:42
some tracing pioneers. Perhaps you've used some of them. And well, now we are at the stage where we want to have something more standardized. So it starts with the trace context from the W3C. And the idea is that you start the trace
04:10
and then other components will get the trace and will append their own trace to it. So it works very well in a web context. And it defines like two
04:26
important concepts that Fabian thinks already described. So now I am done. So I have the same stupid stuff. So here you have... Oh, sorry. Yes. It
04:46
reminds me of the story. I did the same to my colleagues. They didn't care about the presentation. They only remembered that. Okay. So here you have a trace. And here you have the different spans. So here the X1 is the
05:05
parent one. And then the Y and the Z1 will take this X span as their parent span. So this is a single trace. This is a single request across your service. Web stuff is good, but it's definitely not enough. And so
05:24
for that we have the OpenTelemetry stuff. OpenTelemetry is just a big bag of miracles all set into a specific project. So it's basically APIs, SDK tools, whatever, under the OpenTelemetry level. It implements the
05:48
W3C trace context. If you have been doing some kind of tracing before, you might know it because it's like the merging of OpenTracing and OpenCensus. Good thing is the CNCF project. So basically there is some
06:03
hope that it will last for a couple of years. The architecture is pretty simple. Basically you've got sources, you've got the OpenTelemetry protocol, and as Thamian mentioned, you dump everything into a collector. Collector should be as close as
06:20
possible to your sources. And then some tools are able to read like data from it and to display it into the way that we expect to see it. What happens after the OpenTelemetry collector is not a
06:42
problem of OpenTelemetry. They are collectors that are compatible, and for example, you can use Jaeger or Zipkin in a way that allows you to dump your data, your OpenTelemetry data, into Jaeger or Zipkin into the OpenTelemetry format. So you can reuse, and that is very important, you can reuse
07:03
your infrastructure if you're already using those tools, but just switching to OpenTelemetry. And then you are using a standard, and then you can switch your OpenTelemetry backend with less issues. Now comes the fun developer part.
07:23
If you are a developer, you're probably all lazy. I know, I'm a developer. So the idea is OpenTelemetry should make your life as a developer as easy as possible to help your colleague, like diagnose your problems. And the easiest
07:46
part, if you do auto instrumentation. Auto instrumentation is only possible in cases where you have a platform, when you have a run time. Fabien mentioned Java. Java has a
08:00
run time, which is the JVM. Python has a run time. Now if you have Rust, it's not as easy. So in that case, you are stuck. My advice, if you are using a run time, and probably most of you are using such run times, whether
08:21
Java or whatever, use it. It's basically free. It's a low hanging fruit, and there is no coupling. So basically you don't need extra dependencies as developers in your projects. So since it's called practical introduction, let's do some practice. So here I have a bit better
08:43
than a hello world, so I have tried to model like an e-commerce shop with very simple stuff. It starts just asking for products. I will go through an API gateway, which will forward the product to the catalog, and the catalog doesn't know about the prices, so it will ask
09:01
the prices from the pricing service, and it will ask the stocks from the stock service. The entry point is the most important thing, because it gives the parent phrase. Everything will be from that. So in general,
09:23
you have a reverse proxy or an API gateway, depending on your use case. I work on the Apache API 6 project. It uses the Nginx reverse proxy. On top you have an OpenResty, because you want to have Lua to script and to auto reload the configuration. Then you have
09:40
lots of out of the box plug-ins. Let's see how it works. Now I have the code here. Is it big enough? Good. So I might be very old, because for me it wouldn't. Okay, here that's my architecture. I'm using
10:03
Docker Compose, because I'm super lazy. I don't want to use Kubernetes, so I have Jaeger. As I mentioned, I have all in one. I'm using the all included, so I don't need to think about having the telemetry collector and the web to check the traces. I have
10:22
only one single image. Then I have API 6. Then I have the catalog, which I showed you. Of course, I have a couple of environment variables to configure everything. I wanted to focus on tracing, so no metrics, no logs. I'm sending everything to Jaeger,
10:45
and then I do the same for pricing, and I do the same for the stock. Normally at this point, I already started, because in general I have issues with the Java stuff. So here I'm doing a simple
11:01
curl to the product. I've got the data, which is not that important, and I can check on the web app how it works. So here I will go on the Jaeger UI. I see all my services. I can find the traces. Here you can find the latest one, and here is
11:21
the thing. If I click on it, it might be a bit small, right? I cannot do much better. You can already see everything that I've shown you. So I start with the product from the API gateway. It forwards it to the product, to the catalog. Then
11:41
I have the internal calls, and I will show you how it works. Then I have the GET request made from inside the application, and then I have the stocks that respond here. Same here, and here we see something that was not mentioned on the
12:02
component diagram. From the catalog to the stock, I go directly, but from the catalog to the pricing, I go back to the API gateway, which is also a way to do that for whatever reason. And so this is something that was not mentioned on the PDF, but you cannot cheat with open
12:22
telemetry. It tells you exactly what happens and the flow, and the rest is the same. So regarding the code itself, I told you that I don't want anything to trouble the developer, so here I have nothing regarding open telemetry.
12:46
If I write hotel, you see nothing. If I write telemetry, you see nothing. I have no dependency. The only thing that I have is I have my Dockerfile, and in my Dockerfile, I get
13:03
the latest open telemetry agent. So you can have your developers completely oblivious, and you just provide them with this snippet, and then when you run the Java application, you just tell them, hey, run with the Java
13:21
agent. Low hanging fruit, zero trouble. Any Java developer here? Not that many. Python? Okay, so it will be Python. Just the same here. Here it's a bit different. I add
13:46
dependencies, but actually I do nothing on it. So here I have no dependency on anything. Here I'm not using, I'm using a SQLite database because, again, I'm lazy. I don't care that much. But here I have no dependency, no API
14:03
call to open telemetry. The only thing that I have, it's in the Dockerfile again. I have this. Again, I'm using a runtime. It's super easy. I let the runtime, like, intercept the calls and send everything to
14:21
open telemetry. And the last fun stuff is Rust. Any Rust developer? Please don't look at my code too much. I'm not a Rust developer, so I hope it won't be too horrible. And Rust is actually, well, not
14:44
that standardized. So here I don't have any runtime. So I need to make the calls by myself. The hardest part is to find which library to use depending on which framework to use. So in this case, I found one, and perhaps there are better
15:01
options. But I found this open telemetry OLTP stuff. And here this is because I'm using XM, I'm using this library. And so far it works for me. I don't need to do a lot of stuff. I just, like,
15:20
copy-pasted this stuff. Copy-paste developer. And afterwards, in my main function, I just need to say this and this. So I added two layers. So if you are, if you don't have any platform, any runtime, you actually need your
15:41
developers to care about open telemetry. Otherwise, it's fine. Now, we already have pretty good, like, results. But we want, we may want to do better. So we can also ask the developers, once they are more comfortable, to do manual
16:02
instrumentation even in the case when there is a platform. Now I will docker-compose down. And it takes a bit of time. I will prepare this. And on
16:32
the catalog side, now I can have some additional codes. So this is a Spring
16:45
Boot application. What I can do is add annotations. Like, I don't, I noticed there were a couple of Java developers, so it's the same with Kotlin. It's still on the JVM. So basically, I'm adding annotations. And because Spring Boot can
17:02
read the annotation at runtime, it can add those calls. So I don't have to, like, call the API explicitly. I just add some annotation, and it should be done. On the Python side, imports this
17:21
trace stuff. And then I can, with the tracer, add some, again, explicit traces. So, internal traces. And from the Rust point of view, because I already, like, did it explicitly, it worked. And now you can see that I am in deep trouble because it happened a lot of time. The
17:42
Java application doesn't start for a demo, and that's really, really fun. So I will try to docker-compose down the catalog. And docker-compose, hey, what happens? Dash? Are you sure? No, no, no, no, no, no, no, no!
18:04
Not with the new versions. Oh, really? Yes. That's fine. We are all here to learn.
18:23
Stop. Thanks. The stress, the stress. Yeah. Honestly, if there is any, like, person here able to tell me why this
18:43
Java application sometimes has issues starting, because I've added one gig at the beginning, and it's stuck always here. So I can tell you what you
19:03
should see normally. If I'm lucky, I made a screenshot, and yes, here, but it's the beginning, it's the Rust one. So here, this is what you can have in
19:21
Python. This is what I added explicitly. I have five minutes. Well, if the demo doesn't work, it will be much better. Then I won't have any problems with the timing. Here, you can see that this is the trace that, yeah, this is a trace that I did
19:40
manually in Python. And here, we can see that I filled the ID with the value. And on the Java side, again, nope, nope. I think it will be here.
20:03
This is not the manual stuff that I added. Yes, it is. You have the fetch here. You have the fetch here. So this is the span that I added manually. I'm afraid that at this point, the demo just refused working.
20:20
Yes, it's still stuck. I will stop there. I won't humiliate myself further. When it's done, it's done. Perhaps if you are interested, you can follow me on Twitter. You can follow me on Mastodon. I don't know what's the ratio. More importantly, if
20:41
you are interested about the GitHub repo, to do that by yourself, perhaps with better configuration of Docker Compose with the right memory, it would work. And though the talk was not about Apache API 6, well, have a look at Apache API 6. It's an API getaway, the Apache way. Great. Are there some questions now? I never
21:09
got so many uploads with a filling demo. Please remain seated. Please remain seated so we can have a Q&A. Who had a question? Thank you. Very good talk. I have two questions. So
21:22
one is about... Let's start with the first one. Right. Yes, yes, yes. How much overhead does this bring in Python and Java or Rust? How heavy is this instrumentation? That's a very good question. And the overhead of each request depends on your own infrastructure. But I
21:41
always have an answer to that. Is it best to go fast and you don't know where you are going? Or to go a bit slower and to know where you are going? I think that whatever the costs, it's always easy to add additional resources
22:00
and it doesn't cost you that much. Whereas a debug incident across a distributed system can cost you days or even like weeks in engineering costs. And you are very, very expensive, right? Okay. Thank you. And the second one is have you encountered any funny issues with multi-threading or multi-processing? Something
22:22
like when your... Can you come closer to your mic? Your server just now was not starting. So some software when you have multi-threading or multi-processing and have you encountered any issues when the instrumentation caused you trouble? This is
22:41
not production stuff. This is just better than the hello world. So I cannot tell you about prediction issues. You should find people who have these issues. As I mentioned, it's a developers oriented talk. So it's more about pushing the developers to help ops do their job. For production issues, I must admit I have no clue.
23:05
Hi. In the case of runtime, does it always work with also badly written application? I mean, how bad can an application be before it stops working? I'm not sure I understood the
23:21
question. So how often do you need to do it before it stops working? No, no. I mean, let's say I use deprecated libraries, bad clients, something that it's not doesn't work as it's supposed to be for the instrumentation perspective. I mean, I do request to the network using UDP client,
23:42
something I've written myself, some custom stuff that I'm imagining that the instrumentation sits between some layer of the network, which is going to the
24:00
internet, for example. And so how bad can I be before it stops recognizing a request from junk? You cannot be bad. Okay. Well, it's a moral issue first, but then on the platform side, the also
24:21
instrumentation, they work with specific frameworks and tools. It's those frameworks and tools that know how to check what happens and to send the data to open telemetry. So if you don't play in this game, nothing will be sent. Okay.
24:42
So on the manual instrumentation side, it's an explicit call. So it depends what you want to send. Yeah. I was thinking of auto instrumentation. So let's say I do DNS resolution by myself, and then I just throw a request to an IP. Let me like
25:05
show the Python stuff here. This is what I showed you in the screenshot. This is what I write, and this is the attributes that I want to have, right? So basically if here you
25:23
have something that is completely unrelated, it's up to you. That's why it's easier to start with auto instrumentation. And then once you get a general overview of what you have, and your ops start saying, hey, perhaps we want to have more details here,
25:42
then you can come with manual instrumentation. But start with the like less expensive stuff. I didn't really answer the question. I understand it, but that's the best I can do regarding it. Sorry.
26:02
Okay, and then for the talk, for the agent you're using the Dockerfile, how you can configure it, for example, you export the tracing for Jagger or other stuff? Regarding the Dockerfile, sorry? Yeah, how you can configure
26:21
the agent to send the traces, for example, Jagger or other... The Dockerfile doesn't mention where you send it. The Dockerfile just says, hey, I will use OpenTelemetry. And it's during configuration, it's like in the Dockercompuls file, where I'm using like
26:41
agreed upon environment variables, where I'm saying you should set it here or here, or you should use logging or tracing or metrics or whatever. So that's very important to like separate those concerns. On one side in the Dockerfile in the image you say, hey, I'm ready for OpenTelemetry. And when you actually deploy it
27:02
to say, okay, OpenTelemetry will go there for the metrics and there for the tracing and for logging, I will disable it or whatever. Thank you for... Sorry, go ahead. Sorry. And then you have a Docker image that can be like reusable.
27:21
Thank you for being good Fostern citizens to remain seated. Next question. Thank you for your presentation. So my question is, does OpenTelemetry support error handling like Sentry? If not, is there any plans to do that?
27:41
It's really useful to catch crashes and capture the context of the crash. So that's it. Thank you. If it happens, when you mean crashes of OpenTelemetry itself or of the components that are like under watch? Yeah, of the application
28:00
that's monitored, yeah. Well, Fabien showed you how you could log and like bind your traces and your logs. So you could have both here. My focus was just on tracing, but you can reuse the same Docker GitHub repo
28:21
and just like here, put the logs somewhere in, I don't know, elastic search or whatever. No, because it's not a sponsored room. And then you can check and you introduce some
28:42
errors and then you can check how the two are bound and you can like drill down to where it failed. Okay, thank you.