We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Are we observable yet?

00:00

Formal Metadata

Title
Are we observable yet?
Subtitle
Telemetry for Rust APIs - metrics, logging, distributed tracing
Title of Series
Number of Parts
8
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Is Rust ready for mainstream usage in backend development? There is a lot of buzz around web frameworks while many other (critical!) Day 2 concerns do not get nearly as much attention. We will discuss observability: do the tools currently available in the Rust ecosystem cover most of your telemetry needs? I will walk you through our journey here at TrueLayer when we built our first production backend system in Rust, Donate Direct. We will be touching on the state of Rust tooling for logging, metrics and distributed tracing.
Virtual realityComputer virusDirected setLoginIntegrated development environmentOpen sourceCartesian coordinate systemType theoryTracing (software)Front and back endsCodePoint (geometry)Multiplication signWord2 (number)Software maintenanceVariety (linguistics)BitGoodness of fitProduct (business)Group actionMetric systemPhysical systemSoftware developerBlogOnline chat
Computing platformVirtual realityDirected setCartesian coordinate systemDataflowProjective planeWordDifferent (Kate Ryan album)Data conversionBitSpacetimeProduct (business)Type theoryHeat transferCore dumpGroup actionRoundness (object)Electronic mailing listWebsiteLine (geometry)Frame problemSoftwareFront and back endsAlphabet (computer science)Plastikkarte
Virtual realityWeb 2.0CodeIntegrated development environmentNumberMultiplication signStreaming mediaProduct (business)Slide ruleCommon Language InfrastructureGame controllerWordVariety (linguistics)Computer animation
Virtual realityMKS system of unitsSource codeConnectivity (graph theory)Integrated development environmentFigurate numberDiagramDot productCASE <Informatik>Monster groupLine (geometry)Product (business)SoftwareComputer animation
Virtual realityChecklistTracing (software)Metric systemBlogBeat (acoustics)Physical systemFibonacci numberArc (geometry)Duality (mathematics)Polygon meshChecklistCartesian coordinate systemProduct (business)Shape (magazine)Form (programming)Point (geometry)Multiplication signRadio-frequency identificationSoftwareElectronic mailing listEndliche ModelltheorieMetric systemLoginDot productType theoryBeat (acoustics)Matching (graph theory)Profil (magazine)Different (Kate Ryan album)Dependent and independent variablesState of matterVirtual machineError messageOrder (biology)CodeDatabase normalizationTracing (software)PlotterState observerSet (mathematics)Time seriesPhysical systemConsistencyThresholding (image processing)Level (video gaming)Software developerRing (mathematics)BitBit rateCASE <Informatik>View (database)HistogramEnterprise architectureExpected valueService (economics)NumberVariety (linguistics)BuildingMedianPrisoner's dilemmaGroup actionGreatest elementBootingDigital photographyRandomizationMobile appCommunications protocolMatrix (mathematics)Default (computer science)Subject indexingMoment (mathematics)2 (number)Perspective (visual)Computer animation
Virtual realityMetric systemStandard deviationCloningProgrammable read-only memoryLoginEmpennageInformationGroup actionBlogContext awarenessLogicLevel (video gaming)Subject indexingLine (geometry)Query languageVariety (linguistics)Boolean algebraSingle-precision floating-point formatFunctional (mathematics)Cartesian coordinate systemError messageVirtual machineNumberSlide ruleQueue (abstract data type)Visualization (computer graphics)Type theoryOperator (mathematics)Regulärer Ausdruck <Textverarbeitung>MiddlewarePressureSoftwareConfiguration space1 (number)Point (geometry)Term (mathematics)File formatMultiplication signImplementationMetric systemLocal ringPhysical systemProgrammable read-only memoryLoginDrill commandsInformationClassical physicsSet (mathematics)Web 2.0Macro (computer science)Key (cryptography)Context awarenessTracing (software)Electronic visual displayTimestampLoop (music)Data structureMessage passingFerry CorstenStatement (computer science)Binary codeData loggerSummierbarkeitServer (computing)Data managementOpticsBit rateParameter (computer programming)Standard deviationDrop (liquid)Moment (mathematics)Uniform resource locatorShape (magazine)Front and back endsSoftware developerCodeCASE <Informatik>
Virtual realityEvent horizonAbstractionExecution unitBlock (periodic table)Physical systemMotion captureCategory of beingLoginStructured programmingCodeMacro (computer science)1 (number)Data structurePoint (geometry)Multiplication signComputer fileType theoryLoginKey (cryptography)Statement (computer science)ImplementationCartesian coordinate systemGreatest elementCASE <Informatik>Boolean algebraStandard deviationEvent horizonLine (geometry)Functional (mathematics)Task (computing)MetadataSubject indexing2 (number)Block (periodic table)AbstractionCodeMereologyProjective planeBasis <Mathematik>BuildingLogicMacro (computer science)Tracing (software)Pattern languageFile formatPhysical systemContext awarenessVector spaceExecution unitSoftware developerSoftware bugLevel (video gaming)Field (computer science)ExistenceRegulator geneView (database)Closed setGraph (mathematics)Product (business)Data managementSound effectBoilerplate (text)Equaliser (mathematics)Poisson-KlammerRevision controlDomain nameStructural loadOperator (mathematics)Process (computing)Physical lawMachine code
Tracing (software)Distribution (mathematics)Virtual realityProcess (computing)LoginOpen setPerspective (visual)File formatCartesian coordinate systemRepository (publishing)Block (periodic table)Tracing (software)Service (economics)View (database)Computer architecturePoint (geometry)Computer animation
Virtual realityView (database)Execution unitService (economics)DataflowLevel (video gaming)Field (computer science)Type theoryCartesian coordinate systemDifferent (Kate Ryan album)Projective planeImplementationYouTubeCore dumpConfiguration spaceIntegrated development environmentPoint cloudProfil (magazine)Front and back endsPredictabilityPhysical systemMoment (mathematics)Client (computing)Formal languagePower (physics)IdentifiabilityLoginInformationVariety (linguistics)Context awarenessPoint (geometry)Semantics (computer science)Product (business)ParsingCommon Language InfrastructureMultiplication signComputing platformState of matterOperator (mathematics)Strategy gameCombinational logicError messageEnumerated typeState observerGoodness of fitRow (database)Macro (computer science)Standard deviationMetric systemDirection (geometry)Telephone number mappingTracing (software)Order (biology)Pattern languageNumberLine (geometry)Term (mathematics)CodeDependent and independent variablesBitCoordinate systemWeightDefault (computer science)File formatProcess (computing)Software bugAttribute grammarAsynchronous Transfer ModeData structureFitness functionSingle-precision floating-point formatMetropolitan area networkMathematicsComputer architectureDecision theorySpacetimeSpeech synthesisCASE <Informatik>Absolute valueFocus (optics)Overhead (computing)Cross-correlationMetadataSemiconductor memoryLatent heatLoop (music)Open setFunctional (mathematics)Parameter (computer programming)Data managementToken ringThomas BayesVideo gameAbstractionGrass (card game)Canonical ensembleMotion captureTwitterResolvent formalismMessage passingLink (knot theory)Computer animation
Transcript: English(auto-generated)
Amazing. Good evening, everybody. Thanks for joining for the last day of Rusty Days. We're going to chat for the next 30 minutes or so about observability. In particular, we're going to discuss if the Rust ecosystem at this point in time provides enough tooling
to write observable APIs. And we're going to go through the journey of writing one and see how that came along. My name is Luca Palmieri. I work as a live engineer through we're going to spend some words about that in a second. In the Rust ecosystem, I contribute to the Rust London User Group, where I curate the Code Dojo. I've been contributor and
maintainer of a variety of crates in the open source ecosystem, Linfa, Wiremock, and some others. And I'm currently writing Zero to Production, which is a book on Rust backend development, which I publish chapter by chapter on my blog, which you can see linked down there.
So let's get to the meat of what we're going to discuss tonight. This is a little bit our agenda. So we're going to see what DonateDirect is. DonateDirect is an application and it's going to drive our whole journey. We're going to see what it entailed to bring that application to production. Then we're going to zoom in on
three types of telemetry data, which are often collected to observe the behavior of applications in production environments. Metrics, logging, and distributed traces. If you don't know what they are, or you have an experience working with them before, that's not a problem. We're going to give all the details and I'll work you through why they're useful and how we
collect them. So let's start from the very basics. What is DonateDirect? Before that, let's get two words on what Trulia does, which is going to frame the conversation. Now Trulia is a company which operates in the financial technology space. In particular, we provide APIs for people to consume. We mainly provide two types of APIs,
one for accessing banking data on behalf of a user, and then one to initiate a banking transfer once again on behalf of a user. So you pay using your own bank account without credit cards, without intermediaries of other type. During the COVID pandemic, as many of the people,
we kind of tried to think what we could do in any way to help relieve pressure or contribute to what was happening. So myself with a group of other colleagues put together an application called DonateDirect, which lets you use our payment initiation technology to donate money to charities. So as you can see in the GIF on the left, the flow is very simple. So you select
a charity from a list, you specify how much amount you want to donate, then you fill in some tax stuff and you get redirected to the flow of your bank. And the money goes through your bank account to the charity without any fee. So Trulia did this completely free of charge,
matching some donations. Now as it happens when you do side projects of different kinds, so things that are a little bit outside the main product line, you have a chance to experiment with technologies, which will be considered a little bit too edgy to be used in the core product. And as you might imagine, considering that this is a Rust talk in a Rust
conference, DonateDirect's backend API is filled in Rust. Now it's not our first round with Rust, but it was our first Rust API in production year through the year. It was the first time we were actually shipping code that was responding interactively to users coming from the
wild web. So I want to see lots of emojis when I re-watch the stream at this specific slide. Now, as I said, we experimented with it before. So we were doing build tooling, we were doing CLIs, we were doing some weird Kubernetes controllers for non-critical stuff and so on
and so forth. But once you actually put an API in front of a user, then the bar of that API needs to be raised significantly. Which brings us to our journey to production. Now, to use the words from someone that's wiser than myself, one does not simply work into
production for a variety of reasons. And reason number one is that generally speaking, production environments are very complex. So if we look at this diagram, this depicts Monza's production environment. So each of the blue dots is a microservice in Monza's cluster.
And each of the lines connecting two dots are microservice talking to each other over the network. Now, the layer is not Monza, so it doesn't have a half thousand six hundred microservices interacting in production. But you might imagine that our production environment is equally complex in many subtle ways. And what generally you're trying to plan for in a production environment
is not even really the epic case. So is stuff actually working? Are you trying to predict or to mitigate the way stuff can fail? So what happens if one of those blue dots, for example, in the Monza cluster goes down? What happens in one of those blue dots start responding more
slowly than it generally does or is supposed to do? Or it doesn't elastically react to search and traffic? All these kind of behaviors in a very connected graph like that can cause cascading failures. And it becomes very, very difficult once something like that is happening to troubleshoot why and fix it, if possible. Now, what does each of those blue dots actually is? Now,
in today's case, we run a Kubernetes cluster. So all our production applications are deployed on top of Kubernetes, which means those blue dots are Kubernetes deployment. Kubernetes
copies of the application. Each of those copies is called a pod, and the pod may be composed on one or more Docker containers. The pods are identical to one another, and so they can be dynamically scaled to match traffic increasing. And they can also be on different machines in
order to give us redundancy if one of those machines ends up going down for whatever reason. Now, what does it mean to release something to production? In a true layer, especially when you look at things from an operational perspective, you want to have a certain set of guarantees about what each of those applications provides
from an operational point of view. What this means is you want to be sure that the set of best practices is being followed consistently. All those best practices are collected in a huge checklist called the pre-production checklist. Now, if you are a non-call engineer, the pre-production
checklist is in many ways a very nice thing, in the sense that it gives you a baseline level of quality, especially on the observability side, as we're going to see. And you can be sure that those metrics and those logs are going to be there. Now, if you are a developer who's trying to deploy a new application, the pre-production
checklist can be a significant hurdle, because there's a lot of things that you need to do in order to actually see your application out there. And so keeping along with the Lord of the Rings metaphor, they might look a little bit like the Ghost of the Rings, and they look quite scary at this point in time. This is like the first movie. So what's the dilemma? On the left,
application developer, or in general, building something that looks really cool, you want to ship it. And when you are at the beginning of your startup journey, so when you're a skanky group, doing a skanky app, that is only built by a bunch of people, we increasingly know that what you're doing is particularly risky, because the new product is a new company, you just did the
rating fast, it's fine to just ship it. You put your cowboy hat on, and you just deploy to production. Now, as you mature along your journey, you start to get bigger and bigger customers. And those customers will have enterprise expectations. So they want your service to be up,
you will have SLAs with them. And in general, just your reputation will demand of you a higher level of reliability. Now, if you're a true ledger and you work in financial technology, that is even truer, so to speak, or for example, a random consumer app, you don't expect your payments to stop working, they should always be working. And if they don't, that can cause
some serious disruption. So software has to be treated as mission critical, as much as possible. And to be reliable, there's a lot of bests and whistles that you need to attach to your application. So metrics, tracing, logs, horizontal political scaling, alerts to know when something goes wrong, network policies to prevent escalations, liveness and editing probes,
to tell you the need is when to restart something, and so on and so forth. The list can get very long. And that is troublesome, because in the end, and that's my personal model I would say, is convenience beats correctness. What this means is that if doing the
right thing is in any way, shape or form more complicated than doing the wrong thing, then someone at a certain point in time will find the reason not to do the right thing. They have a deadline next Monday and really need to ship this application. Or they think it's too complex and actually they don't need all that stuff. Like this is a small thing, it's going
to run in the cluster, it's not going to get big. Then it gets big, then it fails, and then you have problems. So you won't be able to fall into the so-called pit of success. They should naturally converge to doing the right thing, because doing the right thing is the easiest thing to do. Now I'm not going to cover all the possible things that would require application to do,
that would be long and potentially quite boring. We're still going to focus on telemetry data. So staying in topic with the talk, are we observable yet? What kind of telemetry data? So we said logs that we ship into Elasticsearch, metrics which are scraped from Prometheus from our applications,
and traces that we push into Jaeger for distributed tracing. So we're going to go one by one, look at what they are, why they're useful, and how you collect them in a Rust application. So let's start from metrics. Why do you want to collect metrics? Well, generally speaking, you want
to collect metrics, because you want to be able to produce plots that look exactly like this. So you want to be able to see, well, what's the latency of this application in the last 30 minutes, and potentially break it down by percentile. So the 50th percentile, the 70th percentile, the
90th and the 99th, depending on the type of application your performance profile. Or you might want to know what's the response breakdown. So how many 200s, how many 500s, how many 400s, and so on and so forth. Metrics, generally speaking, are there to give us an aggregate picture of the system state. So they're there to answer Boolean questions very often about how
the system is doing. Is our error rate above or below 10 percent? Is the error rate for request that come to this specific API on this endpoint above or below a certain threshold? Are we breaking our SLAs on latency? And metrics are supposed to be as real-time as
possible, so they tell you what the system state is now in this very, very moment. What do metrics look like? So how do you actually get those plots that we just saw? Metrics generally look somewhat like this. So you have a metric name, which in this case is HTTP requests, duration, seconds, bucket. So they were mapful, but it's very precise. So we're talking
about the duration of HTTP requests, and we're looking at a histogram. So we're looking at buckets of requests at different thresholds of latencies. On this metric, we have a set of labels that we can use to slice the metrics value. So we have endpoint, so bottom point to rethink.
HTTP method, get, post, put, patch, whatever. The status code we returned, so in this case 404, and then you have the bucket that we're looking at. So five milliseconds, 10 milliseconds, 25, and so on and so forth. And then the number of requests falling inside that bucket.
Now this was a super fast 404, so all 1,601 fell beneath the five milliseconds, but generally it's going to be a little bit more varied. This is basically a time series. A time series with a variety of values you can slice and dice from. These time series are produced by the application and then are aggregated by Prometheus, at least
in our specific setup. So Prometheus hits the slash metrics endpoint on all the copies of an application, it could be on another endpoint, but that's generally the default, aggregates all these metrics, indexes them, and then allows you to perform queries against them. One way to perform
queries is to do alerts. So alert manager, you define a variety of queries, which evaluate for boolean. So as we said before, is the error rate, so the number of 500s, above or below 10% for 15 minutes. If yes, then through page of duty, get a non-caller, so a non-call engineer
to look at the system because something is wrong. Otherwise, you can use Grafana if you just want to do some pretty visualizations. So if we go back to the slide we saw before, which is this one, this is Grafana. So we're looking just at Prometheus queries visualized. This is very, very useful for an on-call team or operation team to actually understand what's
going on. Now, how do you actually get metrics? So how do you get your API to produce metrics? The Data React was developed using Actix Web for a variety of reasons. I wrote a piece about that a couple of weeks ago, a few more queries. It's very, very easy. So there's a package
on Cresilio called Actix Web Prom, so Actix Web Prometheus. You just plug the middleware inside your application. It's the .prop.clone line. The middleware takes some very, very basic configuration parameters, so a prefix for the metrics and the point you want to use, and then you're set up. You're just going to expose slash metrics. Now, you might want to
customize it for your specific application because you might need to collect metrics which are known standards. You might have specific naming conventions and so on and so forth. Actix Web Prom is like a single file type of crate. So you can go there, use it as some kind of a blueprint, and adapt it to do whatever you need to do. So metrics, useful, very easy to collect.
Just plug and play if you're using Actix. Logging. As we saw, metrics are about what is happening in the system, in the aggregate, at this very, very moment.
So low latency, fairly aggregated type of data. Logs are instead useful to answer the question, what is happening to this specific request? Such as, what happened to users who tried to do a payment from, let's say, HSBC to Barclays in the UK between 5pm and 6pm on the 27th of July?
There's no way, unless I'm very lucky and the labels on the metrics are exactly the ones I need, but generally they aren't, because labels are supposed to be locality on metrics. There's no way I can generally answer this type of question. Absolutely, I cannot answer it to this single request type of granularity, because those are all aggregated in metrics. Logs instead can provide
us to that level of drill down, that can allow us to slice and dice to get that precise level of information. That is key to actually debug what is going on in a distributed system, especially when things go wrong in a way which you haven't actually accounted for,
the so-called unknown unknowns or emergent behavior in distributed systems. So, let's look at what it looks like to log in Rust. So, classic approach Rust, under the approach logging, is to use the log crate. The log crate is built using a facade pattern.
So, the log crate provides you a set of macros, debug, trace, info, warn, and error to actually instrument your application. This is an example taken straight from the log crate documentation, or at least I think it was. You enter into the shape, the yak function takes a yak in a mutable reference to yak. You meet a trace level statement, so you announce to the world,
we are commencing the yak shaving. It's trace level, so it's at a very verbose logging level. In most cases, it's going to be filtered out. Then you loop and try to acquire a razor. If you get a razor, info level, log statement, razor located, display implementation of the razor, you shape the yak, and you break from the loop,
and the yak's in the function. If instead you fail to find the razor, then you emit a warning saying, I was unable to locate the razor. You're going to retry. Now, facade means that you have no idea what is actually going to consume these log statements.
You just instrument your code, and then generally at the entry point of your binary, you're going to introduce a log in implementation. An actual implementation that takes this log data and then does something with them, or something is generally shipping them some places. If you use the simplest possible logger, which is generally a logger,
you're going to see something like this. You log into the console, standard out. In this specific execution, which I made, you get unable to look at the razor for three times. We're looking three times, and then you actually locate the razor. You have the log message, the name of the function, and then you have a timestamp and the log level.
Now, this may work if you're doing common line applications. So if it's a single main function running, and you have a user looking at logs to understand what is going on. In a backend system, especially in a distributed backend system,
you have applications running on multiple machines. These applications are generally some kind of server, either a web server, or a queue consumer, or something like that. And they're executing many, many requests concurrently. And you want to be able, at a certain point, generally later, so you're not really there tailing the logs, to say what
happened to request xyz, which was about this type of users, as we discussed before. And the only way you can do that in plain logging is using text search. But text search is not easy to search. First of all, it's expensive. It cannot be indexed, and requires also a lot of
knowledge about how the logs are structured. So you end up, if you want to do anything that is non-trivial, so anything which is not, tell me if this substring is in the log, you end up writing regexes. And writing regexes means that you are coupled to the implementation of the logging inside the application, which makes it very, very complicated for operators
and support people to actually go and use these logs. So all the pressure of operating the software ends up on the shoulders of the developers, which we want them to be there, but we don't want them to be the only ones who can answer questions about the system. So a much better way is to have structured logs. Structured logs in the sense that to each
log line, we associate a context. And that context needs to be searchable, which in a very formal terms means that the context is in some machine readable format that somebody can parse and index, allowing people to filter on it and perform queries. So let's have a look at how
we could do structured logging. So similar example, not fully identical. This time the debug macro is coming from this log crate. It's log standing for structured logging. It's not crate, this one as well, but established, been there for quite some time. It allows you to specify the log message, so very similarly to what we were doing before, and then allows you to
specify using the old macro some key value pairs to be attached to your logs. Now, log has been for a very long time the only way to do structured logging in Rust. Recently, if I'm not mistaken, the log crate has added a feature to add key value pairs to log statements. But once again,
as far as I've seen, at least after a month ago, almost none of the log implementations actually supported key value pair logging. So you're once again down to log for doing structured logging. So what are we trying to do here? We're trying to do here is what we
generally want to do in distributed applications. So I want to know when something is beginning, I want to do some stuff which might be composed of some subroutines, so this subunit of work function that might emit their own logs, so this event log, so who, and then you want to know when that thing has ended. And then, given that we're shaving the yak on behalf of somebody
else, so we're taking this user ID, I want the user ID to be associated to every log line, and I also want to capture along the whole application took, so I want to capture that elapsed milliseconds at the bottom. If once again we plug into it the most basic type of formatter, so in this case it's a boolean formatter, log into standard out, we get exactly this.
So you see all the log statements, you see all the boolean metadata, and everything is a JSON. So that means that I can parse this as JSONs, and I can filter user ID very, very fast, very, very easily. Or I can push all these things somewhere else, which is going to index them and search them, and we're going to see that in a few seconds. Now let's go back to the code.
You may agree with me that this is very verbose. It's very, very noisy, like you have a lot of log statements which are interleaved with the application code, and you don't even see the application code here, but this function is really looking a little bit hairy. And this is
because, generally speaking, for most use cases, at least the ones I encounter in the world, adding orphan log events is generally the wrong abstraction. You reason about tasks, and tasks have a start time, they do something, and then they end. So what you really want to use as
your primary building block when you're doing some kind of instrumentation for structured logging is a span, and a span represents exactly a unit of work done in the system. So let's look at the same function using spans. We are moving away from slog, so we're
leaving slog behind for the time being, and we're moving on to the tracing crate. Tracing crate is part of the Tokyo project, and I think it's not an overstatement, it's one of the most impactful crate lists, and what they do on a daily basis, which has been released in the past year or so.
Provides extremely quality implementation, and we're going to see it suits our needs perfectly. So a span. We enter into the function and we create a span. Debug level, so we set the level as if we were doing logging, we tell what's the name of the span, yakshave, and we associate with the span the user ID. Now tracing crate uses a guard pattern, so when you press, when you
gonna enter inside the span. Everything that happens between the enter function method invocation and the dropping point of the underscore enter guard is gonna happen in the context of the same
span, which means there's no need for us to add once again the user ID to debug. There's also no need for us to do anything weird about subunit work. Subunit work can ignore the fact that it's part of the yakshave function and can just go on to do its thing, and they will be able to
emit log statements, and if that was log statements attached context, then we can also capture the context from the pattern function, and all of these happens pretty much transparently. What this means is that if we really want to shrink it, so if we really want to go to the essential of it, we can also remove those two lines of boilerplate so that span equal and then
enter function just use the tracing instrument proc macro, which is gonna basically the sugar exactly to the same thing and leaves us with this function. Now what's that? That's like one, two, three, four, five lines considering there's a closing bracket, so four or five depending on how you count it. If you go and compare that to our log version of this, you can clearly see how
diagnostic implementation is now much less intrusive. It's, as we were saying before, much more convenient. It's much easier for developers to slap slash ash tracing instrument
on top of a function and so allow them to build very, very domain-oriented traceponds and do that consistently if that does not involve writing a lot of code. It does not involve polluting their function code and is generally transparent to the application. Now tracing just like log and just like log is a facade pattern. So what you do is
instrument your application using those macros and then you have subscribers. Subscribers are the ones that actually receive this tracing data and can do something with it. So tracing can be used for structured logging. I think at this point in time is the best trait if you really want to do structured logging. So you can log all those spawns to standard out or to a file
or whatever you think is useful to you. At the same time using spawns and spawns are exactly the concept used by distributed tracing as we'll see in a second. So one type of instrumentation tracing and you're able to get at the same time writing no extra code both structured logging
and distributed tracing and this is extremely powerful and also extremely consistent because you're going to get the same spawns across the two type of telemetry data. So telemetry data. How do we actually process logs? How do we actually process traces? So logs, we take tracing then we have a subscriber that prints logs to standard out
in boolean format is the tracing boolean log for matter which I wrote for donate direct and some if you want to use it. Then standard out is tailed by vector. Vector is another rust law corrector that we use to get logs from standard out to elaborate kinesis which is then
going into elastic search which we then search using kibana. So there's a bunch of hops bring the end up in kibana. And kibana is fairly good to search logs and you don't need to be a developer to search logs in kibana. So you go there you have all the possible
fields of your logs on the left and you can filter either existence non-existence on the specific value doing rectices if you really need to. You can build views and graphs and in general it's very very friendly and we use kibana at all levels inside the company. So from the application developers to the product managers to the support engineers to the first
level of support to customer success managers. That's what allows us to own in a distributed fashion the operation of a product. Disability tracing is more or less the same thing just from a different perspective. So when you talk about logs it's generally about a single
application. So you have this application that is there and it's doing stuff and it's emitting logs. Now in a microservice architecture as the one we have here in many places at this point to serve a single request which is hitting the edge of your cluster that request generally flows through one, two, three, four, five, six different microservices which
cooperate to fulfill the job. Now when a customer comes to you saying I tried to do x and it didn't work you need to understand where exactly the problem is. So you need to be able to trace that request across the different microservices and it should be easy to do so.
The way you do this or one of the possible ways is by adhering to the Jaeger tracing format which is now being evolved but the open tracing tracing format which is now being merged into the open telemetry format. So on the tracing crate you
have a tracing open telemetry subscriber which is maintained in the same repository. You can use that and we do to ship logs into Jaeger to ship traces into Jaeger. Jaeger is once again backed by Elasticsearch so it's more or less the same infrastructure and allows you to have this kind of view. So each of the units of work appear as a bar you track along each of those take
and you can see all the different services that a single request coming from the outside flows through. That is very very powerful to understand when something went wrong. So you're able to correlate a request across everything that is happening inside the cluster. So one final recap. As we said production environments are extremely complex
and if you don't have any way to observe what is happening and that generally means adding some kind of telemetry data then your production environment is a ticking bomb. It might be alive
today but it's gonna go off at a certain point in the future and you're not gonna like it. In order to know what is going on you need to add diagnostic instrumentation but for that to be there consistently it is to be easy to add that instrumentation and make it easy and convenient is your number one priority as an operator in general as an architect of a platform. Now different type of telemetry data gives us different type of
information. So metrics are great to alert and monitor system state while logs especially structure logging with iCardinald context is amazing to try to detect and triage failure modes that you might not have prevented when you design the system. To get very high quality structure logs
spawn is generally the type of instruction that you want to use and no matter how good your logging is at a single service level you need to be able to trace a request across the different services. Either you do that with disability tracing or just a correlation ID that flows through you need to have that somewhere and overall I guess the lesson
learned is we were able to get a Rust application in production in less than a couple of weeks with top-notch observability and telemetry data and that generally means that the answer to the talk which generally is if you're doing a talk with a question as a title the answer is no
like Steve the first day the answer in this case is yes. So are we observable yet? Absolutely. Like tracing has been a step change improvement into the quality of the Rust log ecosystem when it comes to telemetry and you can definitely ship high quality applications with very very granular telemetry data. Now the Net Direct was an experiment in using Rust
in a live production application and we liked it so in one way or another probably the CTO was not fully sober when he said that but we chose to bet on Rust to do some new core projects in particular writing a core banking application which in a nutshell means
cleaning accounts programmatically moving money in and out programmatically once again. We're assembling a team we hired already a bunch we're still looking for one Rust we can engineer so if whatever we covered here sounds interesting to you just reach out that's the opening there
that's my twitter handle like you have many ways to get in touch and with that I think this is the end of the talk and I've been more than happy to take some questions. Okay we have one
Solas Waffle from Twitch is asking does this telemetry setup integrate well with distributed
non-Rust applications? Well it depends on what we mean by integrating well. In our specific use case we do have some structures that we expect application to follow
in the type of telemetry data that they produce so for example we expect metrics exposed by APIs to a certain format or a certain naming convention. We expect our logs to follow the canonical log pattern so generally meet one log line with a lot
data that we then use to do a variety of things so generally speaking there needs to be a little bit of coordination because of course if somebody goes with the .net core defaults format and I go with the Rust default format and you go with the Python default format it's very unlikely that they're going to really match up really nicely but you can use architectural
decision records to just say this is how we do logs and then everybody implements in such a way that they can interoperate so it needs a little bit of coordination.
Okay so another one from Twitch but Chris is asking the trace crate looks very powerful are there any features that you wish you had? They don't have to be easy features
I just like to hear your thoughts on the design space more. Well yeah the tracing crate is extremely powerful I did raise some issues for some of the things that kind of surprised me also there's some of those and made that way into the tracing crate itself also bug fix on a core dump that was nasty but generally speaking has been amazing
things that I wish would be different so at the moment the tracing crate has a lot of focus on making telemetry fast or in general reducing the overhead of doing certain types of operations
for example one thing is traces so the data the metadata you collect about a spawn is statically determined at the moment of spawn creation and that is great because then everything is much faster it consumes less memory but sometimes for the way certain
applications are detected you would like to be able to add additional metadata dynamically even if that means allocating or doing stuff that might not be what you want to do in an off loop maybe for that application and its performance profile works fairly well. Another thing that we found was a little bit of a slippery slope was the instrument macro
which is very very convenient because it captures the name of the function but it captures by default all the arguments of the function and that can somewhat be tricky if you're managing secrets so if you're managing things that you don't want to log and so it's very easy to write a function today put the instrument macro up there then somebody else comes two weeks from
now adds another argument which is a JWT token and then a JWT token ends up in Kibana so it would be nice to have the possibility or different macros or whatever to have denial approach so that I need to explicitly allow certain fields to be logged
which for the type of application that we do we make us sleep better but generally speaking getting is great and I think it's going to get more and more useful as different subscribers implementation come into play so not much to say there. Okay there's another question coming from YouTube so Jeff Varshevsky, I hope I pronounced that even remotely correctly.
How do you configure tracing to send its data to the various backends? Are there docs that also support cloud distributed tracing backends like AWS X-Ray? So there are docs absolutely so if you go on the tracing subscriber crate there are very little docs on how to add
different subscribers to the tracing pipeline at the moment of course there are some type of tracing subscribers implemented but I doubt there are tracing subscribers for all the possible things if in specifically you want
to ship tracing data to X-Ray I think the work that has been done in OpenTelemetry for Rust means that you probably have an implementation of the standard and you might have to drive your own subscriber probably using rest offer that should be not too complicated to actually ship it to X-Ray but I haven't used it personally so I don't know if it's out there already. Okay
once again from Solace Waffle, do you have an approach to avoid handling fields that contain personable identifiable information in telemetry data? Well the approach at this point is trying not to put them there which as I said before so responding to the other question can sometimes
be tricky because of the way instrument works so generally we do have a detection system you're a true layer so what we do is we continuously scan the logs with semantic parsers that look for certain types of secrets that we know might possibly end up in logs like GWT tokens, AWS
credentials and other types of secrets that we don't really want people to have but at the application level of our switching instrument for being allowable to be in denial we don't have necessarily any specific approach. Okay there's another question once again on YouTube from Jeff
general last question what is your preferred strategy for dealing with error handling? Okay interesting in most of the applications we're driving at the moment we use a combination of this error and anyhow so we use this error for all the places but we need to handle errors so
it's very nice to get structured enums and you can match on and then do different things depending on the variant and then when we just want to report errors so we just want to have something that we log on with the time to people as a response then we use anyhow and we generally use them in conjunction so you might have an error enum which is using this error to
get your implementation and then different variants are actually dropping an anyhow error and what we're starting to do recently once again leveraging the tracing crate is using tracing error so capturing span traces in our errors so that when we get logs they're
actually very detailed about what happened and that allows us to debug faster. Okay I guess that
means it's all for today in terms of questions. I've been asked by the friends of Rasty Days
to pick a winner for a Manning book promo code I assume in terms of best questions that's gonna be Soleswafo from Twitch so I think you need to stay online for them to reach out to you.
It seems there's one more question though once again by you so that doesn't change the winner anyway. What was your top process for deciding to build this project in Rust? Were there any attributes that made this project a good fit for the first production Rust application at your company? In terms of the application itself, nothing specifically like we're talking about
basically a client of an API that we suppose publicly is going to power a UI so it's not necessarily doesn't need necessarily to be the fastest doesn't mean necessarily all the guarantees that Rust gives you. We could have done that in any language but we were looking to
use Rust for other types of projects so for mission critical projects in particular to leverage Rust's very strong type system as combined with this very predictable performance profile but it's somewhat of a big leap to adopt a new language when writing a new mission
critical project only to find out when you actually release it that you might have wasted a lot of time so this was a very nice incremental step to de-risk the technology so for example look at all the observability situations say is this actually ready for what we need to do and look at all the things that we need in an API and can we actually
provide APIs with this and so on and so forth so it was very much the risk operation and as with the risk all of these aspects then it became possible for us to say okay now we can confidently bet on it for these other new products that we want to do and there's a huge project and that fits Rust's profile for a variety of reasons and now we know we're not
risking too much. We're still taking some risk but it's not as big of a risk of passing from a small CLI to mission critical product as they do. Okay it's goodbye time so thanks a lot for
tuning in for the rest of the days and stay for the next talk from Dick McManara
on unsafe goods. Have a good evening. Bye bye.