We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Distributed Tracing: From Theory to Practice

00:00

Formal Metadata

Title
Distributed Tracing: From Theory to Practice
Title of Series
Part Number
50
Number of Parts
86
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Application performance monitoring is great for debugging inside a single app. However, as a system expands into multiple services, how can you understand the health of the system as a whole? Distributed tracing can help! You’ll learn the theory behind how distributed tracing works. But we’ll also dive into other practical considerations you won’t get from a README, like choosing libraries for Ruby apps and polyglot systems, infrastructure considerations, and security.
35
Physical systemBlack boxCartesian coordinate systemService (economics)Process (computing)Macro (computer science)CodeAuthenticationSlide ruleNeuroinformatikIndependence (probability theory)Fitness functionDifferent (Kate Ryan album)Web 2.0MultiplicationSampling (statistics)Tracing (software)Multiplication signPurchasingComplex systemInformationLink (knot theory)Flow separationExpert systemWebsiteStructural loadJava appletOrder (biology)TheorySystem callError messageVolumenvisualisierungImplementationDatabaseVirtual machineLevel (video gaming)Writing1 (number)Product (business)CASE <Informatik>ChecklistPower (physics)Single-precision floating-point formatTwitterTask (computing)Mobile appBoundary value problemWeb pageTimestampWeb applicationPerspective (visual)Bit rateWeb servicePattern languageCharge carrierSoftware developerMusical ensembleGame controllerSoftware maintenanceContext awarenessCellular automatonDemosceneAreaMereologyComputer architectureSet (mathematics)InternetworkingFormal languageServer (computing)Data structureFundamental theorem of algebraReal numberStatisticsCovering spaceWater vaporParadoxExistenceSoftware frameworkSource codeDigital photographyRight angleIdeal (ethics)Computer animation
Library (computing)PropagatorMultiplication signConnectivity (graph theory)TimestampLevel (video gaming)Process (computing)Inheritance (object-oriented programming)QuicksortTracing (software)CuboidInformationTrailPhysical systemSensitivity analysisSingle-precision floating-point formatClient (computing)Network topologyService (economics)Term (mathematics)Cross-correlationSocial classCodeMathematical analysisResultantContrast (vision)Pairwise comparisonPower (physics)Data structureAdditionFile formatMetric systemLogic synthesisEvent horizonMetadataMobile appCartesian coordinate systemSource codeSimilarity (geometry)Server (computing)LoginCausalityType theoryData storage devicePunched cardSet (mathematics)Scaling (geometry)Cycle (graph theory)Real-time operating systemEmailSystem callMusical ensembleDiscrete groupConservation lawOpen sourceProjective planeThread (computing)SoftwareBlack boxGraph (mathematics)Bit rateReverse engineeringMappingTransportation theory (mathematics)TwitterWebsiteWordSoftware developerStatisticsDirected graphCASE <Informatik>Product (business)Order (biology)Video gameDependent and independent variablesPersonal identification number2 (number)Intrusion detection systemInferenceFitness functionAuthorizationHacker (term)Zoom lensComputer animation
ChainRange (statistics)Open sourceFile formatOrder (biology)MathematicsLibrary (computing)Single-precision floating-point formatCartesian coordinate systemSimilarity (geometry)HypermediaView (database)Multiplication signNetwork topologyFerry CorstenTracing (software)MiddlewareData structureSoftware testingCommitment schemeMaxima and minimaSocial classPoint (geometry)Set (mathematics)AuthorizationComputer configurationClient (computing)Power (physics)Visualization (computer graphics)Service (economics)Physical systemInternet service providerFormal languageMobile appAnalytic continuationRow (database)InformationBitEmailLevel (video gaming)Optical disc driveWeb browserImplementationInterface (computing)Parameter (computer programming)Web 2.0Complex systemServer (computing)Bridging (networking)Resource allocationTelecommunicationDifferent (Kate Ryan album)Open setWave packetSoftwareCodeProcess (computing)WordNatural numberParallel portSoftware frameworkChannel capacityDependent and independent variablesPattern languageConnectivity (graph theory)Electronic signatureElement (mathematics)TheoryProduct (business)Message passingSampling (statistics)Matching (graph theory)Object (grammar)Online helpRight angleXMLComputer animation
Nationale Forschungseinrichtung für Informatik und AutomatikComputer configurationRootkitPhysical systemLibrary (computing)Multiplication signAuthorizationInternet service providerConnectivity (graph theory)Sign (mathematics)AbstractionAuthenticationInsertion lossProcess (computing)PolynomialBuildingTwitterDistanceAddress spaceChecklistProxy serverService (economics)Musical ensemblePerformance appraisalLevel (video gaming)Validity (statistics)Natural numberWhiteboardInformationWeb 2.0CognitionInternetworkingData storage deviceComputer fileRevision controlSoftware testingInstance (computer science)Structural loadWave packetConfiguration spaceCartesian coordinate systemDemosceneVirtual machineDialectInformation securityRoundness (object)LeakComputer architecturePasswordMatrix (mathematics)Client (computing)DiagramMereologyFiber bundleWeb browserData managementSet (mathematics)SequelNetwork topologyCodeView (database)Different (Kate Ryan album)Statement (computer science)Reverse engineeringOperator (mathematics)Information privacy3 (number)Open sourceLatent heatCovering spaceHacker (term)PlanningJava appletFreewareMagnetic stripe cardOnline helpSystem callLastteilungMiddlewareIntercept theoremMobile appTracing (software)EmailSpreadsheetRoutingComputer animation
JSONXML
Transcript: English(auto-generated)
I am sorry, this is a very technical talk in a very sleepy talk slot, so if you fall
asleep in the middle, I will be super offended, but I won't call you on it too hard. So yeah, I'm Stella Cotton, if you don't know me. I'm an engineer at Heroku, and today we're going to talk about distributed tracing.
So before we get started, a couple of housekeeping notes, I'll tweet out a link to my slides afterwards, so they'll be on the internet, so there'll be some code samples and some links so you'll be able to check that out if you want to take a closer look. And then I also have a favor. If you have seen me speak before, I have probably asked you this favor.
So Ruby Karaoke last night, anybody go? Yeah, totally destroyed my voice. So I'm going to need to drink some water, but otherwise I get really awkward and I don't like to do that. So to fill the silence, I'm going to ask you to do something that my friend Loli Shalin came up with, which is each time you take a drink of water, just start clapping
and cheering. All right, so we're going to try this out. I'm going to do this. Yeah? All right, so hopefully that happens a lot during this talk so that I won't lose my voice. So back to distributed tracing, I work on a tools team at Heroku and we've been working
on implementing distributed tracing for our internal services there. And normally I don't do this whole Brady Bunch team thing with the photos, but I just wanted to acknowledge that a lot of the trial and the error and the discovery that went into this talk was really a team effort across my entire team.
So the basics of distributed tracing, who knows what distributed tracing is? Okay, okay, cool. Who has it at their company right now? Aww, I see you, Hirokai. So if you don't actually know what it is or you're not really sure how you would
implement it, you're in the right place. This is the right talk for you. It's basically just the ability to trace a request across distributed system boundaries. And so you might think, like, Stella, we are Rails developers, this is not a distributed systems conference, this is not Scala or Strangeloop, you should go to those.
But really, there's this idea of a distributed system which is just a collection of independent computers that appear to a user to act as a single coherent system. And so if a user loads your website and more than one service does some work to render that request, you actually have a distributed system.
And technically, because somebody will definitely well actually me this, if you have a database and a Rails app, that's actually technically a distributed system. But to simplify things, I'm really gonna talk more about just the application layer today. So a simple use case for distributed tracing, you run an e-commerce site, you want users
to be able to see all of their recent orders. Monolithic architecture, you got one web process or multiple web processes, but they're all running the same kind of code, and they're gonna return information. Users, orders, users have many orders, the orders have many items, very simple Rails app.
We authenticate our user, our controller, gonna grab all of the orders, all of the items, render it on a page. Not a big deal, single app, single process. Now we're gonna add some more requirements, we got a mobile app or two. So they need authentication, so suddenly it's just a little more complicated. There's a team dedicated to authentication.
So now, you maybe have an authentication service. And they don't care at all about orders. So, makes sense, they don't need to know about your stuff, you don't need to know about theirs. So it could be a separate Rails app on the same server, or it could be on a different server altogether. It's gonna keep getting more complicated.
Now we wanna show recommendations based on past purchases. So the team in charge of this recommendations, bunch of data science-y folks, they only write Python, bunch of machine learning, so naturally the answer, microservices, obviously. But I mean seriously, it might be services.
So your engineering team, your products grow, you don't have to have this microservices bandwagon to find yourself supporting multiple services. Maybe one is written in a different language, it might have its own infrastructure needs, like for example, our recommendation engine. And as our web apps and our teams grow larger, these services that you maintain might begin
to look less and less like a very consistent garden, and just more like a collection of different plants in different kinds of pots. And so, where does distributed tracing fit into this big picture? So one day, e-commerce app, you go to your website, starts loading very, very slowly.
And if you're gonna look in your application performance monitoring like New Relic or Skylight, or use a profiling tool, you can see recommendation service is taking a really long time to load. But with these single process monitoring tools, all of the services that you own in
your system or that your company owns are gonna look just like third-party API calls. You're getting as much information about their latency as you would about Stripe or GitHub or whoever you're calling out to. And so, from that user's perspective, you know there's 500 extra milliseconds to get their recommendations, but you don't really know why without reaching out to the recommendations
team, checking out, you know, figuring out what kind of profiling tool they use for Python, who knows, and digging into their services. And it's just more and more complicated as your system is more and more complicated. And at the end of the day, you cannot tell a coherent macro story about your application
by monitoring these individual processes. And if you have ever done any performance work, people are very bad guessers at understanding bottlenecks. So what can we do to increase our visibility into the system and tell that macro-level story?
distributed tracing, that can help. It's a way of commoditizing knowledge. Adrian Cole, who's one of the Zipkin maintainers, he talks about how in increasingly complex systems you want to give everyone tools to understand this whole system as a whole without having to rely on these experts. So cool, you're on board, I've convinced you, you need this, or it at least makes
sense, but what might actually be stopping you from implementing this at your company? A few different things that make it tough to go from this theory to the practice with distributed tracing. And first and foremost is that it's kind of outside the Ruby wheelhouse.
It's not represented, Ruby's not represented in the ecosystem at large. Most people are working in Go or Java or Python. You're not gonna find a lot of sample apps or implementations that are written in Ruby. There's also a lot of domain-specific vocabulary that goes into distributed tracing, so reading
through the docs can feel pretty slow. And finally, the most difficult hurdle of all is that the ecosystem is extremely fractured. It's changing constantly, because it's about tracing everything everywhere, across frameworks, across languages, and it needs to support everything. So navigating the solutions that are out there and figuring out which ones are right
for you is not a trivial task. So we're gonna work on how to get past some of these hurdles today. We're gonna start by talking about the theory, which will help you get comfortable with the fundamentals, and then we'll cover a checklist for evaluating distributed tracing systems.
Yeah! I love that trick. So let's start with the basics. Blackbox tracing. The idea of a blackbox is that you do not know about and you can't change anything inside your applications.
So an example of blackbox tracing would be capturing and logging all of the traffic that comes in and out at a lower level than your application, like at your TCP layer. All of that data, it goes into a single log, it's aggregated, and then with the power of statistics, you just kind of get to magically understand the behavior of your system
based on timestamps. But I'm not gonna talk a lot about blackbox tracing today, because for us at Heroku, it was not a great fit, and it's not a great fit for a lot of companies for a couple of reasons. One is that you need a lot of data to get accuracy based on statistical inference. And because it uses statistical analysis, it can have some delays returning results.
But the biggest problem is that an event-driven system, so sidekick or a multithreaded system, you can't guarantee causality. And what does that mean exactly? So this is sort of an arbitrary code example, but it helps to show that if you have service
one, kicks off an async job, and then immediately synchronously calls out to service two, there's no delay in your queue, your timestamps are gonna correlate correctly, service one, async job, awesome. But if you start getting queuing delays and latency, then a timestamp might actually
make it consistently look like your second service is making that call. So whitebox tracing is a tool that people use to help get around that problem. It assumes that you have an understanding of the system, you can actually change your system. So how do we understand this path that our request makes to our system?
We explicitly include information about where it came from using something called metadata propagation, and that is a type of whitebox tracing. It's just a fancy way of saying that we can change our Rails apps or any kind of app to explicitly pass along information so that you have an explicit trail of how things go.
And finally, another benefit of whitebox tracing is real-time analysis. It can be almost real-time to get results. Very short history, metadata propagation. So the example that everyone talks about when they talk about metadata propagation is Dapr,
and the open source library that it inspired called Zipkin. So Dapr paper is published by Google in 2010, but it's not actually the first distributed systems debugging tool to be built. And so why is Dapr so influential? Well, honestly, it's because in contrast to all of these other systems that came before
it, those papers were published pretty early in their development. But Google published this paper after it had been running in production at Google scale for many, many years. And so they're not only able to say that it's viable at a scale like Google scale, but also that it was valuable.
And so next comes Zipkin, and that's a project that was started at Twitter during their very first Hack Week, and their goal was to implement Dapr. And they open sourced it in 2012 and is currently maintained by Adrian Cole, who is not actually at Twitter anymore. He's at Pivotal, and he spends most of his time working in the distributed tracing ecosystem.
So from here on out, when I use the term distributed tracing, I'm going to talk about Dapr and Zipkin-like systems, because white box metadata propagation distributed tracing systems is not quite as zippy. And if you want to read more about things beyond just metadata propagation, there is
a pretty cool paper that gives an overview about tracing distributed systems beyond this. So how do we actually do this? I'm going to walk us through a few main components that power most systems that are of this caliber. So first is the tracer. It's the instrumentation you actually install in your application itself.
There's the transport component, which takes that data that they collect and sends it over to the distributed tracing collector. That's a separate app that runs. It processes, it stores the data, and it stores it in the storage component. And then finally, there's a UI component that's typically running inside of that that allows you to view your tracing data.
So we'll talk first by the level closest to your application itself. That's the tracer. It's how you trace individual requests, and it lives inside your application. In the Ruby world, it's installed as a gem, just like any other performance monitoring agent that would monitor a single process, and a tracer's job is to record data from
each system so that we can tell a full story about your request. You can think of the entire story of a single request life cycle as a trace. This whole system here captured in a single trace. Next vocab word, span. Within a single trace are many spans.
It's a chapter in that story. So in this case, our e-commerce app calling out to the order service and getting a response back, that's a single span. In fact, any discrete piece of work can be captured by a span. It doesn't have to be network requests. So if we want to start mapping out the system, what kind of information are we going to start passing along?
So you could start by just doing a request ID so that you know that every single path that this took through, you query your logs and you could see that's all one request. But you're going to have the same issue that you have with black box tracing. You can't guarantee causality just based on the timestamps. So you need to explicitly create a relationship between each of these components.
And a really good way to do this is with a parent-child relationship. The first request in the system doesn't have a parent because somebody's just clicked a button, loading a website. So we know that's at the top of the tree. And then when your auth process talks to the e-commerce process, it's going to modify the request headers to pass along just a randomly generated ID as a parent ID.
Here it's set to one, but it could really be anything. And it keeps going on and on with each request. So trace is ultimately made up of many of these parent-child relationships. And it forms what's called a directed acyclic graph. And by tying all of these things together, we're able to actually not just understand
this as an image, but with a data structure. And so we'll actually talk in a few minutes about how the tracer actually accomplishes that in our code. So we've got our relationships. If that's all we wanted to know, we could stop there. But that's not really going to help us in the long term with debugging.
Ultimately, we want to know more about timing information. And we can use annotations to make a more rich ecosystem of information around these requests. By explicitly annotating with timestamps, when each of these things recur in the cycle, we can begin to understand latency.
And hopefully you're not seeing a second of latency between every event. And these would definitely not be in user-readable timestamps. But this is just an example. So let's zoom in to our auth process and how it talks to the e-commerce process. So in addition to passing along the trace ID, parent ID, child span, we'll also annotate the request with a tag and a timestamp.
And by having our auth app annotate that it's sending the request, and our e-commerce app annotate that it received the request, this will actually give you the network latency between the two. So if you see a lot of requests queuing up, you would see that time go up. And on the other hand, you can compare two timestamps between
the server receiving and the server sending back the information. And you would be able to see if your app is getting very slow, you'll see latency increase between those two things. And then finally, you're able to close out that full cycle by indicating that the client has received that final request.
Let's talk about what happens to that data. Each process is gonna send information via the transport layer to a separate application that's gonna aggregate that data and do a bunch of stuff to it. So how does that process not add latency to your system? First, it's only gonna propagate those IDs and
band by adding information to your headers. Then it's gonna gather that data and just report it out of band to a collector. And that's what actually does the processing and the storing. For example, Zipkin is gonna use Sucker Punch to make a threaded async call out to the Zipkin server.
And this is gonna be similar to things that you would see in metrics like Librato, any of your logging and metric systems that use threads. So our data collected by the tracer, transported via the transportation layer, collected, finally ready to be viewed in the UI. So this graph that we're viewing here is a good way to understand how the request
travels, but it's not actually good at helping us understand latency, or even understand the relationship between calls within systems. So we're gonna use Gantt charts or swim lanes instead. So the OpenTracing.io documentation has a request tree similar to ours. And looking at it in this format, you'll actually be able to see
each of the different services in the same way that we did before. But now we're able to better visualize how much time is spent in each sub request, and how much time that takes relative to the other requests. You can also, like I mentioned earlier, instrument and visualize internal traces that are happening inside a service,
not just service to service communication. Here you can see billing service is being blocked by the authorization service. You can also see that we have a threaded or parallel job execution inside the resource allocation service. And if there started to be a widening gap between these two adjacent services,
it could mean that there is network request queuing. Doo, doo, doo, doo, doo, doo, doo, doo. I still can't help myself but sing and do a little dance when I do that, so. All right, we know what we want, how are we gonna get it done?
So at the minimum, we wanna record information when a request comes in and when a request goes out. How do we do that programmatically in Ruby? Usually with the power of Rack middleware. If you're running a Ruby app, the odds are that you are also running a Rack app. It's a common interface for servers and applications to talk to each other.
Sinatra, Rails both use it. It serves as a single entry and exit point for client requests that are coming in the system. The powerful thing about Rack is that it's very easy to add middleware. So that can sit between your server and your application and allow you to customize these requests. Basic Rack app, if you're not familiar with it, Ruby object.
It's gonna respond, call, takes one argument, and in the end returns status, headers, body. That's the basic of the Rack app. And under the hood, Rails and Sinatra are doing this. And the middleware format is a very similar structure.
It's going to accept an app, could be your app itself or another set of middleware, respond a call, needs to call app.call at the end so it keeps following down the tree, and at the end return the response. So if we wanted to do some tracing inside of our middleware, what might that method look like? So like we talked about earlier,
we're gonna wanna start a new span on every request. It's gonna record that it received the request with a server received annotation like we talked about earlier. It's gonna yield to our Rack app to make sure that it executes in the next step in the chain and is actually gonna run your code. And then it returns back that the server has sent information back to
the client. So this is just pseudo code, this is not actually a running tracer. But Zipkin has a really great implementation that you can check out online. So then we can just tell our application, use our middleware to instrument our requests. And you're never gonna wanna sample every single request that comes in,
because that is crazy and overkill when you have a lot of traffic. So tracing solutions will typically ask you to configure a sample rate. We got our request coming in. But in order to generate that big relationship tree that we saw earlier, we're also gonna need to continue to record information when our request
leaves our system. So these can be request external APIs like Stripe, GitHub, whatever. But if you control that next service that it's talking to, you can keep building up this chain. And we can do that with more middleware. If you use an HTTP client that supports middleware like Faraday or XCon, you can easily incorporate tracing into the client.
I'll use Faraday as an example because it has a pretty similar pattern to Rack. So match the method signature, just like we did with Rack. And honestly, Faraday's is very similar to Rack. If you're using like XCon, it's gonna look a little bit different, but this is just an example.
So pass in our HTTP client app, we'll do some tracing and keep calling down the chain, it's pretty similar. But the tracing itself is gonna be a little bit different. So we're actually gonna need to manipulate the headers to pass along some tracing information. That way, if we're calling out to an external service like Stripe,
they're gonna completely ignore these headers because they don't know what they are. But if you're actually calling to another service that's in your purview, you'll be able to create that, you'll be able to see further down. So each of these colors, it's gonna represent an instrumented application. So we wanna record that we're starting a client request,
ensure that we're receiving client requests, add in the middleware just like we did with Rack. It's pretty easy. You can even do it programmatically, like automatically for all your requests for some of your HTTP clients. So we've got some of the basics for how distributed tracing is implemented. Let's talk about how to even choose, in this ecosystem,
what system is right for you. So the first question is, how are you gonna get this working? I'm gonna give a caveat that this ecosystem is ever changing. So this information could actually be incomplete right now. And it could be obsolete, especially if you were watching this at home on the web.
But let's talk about whether or not you should buy a system. Yes, if the math works out for you, it's kinda hard for me to really say whether you should do that. If your resourcing is limited, and you can find a solution that works for you, and it's not too expensive, probably.
Unless you're running a super complex system. LightStep, TraceView, examples that offer Ruby support. Your APM provider might actually have it too. Adopting an open source solution is another option for us, the paid solutions just didn't work.
So if you have people on your team who are comfortable with the underlying framework, and you have some capacity for managing infrastructure, then this really could work for you. So for us, we're a small team, we're just four people, four engineers. We got Zipkin up and running in a couple of months, while also doing a million other things.
But partially because we were able to leverage Heroku to make the infrastructure components pretty easy. And if you wanna use a fully open source solution with Ruby, Zipkin is pretty much your only option, as far as I know. So you may have heard of OpenTracing. You might be like, Stella, what about this OpenTracing thing? That seems cool.
A common misunderstanding is that OpenTracing is not actually a tracing implementation. It is an API. So its job is to just standardize the instrumentation like we kinda walked through before. So that all of the tracing providers that conform to the API are interchangeable on your app side. So if you wanna switch from an open source provider to
a paid provider or vice versa, you don't need to re-instrument each and every service that you maintain. Cause in theory, they're all being good citizens. They're conforming to this API that is all consistent. So where is OpenTracing at today? They did publish a Ruby API guidelines back in January.
But only LightStep, which is a product in private beta, has actually implemented a tracer that conforms to that API. So if a tracer, existing implementations like Zipkin, they're gonna need to have a bridge between the tracing implementation that they have today and the OpenTracing API.
And the other thing that is just not, it's just not clear still is interoperability. So for example, if you have a Ruby app, OpenTracing API, everything's great and you have this paid provider that doesn't support Go. You can't necessarily use two providers that use OpenTracing and still send them to the same
collection system. So it's really only at that app level. Another thing to keep in mind is that for both open source and hosted solutions, Ruby support means a really wide range of things. At the minimum, it means that you can start and end a trace in your Ruby app, which is good.
But you might have to still write all of your own Rack middleware, your HTTP library middleware. It's not a deal breaker. We ended up having to do that for XCON for Zipkin. But it may be an engineering time commitment that you are not prepared to make. And then unfortunately, because this is tracing everywhere,
you're gonna need to rinse and repeat for every language that your company supports. So you're gonna have to walk through all of these thoughts and these guidelines for Go or for JavaScript or for any other language. So some big companies find that like with the custom nature of their infrastructure, they're gonna need to build out some
or all of the elements in house. Etsy, obviously Google, they're running fully custom infrastructure. But other companies are actually building custom components that are tapping into open source solutions. So Pinterest, Pinterest is just an open source add-on to Zipkin, similar to Yelp.
So if you're really curious about what other companies are doing, large and small, Jonathan Mace at Brown University published a snapshot of 26 companies and what they're doing. It is already out of date. Like one of those things is already wrong, even though it was like literally published a month ago. So 15 are using Zipkin, nine are using
custom internal solutions. But yeah, most people are actually using Zipkin. So another component about this is what are you running in house? What is your team or your ops team, what do they wanna run in house and are there any restrictions? There is this dependency matrix of the tracer
and the transport layer which need to be compatible with each one of your services, so JavaScript, Go, Ruby. And so both the tracer and the transport layer need to be compatible across the board. So for example, for us, HTTP and JSON is totally fine for a transport layer. We just literally call out with web requests
to our Zipkin collector. But if you have a ton of data and you need to use something like Kafka, you might think that's cool and it's totally supported. But if you look at the documentation, it's gonna say Ruby and then you're gonna be like, wait, no, if I dig in four layers deep into this documentation, it's only JRuby. So that's like a total gotcha. And so for each of these,
you really should just build a spreadsheet because it's pretty challenging to make sure you're covering everything. The collection and the storage layers, those don't have to be, those aren't really related to the services that you run, but they might not be the kind of apps that you're used to running. So for example, Zipkin is a Java app which is totally different from the apps that my team runs
Another thing you need to figure out is whether or not you need to run a separate agent on the host machine itself. So for some solutions, and this is why we had to exclude a lot of them, you actually need to install an agent on each host for each service that you run. And because we run Heroku on Heroku if we can,
we can't really do that because we can't just give root level privileges to an agent that's running on a dyno. Another thing to consider is authentication and authorization. Who can see and submit data to your tracing system? For us, Zipkin was missing both of those components.
And it makes sense because it really needs to be everything for everybody. And so also adding on authentication and authorization on top of that for every single company to use that open source library is not really reasonable. So you can run it inside a VPN without authentication. The other option is using a reverse proxy which is what we ended up doing.
So we used two buildpacks, apt and the runit buildpack. And so we are able to get nginx on our Heroku slug which is just like a bundle of dependencies and your code with apt. And it's just a package manager for Linux. So we can download and install a specific version of nginx to run as a reverse proxy.
runit allows us to run our Zipkin application and nginx alongside each other in the same host. And we didn't want anybody on the internet to just be able to send data to Zipkin. Like if you just suddenly started sending data to our Zipkin instance, that would be pretty weird. So we wanted to make sure
we're only having Heroku applications interacting with it. And so we decided to use basic authorization for that. We used ht-password to set some just team based credentials in a flat file because we only had about 25 different basic auth configurations that we thought we'd be using. And it ends up looking like this from an architecture diagram standpoint.
The client makes a request, nginx is gonna intercept that, check against the basic auth and make sure it's valid. And then if it is, just forward it along to Zipkin, otherwise it returns an error. And so adding authentication on the client side itself was as easy as going back to that rack middleware file
and updating our host name with basic auth. So that was a really good solution for us. We also didn't want any of y'all to be able to see our Zipkin stuff on the internet, which right now if you just run a Zipkin instance, there is nothing to keep you from seeing anybody's if there's no authorization. So we used bit.ly as an OAuth2 proxy,
which is super awesome. It allows us to restrict access to only people with Heroku.com email addresses. And so if you're on a browser and you try to access our Zipkin instance, we're gonna check to see if you're authorized, otherwise this OAuth2 proxy is gonna handle the full authentication.
So it's configurable with different load balancers slash reverse proxies and OAuth providers. So it's actually really cool if you need to run any kind of OAuth in front of a process. But even if you're going the hosted route and you don't need to handle any of this infrastructure,
you're gonna need to ask about how are you gonna get access to people who need it? Because you don't wanna be the team who has to manage this handoff of sign-ons and sign-ins and oh, you need to email this person. You don't wanna manage all that. So just make sure it's clear with your hosted provider how you're gonna manage access.
Security. If you have sensitive data in your systems, which a lot of people do, there are two places specifically where we had to really keep an eye out for security issues. One is custom instrumentation. So for example, my team, the tools team, added some custom internal tracing of our own services
using prepend to trace all of our Postgres calls. And so when we, like we did with the middleware earlier, we're wrapping that behavior with tracing. But the problem here is if you're calling SQL.toS and that SQL statement has any kind of private data,
you wanna make sure that you're not just storing that blindly into your system, especially if you have PII, any kind of security compliance information that you're storing. And the second thing is that you need to talk through before it happens what to do when your data leaks. For us, running our own system is a benefit
because if we accidentally delete data, or leak data into a third-party provider, it's easier for us to validate, sorry, it's easier for us to validate when we own that data that we've wiped that data than having to coordinate with a third-party provider. It doesn't mean you shouldn't use a third-party solution, but you should ask them ahead of time,
what do you do when data leaks? What's the turnaround? How can we verify it? You don't wanna do that when you're in the middle of a crisis. The last thing to consider is the people part. Is everybody on board for this? The nature of the distributed tracing is that the work is distributed. Your job is probably not going to end
when you just get the service up and running. You're gonna actually need to instrument apps. And there's a lot of cognitive load, as you can see from the 30 minutes we've talked about this, into understanding how distributed tracing works. So set yourself up for success ahead of time by getting it on Teams roadmaps if you can.
Otherwise, start opening PRs is the other option. Even then, you're probably gonna be able to need to talk through what it is and why you're adding it, but it's a lot easier when you can show them code and how it actually interacts with their system. So here's the full checklist for evaluation.
We'll cover one last thing before I let y'all go. If you're thinking this is so much information, where do I even go next from here? My advice is if you have some free time at work with a 20% time or hack week, start by trying to get Docker Zipkin up and running. Even if you don't plan to use Zipkin at all,
it includes a test version of Cassandra built in. So you just need to get the Java app itself up and running, and you don't have to worry about all of these different components right off the bat. If you're just instrumenting Ruby apps, then Zipkin is compatible. You can even deploy this onto Heroku.
So once you're able to just get this deployed, the UI loaded, just instrument one single app. Even if the only thing that app does is make a third-party Stripe call, it'll help you turn some of these really abstract concepts into concrete concepts. That's all I got today, folks. If you have any questions, I'm actually heading straight to the Heroku booth
after this in the big expo hall. So stop by, I'll be there for about an hour. Come ask me any questions or talk about Heroku or get some stickers. See ya.