We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Tracing, Fast and Slow: Digging into & improving your web service’s performance

00:00

Formal Metadata

Title
Tracing, Fast and Slow: Digging into & improving your web service’s performance
Title of Series
Number of Parts
160
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Tracing, Fast and Slow: Digging into & improving your web service’s performance [EuroPython 2017 - Talk - 2017-07-11 - Anfiteatro 1] [Rimini, Italy] Do you maintain a Rube Goldberg like service? Perhaps it’s highly distributed? Or you recently walked onto a team with an unfamiliar codebase? Have you noticed your service responds slower than molasses? This talk will walk you through how to pinpoint bottlenecks, approaches and tools to make improvements, and make you seem like the hero! All in a day’s work. The talk will describe various types of tracing a web service, including black & white box tracing, tracing distributed systems, as well as various tools and external services available to measure performance. I’ll also present a few different rabbit holes to dive into when trying to improve your service’s performance
RootWeb serviceComa BerenicesSoftwareGreen's functionIntelOpen sourceWebsitePiGoodness of fitTheory of relativityMultiplication signRootLecture/Conference
SpacetimeScale (map)System programmingWeb serviceWeb serviceWebsiteCodeNumbering schemePhysical systemFocus (optics)Key (cryptography)Type theoryQuicksortControl flowMultiplication signRight angleChannel capacityPlanningBitGoogolProfil (magazine)WordComputer animation
Cache (computing)Term (mathematics)Database transactionDebuggerConnectivity (graph theory)Front and back endsWeb 2.0Right angleDatabaseSystem callLastteilung10 (number)Multiplication signWeb applicationScaling (geometry)
Virtual machineFocus (optics)Web serviceTracing (software)MiniDiscPhysical systemVirtual machineSemiconductor memorySpacetimeDependent and independent variablesMetric systemView (database)Web serviceLevel (video gaming)Connectivity (graph theory)Dataflow2 (number)Scaling (geometry)DatabaseFigurate numberMobile appTask (computing)Mixed realityComputer animation
Client (computing)Structural loadWeb serviceIBM Client AccessPhysical systemConnectivity (graph theory)Virtual machineSystem callMobile appFront and back endsOrder of magnitudeDatabaseLastteilungComputer programmingDistortion (mathematics)Metric systemStructural loadVector potentialDiagram
Performance appraisalProcess modelingPoint (geometry)Web serviceConnectivity (graph theory)Physical systemMiniDiscProfil (magazine)WorkloadMathematical analysisComplex systemOnline helpIdentifiabilitySound effectData recoverySoftware testingEndliche ModelltheorieMultiplication signRight angleComputer animation
Centralizer and normalizerConnectivity (graph theory)Web serviceVisualization (computer graphics)Online helpIntrusion detection systemLogin1 (number)Computer animation
LogicEmailClient (computing)Server (computing)Point (geometry)Inheritance (object-oriented programming)EmailConnectivity (graph theory)Proxy serverLoginFunctional (mathematics)Mobile appWeb serviceEndliche Modelltheorie
ImplementationConnectivity (graph theory)Library (computing)Variable (mathematics)MetadataTracing (software)Multiplication signSystem callCausalityData typeFunctional (mathematics)Point (geometry)Message passingInferenceSoftwareTable (information)Multiple RegressionAttribute grammarConcurrency (computer science)TimestampPhysical systemOrder (biology)Black boxPropagatorDataflowOverhead (computing)Mereology
Query languageInheritance (object-oriented programming)Scale (map)Limit (category theory)Client (computing)Data storage devicePhysical systemCausalityPoint (geometry)PropagatorServer (computing)BitOverhead (computing)Tracing (software)Web serviceResponse time (technology)MetadataMultiplication signDependent and independent variablesSingle-precision floating-point formatFreewareProgram flowchart
TrailView (database)Point (geometry)MereologyMetadataOverhead (computing)Visualization (computer graphics)CausalityTerm (mathematics)Network topologyReduction of orderTrail40 (number)Computer animation
DataflowCache (computing)Complete metric spaceSoftware bugMereologyMetadataGroup actionInstance (computer science)Point (geometry)2 (number)View (database)CausalityRight angleSoftware testingCuboidInferenceProfil (magazine)Computer animationProgram flowchart
TrailCausalitySynchronizationMetadataEvent horizonNatural numberTracing (software)Physical system
Logische UhrPoint (geometry)Computer wormSoftwareMessage passingImplementationVector potentialTracing (software)Point (geometry)SkewnessTimestampInsertion lossConnectivity (graph theory)QuicksortParallel portPhysical systemConcurrency (computer science)Logische UhrMetadataMultiplication signCausalityOrder (biology)RandomizationEmailParallel computingTheory of relativitySystem callWorkstation <Musikinstrument>SynchronizationLocal ringIdentifiabilityXML
Computer wormSample (statistics)Disk read-and-write headEmpennageUnitäre GruppeGraph (mathematics)Network topologyContext awarenessDataflowGraph (mathematics)Insertion lossPhysical systemSinc functionTerm (mathematics)CausalitySound effectDataflowQuicksortPoint (geometry)Online helpSampling (statistics)Tracing (software)SubsetRun time (program lifecycle phase)Overhead (computing)Instance (computer science)Computer wormEntire functionView (database)Multiplication signInformation securityConnectivity (graph theory)TimestampGraph (mathematics)MultiplicationRepresentation (politics)Order (biology)Visualization (computer graphics)Disk read-and-write headSoftwareBlack boxWeb browserInformationResponse time (technology)Web serviceContext awarenessNetwork topologyDifferent (Kate Ryan album)BitOpen sourceDecision theoryConstructor (object-oriented programming)ImplementationComplete metric spaceProfil (magazine)Data storage deviceWeb 2.0Right angleGoogolGraph (mathematics)MereologyTheory of relativityPattern languageProgram flowchart
StapeldateiParallel computingCache (computing)Just-in-Time-CompilerDependent and independent variablesPhysical systemWeb serviceComputer fileDomain nameMultiplication signDirect numerical simulationNumberWebsiteScripting languageMessage passingDigital photographySoftwareSystem callSubsetServer (computing)Element (mathematics)MereologyStapeldateiDependent and independent variablesCodierung <Programmierung>Order (biology)Block (periodic table)Right angleRoundness (object)Computer animation
Web serviceSystem programmingTracing (software)Open setTwitterMusical ensembleTraffic reportingExplosionHost Identity ProtocolMereologySoftware developerTracing (software)CodeLibrary (computing)Standard deviationInformationMusical ensembleWeb serviceMobile appPhysical systemNetwork topologyComputer architectureOcean currentMechanism designDifferent (Kate Ryan album)Latent heatCodeImplementationJava appletProcess (computing)Set (mathematics)PiTwitter
Data managementContext awarenessLibrary (computing)Content (media)Mechanism designNormal (geometry)WordFigurate numberXML
DemonData storage deviceLocal ringLibrary (computing)Client (computing)Data storage deviceTracing (software)Network topologyPhysical systemCartesian coordinate systemLatent heatGraph (mathematics)InformationOnline helpTraffic reportingTwitterRight angleElasticity (physics)Wave packetRemote procedure callComputer animation
Message passingPrice indexContext awarenessDisk read-and-write headCodeLibrary (computing)Multiplication signData managementBitLoop (music)Standard deviationOpen sourceLine (geometry)XML
Beta functionStack (abstract data type)GoogolDevice driverClient (computing)Data storage deviceFrequencyGraph (mathematics)Dependent and independent variablesSample (statistics)DataflowDependent and independent variablesMobile app.NET FrameworkTracing (software)Software development kitGraph (mathematics)Visualization (computer graphics)RoutingBit rate2 (number)Cartesian coordinate systemClient (computing)Device driverPlotterData storage deviceLimit (category theory)DataflowDistribution (mathematics)Multiplication signJava appletFrequencyGraph (mathematics)Electronic mailing listDemonTraffic reportingDifferent (Kate Ryan album)Mathematical analysisWeb serviceResponse time (technology)Interface (computing)Medical imagingCASE <Informatik>Stack (abstract data type)Library (computing)Physical systemMeasurementData managementRight angleCoefficient of determinationServer (computing)Greatest elementWater vaporQuantum stateComputer animation
Formal languageWrapper (data mining)Entire functionPhysical systemSpacetimeStandard deviationWeb serviceConfiguration spaceFormal languageVisualization (computer graphics)Sampling (statistics)TrailTraffic reportingGoodness of fitComputer animation
BlogGraph (mathematics)
Transcript: English(auto-generated)
Hello, good afternoon. Does anyone actually get the reference to my title like I see like Yes, so it's a reference to a really awesome book thinking fast and slow. I highly recommend it No relation to this actual like talk though
So yes, my name is Lynn root. I am a site reliability engineer at Spotify I also do a lot of open source evangelism internally and you might know me from pie ladies as well Also, unfortunately, I'm going to take up like the whole time So if you have questions or want to chat you can come join me for a convenient coffee break right after this
Okay, another quick question has anyone read the site reliability engineering book aka the Google SRE book I Think I see a few hands. All right Well, I highly recommend that book but the TLDR of like every chapter seems to be used distributed tracing
So with the prevalence of microservices where you may or may not own all the services that a request might flow through It's certainly imperative to understand where your code fits into the grand scheme of things and how everything operates with each other So there's three main needs to trace a system
performance debugging capacity planning and problem diagnosis, although it can help address many other issues as well, so While this talk will have like a slight focus towards performance debugging these techniques can certainly be applicable to other needs So I have a bit of a jam-packed day today
I'll start off with an overview of what tracing is and the problems we can try to diagnose with it I'm also talk about some general types of tracing we can use and What key things to think about when scaling up to larger distributed systems? And then the inspiration for this talk a stem for me trying to improve the performance of one of my own team services
Which sort of implies we don't really trace that Spotify So I'll be running through some questions to ask and approaches to take when diagnosing and fixing your services bottleneck And finally, I'll wrap up with some tracing solutions for profiling performance
And as I mentioned before I won't have time for questions so you can catch me right out there All right in the simplest of terms a trace follows the complete workflow From the start of a transaction or a request to its end including the components that it flows through So for a very simple web application, it's pretty easy to understand the workflow of a request
But then add some databases Separate the front end from the back end maybe throw in some caching have an external API call All behind a load balancer then scale up to tens hundreds or thousands of times It gets kind of difficult to put together workflows of requests
So historically we've been focused on machine centric metrics Including system level metrics like CPU disk space and memory as well as app level metrics like requests per second response latency database rights, etc Following and understanding these metrics are quite important But there's no view into a services dependencies or its dependence
And it's not it's also not possible to get a view of a complete flow of a request Nor develop an understanding of how one's service performs at scale so a workflow centric approach allows us to understand relationships of components within an entire system and
Then we can follow a request from beginning to end to understand bottlenecks and hone in on the anomalistic paths And figure out where we need to add more resources So when looking at a very simplified system where we have a load balancer or front-end back-end database Maybe an external dependency to a third-party API and when we have redundant systems
It gets particularly confusing to follow a request. So how do we debug a program of a rare workflow? How do we know which component of this system is the bottleneck which function call is taking the longest? Is there another app on my house causing distortion of machine centric metrics a performance metrics something like the noisy neighbors problem?
So as so many potential paths that a request can take with potential for issues at each and every node and edge This can be mind-numbingly difficult if we continue to be machine centric So end-to-end tracing and will allow us to get a bigger picture
To address these concerns And looking at the magnitudes of what we're operating at Spotify you can see that tracing if we did it would help us a lot So real quickly there are a few reasons why we trace the system The one that inspired this talk is performance analysis
This is trying to understand what happens at the 50th or 75th percentile the steady-state problems And this will help us identify latencies resource usages and other performance issues We were also able to understand questions like did this particular deploy of the service have an effect on latency of the overall whole system
Tracing can also clue us in on anomalistic request flows the 99.9 percentile The issues can still be related to performance or it can help identify problems with correctness like component failures or timeouts Profiling is very similar to the first but here we're just interested in particular components or aspects of system
We don't necessarily care about the full workflow here The fourth one we can also answer questions of what a particular component depends on and what depends on it particularly useful for a complex system complex systems
So when with dependents identified we can also attribute particularly expensive work Like component a significant workload with disk rights to component beam So which can help be helpful when attributing costs to teams and service owners or component owner owners?
And finally we're able to create models of our entire systems that allow us to ask what if questions Like what would happen to component a if we did a disaster recovery test on component beam So there are varying various approaches to tracing. I'll only highlight three of them here
My first is is manual It's also very simplistic where you are just generating your own trace IDs and adding them to your logs And there are very simple things that can be added to your web service here Especially ones that do not have dependent or depending components that you don't have access to You won't get any pretty visualizations or help with centralized collection
Beyond what we typically have with your logs, but it still can provide insight So this is a flask example super simple using a decorator here You can simply add a UUID to each requests received as a header then log at particular points of interests
Like at the beginning and end of every request and then any other in-between components or function calls Where you want to propagate headers And this is exactly what I ended up doing for my service, which made me wish for a better way. Hence this talk I must admit I do a lot of conference driven development
So if your app is behind nginx here that you're able to manipulate you can also turn on its ability to stamp Each request with a request ID header as you see here with the add header and proxy set header You can also add a very Simple like you can simply add the request ID to nginx's logs as well
Next up is blackbox tracing. This is tracing with no implementation across the components It tries to infer the workflows and relationships by correlating variables and timing with an already defined log messages so from here a relationship relationship inference is done via statistical or regression analysis and
It's this is easiest with a centralized logging and if there's Somewhat of a standardized schema to log messages that contain like an ID or a timestamp It's particularly useful if instrumenting an entire system is too cumbersome Or you can't otherwise instrument components that you don't own and as such it's quite portable
And there's very little to know overhead, but it does require a lot of data points in order to incorrectly infer relationships It also lacks accuracy with the absence of instrumenting components themselves as well as the ability to attribute causality with
asynchronous behavior and concurrency Another approach to a blackbox tracing can be through network tapping using sflow or nfdump or IP table packet data Which I am sure the NSA is quite familiar with themselves And then the final type of tracing is through a metadata propagation
and this approach was made popular by Google's research paper on dapper and so components are instrumented at particular trace points to follow causality between functions components and systems or Even with common RPC libraries like gRPC and that will automatically add metadata to each call
So metadata that is tracked includes a trace ID Which represents one single trace or workflow and a span ID for each and every point in a particular trace Like requests sent from client and requests received by server server responds and then the spans start and end time
So this approach works best when the system itself is designed with tracing in mind But not many people do that, right? So this avoids guesswork with the inferring causal relationships However, it can add a bit of overhead to response time and throughput So the use of sampling traces limits the burden here on the system and the data point storage
Sampling anywhere between 0.01% and 10% of requests is often planning to get an understanding of a system's performance So when starting to have many micro services and scaling out with many more resources
There are a few points to keep in mind when instrumenting your system particularly with the metadata propagation approach So in terms of what to keep in mind and I'll go into detail about each in a second We want to know what relationships to track essentially how to follow a trace and what is considered part of a workflow
How they are tracked Constructing metadata to track causal relationships is particularly difficult. There are a few approaches each with their own fortes and drawbacks And then how to reduce overhead of tracking The approach one chooses in sampling is largely defined by what questions you're trying to answer with your tracing and then there may be a clear answer but not without its own penalties and
Finally how to visualize The visualizations needed will also be informed by what you're trying to answer with tracing All right, so what to track when looking within a request We can take two points of view either the submitter point of view or the trigger point of view
So the submitter point of view Follows or just focuses on one complete request And doesn't take into account if part of that request is caused by another request or action So for instance the evicting cache here that was actually triggered by request 2 is
Still attributed to request 1 since its data comes from the first request The trigger point of view focuses on the trigger that initiates the action We're in in the same example Request 2 evicts cache from request 1 and therefore the eviction is included in request 2's trace
So choosing which to follow depends on the answers that you're trying to find For instance, it doesn't really matter which approach is chosen for a performance profiling But following trigger causality will help detect anomalies by showing critical paths All right, how to track or essentially what is needed in your metadata. This essentially boils down to
It's it's very difficult to reliably track causal relationships within a distributed system Now the sheer nature of a distributed system implies issues with ordering events and traces that happen across many hosts And there might not be a global synchronous clock available
So care must be taken when deciding what goes into crafting the metadata that is threaded through an end-to-end trace So using a random ID like UUID or the X request ID header Will identify causal related activity, but then tracing implementations
Must use some sort of external clock to collect traces And then in the absence of a global synchronized clock or to avoid issues like clock skew Looking at network send and receive messages can then be used to construct causal relationships Because you can't exactly receive a message before it's sent
And a lot of tracing implementations use this very simplistic approach However, this approach lacks resiliency there's a potential for data loss from external systems or Inability to add trace points to components that's owned by others
Tracing systems can also add a timestamp derived from a local logical clock to the workflow ID Where this isn't exactly the local systems timestamp But either a counter or sort of a randomized timestamp that is paired with a trace message So with this approach we don't need the tracing system to spend time on the ordering of traces
It collects since it's explicit in the clock data, but parallelization and concurrency can complicate understanding these relationships And then one can also add the previous Trace points that have been already executed within the metadata itself to understand all the forks and joins And it also allows immediate availability of the tracing data itself as soon as the workflow ends because there's no need to spend time on
collating or establishing the order of causal relationships But as you can imagine metadata will only grow in size as it follows the workflow adding to the payload So basically boils down to this if you really care about payload of requests then a simple unique ID is your go-to
But at the expense of needing to infer relationships You can then add a time a timestamp of sorts to help establish explicit causal relationships But you're still susceptible to potential ordering issues of traces if data is lost
You may add the previously executed trace points to avoid data loss and understand the forks and joins of a trace While gaining immediate availability of trace data since causal relationships are already established But then you suffer in payload size And then there's also the fact that there are no open source tracing system that actually implement this last one
So end-to-end tracing will have an effect on runtime and storage Overhead no matter what you choose for instance if Google were to trace all web searches despite its Intelligent tracing implementation it would impose a 1.5 percent throughput penalty and add 16 percent to the response time
I won't go into very much detail, but there are essentially three basic approaches to sampling First is head-based Which will make a random sampling decision at the start of a workflow, and then we'll follow away follow it all the way through to completion
The next one is tail base, which will make the sampling decision at the end of the workflow Implying some caching going on here Tail base sampling and needs to be a little bit more intelligent, but it's particularly useful for tracing anomalistic behavior And
Finally unitary sampling where the sampling decision is made at the trace point itself and therefore prevents the construction of a full workflow So head base is the simplest and probably most ideal for a performance profiling and both head based and unitary are most often seen In current tracing implementations, and I'm not quite sure if there's a tracing system that actually implements tail based
All right, what visualizations you choose to to look at depends upon what you're trying to figure out So again charts are popular and definitely quite appealing, but it only shows requests from a single traced
And you can you definitely have seen this type before if you looked at the network tab of your browser's dev tools When trying to get a sense of where the system's bottlenecks are a request flow graph aka a directed acyclic graph Will will show workflows as they are executed and unlike Gantt charts can aggregate information of multiple requests
of the same workflow Another useful representation is a calling context tree in order to visualize multiple requests of different workflows And this reveals both valid and invalid paths that a request can take best for creating a general understanding of system behavior
So what the takeaway here is there's a few things we need to consider when we trace a system You should have an understanding of what you want to do. What questions you're trying to answer with tracing and Certainly, there will be other realizations and questions that come from a trace system
For example with dapper Google is able to audit systems for security I'm asserting that only authorized components are talking to sensitive services But not without understanding what you're trying to figure out you might end up approaching your instrumentation incorrectly The answer to this question will help identify the approach to causality whether
From the trigger point of view or from submitter point of view Then another important question how much time do you want to put into instrumenting your system or can you even instrument all parts? This will inform the approach that you use to tracing be a black box or not If you can't instrument like all the things or at least some of it
It then becomes a question of what data you should propagate through an entire work through entire flow And finally how much of the flows do you want to understand? Do you want to understand all the requests? Then you should be prepared to take a performance penalty on the service itself and then you can have fun storing all that data
Or is a percentage of the flows okay? And then if so, and then how do we approach sampling and that's in your answer of what we want to know question So for understanding performance head-based sampling is certainly fine You also need to think about whether or not you want to capture the full
Workflow of requests or only focus on a subset of a system and this will also inform Your sampling approach be it unitary or not And so in terms of performance and understanding where bottlenecks are You you want to try and preserve the trigger causality rather than submitter as it shows like the critical path to that bottleneck
Head based sampling is fine as we don't need intelligent sampling and even with very low sample rates we can get a good idea of where our problem lies and since we essentially care about the 50th or 75th percentile And finally a request flow graph here is ideal since we don't care about
Anomalistic behavior now we want information of the big picture rather than looking into particular individual workflows And so most often once you are tracing a system the problem will reveal itself as will the solution
But not always so I do have a few questions to ask yourself when figuring out how to improve a services performance First one is are you making multiple requests to the same service? Round-trip network calls are expensive and perhaps. There's a way to set up batch requests or accept batch requests on your end
Perhaps your service doesn't need to be synchronous or it unnecessarily blocks For example if you're some big social networking site Can you grab a user's profile photo at the same time that you pull up their Timeline while you try and grab their messages at the same time
Is the same data being repeatedly requested? But not cached or maybe you were caching too much or maybe not the right data Is the expiration too high or too low? What about your site's assets could there be?
Could they be better or ordered to improve loading time? Can you minimize the amount of inline scripts or maybe make your scripts async? Are there a lot of distinct domain lookups that add time to with DNS responses? And how about decreasing the number of actual files referenced or maybe minify and compress them?
There's a bunch of stuff that can be done with the front-end part And then finally perhaps you can use chunked encoding when returning large amounts of data Are you otherwise able to have your servers produce elements of response as they are needed rather than trying to produce all Elements as fast as possible All right now probably the most interesting part
So about the current tracing systems that are out there So there is an open standard for distributed tracing allowing developers to instrument their code without vendor lock-in And they do this by standardizing the trace span API One criticism I have of open tracing is that they don't prescribe a way to implement more intelligent sampling
Other than a simple percentage and setting priority There's also a lack of standardization for how to track relationships whether a submitter or trigger it's pretty much all submitter and It's mainly just a standardization for managing the span itself, but mind you it's a very young specification
That's evolving and developing as we speak There are a few self-hosted popular solutions that do support the open tracing specification Probably the most widely used is Zipkin from Twitter, which has implementations in Java, Go, JavaScript, Ruby and Scala
The architecture setup is basically the instrumented app sends data out of band to a remote collector And that accepts a few different transport mechanisms including HTTP, Kafka and Scribe So with propagating data from a service all of the current Python libraries only support HTTP
There's no RPC support And Zipkin does provide a nice Gantt chart or a waterfall chart of individual traces And you can see you can view a tree of dependencies, but it's essentially only a tree with no information
Like latencies or status codes or anything else Using PyZipkin on which other libraries are based you can define a transport mechanism like I did here with HTTP transport Which is can be just simply posting a request with the content of the trace You can otherwise make one for a Kafka or Scribe
But then otherwise, it's just a simple context manager and being placed wherever you want to trace Eager is another self-hosted system that supports open tracing specification. It comes from Uber Rather than the application or client library reporting to a remote collector
It reports to a local agent via UDP who then sends out traces to a collector Unlike Zipkin which supports Kafka and Elasticsearch and MySQL, Jaeger only supports Cassandra for its storage The UI is a very similar to Zipkin with a really pretty waterfall graphs and a dependency tree
But again nothing to help aggregate that performance information we're interested in Their documentation is also horribly lacking unfortunately, but they do have a pretty decent tutorial to walk through Their client library for Python is a bit cringe-worthy
So this is a trimmed example from their docs Just meaning to give the gist here Basically, you can initialize a tracer that the open source or that the open tracing Python library will use and create a span a child Span of context managers But their usage at the end of time dot sleep for yielding to IO loop. It's a bit of a head scratcher
Its docs also make mention of supporting monkey patching libraries like requests and redis and your lib2 So all I can say is use at your own risk After I presented this at PyCon a couple months ago Like the day after they created an issue and basically made a comment in their code
Reasoning why but I still don't get why So there are a couple others I'm not familiar that familiar with I'm including app dash in light step And there are a few more that don't have Python client libraries yet
And in case you don't want to host your own system there are a few services out there to help There is stack driver trace from Google not to be confused with stack driver logging So unfortunately Google has no Python or gRPC client libraries instrument your app with But they do have a rest in RPC interface if you feel so inclined
But they do support a zip can traces where you can set up a Google flavored zip can server either on their infrastructure or on yours and have it forward traces to Stack driver and they actually make it pretty easily pretty easy I was able to spin up a Docker image and start doing traces within a couple minutes
And knowingly they have a storage limitation of 30 days same with their logging and My last criticism is their UI They have simple plots of response time over the past few hours and a list of all traces That are automatically provided in the UI But you have to like manually make analysis reports for each time period that you're interested in to get all that fancy distribution graphs
They're not automatically generated unfortunately And then finally Amazon also has a tracing service available called x-ray I only set up their demo app, but it looks like they do not explicitly support Python only no Java and dotnet apps But the Python SDK
Bato has support for sending traces to local daemon which then forwards to the x-ray service And what's nice about x-ray despite it being proprietary and not open tracing compliant Is you're able to configure sampling rates for different URL routes of your application based on either
Fixed requests per second or a percentage of requests. However, it's not possible to configure these rules with bottom Also are almost redeemable is their visualizations So while there's the typical waterfall chart They also have a request flow graph that and where you can see average latencies
Captured traces per minute and requests broken down by response status so basically AWS and x-ray seems pretty cool and probably the most useful out of all of these But it'll take some time instrumenting your app and introduces vendor lock-in And some honorable mentions that do app performance management measurement
I don't have personal experience with these but data dog and New Relic might be of interest to some of you All right and a quick opinionated wrap-up got like a minute here if you run micro services you should be tracing them Otherwise, it's very difficult to understand an entire systems performance. I'm anomalous a behavior resource usage among other many aspects
However, good luck Whether you choose a self-hosted solution or a provided service documentation is all-around locking Granted very young space very much growing as open tracing standards standard is developing And as I mentioned language report isn't hundred percent even if and it might not even be there
And there's a lack of configuration for a relationship tracking or intelligent sampling and available visualizations But it is indeed an open spec that can't be influenced or you might feel so inclined to implement your own
to which I say good luck and Then finally all this and some pretty graphs and stuff is up on my blog post appear if you're interested Thank you