We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

A Tale of Scaling Observability

00:00

Formale Metadaten

Titel
A Tale of Scaling Observability
Serientitel
Anzahl der Teile
131
Autor
Mitwirkende
Lizenz
CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
What it’s like to keep the lights on in a rapidly growing business - how we’ve scaled our metrics, logging and tracing beyond processing 50TB+ of telemetry a day, and what we’ve learned along the way. During this session, we will discuss the challenges of scaling high load services, and give few pointers to developers to help your chosen open-source observebility tool function as a product.
Prozess <Informatik>Zentrische StreckungSchriftzeichenerkennungAnalytische FortsetzungMetrisches SystemPhysikalisches SystemLuenberger-BeobachterComputeranimationVorlesung/Konferenz
MultiplikationsoperatorSchedulingPhysikalisches SystemSystemplattformLuenberger-BeobachterStabBitMetrisches SystemComputeranimation
Wärmeübergang
WärmeübergangRechnernetzZahlenbereichVolumenDatensatzZweiWort <Informatik>Spezifisches VolumenMultiplikationsoperatorWärmeübergangComputeranimation
Inhalt <Mathematik>YouTubeLuenberger-BeobachterSystemplattformEreignishorizontZweiBitVersionsverwaltungZentrische StreckungMereologieComputeranimation
ZahlenbereichBitrateAdvanced Encryption StandardApp <Programm>Rechter WinkelEinfache GenauigkeitDatentransferNormalvektorAblaufverfolgungMultiplikationsoperatorDienst <Informatik>InformationZahlenbereichGraphMetrisches SystemLoginSoftwareBitrateAbgeschlossene MengeMusterspracheComputeranimationDiagramm
AnwendungssoftwareMetrisches SystemService providerInformationsspeicherungMulti-Tier-ArchitekturOpen SourceDienst <Informatik>FitnessfunktionInstantiierungPhysikalisches SystemCloud ComputingHalbleiterspeicherPunktQuellcodeComputeranimation
LoginInformationsspeicherungAnalytische MengeObjekt <Kategorie>EchtzeitsystemRechter WinkelÄquivalenzklassePay-TVMultiplikationsoperatorElastische DeformationKeller <Informatik>Web-Seite
InformationsspeicherungServerCloud ComputingKanalkapazitätElastische DeformationReelle ZahlLoginKeller <Informatik>Dienst <Informatik>LoginAblaufverfolgungEvolutePhysikalisches SystemZentrische StreckungCASE <Informatik>EchtzeitsystemFront-End <Software>RechenschieberElastische DeformationMetrisches SystemZusammenhängender GraphServerVirtueller ServerComputeranimation
Physikalisches SystemAblaufverfolgungStrategisches SpielStichprobenumfangProgrammbibliothekLeistung <Physik>Güte der AnpassungKartesische KoordinatenUmsetzung <Informatik>BitElastische DeformationFront-End <Software>DialektDifferenteComputeranimation
MultiplikationsoperatorPhysikalisches SystemProdukt <Mathematik>Overhead <Kommunikationstechnik>Computeranimation
MereologieKartesische KoordinatenLuenberger-BeobachterProgrammbibliothekGenerator <Informatik>Analytische FortsetzungProjektive EbeneDienst <Informatik>UnordnungDifferenteInzidenzalgebraMathematikAppletProdukt <Mathematik>Service providerComputeranimation
PunktProdukt <Mathematik>BitrateLuenberger-BeobachterPunktwolkeDienst <Informatik>AbschattungComputeranimation
SkalierbarkeitSkalierbarkeitGebäude <Mathematik>ClientComputerarchitekturPhysikalisches SystemMereologieComputeranimationXML
ClientGebäude <Mathematik>VierzigSoftwaretestResultanteClientPunktwolkeGamecontrollerComputeranimation
BildschirmfensterPauli-PrinzipProgrammbibliothekPauli-PrinzipTermSkalierbarkeitDienst <Informatik>Gebäude <Mathematik>Zusammenhängender GraphSoftwareentwicklerSchlüsselverwaltungKartesische KoordinatenPunktPunktwolkeCluster <Rechnernetz>ComputerarchitekturClientTreiber <Programm>EntscheidungstheorieXMLComputeranimation
Metrisches SystemNichtlinearer OperatorSkalierbarkeitTermComputeranimation
AblaufverfolgungMetrisches SystemLoginSystemplattformClientZusammenhängender GraphRadiusComputerarchitekturKontextbezogenes SystemSkalierbarkeitTermPaarvergleichMetrisches SystemAblaufverfolgungProdukt <Mathematik>Umsetzung <Informatik>WärmeübergangDienst <Informatik>Keller <Informatik>Nichtlinearer OperatorComputeranimation
Zusammenhängender GraphEntscheidungstheorieQuellcodeProgrammbibliothekProdukt <Mathematik>MaßerweiterungOpen SourceLokales MinimumCodeComputeranimation
Zentrische StreckungSchlussregelObjekt <Kategorie>Quick-SortChecklisteWeb logAbfrageOpen SourceÄhnlichkeitsgeometrieInverser LimesVersionsverwaltungMetrisches SystemCodeMixed RealitySchlussregelGewicht <Ausgleichsrechnung>SoftwareProdukt <Mathematik>SichtenkonzeptArithmetisches MittelWeb-SeiteMultiplikationsoperatorKartesische KoordinatenTermFreewareProzess <Informatik>Dienst <Informatik>MereologieDämpfungVerschlingungFunktionalModallogikDatensatzComputeranimation
Produkt <Mathematik>Keller <Informatik>Gebäude <Mathematik>MereologieZusammenhängender GraphLuenberger-BeobachterComputeranimation
InstantiierungVariableDatentypDigitalfilterQuellcodePROMKonfiguration <Informatik>Mixed RealityMetrisches SystemSinguläres IntegralAbfrageBrowserMinkowski-MetrikSpannweite <Stochastik>Divergente ReiheDateiformatNamensraumVariableKontextbezogenes SystemProdukt <Mathematik>PunktProfil <Aerodynamik>MathematikQuellcodeLuenberger-BeobachterTermMatrizenrechnungMetrisches SystemLoginAblaufverfolgungAnalytische FortsetzungBeobachtungsstudieClientComputeranimation
RouterSchlüsselverwaltungVariablePhasenumwandlungWeb logSchlussregelProzess <Informatik>MultiplikationsoperatorQR-CodeComputeranimation
KorrelationSukzessive ÜberrelaxationRouterLesen <Datenverarbeitung>Software Development KitSoftwareLoginLuenberger-BeobachterMultiplikationsoperatorRohdatenDifferenteCodeKontrollstrukturMittelwertExpertensystemUnrundheitPhysikalisches SystemSystemplattformStichprobenumfangSoftwareentwicklerPunktwolkeRechnernetzMetrisches SystemProgrammierparadigmaTetraederAblaufverfolgungComputeranimationVorlesung/Konferenz
Transkript: Englisch(automatisch erzeugt)
Today I'm going to tell you a story about scaling observability, a story of continuous growth. What it's like to keep the lights on in a rapidly growing business and how we've scaled our metrics, tracing and logging systems beyond processing 50 terabytes of data a day.
I'll go over some of the challenges, some key principles we've used across our journey and some lessons that we've learned on our journey as well.
A little bit about myself. My name is Thomas Armisen. I'm a staff engineer in Wwise. I've been living in the UK for almost six years now, past four and a half of which I've been building and scaling the observability platform in Wwise. Before that, before moving to the UK, I was specialized in metrics and logging systems in Telia, Estonia.
I wasn't supposed to be here today, but as I'm volunteering at the conference, one of our speakers unfortunately caught COVID and wasn't able to make it today. Fortunately, I had content ready to go that fits today's schedule.
It's my first year at Python, first time volunteering and also first time presenting at the conference. A little about the company I work for as well. I work for Wwise. Wwise is a global technology company building the best way to move and manage the world's money.
We're powering money for people and businesses to pay, get paid, spend in any currency, wherever you are, whatever you're doing. We were co-founded by Christo Garmon and Tavet Henrikus back in 2011 under our original name TransferWise. We're one of the world's fastest growing profitable tech companies and we're listed on the London Stock Exchange.
Some of the problems we've been solving over this time is basically fixing the international money transfers. We've been building the Wwise account which is transforming international banking and having created this new infrastructure, we are now opening it up to banks and businesses to build upon.
In 2024, we had 12.8 million active customers worldwide. We supported 5% of the personal cross-border payment volumes globally and 66% of our customers found us via word of mouth
and 62% of our cross-border payments are instant which for us means less than 20 seconds. To the business, today's content was created for someone like myself five years ago and it was created for an observable engineering event.
However, I hope there's still something you can take away from this, from our experiences and learnings of scaling platform at scale. Over the talk, I will skip over some parts that are less relevant to a Python engineer and if you're more interested in those bits, there's a version of this talk available in YouTube
where I focus more on these areas. Our journey from 2019 to 2024. Back in 2019, we started with a single monolith app and over that time, we've worked on increasing our reliability and availability
and that app got sharded into smaller and smaller pieces doing more specific things and over those five years, we've built our way into having close to a thousand microservices
and roughly speaking, the number of growth of services we've had year over year is 30 to 35% and our telemetry growth has followed a similar pattern. This is roughly the rate of telemetry we get over a normal working day.
This data is looking at container network transmission data because that's the most neutral metric I can get and on the right, you can see the metrics, which is uncompressed data. In the middle, we have traces and at the very end, we have the logging data as well.
So, we get around 23 terabytes of metrics, 30 terabytes of traces and around 8 to 9 terabytes of logs per day. So, that takes us somewhere between 60 to 70 terabytes of telemetry a day on this graph.
Some of this info is compressed. And yeah, what our metric systems look like. This is a general open source telemetry collection system. We use Prometheus for collecting our data and we used Thanos to provide long-term storage on top of that data.
I won't go too much into detail about the system, but what the Prometheus that collects the data looks like and the challenges we had with it looks like something like this. So, we started out with a single instance with a replica that was collecting all the data from all of our services
to a point at which it just started running out of memory and didn't fit on the Kubernetes worker nodes anymore. So, we sharded our service using our service tiers 1 to 4, using shard 0 for everything that was left over. In 2022, we ran into a similar issue where even the sharded setup was just too big.
So, we ended up sharding our sharded system even more. And yeah, having done all these things like keeping the lights on, that's what you do. We realized that that generates quite a lot of network traffic
and that's something cloud providers really like to charge money for and some things to keep an eye on when you're solving these kinds of problems. When we talk about logging, we talk about log analytics and real-time logging.
It's important to differentiate those in my mind because engineers really, really like Kibana for some reason. I don't know why. But as things stand, Elk Stack or Elastic Search Logs, that's Kibana, costs about 10 times to run what a low-key equivalent
or object storage page equivalent takes. And I view Kibana more as a premium feature and not necessarily something that should come as a birthright to all services. And yeah, this is what our telemetry or logging collections stack looks like in general.
Like we have our collection layer, aggregation layer, our feature consumers and that write data to different backends. And yeah, the problem with this picture is that whilst we have everything in Kubernetes covered, the edge cases start to creep up where different teams come to you
and ask, I really want to see my logs in Kibana, how can I do that? And that's something where we can see that infrastructure details start leaking out of what was originally intended. And basically, that's the next evolution of the setup we had.
We still had our collection layer. But in front of the aggregation layer, we added another one, an ingestion layer where we abstracted away our infrastructure details and made sure that we controlled both the producers and the consumers of our system
so we could protect our system from all the craziness of the world, I guess. And yeah, that's something that's good to keep in mind. Try to abstract away the intrinsics of your infrastructure.
And yeah, this is another slide that we'll just skip over. It's about Elasticsearch and what it looks like. But that's not too relevant today. Grafana Loki was actually the first Grafana or LGTM stack components. LGTM is Loki, Tempo, Grafana, and Mimir, which are backends for logging,
visualizing metrics, traces, and logs, yeah, traces, logs, yeah. And with Loki, we introduced it at the premise of providing alerting on top of logs
because we felt more confident in providing real-time logs with Loki. And at the backend, I always knew it was going to be cheaper and it turned out to be about one-tenth of the cost, our Elastic Stack cost.
And yeah, we used our own service to deprecate some other services such as log server that we had that wasn't really functional at that scale. So we ended up creating a virtual server that was stateless and just provided the command line access for engineers to search their logs.
And that's basically for compliance reasons, more or less. And yeah, our tracing systems. So before I joined, there was a lot of good effort put into instrumenting the applications. Our previous talker in this room was showing how to do instrumentation using OpenTelemetry.
Back when we started, OpenTelemetry wasn't mature enough yet, or like it wasn't there yet. And yeah, all of our instrumentation was using Jaeger libraries and in the backend, Jaeger and Elasticsearch to power that.
So because Elasticsearch is as expensive as it is, we were only doing 10% sampling and engineers often came to us asking, where are my traces? We were doing head-based sampling, so it was flat 10%. And yeah, introducing Grafana Tempo, we were able to tune that tracing dial
from 0.1 to 1, or to 100%, because once we already got all these spans, the traces, the telemetry, storing it wasn't expensive.
And yeah, today the conversation is a bit different. We're getting a lot and we're looking more into more sophisticated strategies because 100% just is a bit much. So, how to keep up with a successful business? Year over year, our telemetry, the telemetry our systems produce has kept going up.
We've had to think long and hard about how to sustainably scale our product, reducing technical overhead, finding time to improve the data we collect, and most importantly, creating time to educate the users as well.
So, meet the reliability squad. I personally am most active in the observability part of this, and we provide the underlying infrastructure for other teams of the tribe to build upon.
The application engineering team builds instrumentation libraries, mostly for Java services, and the SRE team uses what application engineering and we provide to provide user-facing value and dashboards and make sure that everything works and some other things as well.
And yeah, the hardest problem around scaling our product isn't really a technical one. It's being able to deal with all the change and chaos around us. We've got around 1,000 different users in Grafana every day or in a month, and all those users generate lots of requests,
and we continually go through audits. We're probably one of the world's most audited or most regulated companies, and there's always incidents with such growth, and making sure that we learn from the incidents is very, very important as well as hiring and continuity of the team and all external projects where we get included as well.
And yeah, throughout all of this change and chaos, we just have to keep calm and carry on. So yeah, at one point we realized that we'd either become a vendor or take our business to a vendor, just send everything away.
Whilst it's a very easy thing to do, just point it at an endpoint and send it all away at our rate where we're getting 70 terabytes of data a day, growing at the rate of 30% a year. We could think about sending only production data,
but that would generate the situation where we'd need to maintain a shadow IT for everything else, and we'd still need to do all the things we do today, but we'd just be paying someone else for some other services as well. So yeah, if we wanted to send everything away, we'd get into the territory of observability costs,
surpassing our cloud account costs, it's just so much data. And yeah, in general, we don't really want to police our engineers on the data they produce, but we do want to enable them and make them think about the data without having to pull the handbrake and try educating as much as we can.
So what does becoming a vendor actually mean for us? It's about being available and accessible, building for scalability and educating our users.
So being available and accessible is a mission critical part for us, and we need to have architecture that enables us to be available to all our clients. We don't want our clients to ask how I get my telemetry out of system X, Y or Z,
and on top of that, we want to make sure that our systems are compliant so we keep our licenses and keep growing as we have been so far. And yeah, this is what it looks like in practice, more or less. Being available where our clients are is essential to us,
because if we are not, we will need to go and create custom solutions that will become very hard to maintain. And this also means that we need to be accessible from all of the different cloud accounts, different clouds,
and to enable that we need to be in a lot of regulatory scopes as well, which complicates things, but because we keep our design simple, set up controls to mitigate all the risks we can think of, and this enables us to add tests on top of our infrastructure to test for all the controls we have in place,
and when the auditors come, we should be able to point them at the tests, the design and the test results and have them come back to us with any questions they have after that.
And in terms of building for scalability, another mission critical component for us. Once we are available, we need to be able to scale horizontally and make sure we don't fall over after we've made our service so easily accessible.
And this is where PEP206, batteries included philosophy, comes into play as well. In my opinion, this is the key driver behind our design decisions. And on the client side of the architecture, this means that all of the Kubernetes clusters we provision
come with agents on the cluster that make the data from all the applications easily accessible for the service owners or the engineers, developers, you guys. And on the cloud account side, it means that if we deploy something in AWS,
Microsoft or Google Cloud, all those accounts will have agents running in them, collecting all the data that falls in their scope, sending that data to us. And yeah, we do everything we can to make onboarding our clients as frictionless and seamless as possible.
And when we're at that point, we need to make sure that we're ready to accept all that data. And yeah, in terms of operational and infrastructure scalability,
we need to think about metric collection, ingestion, caching, querying, all those are aspects that need to be scalable vertically, vertically scalable, horizontally scalable, and all of them can affect the user experience one way or another.
And yeah, from operational scalability, for us, it's very important to simplify our stack. With Loki, Tempo and Mimir are designed to scale horizontally. And more importantly, they all follow the same architecture.
You can see it's just copy pasted everything. That makes the work of maintaining this platform much easier because when we go from logging to metrics to traces, we don't have to context switch as much. And when we're solving a problem for one of the components,
it's very likely that we're solving it for all others as well. And yeah, in terms of protecting our platform, it's very important for us to be able to have tenancies, blast radiuses, if something goes rogue, we keep these things contained.
And that's why we limit our customers as well. And this is an important step in having a conversation about fair use policies and notionally building our clients. For us to be able to reach our company mission, mission zero,
take transfer fees to as low as we can, we need to know what our products cost and to be able to make educated cost-benefit comparisons. We need to know if a service is helping us towards a mission or taking us further away. And yeah, borrow and extend, don't copy paste,
is very important aspect in our design decisions. And we basically try to keep the code we need to maintain to a minimum. That means that for deploying our infrastructure, we use what we get from open source libraries,
like with Grafana products, all of them come with Helm charts, Tonka, all the infrastructure components included, so we don't need to solve these problems. If we need to improve something, we can make upstream contributions and work with others to solve these problems.
And yeah, in terms of developing applications, for us, we use Grafana products and get free dashboards, recording rules, recording rules for metrics, and alerting rules for free.
And the monitoring mixins.dev webpage goes a little bit about into what monitoring mixins are. And in my mind, it would be nice if software came with batteries included. That includes dashboards, alerts, and recording rules as well if necessary.
Like developing the code, writing the code is just one part of the job. It also needs to function and run in production. And yeah, this is kind of helicopter view we get with Mimir. We have same for Loki and Tempo. We get our dashboards with our software that have generic overview,
our product functions, we can navigate from one aspect of the service to another. And yeah, just that's a lot of engineering effort taken away from our team. And it keeps maintaining everything much easier.
And yeah, with Loki, Tempo, Mimir, and many other software like Prometheus, Kubernetes, we are talking about software that comes with means of deployment, Helm, and Tonka, or JSON at code, and alerts with links to great runbooks and dashboards that can be linked to specific alerts when they fire as well.
If we needed to build all of that from scratch, that would take a lot of time and engineering effort. When I look at those products, I see great software that comes with batteries included, Python philosophy mantra, and this is why I really like that software as well.
I recommend everyone to have a look at Mimi runbooks, which is basically when an alert fires, it goes into quite a lot of detail on why that might be happening. It's the example of the best runbook I've seen so far.
And yeah, at the end of the day, we are building an APM product using LGTM stack, Loki, Grafana, Tempo, Mimir, which are basically components of the observability stack. And yeah, we have hundreds of engineers to build this product,
so we need to level up the average wiser to know what their part of building that product is. And education is key. And yeah, in terms of making our product usable, we want our engineers to always query variables
because when data sources change and they do change, all the dashboards break, at which point we need to know who maintains these dashboards, who the audience for these dashboards is. And yeah, when we query our data, our observability data,
we have to be specific about what kind of data we're querying as well. So naming a data source, data source isn't really useful if we don't know if we're querying metrics, traces, profiling data or something else. Like having correctly named variables helps us build a much nicer product and user experience.
And this is roughly what it looks like. For our clients, they live in tenancies, they select their tenants and all the data sources are basically hidden away from them, they just work based on their tenant.
So one dashboard can have logs, metrics, traces, profiling data, which is becoming more of a thing now with continuous profiling. And yeah, you can carry over that context from one dashboard to another if you have standardized naming of those variables.
And thanks, that's really too much to cover. But that's a helicopter overview of what we've been doing, challenges with phase key principles and some lessons we've learned. There's a few QR codes, there's one for medium Wyse engineering blog and one for Wyse.jobs.
And last time I checked, I think when I googled Python in Wyse.jobs, I think there was like eight or nine rules available for Python as well. Thank you. Thank you, Thomas.
If anyone has any questions, please come to this mic and you can ask any question or doubt. Thank you for the talk. In your role, how much time do you actually spend writing code? I sort of got the impression from your talk that the actual software that runs the observability platform
is now very, very mature and that you're kind of an expert in all the operation, all the dashboards. And I was curious as to if that's a paradigm, like if I've correctly identified that and if it's just what your thoughts are, basically. Thank you. I wish I could do more software development, but generally, like when I see a problem, I first check if there's something that already solves that problem.
If there isn't, I go write the POC or something. But generally, like a lot of problems have already been solved and I don't want to duplicate that effort. The only code I've written in my four and a half years in Wyse is a Cloudflare exporter.
It's available in GitHub as well. I've made it open source or like made it available for everyone. And that was only because all the existing exporters were using a deprecated API and Cloudflare had new GraphQL API that was supposed to be the future.
And yeah, I don't write a lot of code, but when I do, I usually draft ideas in Python. Thank you. Any other question? We have like two, three minutes more. Hey, thanks for the talk. I'm curious about how, I mean, what is the average of the data retention that you currently have for different, well, telemetry data?
For metrics, we usually keep half a year. For raw data, it's different than down-sampled data. For traces, it's around two weeks.
But all of that is in S3, so S3 costs aren't a massive issue. A larger issue is the compute and networking. And for logs, it depends, like different systems have different attentions. Like because we're quite heavily regulated, we keep data in S3 for years.
And currently we have like, last I checked, it was around two petabytes of compressed logs in S3. Thank you. Any other question? Okay, so that's all. Let's thank Thomas with a huge round of applause. We'll have a five-minute break.