We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Stateful systems on immutable infrastructure

00:00

Formal Metadata

Title
Stateful systems on immutable infrastructure
Title of Series
Number of Parts
44
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer

Content Metadata

Subject Area
Genre
Abstract
Lessons learned operating thousands of stateful production clusters on top of Fedora and systemd-nspawn. Aiven is a cloud data platform operating thousands of production clusters on top of different cloud infrastructure providers (e.g. AWS, GCP). We offer the latest open source database & streaming engines to our users around the world, and implement most of our platform using the latest open source software including Fedora and systemd-nspawn. We wanted to base our platform on a fast moving Linux distribution like Fedora to gain quick access to new technology and avoid having to backport a lot of things. Fast moving distributions are typically not supported for a long time, but implementing an immutable infrastructure where deployed machines are not touched afterwards makes it possible to use them in production. In this talk we’ll share the details of our architecture and the lessons we’ve learned as well as problems we’ve faced over the years operating hundreds of thousands of virtual machines and containers with it on top of six different public clouds.
System programmingProduct (business)Scale (map)DatabaseService (economics)Software maintenanceWordDatabaseService (economics)Software maintenanceProjective planeMultiplication signOpen sourcePhysical systemComputer animationMeeting/Interview
Virtual realityVertex (graph theory)Scale (map)WordDatabaseService (economics)Open sourceInformation engineeringDifferent (Kate Ryan album)DialectPhysical systemCloud computingGoogolHypercubeScaling (geometry)Scalar fieldComputer animation
Contrast (vision)System programmingProgramming paradigmServer (computing)Process (computing)Group actionConfiguration spaceState of matterCASE <Informatik>Order (biology)NP-hardBitPhysical systemMathematicsService (economics)Programming paradigmDifferent (Kate Ryan album)Scaling (geometry)Configuration spaceVirtual machineQuicksortMultiplication signServer (computing)Computer animation
DisintegrationComponent-based software engineeringPhysical systemOrdinary differential equationOpen setSoftwareContext awarenessControl flowLibrary (computing)Default (computer science)Revision controlCycle (graph theory)Physical systemSynchronizationFreewareDerivation (linguistics)INTEGRALDefault (computer science)Ocean currentLogic gateCASE <Informatik>Multiplication signQuicksortDistribution (mathematics)Scripting languageProjective planeDifferent (Kate Ryan album)Web pageExecution unitSoftwareExterior algebraComponent-based software engineeringSoftware bugOpen sourceComputing platformTraffic reportingCurveFitness functionComputer animation
Default (computer science)Revision controlCuboidOrder (biology)Software bugPatch (Unix)BuildingInformation securityQuicksortMultiplication signLibrary (computing)Physical systemComponent-based software engineeringDefault (computer science)Pulse (signal processing)Service (economics)DatabaseComputer animation
Kernel (computing)Revision controlSoftware bugPhysical systemMultiplication signFirewall (computing)CuboidDifferent (Kate Ryan album)Meeting/Interview
Kernel (computing)Scripting languageZugriffskontrolleFirewall (computing)Service (economics)Computer networkBuildingInterface (computing)CodePhysical systemDefault (computer science)Revision controlNumberInterface (computing)Configuration spaceJava appletInstallation artServer (computing)Point (geometry)Computer animation
Virtual machineVertex (graph theory)Local ringData storage deviceLocal ringSoftwareVolume (thermodynamics)EncryptionFunctional (mathematics)Virtual machineWordMassINTEGRALMultiplication signMiniDiscBitCASE <Informatik>Pay televisionCloud computingGoogolLimit (category theory)ACIDComputer animation
Vertex (graph theory)Local ringComputer networkInstance (computer science)VolumeVirtual machineService (economics)Volume (thermodynamics)SoftwareComputer hardwareCloud computingMiniDiscNumberCharacteristic polynomialReading (process)Human migrationProjective planeFerry CorstenE-bookComputer animation
Vertex (graph theory)BackupDemonReplication (computing)System programmingBoundary value problemData storage deviceBlock (periodic table)Service (economics)MiniDiscProcess (computing)Moving averageChannel capacityTime zoneCloud computingRhombusWindowPhysical systemInsertion lossDifferent (Kate Ryan album)SoftwareRight angleGame controllerPlanningData compressionResultantService (economics)Computer hardwareSemiconductor memoryCASE <Informatik>Virtual machineObject (grammar)MiniDiscPartition (number theory)BefehlsprozessorBackupDemonData storage deviceAzimuthGoogolMultiplicationElasticity (physics)Computer animation
Process (computing)Moving averageChannel capacityVertex (graph theory)BackupData centerScaling (geometry)Shared memoryComputer animation
Physical system1 (number)Computer-generated imageryDirectory serviceMiniDiscExecution unitProcess (computing)Physical systemSet (mathematics)CASE <Informatik>Single-precision floating-point formatMereologyWordComputer animation
Physical system1 (number)Directory serviceComputer-generated imageryMiniDiscExecution unitPhysical systemMedical imagingComputer fileNetwork topologyDirection (geometry)Execution unitWordDifferent (Kate Ryan album)File formatINTEGRALVirtual machineBitDirectory serviceCASE <Informatik>Computer animation
Physical systemDigital rights managementComputer-generated imagerySoftwareRevision controlVirtual machineService (economics)Virtual machineCASE <Informatik>Service (economics)Mechanism designPoint (geometry)Medical imagingBitCustomer relationship managementMultiplication signFilm editingGraph coloringRevision controlDifferent (Kate Ryan album)DialectCloud computingMeeting/InterviewComputer animation
Configuration spaceMiniDiscEncryptionComputer networkVertex (graph theory)Set (mathematics)Digital rights managementSet (mathematics)CASE <Informatik>Type theoryDatabaseDifferent (Kate Ryan album)QuicksortNetwork topologyComputer fileMiniDiscEncryptionMathematicsIPSecBackupPhysical systemService (economics)Configuration spaceRAIDParameter (computer programming)Data storage deviceCustomer relationship managementVirtual machineComputer animation
Digital rights managementConfiguration spaceOpen sourceDemonMetric systemMetric systemPhysical systemLoginPoint (geometry)DemonFile formatMathematicsMultiplication signService (economics)Data structureConfiguration spaceGene clusterComputer animation
Service (economics)CodeDefault (computer science)VolumeComputer-generated imageryHeat transferBuildingVirtual machinePhysical systemService (economics)CodeCore dumpBuildingNumberComputer fileType theorySingle-precision floating-point format1 (number)DatabaseConfiguration spaceDifferent (Kate Ryan album)Computer configurationCloud computingMedical imagingOperator (mathematics)Computer animation
Computer-generated imageryHeat transferBuildingCloud computingOperator (mathematics)Medical imagingDialectDifferent (Kate Ryan album)ImplementationComputer animation
SoftwareBootingComputer-generated imageryInternet service providerPhysical systemSuite (music)Software testingMedical imagingSpacetimeMiniDiscGoogolMultiplication signUnit testingSuite (music)Software testingRevision controlDistribution (mathematics)Different (Kate Ryan album)Chaos (cosmogony)BitInternet service providerSystem callCloud computingCycle (graph theory)Physical systemComputer animation
Physical systemMagnifying glassBitRevision controlUnicodeMultiplication sign
UnicodeSoftwareRandom numberDisjunctive normal formGastropod shellComputer networkError messageRevision controlBitInternetworkingSoftwareMathematicsMultiplication signRevision controlComputer animation
UnicodeSoftwareRandom numberDisjunctive normal formGastropod shellComputer networkError messageRevision controlWrapper (data mining)SoftwareData centerError messageMultiplication signIntegrated development environmentBitDisjunctive normal formComputer animation
Model theoryPhysical systemComponent-based software engineeringSoftware maintenanceFocus (optics)Service (economics)Arithmetic meanComputer animation
Physical systemSystem programmingData structureComputer fileCustomer relationship managementPhysical systemState of matterQuicksortHand fanMultiplication signCartesian coordinate systemDemonCycle (graph theory)BootingPoint (geometry)DatabaseEvent horizonService (economics)View (database)CodeMiniDiscEncryptionCASE <Informatik>Goodness of fitMereologyFormal languageLine (geometry)SynchronizationCloud computingKey (cryptography)Open sourceInstance (computer science)Computer animationLecture/ConferenceMeeting/Interview
System programmingPhysical systemObject (grammar)Data storage deviceType theoryService (economics)Point (geometry)Key (cryptography)Open sourceEncryptionAuthentication
System programmingWebsiteLattice (order)Computer animation
Transcript: English(auto-generated)
A word about me first. I'm one of the co-founders of Ivan, which is a database as a service company. We operate in pretty much all the public clouds and previously I used to work on large-scale databases and distributed systems. I'm also the maintainer for a bunch of open source projects, mostly
around Postgres these days. Not that active anymore but still doing something every now and then when time permits. Then a word about Ivan, just so you know where we're coming from. We basically are a company that operates databases
globally in six different cloud providers in 89 different regions around the world. There's eight different open source data engines or messaging systems that we provide and we started off in early 2016 by providing a managed Postgres service. While we are not like a hyperscaler like
Google or AWS, we still operate at a pretty fierce scale. Some definitions before we get into the meat of the matter. Stateful systems, they typically hold important things, namely the state of Europe system. That is
what also makes them slightly different from stateless systems, which are really easy to restart, really easy to move around. This is, by the way, where typically systems like Kubernetes have been at their strongest as in like stateless systems you can easily easily just stop or restart or
start running in a pod somewhere else. But in the case of stateful systems, there's plenty of data and the more data you have, the harder it gets to actually manage them. There's also lots of considerations around durability and accidental changes to the data and those are the kind of
things that you really don't want to see happening accidentally. As for immutable infrastructure, it's basically a paradigm where servers, once they're actually running, they are never actually changed afterwards. Basically the idea behind this is to make your deployments more consistent and reliable because you're always doing it the same way. So historically, if people
deployed a service and then they went manually to change something into configuration, eventually if you had, let's say, a hundred thousand machines, none of them actually looked the same as the others. So while there have been lots of different deployment automation tools for this sort of thing,
typically they still start differing over time. In order to do this, I'm gonna go through some of the tools we have but you pretty much need quite a bit of automation tooling around this because operating at this at a scale is otherwise kind of hard. Anyway, I'll take us back to like early
2015 when we started writing our platform. Basically a lot of our team had background in using Debian and different companies where we had worked previously. Basically Debian at that time had had its issues with
slow release cycles and basically because of that a lot of people were using backported packages, like we had been backporting tons and tons of things. So one of the things we knew like out of the gate was that we really didn't want to backport stuff. At least we didn't want to backport
system components, which would have basically meant rebuilding the whole thing. The other thing that we wanted to do is corollary that is basically something that worked really close to the upstream project. So typically that when you made a bug report actually this stuff would get the upstreamed whenever the fix went in. And also back then there was still the hula-ba-loo
around systemd integration into Debian and that was something that we actually wanted to use already back then. Debian has a reputation for being stable. I've added the same quotation marks but I'll get to that later.
But it has many positives as in the open source free software ethos is really strong around that community. There's no single controlling company around Debian. It's basically lots and lots of volunteers. They may be doing it on company dime but still they're working there as individuals and
there's no single overarching company behind the distribution. Unlike for example in Fedora's case where Red Hat is the prime contributor behind that. In Debian there's tons and tons of different packages available. Based on Debian org's front page they have like 59,000 different packages available. So that would mean that they basically have coverage of pretty
much all the free software out there. Not really but close enough that it doesn't make a difference. Also there's lots of Debian derivatives, Ubuntu being the most famous one. But that basically means there's a lot of people who know how Debian and Debian derivatives actually work. But anyway
it turns out that Debian is not quite the perfect fit for us. Basically in the stable especially back then when earlier Debian release had been few and far between. It meant that basically you either needed to start back porting stuff or you'd have to live with the old packages which we really didn't want
to do for various reasons. And once you actually got far enough behind the curve so you actually needed to start back porting system components then it really wasn't fun anymore because then you actually need to do lots
and lots of work that you'd rather have somebody else doing which was the whole reason for the thing. Also once you actually go down that path you're not really running Debian itself you're actually running a custom distro which is just fine nowhere is there. But the thing is why did you want to go with the stable system in the first place? Didn't you want the thing
to be something that other people have proven and battle-tested over time? Anyway it was basically we had really bad experiences on having to do back porting for ages and we really didn't want to do that anymore. Then the other thing is like the systemd hula below back in the day when they were
choosing init like which init system should be the default. Even though systemd was available as a package for quite a while before then we still felt that the integration wasn't quite there. A lot of the packages still had init scripts there and lacked systemd unit files for some time and
on the whole it was at the time it really wasn't that well integrated with systemd which was definitely something we want to use already back then. So then we started looking at the alternatives so eventually we basically ended up with
Fedora but there were a couple of others but the basically Fedora was pretty early on the main contender. Fedora has a six month release cycle well that's well slipping every now and then but still six month ish release cycle which sounded a bit scary for us at first because that means that we need to be continuously updating because Fedora supports the
current distro release and then the one before that and basically the one before that only gets two months of overlapping support. So you need to at least once a year basically to be thinking of upgrading and if you don't do that then you are well out of luck as far
as it comes to security patches and whatnot. Anyway the upside of this is of course that everything is fairly fresh you don't need to backport a lot of stuff especially you don't need to backport system libraries or system components which is great because you don't really want to do that but on the
other hand we as a database service company we actually still need to build plenty of packages in order to be able to fix customer found bugs or new minor releases of things or whatnot but there's basically a tons of things that we still need to package but instead of it being thousands of
packages now we're like left with like 150 or so packages which is great so basically it sounded like something we've wanted to go with and the other thing was that I mentioned about systemd. systemd has been in fedora for quite a while for well for obvious reasons but it's also well integrated
and it's basically been the default for quite some time and it really works fairly well. Another by the way anecdotal thing is my pulse audio setup on Debian never worked but it worked out of the box on fedora so there's that but anyway it's not quite related to us choosing fedora well sort of anyway the
also the rpm spec files the way you build rpm packages it's much much nicer than building Debian packages this is my personal opinion but this really take that to the bank that's like true anyway but I really recommend like
building rpm specs rather than actually doing it the Debian way. Of course there's not one way well there's an official way of doing it in Debian but there's like 10 other ways of creating packages too. Anyway what you get out of the box with fedora is you get an up-to-date kernel and systemd with the kernel by the way it also gets
updated over time so you're not actually stuck with running whatever was the release version back then so you're actually running something fairly recent as in reason being like the kernel release from the last month or so so you're actually getting fairly fresh stuff. Then systemd support is there out of the box it really does work well
there haven't been any issues around that except for a couple of system debugs then you get the se linux which is also integrated fairly well into the system and works but we haven't really had any issues with that over the years then there are firewalls there but in general okay this is by the way a difference to Debian packaging
like philosophy if you will when you install Debian package let's say you install postgres packages by default it actually starts the server and binds it to port x whatever the default is and starts that usually is a public interface and then it starts serving stuff of course at
this point you don't really have any useful configuration for it or anything else but we much preferred like when we install package it does absolutely nothing until we tell it to tell system d to actually start start the thing off but this is like one nice thing that's usually not mentioned anywhere also these days it has the latest python but on the other hand it has had
the latest python forever and but the version number was just different back then so we're like a heavily python using house so we have lots of code well we have some code using go and java and c but the majority the vast majority of 90 something percent is python for us
anyway then a word about the topic so generally speaking our philosophy on nodes which by the way and for this purpose of this talk are basically either a virtual machine or a machine we don't really distinguish between those anyway the idea is that they're disposable so
we don't care if they go away we expect them to go away en masse around the planet all the time the other thing is we really don't put any manual effort into any given single node basically we do everything by automation and that's been the case for quite a while
uh the other thing is we operate in six different public clouds and they all have different ways of doing things so whoopsie so we don't try not to rely on their functionality too much
so we even from things like disk encryption and so forth we actually use locks for that instead of using the cloud providers provided functionality in general our like integration towards the cloud providers is fairly minimalistic but anyway this also there's another side of the
coin because of the way we do persistence which i'll go into in a bit is it allows us to use things like local sds so typically with many of our competitors what you're using is network ssd so ebs volumes or about google's persistent disks or azure's premium ssd's but they're all basically because of the speed of light they have severe limitations on how
well they perform compared to local sds that are basically pc express and vme devices that are connected directly to the machine that gives us some performance benefits anyway the idea behind durability is also that we always have the data somewhere else than the actual node
so the idea behind this is that if the node get them dies for whatever reason which they frequently do it's still not a biggie anyway worried about persistence and durability so anyway we try not to rely on ebs volumes persistent
disks or premonitions for persistence you can't easily move them between clouds so one of our value props is that you can actually move your services with a couple of clicks between clouds so instead of having a migration project you can just say that i want my service which is in aws eu west one move to us east one or wherever you want to move it i'll go
further on into the details how we do that but anyway since you can't easily move these actual disks that are network attached we basically solved the issue another way around but the nice thing is where since we actually are using local ssd's because we're handling the
persistence in a slightly different way is like here's just some example numbers but an ebs volume can do roughly 250 megs of let's just go with reads it's simpler they have different characteristics for reads and writes but if you just go with reads it's basically 250 megs and then it can do roughly 10k iops from what i remember
then on another hand like aws the same cloud vendors i3 machines which have local ssd's they can do like north of two gigs a second it's actually closer to three gigs a second and then on read iops you can get into millions range so it's a completely different
ballpark when it comes to hardware characteristics then here's an example of how we do persistence for postgres for example so we have a thing called pg horde which was originally written by yours truly it's basically a postgres backup daemon which is on github and the second or third most popular
one based on github stores for postgres basically what it does it like takes our write ahead log compresses and compresses the data encrypts it and then sends it to object store this basically gives us a bounded data loss window which is okay but this is still
isn't great but for all our ha services basically the customer gets to choose whether the data is synchronously or asynchronously replicated so basically they get to choose how much performance loss are they willing to accept but basically with this you can get arbitrarily
low data loss windows depending on a single node loss also since we provisioned these in all the different cloud vendors every one of those which has multiple availability zones we automatically spread the nodes of a given service so if you have a kafka cluster or a
multiple azs and we also make sure that in the case of clustered systems like kafka if you have n copies of a partition they are always split among the different availability zones automatically anyway our approach to upgrades is basically that we do rolling forward upgrades
so basically when you have let's say a three node kafka cluster or what or let's say a three node elastic search cluster what happens is that we side by side those we create three new virtual machines replicate the data over and do a controlled failover without any downtime for
the customer and that's pretty much the way we do all our software upgrades or hardware upgrades so the same thing is being used whether you upgrade let's say you have three machines that all have eight gigs of memory couple CPUs and x amount of disk if you upgrade to a larger plan what happens is we again create the a bunch of new nodes replicate the data there and then
basically do failovers there and this happens the same way whether we change between cloud providers or i mean if you're moving from aws us east one to google's south carolina data center we basically just use the exact same methodology again and again so basically once
the actual nodes are up and running we actually never touch them again and we also do this at a huge scale it's been fairly useful to actually have just a single way of doing this we've had
our share of issues with this but happily here's only one way of for us to do any upgrades so we really rehearse this a lot and then word about systemd and spawn versus docker basically docker comes with its own set of baggage there's some philosophical things which we disagree
mostly like having a single process per container i mean you can get around it but still that's the general tendency but also things like systemd and spawn is actually part of the system that you're already using it's like already there it's built in it's also much more minimalistic
and doesn't come with all that much stuff but it works fairly well and in the case of how we build images our container images are basically just directory trees more or less i mean they may come in tarball or compressed format or whatnot but they're essentially just directory trees
and the other thing is nspon integrates really well with systemd which is pretty neat for us because we can then control stuff from outside the container with systemd itself we use unit files for a lot of things and a lot of different directives in those and then we use journal d and it's structured logging quite a bit
and both have been like really good for us then the word about the host machine the host machine in our case where we are running our customer service is also running fedora there's also a single container on the same vm or well node so basically bare metal machine or vm
and there's a bunch of things that are running on it so once we provision a vm we install our management agent there what the first thing it does it actually tries to refresh the package it has we're basically doing this so that we can have immediate control for over any new
nodes even before we build new images which we do frequently so we like build our images a lot but on the other hand if we want to say let's say that you cannot have version x of postgres package because we decide that okay this one sucks before we can actually roll it out to 89 different cloud regions around the world that takes a while depending on the cloud
vendor may take quite a bit of time so we actually have created this mechanism so just we can just like five minutes later like when we create the next node we can just control though like specifically what kind of packages are is it getting
typically there's nothing to do at this step it's just like typically where the images that are there are already good enough but it's basically an emergency handbrake uh anyway after this point after it's installed the packages that it needs for the customer service uh the machines are immutable they really aren't changing for the duration of the lifetime of
that node again we do have the ability to manually go there and install stuff but we it's not really done ever it's basically only for debugging purposes if we need to do something weird then the management agent starts up which we call prune which is
apparently some sort of plum tree or something we came up with the name from somewhere which i forget basically what it does it sets up the machine to operate the customer service so sets up the disk layout with raids and encryption sets up basically all the cluster nodes talk to each other over ipv6 over ipsec and then it basically restores the data of
either from backups which are often object stores or then the other way around they could be a restoring the data from other nodes in the cluster in the case depending on the type of cluster that we're serving and then it typically keeps our monitoring and reporting the health of
system and there's also other ways of doing that but it also keeps a general sense of the thing is still healthy and there's a heartbeat coming out of it it also reacts to configuration changes so when we add a new node we need to create ipsec tunnels to those we need to change cluster configuration and different database services and then there's also
the customers are allowed to change some configuration parameters in different services so we allow users to configure postgres some postgres settings so those while the packages themselves are immutable there's the data files on disk are obviously changing with the database
the other thing is of course the configuration may change within those posts like postgres columns but otherwise it's completely immutable and then the like the management agent actually sets up a bunch of auxiliary agents all of those run on the host side there's basically some we collect metrics crap we collect metrics out of the system and we collect tons and tons
of metrics so there's like tons of data points coming out of those then we are shipping the logs from journal also in structured format so we actually are able to search through those and we basically retain the structure of the logs that we basically put in there
and we also use structured logging for this then there are also backup and ha demons like pg horde which i mentioned that are largely we've open sourced over time the one we haven't is actually cassandra's and we're hoping to do that too but when we implemented it actually
started relying on some of our internal things and now we need to write it slightly in a slightly different manner we besides actually selling apache kafka as a service we actually use it internally a lot so we have tons and tons of apache kafka clusters and we are basically all of these nodes and all of these demons are actually talking to kafka and all the configuration
changes coming from users are basically being sent over kafka to these nodes okay yeah then there's the container in those machines or nodes it's basically run through
a fairly locked down system dn spawn it basically contains only the customer user services so things like postgres or apache kafka or elastic search or what have you and none of these actually allow code execution so it's actually like we haven't taken a single service type into use which would allow arbitrary user code execution basically we are lacking
like a really good sandbox for that there's a couple of interesting ones like g visor and like firecracker that we've been looking at but currently we don't allow any code execution it's not that it's actually that bad if somebody actually managed to like get into their own
machine there's nothing that secret there but it's just something that we'd rather people not shoot themselves in the foot with that with a foot gun anyway after installation the container again is totally immutable except for the config files which may change like postgres conf you may get the user options there and we don't really want to rebuild the whole service again
just to get like a new configuration switch that has different number in it and the data files that the database itself is actually writing to then worried about image building so we support the six different cloud providers and they all have a different way of for you to
register new images and how you do this some like digital ocean actually only have pre-built images that you can only take snapshots of after you change them so you can't actually upload your own images at all so we basically have to make sure that our tooling works with all of them then some public clouds are fairly slow in when you're operating with them and
creating like base images and especially when you're transferring it to all the regions of that cloud provider and the way the ways we do those are basically cloud dependent there's nothing
really shared between the clouds they just have very differing implementations of how you do that anyway now we have i think it's like 89 cloud regions what we currently support among the six different public cloud vendors anyway the pre-installed packages that we put on
the images they some of them are actually fairly large so they actually do take some disk space but the idea behind this is that why we pre-installed them and make the images themselves already contain this stuff is so that when we spin up a new node which we do a lot they basically are much faster ready to serve the customers and their needs and then depending on the cloud
provider again like things like google are fairly fast and things like azure are fairly slow in booting up aws being somewhere in the middle it they usually take somewhere between two minutes to ten minutes from the time we can call the api on the cloud provider side
saying please give me a vm with specs like x anyway testing because we do a lot of these updates so basically we follow fedora's release cycle mostly we basically have tons and tons of
like testing we have unit tests system tests chaos tests with whatever kind of tests but the thing is oftentimes when we've actually hit problems it hasn't been something that tests a particular version of something it's actually just a generic test that starts failing
and then we go investigate and okay it's because something changed somewhere but it's if you want to do follow a fast changing distribution you really need like a fairly wide coverage test suit that's my opinion at least or otherwise you're basically sailing blind i'm not saying that our test suit couldn't be better it could definitely be much much better but it's
still has like found lots of different issues the other thing is with our approach you're basically eating like basically enduring quite a bit of pain all the time but the other way
around is if you do this every three years or four years or five years or whatever what have you there's basically a lot of pain when you actually do have to eventually move to the next version of the distro so we're basically rather we'd rather like have a quite a bit of pain all the time instead of having an immense amount of pain every x years
and the other thing is that we you really should be reading the release notes of everything with a magnifying glass it hasn't always been smooth sailing like recently glibc changed their unicode collations and we got hit by this because we while we were aware that it's changing we weren't aware that
fedora had backported the change to the previous version of glibc so that came as a bit of a surprise for us but it's it's something that we should have read with more carefully from the release notes because it was definitely mentioned there then also ipsec in the kernel or
in the tooling keeps on breaking all the time which in 2019 i'd rather it actually just worked but we still have issues of course the other thing is we're using it on the public internet where the networks are not that great which means that we're probably we're by the way
exposing a lot of these to an environment where they're typically not designed to be used originally things like apache kafka were used in in-house data centers where the networks were stable and everything was good in the public clouds the networks aren't great and you keep on having issues all the time by losing nodes or having net splits or whatnot
and then one of the other annoying things is that dnf is a bit on the slow side happily there's going to be some improvements on this but it's really also not really resilient against temporary network errors so we actually use a wrapper around it but in general we're very
happy with fedora and it's allowed us to actually focus on what we're doing instead of actually backporting stuff again and it's also basically through the way we're doing this it also forced us to de-emphasize the meaning of any single node so we really if a node go and gets lost that's fine the other thing is we basically just need to take care of persistence
another way because we don't really care about the particular nodes anymore and if you want to do this you really really want to automate pretty much everything any questions the first x
questions get socks so by the way everybody who has a question come to me later and i'll find you the socks uh there's we have some here okay well there are people in the audience with socks run c and oci um oci structure uh you know files um and also um yeah that's my one question
we could use something like run c but back in back in the day it actually didn't exist
how long how long ago is 2015 okay so it didn't exist uh like uh and besides i think actually a system dns point is actually like trying to actually be able to run oci directly but on the other hand what we're actually looking for next is actually better sandboxing rather than uh like my second question is what uh what language is pruned
written in and are there any interesting mechanics and that that orchestration piece uh it's actually written in python and uh well yeah it's written in python uh well i'm not sure how well it does a lot of or like uh interesting orchestration but it's also
part of the proprietary uh secret sauce of ours i mean we're happy to open source a lot of the tools we're working on but that's uh like also some of the stuff that we actually do it actually doesn't use internally it's basically run on every node it doesn't have a database
of its own but what we use for our management things is actually postgres so i'm a long time postgres fan and uh like all the founders had written some small piece of code to postgres back in the day so we're heavily uh into postgres so you picked uh fedora for its release cycle and because it was up to date why not go even further
rawhide yeah uh there's only so much pain that we can take yeah well i mean rawhide like back in the day it used to be even wilder these days it's actually settled down in my opinion but uh users still have to draw a line somewhere any other questions
okay uh how do you react to a host going down because if you use disk encryption it can't or
i assume it can't come back on its own do you then just spin up a new host or you know we typically spin up a new host once it's been unresponsive for x amount of time they all send heartbeats over actually kafka and we actually have fairly good idea when a node goes down
and we actively monitor for things like akpi events so wow if a club if a cloud provider deems to actually they actually want to tell us that the node is going away we get well like noticed beforehand usually they don't they just vanish but some of them are nicer in this regard than others you mentioned g visor um do you use it i mean we've been looking at it
we don't currently use it but it's something we've been looking actively at also the firecracker thing is another similar thing the problem with firecracker in our case is actually that
you can't actually run it on other instances that aren't bare metal so in the case of aws that sucks in the case of gcp and azure they actually allow you to run things like kvm on kvm but you can't do it on aws yeah like the talk is mostly uh it's mostly about state management and
you solved the problem by having the application having some sort of clustering are there any other demons that do not support clustering on their own and you have some tooling to get the state from there and sync it with some other places uh yes so we uh it's a fairly broad question but yes so yes we do uh come talk to me after this if you want
some more details uh you said you didn't use um that you use lux how do you deal with key management without using kms services uh so for us the keys are actually transient in the sense
that they only have the lifecycle of the node so once the node is gone uh that's it we don't back to the nodes that are gone and we've got include a reboot uh we don't actually even reboots from our point of view are basically the node is gone and we replace it
yeah um so any other questions i think i'm running over time um cutting into that uh i assume that if you restart a node somewhere then it also needs data so how do you provision
the initial data uh so it was here somewhere uh but the the answer is depends on the service type and the question okay so postgres we typically restore everything from object store up until the point of the very latest data and that we actually replicate from the other hosts in the uh cluster but if you provision from like an object store then you
need secrets or authentication or that kind of yes so all of this is all of this is encrypted so basically we do client-side encryption encryption we have the keys for the encrypted data in the object stores obviously and we basically just restore that we use things like pg hoard or my hoard or others like that so those are all open source on github yeah
anyway i think that's it if you have any other questions please find me after this thank you very much