Migrating a Hosting Infrastructure from Gentoo

Video thumbnail (Frame 0) Video thumbnail (Frame 830) Video thumbnail (Frame 4199) Video thumbnail (Frame 4924) Video thumbnail (Frame 5782) Video thumbnail (Frame 9325) Video thumbnail (Frame 10438) Video thumbnail (Frame 12971) Video thumbnail (Frame 20208) Video thumbnail (Frame 26895) Video thumbnail (Frame 28024) Video thumbnail (Frame 29042) Video thumbnail (Frame 29926) Video thumbnail (Frame 30835) Video thumbnail (Frame 32962) Video thumbnail (Frame 34701) Video thumbnail (Frame 37423) Video thumbnail (Frame 39679) Video thumbnail (Frame 40935) Video thumbnail (Frame 42219) Video thumbnail (Frame 51422)
Video in TIB AV-Portal: Migrating a Hosting Infrastructure from Gentoo

Formal Metadata

Migrating a Hosting Infrastructure from Gentoo
Title of Series
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
I would like to share our experiences while migrating a VM-based hosting infrastructure at Flying Circus Internet Operations from Gentoo to NixOS. It is a case study of migrating 500+ VM running diverse customer projects using a wide range of web technologies. I'll cover the following main points: 1. What was our motivation to change to NixOS? Which specific features of NixOS attracted us? 2. Creating a NixOS-based hosting platform - What did we have to implement? What is still missing? 3. What are the pain points? How could NixOS become more attractive? 4. What do our users think? -- benefits and stumbling blocks. --- Bio: Christian is a systems engineer working with Flying Circus Internet Operations in Halle (Saale), Germany. He works mainly on infrastructure which keeps customers' web applications running. Main programming languages are Python, Nix, Rust. Christian is a regular contributor to NixOS, most known for vulnix and vulnerability roundups.
Multiplication sign Vulnerability (computing)
Plane (geometry) Architecture Escape character Digital media Sine System programming Virtualization Computing platform Computer network Data storage device Systems engineering
Operations research Internetworking Internetworking Operator (mathematics) Bit
Functional (mathematics) Scripting language System administrator Multiplication sign View (database) Software maintenance Latent heat Strategy game Radio-frequency identification Operator (mathematics) Integrated development environment Scripting language Operations research Addition Dependent and independent variables Software engineering Concentric Wrapper (data mining) Kolmogorov complexity Structural load Projective plane Mereology Extreme programming Incidence algebra Cartesian coordinate system Digital rights management Internetworking Software Extreme programming Web service Function (mathematics) Synchronization Configuration space Quicksort Physical system
Group action Scripting language State of matter System administrator Multiplication sign Database Mereology Formal language Data model Medical imaging Order (biology) Web service Endliche Modelltheorie Descriptive statistics Physical system Scripting language Software developer Bit Mereology Complete metric space Substitute good Entire function Divergence Digital rights management Arithmetic mean Process (computing) Content (media) System programming Configuration space MiniDisc Endliche Modelltheorie Quicksort Programmschleife Physical system Resultant Point (geometry) Domain name Implementation Functional (mathematics) Game controller Server (computing) Computer file Real number Computer-generated imagery Virtual machine Checklist Theory Congruence subgroup Content (media) Divergence MiniDisc Digital rights management Distribution (mathematics) Standard deviation Database Group action Binary file System call Integrated development environment Personal digital assistant Analog-to-digital converter Data center Video game
NP-hard Functional programming Building Code Multiplication sign System administrator Rule of inference Binary file Formal language Goodness of fit Flow separation Hacker (term) Computer configuration Software Cuboid Damping Plug-in (computing) Physical system Installation art Default (computer science) Curve NP-hard Projective plane Moment (mathematics) Code Bit Mereology Directory service Cartesian coordinate system Web application Graphical user interface Process (computing) Fluid statics Software Canadian Mathematical Society Rewriting Quicksort Window
Point (geometry) Code Connectivity (graph theory) Point (geometry) Disintegration Computer-generated imagery Projective plane Drop (liquid) Mereology Cartesian coordinate system Mereology Product (business) Element (mathematics) Congruence subgroup Medical imaging Divergence Web service Divergence Modul <Datentyp> Endliche Modelltheorie Fiber bundle Task (computing)
Point (geometry) Computer file INTEGRAL Code Connectivity (graph theory) Point (geometry) Disintegration Data storage device Planning Numerical integration Directory service Equivalence relation Web service Configuration space
Point (geometry) Pulse (signal processing) Code Multiplication sign Port scanner Database Open set User profile Mechanism design Mathematics Human migration Speicherbereinigung Digital rights management Physical system Installation art Source code Default (computer science) Link (knot theory) Wrapper (data mining) Projective plane Data storage device Code Bit Mixed reality Library (computing) Booting
Point (geometry) Server (computing) Multiplication sign View (database) Sheaf (mathematics) Open set Product (business) Revision control Latent heat Human migration Energy level Software testing Physical system Default (computer science) Point (geometry) Projective plane Maxima and minima Similarity (geometry) Human migration Sign (mathematics) Integrated development environment Personal digital assistant Mixed reality Revision control Software testing
Point (geometry) Computer file Server (computing) Execution unit System administrator Electronic mailing list Ultraviolet photoelectron spectroscopy Numbering scheme Database Open set Cartesian coordinate system Connected space Length of stay Mathematics Roundness (object) Process (computing) Hash function Term (mathematics) Web service output Endliche Modelltheorie Information security Physical system Vulnerability (computing)
Point (geometry) Dependent and independent variables Bit Data structure Perspective (visual) Data structure
Point (geometry) Axiom of choice Slide rule Presentation of a group Code Multiplication sign Source code 1 (number) Similarity (geometry) Control flow Online help Chaos (cosmogony) Modulare Programmierung Mereology Neuroinformatik Goodness of fit Web service Roundness (object) Different (Kate Ryan album) Computer hardware Addition Default (computer science) Software developer Projective plane Database Bit Lattice (order) Software maintenance Cartesian coordinate system Digital rights management Process (computing) Integrated development environment Software Mixed reality Order (biology) Configuration space Quicksort Abstraction
all righty time for our next speaker and this is going to be Christian and you might know Christian from his three yes and you might know a Christian from his work that he's done on vol Nix and on the vulnerability vulnerability I can't even pronounce it he did it vulnerability roundups fornix OS but today he is going to talk to us about this annoying situation in which we might end up where you're not actually yet running Nix OS but of course you want to and he's going to explain us how to get out of the misery quick applause
of course please thanks yeah hi everyone so sorry no okay forget about it sorry okay as already used to be working sorry for the delay yeah okay sorry about that okay I
already said hi my name is Christian I'm systems engineer doing this for bit more than 10 years you can at least yeah which me by mail or find me at various social media or RAC or github and and I'm wanting to talk about migrating
hosting infrastructure from gen 2 to nix OS and answer mostly two questions first what motivated us to migrate so you don't do that just out of a mood and what experience did we make so I'd like to share a bit of real-world insights so perhaps the community can profit from that so when I say we I mean Flying
Circus internet operations we are the guys were the cool aircraft but bit more than that we are a small company located
in halahala in Germany and basically we are taking care of customer applications so there are customers which can't won't concentrate on operational issues because for example they are software shops or that small or whatever and so they approached us for deployment concepts for automation for telemetry for monitoring for incident response for load and resource management for up great strategies and much more so some someone said we are sort of DevOps for hire or DevOps as a service and so we are actually working closely together with the deaths usually on individual applications and we work together to keep the stuff running so actually for customer projects we have more than 500 VMs with a lot of very diverse stuff installed and running on them so well why did we migrate so where did we come from we have been historically a rental and puppet shop and why beef before I'm gonna ditch this it's important to state it went running for quite a while and well it was not too bad so during the time we started that gentle was a good choice because of its extreme configurability and puppet was also a good choice because at that time I think it was one of the best tools you could get for automated systems management but after doing that for ten years or so it got too complex so um puppet is not able to manage everything and so it was no problem we are software engineers after all so we say well that's write a script let's write a wrapper let's write a fix up and this is a
partial view of our old system management stack we see a lot of diverse steps and fixes and scripts and additional config runs and spare specific scripts for stuff and this is only 1/10 of the whole thing and what you see is that there are partly overlapping functionalities so for example puppet can manage 80% of something or dental emerge can do 90% of something but then we have to put another fix and another script and well it was painfully slow so for a full
configuration run we took more than 5 minutes on a typical VM too many moving parts so you perhaps I have to emerge a package but then you have to revved up rebuilt don't know if anyone knows Trento and then you have Python updater then you have a pearl cleaner and so on and so on it and with every step you add you may get a bit closer to where you want but not not really to the point and this small gap when you multiply that with 500 machines gets a really awful lot of works to to be solved manually and we didn't want to do that so well what what was the problem what is more like we did a bad job so what we just have to improve the implementation or perhaps if is the approach not optimal is our whole model of system management too weak perhaps yes and to answer this question we have to go a bit into a theory so and perhaps some of you may know the real great paper by Steve Troy got from NASA about why aromatase and in this paper he defines three main models of system management the divergent system management the convergence is the management and congruent system and management and I go to the each of these in detail so what does divergence mean
aside from the paper divergence is characterized by the configuration of life hosts drifting away from any desired assumed baseline disc content so what is this for example this is when you configure server by following a checklist let three admins do that you get three different results but also like it's dollar scripts for example while call to shell ok this is divergent system management but there's also a legitimate use for divergence for example when you have user home dears or you have some sort of database content or any relevant productive data is by definition divergent so if you wouldn't have that then you could burn your entire data center on DVDs and then it's static so no point about that and the second model if a convergent model so this is where most of the system management tools we have today live in so you have some description of the desired state and then you measure the actual state and then you see where the Delta is and then you for each deviation you have a corrective action which puts the actual system state nearer to the desired system states of many things like puppet and sober salt and you name it work this way and this has also its use case and won't get a way for example you have activation scripts in Nix West who are constructed exactly around this model or for example you have system D services which follow this model or when viewed as a larger scope you see for example a container orchestration also follows this model so this model is good but we can even get to a stronger model the congruent model this means congruence is the practice of maintaining this state in complete compliance with a fully described descriptive baseline so for example system packages are exactly defined in every distribution and of course everything in the Nix star is exactly defined but also there are for example container images which are exactly defined and okay you could argue if serverless functions belong to that as well or not perhaps yes so and well to get maximum control you want to make the congruent domain as large as possible and the other to do mate domains while still they are but it's not growing too big so we design it decided it's time to switch to another model we wanted to follow the congruent approach and to interesting candidate for doing so were of course Nick's ass and of course going with a container thing so well and everyone is going with a container thing today but well they don't do well in a multi-tenant environment because they provide no great isolation and the other thing is that containing out technology well has its strengths but also has its weaknesses and if you reach some larger docker file then you know what I'm talking about so what attracted us to Nix whereas I think in the first place the NYX language is so expressive that you can do nearly everything with it and this was of course very important for us as as software developers you get a lot of flexibility you can just mix everything together and all the dependencies are still in place and working binary substitution you get that basically for free and when you manage to get hydra running and and well someone said Nix is a tool to rule them all so you have one approach to packaging to assistance management to deployment and at least partner last the community is very approachable it was no problem for us to get into the community and for example place pull request or something like that this is very great ok what did we do ok I'm just talking to
some hackers here oh so the hecka would think no problem at all just install Nix OS on the VMS and every one will be happy so well every one of you would just start to put your whole project in one big default dot NYX and that would describe everything then you would just say NYX build and then the whole project trust builds and everything is fine why this is what NYX as hackers think but we had a large installed base which was not constructed with mixers in mind and we had to put that somehow over the fence and we'll see what's there how can we get that running and while doing so we talked to our customers where he reviewed the code and we found out well not everyone is ready for next verse well Nicks is sort of well I don't understand me wrong but it's a hacker thing it's technically very advanced so I like that very much but some of our users were frightened so they had quite a bit of hard time with this congruent immutable approach so for example hey I want just to edit some it is he whatever follow I kind of do that and whereas the option for X and set and do you have a GUI and what's going on here don't understand that and slash user bin is empty so we're off the stuff gone and I'm overwhelmed okay Nix Nix what that's a funny language so I don't understand so can can I program that like she'll know I can't and I don't understand that so functional programming is very expressive once you get into it it's really fun and you don't want to go back and do something other but if you don't do that and for example you're more like an admin guy who started putting some CDs into some Microsoft Windows boxes then it's quite a steep learning curve that was for the one thing and the other thing well it depends on what you are running and we do a lot of web applications and they are well some of them insist of trying to build themselves in the very moment you are starting them up for example we've got some note reyes application which just starts to compile some C code in the moment you're trying to start that and our we have some CMS applications who start to optimize themselves somehow or don't know what they are doing but they I do do know for sure they are going to fail if they cannot write the installation directory and we also have some applications which are and centered around this model of incremental commented so for example some stuff which needs an installed base and then needs to reinstall itself depending on the old installed base and if you try to answer everything from scratch then it takes one and a half hours or so also some plugin system auto up auto update systems and so on so well you could say and that's all bad rule it out yeah but we are earning our money with running business critical customer applications and most of them have some tiny ugly car deep within which well isn't really good software design but is absolutely important and everyone knows ok no it's not the best but we cannot do without it and it's our job to keep it running so how to solve that obviously we cannot rewrite everything and Nixa fie everything right away so we decided for
a flexible approach so we just took the application in the narrower sense so the custom code and see if that mattress more convergent model are another model and we also separate components we also had that in our old gent who set up which bundle often mostly used things like nginx like Postgres like rudders like memcache d like elastic search that many projects use and now we see which part of what
goes into what model so the application deployment depends largely on the project we have some conventional deployment for example like ansible we didn't want to get either truth at that point we have even divergent stuff like data ideas and but there are also congruent elements for example like customer supplied container images and of course there are a few already Nixa fight products and the color code is green is congruent yellow is convergent and red as divergent as it comes to the
pre-made components most of them have some blue coat use some NIC service modules Fortran place already and have some integration points what does that mean these are directories or files where user deployments can typically put some snippets in door so this is the equivalent of fiddling with some contact files under ITC so and I give an example for example and UNIX component has an
integration point this is some directory and you can just drop in some virtual hosting configuration files and so on and that stuff gets picked up by the mixers we built run using our glue code and gets incorporated in the configuration running in the Nix store so that we still run everything out of the NIC store but users find at least some points they can cope with so that was the plan was what was our
experience for most projects with all of this in place it was quite doable and while you have to fit around a little bit but I think 90% of our customer projects went quite fine with that and we of course have some dependencies which used to come out this out of the system insulation I'm not gentle for example some libraries like SSL and so on and most of them are now placed into the user annex profile during whatever are using whatever mechanism is fitting so it's in some project we and so I'm why annex and and others have a default mix which builds on and which is going to be installed there and in the end we see that running code base mixing up mixed code from as far back from 1509 and to the current stuff and due to nix as we are able to mix it so it works out so this is really great so of course we made some tools to facilitate this our
major wrapper and it's only one wrapper for everything is FC manage this is basically a thin layer above mixers who rebuilt and that runs regularly from a system the timer and pulse the channel sees every changes occur and triggers them and then what perhaps more interesting for the community it's a little to record at users can that scans unmanaged installations for example user just compiled code in his home and that uses some libraries from the Nick store the Knicks or garbage collector doesn't know about that and deletes the libraries at some point and then the user code doesn't run anymore and so the scanner just goes through the installation and registers all the time dependency it finds with a Knick star garbage collector and of course will next I think this is well known there's scans and installation for open CVS so what were the main benefits from our
point of view going from gen 2 tunics gave us significantly higher productivity so we can do stuff in no time which took days in the environment we also have a test on the infrastructure level that's quite good we got a lot more flexibility because we don't have to say ok this particular version of open it's the system version and you have to use that and nothing else and also we see that we can scale things so for example most engine X based convicts are the same but also have the flexibility to bring in customers specific modifications without losing oversight so what do our users think well most users don't care so they just want to have some Linux server so well distro don't know so ok get it running that's fine so and we would like to have some more projects where our customers are just shipping a default mix and say hey that's my default mix run that in production but it's only the case for maximum ten percents of our projects and so I think we we profit more of the migration than our users profile directly but of course they provide indirectly because we are better off right now okay
to finish on there are some things which where I would wanted to call the sectioned pain points but I think that's a misnomer because working with Nick sighs it's not painful at all it's fun so I call it things to improve the
security story so I've been busy providing vulnerability round ups every one week or two and but when you look around you see quite a long list of open issues so I think and we need a better approach to actually fixing stuff so discovering of what needs to be done is good but it's only first step the second step is to fix this stuff and in my opinion this is largely a problem of missing manpower so I was really happy to see a larger security team and I want to use Saturday I don't know for people who are still around there and just to see what could be done in the short term just to get this one better and we also should think about backporting important changes to older releases don't know if we need to form a process for that or just start by doing so informally the other point that is a constant source of
well confusion with our customers is that nick says relax to restart everything even for minor changes well it comes from the model if I change any input the hash changes the unit file changes and systemd restarts the unit so when you see for example here for Postgres it's well quite a problem because it terminates all connections who starts and then the application just has to reconnect and we've got one minute of useless downtime so also here the question if is can we do better can we come up with some clever scheme to avoid that without breaking the overall neck size model and well the
last point is I think the community has grown quite a bit over the last year's knixwear as is attracting a larger user base this is great news but I think we have to keep up with a community structure so from my perspective I would really be glad to see more teams with well cleared structures structured with responsibilities well perhaps there already exist some teams but then they are not well discoverable at least so I think this is a point where we can and should improve and so just to continue the success of mixers so as a final
question if we would be in the same situation we've been three years ago right now would we choose next as again probably yes think it's a really great
piece of software software infrastructure thank you [Applause] I can shout okay hello hi thanks so one of your slides was very interesting that just what first one of the first ones is NYX West versus containers and someone would have to ask that yeah yeah so we docker and containers there is this huge pool about one containers runs one application and so on so it seems that it fits quite well at my understand what you are doing because your customers have one application and they want to run one application in a container so would you like to elaborate a bit but why Nix worse for you was a bad choice yes sure sure so and well one point is that containers are only part of the solution because in a multi-tenant environment you have to provide and get encapsulation anyway and so we still need some management for the under light underlying thing be it VMs whatever and we're not large enough to have dedicated Hardware per customer and the other thing is that containers impose a specific way of deploying the application because you have to follow their standard or you aren't going to use it well and with our customer base this is not possible for example we have a lot we are historical a Python shop and there are a lot of CC build-out based projects and they don't work together very well in addition to that we experimenting now with running containers inside VMs just to give the benefit of both worlds more questions yes what was the first question yes so your 10% of your base of user base is already using default Nick so Canada mix if I the project for you in a sense at least partly yeah what is the difference in maintainability or between this 10% and let's say some some similar size projects okay it largely depends on the project in my experience the maintainability is more dependent on the software code base as such so there are for example projects which safe part of the code in their databases and well that's a nightmare and and while we have quite good convergent deployment tools for example we use part orders in-house tool or ansible and other tools and if you do that well it works too and so I think the major benefit of projects which are using mix if I'd set up is that you've got the exactly Proteus ability on the developers computers so you know okay exactly what is running here will be running there I think this is the same promise that containers make and I think this is the main benefit for the most other project we have a much more sort of staging setup with a lot of VMs that are subtly there to see that everything still fits together in the deployment do we have there two foreign XE fight projects but they don't need that much second question so you remember do you still remember when you came back to your meeting with the idea let's try it mix and see it and when your coworkers okay that was actually Domon the guy so and we had a sprint at then goes help this was the pharma company and and we were supposed on hacking on Python stuff and I know very well dumb and just put there some Nick stuff in and just said here look at that that's cool and we all looked at that and said that's weird well you have to free yourself from 30 years of thinking that's the way how a UNIX or Linux is and and well after seeing that presentation and we've been thinking about that for nearly a year so and then it's starting to try it may be time for one last question that doesn't have sub questions yes hello thanks for the presentation I want to ask you about
the emerge givers in Gen 2 compared to the module system that alcohol showed like how do you compare them and I think there's nothing to compare really because the whole emerge ecosystem is solely about installing packages and not really about doing service configuration so the whole mixes module systems is one abstraction layer up so did that answer your question not really ok perhaps have a personal talk yeah already yeah you can you can find him afterwards during what comes up next which is the next coffee break so a small round of applause again please for Chris and in this great job thank you [Applause]