We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Delivering Continuously to 10 Million Users

00:00

Formal Metadata

Title
Delivering Continuously to 10 Million Users
Title of Series
Number of Parts
170
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
We want feedback - fast and often. This requires quick and frequent high-quality releases. But how do you do that with a platform that 10 teams are working on? Can you do without branches? How do you keep up with testing? And database changes...? This is a look behind the scenes of AutoScout24 - a pan-European online market place - that makes continuous delivery possible with agile methods.
Disk read-and-write headLine (geometry)Level (video gaming)Multiplication signFrequencyMathematicsSoftware developerContext awarenessWordFunctional (mathematics)Product (business)Execution unitWebsiteWeightData managementProjective planeTerm (mathematics)Moment (mathematics)Computer animationSoftwareHeegaard splittingNumberPhysical systemSystem callSoftware frameworkFormal languageContent (media)Tablet computerVideo gamePoint (geometry)Error messageMobile appAnalytic continuationWater vaporGoodness of fitEnterprise architectureData miningComputing platformSlide ruleMereologyService (economics)Software bugPredictability
Continuous functionSineTelephone number mappingThomas KuhnLemma (mathematics)Source codeSet (mathematics)INTEGRALExecution unitVirtual machineMultiplication signLine (geometry)Integrated development environmentInformation managementVideo gameFlow separationMathematicsBranch (computer science)CodeAdditionSoftware1 (number)Server (computing)Software testingComplex (psychology)Software developerObservational studyTrailSubsetElectronic mailing listData storage deviceConstructor (object-oriented programming)NumberService (economics)Computing platformArmMereologyLocal ringBoom (sailing)SynchronizationChaos (cosmogony)Right angleWeb pageContinuous integrationRevision controlFunctional (mathematics)ImplementationConfiguration spaceDifferent (Kate Ryan album)Mathematical analysisLink (knot theory)State of matterData miningLogic2 (number)Installation artOnline helpApplication service providerSoftware engineeringCASE <Informatik>Data structureElectronic visual displayBuildingWeightComputer animation
Human migrationDatabaseGroup actionData structureOrder (biology)Scripting languageHuman migrationMathematicsFormal languageLogicWordProduct (business)Table (information)Java appletCodeMultiplication signService (economics)Functional (mathematics)Field (computer science)Right angleCartesian coordinate systemComputing platformDependent and independent variablesBitElement (mathematics)Different (Kate Ryan album)Interrupt <Informatik>Software developerOperator (mathematics)Representation (politics)Limit (category theory)Revision controlPoint (geometry)BuildingFreewareIntegrated development environmentGreen's functionSoftware testingTorusState of matterHeegaard splittingChemical equationCasting (performing arts)Electronic visual displayCASE <Informatik>Row (database)Semiconductor memoryReading (process)Virtual machineComputer hardwarePhysical systemLoginAdditionSpeicherbereinigungWeb serviceBusiness objectActive contour modelObject (grammar)MereologyKey (cryptography)Configuration spaceLevel (video gaming)Analytic continuationCondition numberTheory of relativityWeb 2.0HierarchyDomain nameLastteilungStructural load
Support vector machineNominal numberSingular value decompositionVirtual memoryMathematicsPoint (geometry)Software developerGroup actionCodeRevision controlVideo gameMultiplication signChemical equationPhase transitionSubsetBackupDatabaseField (computer science)Branch (computer science)Linear regressionConnected spaceData structureArithmetic meanCentralizer and normalizerMoment (mathematics)Computer programmingRollback (data management)Strategy gameFilter <Stochastik>BitData managementTouchscreenMereologyWeb browserCAN busGraph coloringService (economics)State observerComputing platformServer (computing)Exception handlingLastteilungVisualization (computer graphics)VirtualizationSoftware testingFront and back endsOpen setDebuggerSmoothingWeb 2.0Product (business)Data mining
Computer programmingMathematical analysisFluid staticsVacuumHill differential equationEmailConvex hullMaxima and minimaCompilerAnalytic continuationSoftware testingCodeLevel (video gaming)Software developerExecution unitContinuous integrationBuildingMoving averageKanban <Informatik>DataflowVideo gameMultiplication signConnectivity (graph theory)Control flowWeb browserCASE <Informatik>Error messageDifferent (Kate Ryan album)Classical physicsRollback (data management)Clique-widthSoftware bugSoftwareCodeUnit testingComputer architectureEmailFeedbackFluid staticsMathematical analysisAreaLinear regressionFile formatData managementServer (computing)NumberLogicTask (computing)RootFunctional (mathematics)CausalityUniversal product codeBitMereologyPatch (Unix)Cartesian coordinate systemForcing (mathematics)Independence (probability theory)Computer programming1 (number)INTEGRALAverageRational numberIntegrated development environmentHeat transferObservational studyView (database)Compilation albumKeyboard shortcutRow (database)WordUniformer RaumLattice (order)System callCuboidQuicksortTouchscreenRight angleVirtual machineComputer animation
outputAndroid (robot)Software testingAnalytic continuationMobile appOperator (mathematics)Cross-correlationWeb 2.0Multiplication signIntegrated development environmentLinear regressionServer (computing)Stability theoryDependent and independent variablesSoftware developerBitComputing platformMoving averageLink (knot theory)Right angleProcess (computing)MereologyQuicksortInclusion mapChemical equationBlogFreewareSoftware bugGoodness of fitHeegaard splittingFlow separationTable (information)Sign (mathematics)SubsetCodeConnectivity (graph theory)TouchscreenRevision controlVideo gameTheory of relativityUltraviolet photoelectron spectroscopyINTEGRALData storage deviceDataflowCASE <Informatik>Digital electronicsTorusDatabaseWeb pageService (economics)PlastikkarteShared memoryControl flowScripting languageTrailRoundness (object)Line (geometry)Configuration spaceProduct (business)Forcing (mathematics)Computer animation
Fluid staticsSupport vector machineMenu (computing)Atomic nucleusVirtual memoryBinary decision diagramMaxima and minimaComa BerenicesVisualization (computer graphics)BuildingPresentation of a groupPlug-in (computing)DatabaseMoment (mathematics)MathematicsMultiplication signComputer architectureDisk read-and-write headProcess (computing)Term (mathematics)Figurate numberMereologyView (database)Branch (computer science)Heegaard splittingRing (mathematics)Observational studyNumberGroup actionCasting (performing arts)Physical systemStructural loadCode refactoringControl flowSerial portHuman migrationGoodness of fitServer (computing)Computer animationMeeting/Interview
Maxima and minimaNominal numberComputer animation
Transcript: English(auto-generated)
Good to go. A supernova is a stellar explosion that easily outshines everything around it. It's an explosion with a predictable impact on galaxies, on planets, on stars. And that's pretty much what our releases looked like a few years ago, probably three to four years ago.
Would users be able to use our website after having released it? Would they be able to sign on and sign in again? And how can we deliver without being offline to about 10 million users frequently?
That's our question today. Our target back then was to establish a smoothly running system of releases. Maybe as smooth as this wireless and animated graphics.
This production line is supposed to never stop. To run as long as bottles of water are produced. Our releases were to never stop as long as we produce software.
I'm Robert. I'm pretty much excited to be here today. It's my first time in Oslo and I'm from Germany. I'm with gutefrage.net. So let's do a first part. Who of you has ever heard of gutefrage.net? That's not too much. They should have included an advertisement slide for next.
But to keep it short, I'm with Germany's biggest question and answer community with all of our users producing the content. We enjoy about a little more than 17 million registered users with about almost 16 million visits per month. And I've been into agile product development, Scrum, Kanban, XP for about six years now.
And I'm in charge of agile product management with gutefrage. And our goal at gutefrage.net is to transform us into an agile enterprise. Including marketing, sales, finance, HR and all these other business units. Being guided by the agile manifesto.
If I'm not presenting at a conference, I do have hobbies and I do have a private project. I'm building a house together with my wife. And I'm pretty excited to be joined today by a good friend of mine whom I met at AutoScout24. The company I was with before I joined gutefrage.
And whom I was with at a Scrum team back at AutoScout24. Okay, my name is Simon. Never mind my last name, it's unpronounceable in any other language than German. Well, Bavarian actually. I'm married, I have a little daughter and I've been developing software for 13 years. I'm with AutoScout24 in Munich since 2010.
And I'm leading eight developer teams there. A few words about the company so that you have some context about what we're talking. Has anybody ever heard of AutoScout24? Oh yeah, so I win. BAM! Has anybody maybe even bought a car or sold a car through our website?
No, okay. That's pretty much the same number. Yeah, yeah, yeah, okay, okay. Now AutoScout24 was born as a platform for buying and selling cars. We help finding people the right car for them and the services if they already have a car. We are present in 18 countries, unfortunately not in Norway or another northern European country.
So it's not a surprise that nobody would have heard of us. About 10 million people use AutoScout24 every month. That is through websites and apps on mobile phones. So it's all about cars but I always say at heart we're actually an IT company. We have about 100 employees in IT and we build good software and that's what we want to talk about today.
You can take the seats over there if you want. I'll give you a minute. So back in 2009, about five years ago, we started to introduce Scrum. Let's do the second part.
Who of you has ever made use of the Scrum framework? Ah, that's my people, okay. So back in 2009 this meant to us splitting off our project teams into cross-functional teams comprising of a product owner, up to six software developers, a quality manager and a Scrum master.
And we did report important improvements. We enjoyed better quality in terms of less bugs and as well as humans we do make mistakes. Whenever we made a mistake we got to realize this earlier when developing, at an earlier stage. So we gained in efficiency and in speed.
We were about to have increments of our software ready frequent times a week. Nevertheless, we were forced to be faced with monthly releases. And this is not what Agile talks about, right? It talks about shipping software of value to our customers as often as possible.
The moment it's done, not with us. We're to wait for another scheduled release. So we actually realized actually that our releases hurt a lot. They were annoying, they were exhausting and we just didn't like them.
So what to do about it? Why not trying to ship software live the time it's ready? So we wanted to ship our new features when they were ready. We didn't want to wait any longer for a scheduled release.
We didn't want to think about releases as the way kind of a left over from the last sprint. And anyone, not me, not my team has to ship it. Unfortunately there's lack of motivation. We did not enjoy drive, it was exhausting, stressful and of course error prone.
And even worse, we had to realize that this was kind of water fallish. Developing software and releasing it at a very, very late stage. Bad things.
And we always keep in mind, if it's not us who ships the features, it's our competitors. And as change was omnipresent back then with our Agile mindset, we just tried to change things.
Even though we did not know what continuous life delivery is all about, we just kicked it off. The only thing we knew for sure is there will be lots of changes. Incremental changes. Getting things done one by one. So the first thing we try to understand is why are our releases time consuming?
Why are they annoying? Why do we have to wait for another release to be scheduled? We keep comparing our releases to container ships.
They are large. They're carrying many, many containers. It takes long hours or even weeks for the ship to arrive at the next harbor. What if a container breaks? How can you find it? How can you fix it?
And it's pretty much the same with our releases. We're carrying many, many features per release. A release took us one week to get it live from the start of delivery until, yes, we're live.
And our frequency was about one release per month. And we did find bugs while releasing. And we did have dependencies in our features. So it was just difficult to find them.
And if we found a bug, shall we wait for the other teams to fix that single bug? By being aware that this will delay the scheduled release? So in and out, there was room to improve our releases. The first idea that popped up to our minds was,
if we make it to develop software independently, then we should be able to ship it independently. And that's what we tried. Feature branching.
To us, this meant we'll have dedicated development environments per user story, hence per team, as one team is developing on one user story at a time. And hey, that was great, that was cool, that was fun, because our mainline was green all over the sprint.
But we had to pay our dues at the end of the sprint. Because if up to six teams are developing software in sync, they all have to merge this back to the mainline at the end of the sprint, probably at the last day of the sprint.
And it's a huge effort to just orchestrate all these teams merging back their code to the mainline. And to be honest, this very, very late integration of software resulted in more obstacles than ever before.
So we tried to get rid of it, of feature branching, and try to focus back on what is known among professional software engineers as continuous integration. So continuous integration is nothing new.
We've been hearing that for years and years. I still want to talk about it again because I believe there are two big misconceptions that we still have today with continuous integration. And the first is the one that we had. We thought we could do continuous integration on a branch. But if you use feature branches, you're by definition not integrating continuously
and keeping changes separate over time is not continuous integration and will cause problems. So we do not use branches anymore. We develop everything on a mainline that is straight like this road. And we merge all changes daily onto it. So this leaves us with little merges with only a day's worth of changes of a person, of a pair.
So what does this look like? We have TeamCity running, which on every commit or push, in the case of Git, pulls the sources from source control, builds, runs unit and integration tests, and study code analysis, and installs it on a test environment.
Now, is this continuous integration? This is technology that works, and it's nice, and I believe here is the second misconception of continuous integration. Continuous integration is not about technology. It's about mindset. It is not a server that compiles and installs something automatically somewhere.
It's about developing software that works all the time, that is always ready to ship. And without this continuity, the best build server doesn't help. Now, of course, changing software is not as smooth as this river. It always comes in chunks. But the more changes you dam up, the more problems you'll have.
And in this way, real continuous integration is the foundation for continuous delivery. So when somebody asks me now, do you do continuous integration, I don't say, yeah, I have Jenkins running, or TeamCity, or TFS, or whatever it is. I say, yeah, we do not do feature branches, and we're sometimes red, but we're mostly green, so good to go live.
So what we do is we do not release every commit, which you could understand under continuous delivery, but we want to always be able to release. And as a developer, I know everything that I commit will go live. Okay, that's no wonder, but I don't know when.
It might be right after I commit or push, or it might be the next day. I only know everything that I commit has to work, because it might go live right away. And this thing that every commit must be releasable requires a lot of discipline, of course. Now, we have one mainline.
We do continuous integration on it. We want to be able to release at any time. But we also have several teams working on code base, and each team maybe develops different features at one time. Now, not all these features are complete at the same time, so we definitely release incomplete features.
How can you do this without creating the big chaos? There are different possibilities. First one is maybe just hide new functionality until it's complete. So if you build a new page, you would put the links into the new page only when the page is ready. Or you can build stuff incrementally. You first build the data access layer, then the business logic, then the display logic,
and you enable editing the data only when everything else is complete. What we also use a lot is feature toggles. Who has heard of feature toggles? Okay, it's also called feature flippers or switches or stuff. I see many of you. Now, let me just give a quick example of how we do this and the learnings from it.
This is a part of the central configuration of one part of our platform, which is called Garage Portal. It's basically a service that enables garages to offer their services online and users to find those services and compare them and book them online.
So this is YAML just because it's simple and readable, and you can see four feature toggles. Each has a name and the settings for our different environments. So dev would be the local developer machine. CI is the environment where the continuous integration server installs the latest version, and live is the live environment. You can, for example, see this future offer toggle,
which is only switched on on the local machine, so it's in active development. You have the support chat toggle, which is already switched on on CI, so it's being tested. And the all-service funnel is switched on in all environments, so it's probably ready to be taken out of the code base.
The switch, that is. How does this look in the code when we use it? This is a part of the main page where the garage can edit its data. It's an ASP.NET MVC RazorView because .NET is our main development platform, but the whole feature toggle thing is basically technology agnostic,
so you can do it in every technology. When the feature is on, we show a partial where the garage can edit a certain set of data. When the feature is off, the partial is simply not shown, so the garage doesn't even know the data is there and can be edited. What do we do here? We have an additional if in the code,
so we basically branch an additional time in the code, and we do branching in the code instead of source control. So it's still a branch, but it's all in the code base, and it makes it easier to handle. This is a very simple example, but of course you can imagine more complex ones where you might switch out an implementation of an interface for another.
Feature toggles give us the possibility to develop features in the background without neglecting continuous integration. Of course, there are no silver bullets, and we've had our problems. One is that in the code you just saw in the configuration code, this is by definition untestable because it's different in every environment.
In the beginning, we had a separate setting for the acceptance test environment, and of course we tripped over that on acceptance test environment. It worked. It went live. Boom. It didn't work. So now we have just one setting, and the acceptance test environment takes its setting from the live setting. Not all changes can be handled this way, but it works quite well for us.
And feature toggles, switches, flippers come with another drawback. Whenever you add one, you increase the complexity of your code structure. So keep in mind there's a key takeaway.
Monitor the number of feature toggles you're using and get rid of them when you don't need them any longer. So this is an example of how we do it at Goodify. We just keep track of our feature toggles in a list that summarizes the name of the toggle, who created it, when it was created,
and whether it can be deleted or not, indicating yes or no. So keep track on that, and make sure you reduce the complexity of your code once you don't need to toggle any longer. Okay, so it's nice we can switch our application logic back and forth,
but what about the data? If you change the structure of your persisted data, it doesn't work that simple. Like adding fields or changing names in the database is something different. Now, traditionally, for relational database, we use update scripts. We do that too for our Oracle database.
But we had quite a bit of pain with that. Now, the pain has reduced somewhat since we started giving responsibility for database updates to the development teams, and now that we're releasing more frequently with less changes in every release. But the pain still remains. So when do you actually run these update scripts? Before you put the new code on the web service, or after, or during?
Do you need a downtime when you run the update scripts? Do you have dependencies between them and have to run one after the other in a certain order? To avoid all these things, when we started with the Garage Portal, which is our younger product, say, we decided to use MongoDB.
Who's heard of MongoDB? Okay, who's actually using it? Okay, then I say two words about it. MongoDB is a document database, so it does not have tables. It has collections instead, and a record is basically a JSON document in such a collection. The data inside a document is a hierarchy structure,
and the interesting thing is that all the documents in one collection can have different structure, so it's very flexible in that. And we've chosen this flexibility explicitly to ease continuous delivery, and in that case, it actually worked. Let me show an example for this. So, what we do in the Garage Portal is we basically serialize our domain objects
into documents in the database. So, we'd have a document in the Garage collection with all the data of one Garage in it. When we load the data, we basically deserialize it into a CLR object. Now, bear with me through a scenario.
We wanted to make Garages be able to offer additional services, free services with what they offered on the platform. So, for example, when you leave your car with the Garage to have some service done on it, they would give you a replacement vehicle free of charge. And to display this online, we added a field to the Garage domain class,
included services, it's called. Now, the documents already in the database didn't have that field, so what we did is added this function to the data access code, and it's called every time a Garage document is loaded.
For those of you who haven't worked with MongoDB, a Bson document is the binary representation of the JSON document in the MongoDB API. What happens in this code? I think it's simple enough to figure out, even if you're not familiar with C Sharp, we just look, is there already an element included services?
If not, the element is added. That's all. If the element is already there, nothing happens. So when a Garage logs on, the document without the field is read, it's added through this code, and whenever the Garage saves its data for whatever reason, the new structure is persisted to the database.
Again, this is a simple example, but of course you can imagine more complex scenarios. So what are we doing here? We've created backwards compatibility. The application can read the old and the new structure, and it always writes the new structure. And this is possible because we can have documents with different structure in one collection.
And we do not have any operation interruption. We do not have to run the scripts manually. This all works on the go. Now, not every Garage will log on within a limited amount of time,
and in the end you want all your documents to have the new structure, so at first we still wrote update scripts in JavaScript because that's the language of MongoDB, and we ran them in the end, which doesn't need downtime, but still we had to run this update script, and we had duplicate logic, one in C Sharp that we just saw and the other one in JavaScript, and we didn't like that.
And what we do now is at night when there are a few users on the platform, we just load the whole database into memory and save it back again so that every collection runs through the code we just saw, and the next day all documents are in the new structure and we can delete the migration code. So I like the picture of the two-headed snake because it gives me the idea of this creeping migration
that you don't really notice, and it has two heads so that it can eat both versions and they end up in one stomach. So far we've heard about condition integration, feature targets, and how to deal with beta. Here's another key learning to us. It's about automation.
Automation for sure, no doubt about that, is key if you want to get to the stage that you'd be able to release continuously. Even though our build was automated, our releases still took long.
The build, the test, the deployment. Now automation came into play. For example, with provisioning of environments, testing, deployment, building, and making sure that the same config is deployed on any environment. But is this all? We didn't think so.
So even though the build was automated... I think there's a question. Oh yes, up there there's a question. This one.
Okay, so that's you, Simon. I'd say there's no recipe for that. You'd have to think about it and establish how you proceed in that case. So I can't give you an answer. We'd probably do it in some way every time.
But the thing is this works for the most common changes. So we have a lot of changes in the front end and little features that we try out. And then we see AB testing, see what works best. For bigger changes you'd probably have to put more brainpower into it. Did I answer it in some way? Okay.
Okay, any other questions? Okay. Then continue with our build system. So the obvious solution to us was, let's invest into hardware. To buy new build machines that allow for parallel builds. Well, it was a good idea and we got faster, yes.
But at the end we realized that there was another even better solution. Why not strip your build systems down? Why not get it leaner and getting our code structure less complex? Make it simpler. And that's even compelling.
Whenever you try to automate things, make sure you keep your code as simple as possible. Well, that's a story that's cool. But now it comes to a crucial point. Releasing. And remember, we do not want to be offline when releasing. There are two different scenarios.
One of them is the blue-green deployment. Has anyone of you ever heard of blue-green deployment? Couple of hands showing up. So the key point about blue-green deployment is that you split your web service into two different pools.
And they are online at the same time if you're not releasing. Whenever you release, the load balancers to manage that the one, the first one, let's say the blue one, is taken offline while the green one is still online. And new code is deployed on the blue half of the pools,
being tested and being taken online again. Again, managed by the load balancer. And as soon as the blue half is online again with the new code, the load balancer again manages to take the green one, the other half, offline, bring the new code, structure on it, and take it online again.
Following this simple principle, you may not be offline while releasing. The most crucial part in this scenario is make the load balancer orchestrating these changes. But there's another one you could try. Simon.
Okay, just to repeat the question is how do you go about databases with blue-green deployment? Well, I've heard of people talking about blue-green deployment for databases.
We haven't done it. I don't think it's a really good scenario. So we decouple these things, do them separately. So this is for web server deployment. Another one?
So the question is, do you use it also for rollbacks to see whether it works live? And what do you do when you have database changes included? So first question, rolling back. Yes, of course. So the idea is that while the blue server, the blue pool is inactive, you can actually test that and see if it works live and then roll back to the old version if you see that it doesn't without impacting any users.
The other question about database changes is, I believe it's a very good idea to actually decouple your code changes from your database changes in that you keep, as we saw before, the code backwards compatible with new version of the database or with the old version, better. So you can decouple these and you don't get into problems when you do blue-green deployment
that you also have database changes interrelated. Okay? Just another one?
Yeah, okay. So the question is, what crap are you talking? You just showed some code where you actually write new data and you can't just switch back to the old version.
I suppose that is correct, yeah? Of course, when you write new data or when you write it in a new structure, then you are being sort of destructive about your data. So if you want to be backwards compatible here, then you'd probably have to have an intermediate version where you write both versions, like, say, you rename a field, you still write both fields,
the old name and the new name, and then only when you see that it's working live and after some time you remove the one with the new name. I'm pretty sure this doesn't work for every scenario. So blue-green deployment is good if you have a lot of changes in the front end which don't really go into the database and you have to think about it when you do database changes.
There's another one up there.
So the question is, if I got you right, do you do backups before you release or what other rollback strategy do you have?
Is that correct? Okay, well, we do not actually have a rollback strategy. We do only go forward. So that is the idea, taking the risk of not being able to rollback, which, of course, means you have to have high quality all the time. I'll get to that in a minute. And the other thing is, of course, we do backups, but not before every release, so we do backups regularly, I think, every night.
And if something goes wrong and we have to go back to the backup, then we'll have to do it, but it hasn't happened in a long time. Okay, we can take more questions at the end, I suppose. Now, another thing, maybe, to be honest, is we're talking big about blue-green deployment here, and we say, yeah, I've been doing that for years and just a few weeks ago we actually discovered that. We always had exceptions on our live servers when we did deploy,
and then we found out that we still cut off connections because we just told the load balancer to switch hard from one moment to another. And we found that out only recently, so now we're doing it more intelligently where we wait for connections to actually end,
and then we switch over to the other pool. So you have to be intelligent about all this stuff. Okay, we have feature toggles and we have blue-green deployment. Now we want to go one step further. Why? What we're doing is we're still doing a big bang release when we switch a feature toggle on because the whole code that is behind the toggle, which can be quite a bit when you have big changes,
goes live to all users at once. And we still need to deploy when we want to switch a toggle. What we want to do is use virtual cannery birds. In the old days of coal mines, cannery birds were used to detect poisonous gases in the mines,
and the poor birds would fall to the ground before the miners even noticed that there are toxic gases in the air, so they could run when they saw the birds on the ground. This gives the name to the deployment strategy of cannery releases, where you switch new features on only for a few users,
and you observe them, and when they fall dead to the ground, you better switch back to your old version, and when they are happy, then you can go on and release it to everybody else. What we also started doing is writing regression tests for both branches, so for switch on and off, so that we can actually have both versions live without risk.
And we want to separate the code release from the feature release, so that we can switch between without deployment. And the idea is to go so far that the product owner can actually go to the live service, switch on the feature only for himself, do acceptance, and then switch it on for the desired group of users or for everybody.
So to start this, some clever colleagues of mine have developed this thing called Feature B. It's basically a central management of features. You can view and change the feature status. So this is an example screenshot. The first column would be features that are in development.
The second one is the features that are switched on for the canneries. And the third one is active for all users, so they are basically done live for everybody. To select the canneries, you have certain filters. They can be filtered by country or by browser, browser version, or you can assign it to a percentage of users.
So this goes also towards A-B testing. There's also a browser plug-in where you can switch on a feature just for yourself, so that you can see for yourself if it works or not. This is open source, and on GitHub, it's implemented in .net, but of course we welcome contributions for other platforms. Here's another example screenshot.
At Code for Article, we're currently migrating our backend to go for Scala services, and they're deployed once the developers commit their code into the mainline, and this looks pretty much like this. For example, with this service, common service, as long as the background color of each of these services is green,
everything's fine. When I turn orange or red, there's problems behind that, and it signals where to just check on that single individual service to get it fixed and run the deployment again. It's our visuals to realize when everything's not running smoothly.
It's as easy as that, but you have to keep in mind to realize the point something happens with your tools. Okay, so we're developing in several teams on a code base. We have different features in one team in development at one time,
and we want to release at any time, so we do not have a lockdown anymore or a code freeze, and we do not have a phase where we can check the whole platform and see that everything's working. We don't want to do rollbacks anymore, so we always want to go forward. This all requires continuous quality as well.
What does this mean for us? It starts with pair programming. The two teams that have been working on the Garage Portal have developed all production code in pairs from the beginning, and this not only means four eyes for high quality, but also included know-how transfer, and we really discovered the fun of working this way.
And pair programming also has another advantage over the classical code review, so when you work in the way of developing, committing, and then doing the code review, the code might always go live between the commit or the push and the review.
This is the testing pyramid, supposed to show what we test on what levels. The goal is high automation, of course, to have fast and repeatable feedback. The width of the pyramid shows the automatic coverage, and of course we use TDD so that no code is written that is untested.
On the lowest level are units and small component tests, which are supposed to have a very high coverage, of course. Then we have regression tests, which test only our main use cases, but they also run continuously so that we always know how the main use cases are doing. So far we've been using browser tests for that, so Selenium, which we've had quite a bit of pain with
in the sense that the tests are slow because they run through browsers and through network, and because we have quite a few brittle tests, flaky tests that are red, although the functionality is actually working correctly. So what we're doing is we're trying to reduce the number of tests we have, remove the ones that are actually testing duplicates,
and also go to a lower architecture level where we test directly the business logic or the application logic instead of going through the browser all the time. Next level would be smoke tests. So this is a part of the regression tests that are only the most important functionality
that we run after every delivery so that we know, okay, whenever we deliver it to an environment, it's actually still working. One example would be that in the Garage Portal, we always had problems with sending out emails, so now we have a smoke test that after every delivery says, okay, emails are still working because this is very important for us. And above the pyramid, without automation,
you still have manual regression and explorative tests because no automation can substitute the intuition of a human, and our testers know pretty well what areas of the code are very critical, what has changed, and what the coders usually break.
We also have these friendly gentlemen. We call them the cops. After the static code analysis tools, StyleCop, FXCop, ReSharper, but you also use JSHint for JavaScript. Well, these are tools for checking formatting and common errors and bugs. We also respect compiler warnings and treat compiler warnings as errors.
It's all very annoying in the bidding, but once you get used to it, you have the advantage of having a uniform formatting, so you can't really recognize who has written which piece of code, although developers have all their different styles of coding. And, of course, you avoid bugs from the beginning.
The question is, when do we run the static code analysis? Is that correct? I would say the actual continuous integration happens on the developer machine,
so before you actually commit, you would run unit tests and static code analysis so that you know at once if you've broken any of those. And, of course, it's run again on the continuous integration server. But it's actually important to have the feedback as quick as possible, so you could even run it, I think, during coding time,
but it's actually, yeah, compiler unit test time. Okay? Okay, so let's give it a try again. Failures are, of course, a great chance to learn, to get better next time. And whenever a failure happens to a website, we do like this. At this point, the entire team stands up and gathers at the rubel that's located on our floor
and asks three questions, and they answer it as well. What happened? What's the impact? Who takes care? And when do we meet again to check if we're able to fix it?
So that's part of our ownership. It's the teams who deliver software. It's the teams who are doing the releases. And even after that little... and getting it fixed again, they gather again to ask five questions,
five times why in a row. Why, why, why, why, why? To find out on the root cause of this failure. So it's just that. It's all about teamwork. And once upon a time, we had a release manager orchestrating
a big, huge release task force to do our monthly releases. And for us, this has become a fairy tale because it's not the teams who are in charge of releasing. Each team is able to ship its software independently from the other teams.
And we managed to improve our releases from five weeks to a commit to life in less than an hour. And as Simon just pointed out, we do not have any rollbacks. We only go forwards with roll forwards.
So is this annoying? Yes, it is. But if it hurts, do it more often and you get trained to it. That was our mantra back then. And so we did. There was another learning. When we achieved continuous life delivery,
we learned a lot about flow, about improving flow. We realized impediments and get rid of them. And that's why we tried to implement Kanban with a couple of teams instead of Scrumban. Nowadays, we do not have teams that follow Scrumban by the book.
Nor do they follow Kanban by the book. They do anything in between. You'd like to call it Scrumban. I don't care. But we realized that whenever we try to improve any kind of flow, there will be impact on other flows as well.
Okay, so to show you that we're not inventing things, this is a screenshot from the beginning of the week. It's the live delivery build in Still Team City of the garage portal that I was talking about.
So what can you see here? You can see a few things. You can see, for example, that right here we had a gap of, what is it, seven days of delivery. So, of course, you can ask yourself, he's talking about continuous delivery. Is this continuous delivery? You can see here a live delivery broke,
but we actually fixed that an hour later. Well, no, it's actually, what is it? Well, a few minutes. Anyway. This is like more the roll forward scenario. Try to fix it quickly. Be able to fix it quickly so you don't have to go back. What else can we see?
There's one day, 28th of April, where we actually went live three times a day. So that's possible too. Now, of course, you can ask, is this continuous delivery? I don't like the big break in there either. Probably we were doing server patches because unfortunately we don't have that included in our delivery pipeline yet. So, it could certainly be a lot better and there are still quite a few things that can be improved.
For us, it is important to be able to go live when we want to. Now, what can still be improved? What are we working on? Autoscar24 turned 15 years last year and the vehicle market is a big application that you often have in these cases a big monolith.
It used to be released in one big bang, as we heard before, which of course prevents teams from working independently and releasing independently. So what we're working on is splitting up this monolith in separate components that can be developed and released independently and that single teams can take responsibility for.
Of course, you have to watch out. It's not all done by splitting it up into a separate code base and pulling up a release pipeline for it. There are, of course, hidden dependencies and you have to figure them out and manage them and we've tripped over a few of them, of course. Another thing is DevOps.
So, we've been working in interdisciplinary teams but operations is basically still a separate department and, of course, ops and devs have, by definition, different interests. So, developers want to change as much, as quickly as possible and ops want to change as little as possible, of course, because of the stability of the platform.
But our goal is to have one team responsible for release and operation of their product. You build it, you run it and, of course, that needs competence on the team for server configuration, automation, monitoring and all these things. What we have now is one developer in the team has a DevOps role,
so he has additional rights to the live platform and acts as a link to the ops department. What is missing is that basically every developer has that role and has the right to deliver but also the obligation to watch out and to monitor and to run the thing.
So, let me tell you a last story about our so-called user sign-up bug. Users can sign up on our page for free. There's no charge and so do about 10,000 users a day. But several weeks ago, about 10,000 users did so,
signed up and have not been able to sign in after signing up. So, what happened? The data, the sign-up data was written to the database, check, but unfortunately to the wrong table. And even worse, we did not realize
because back then we only monitored operations data, operations KPIs like bugs, latency, performance. And it took us very, very long hours to realize that there is a big, important bug live on our platform.
So, the team were blinded. They did not check after they released if users will be able to sign in after they signed up. Today, we do have screens that look like this. That show business KPIs.
Like in this case, the correlation of sign-ups and sign-ins. And we could have realized this bug way earlier if we had had that TV screen, that correlation of sign-ups and sign-ins. So, monitoring is not only about operations data.
It's also about business and business KPIs to find out if something happened after you released on your platform. And it's crucial. It's crucial if you want to release frequent as often as possible, whenever it's done, whenever you want. It's just about flow. Just about getting things to your customers.
Again, again, and again. Okay, any more questions? Yes, please. Yes.
Yes, so you're talking about the iOS one, are you? Yeah, continuous delivery. Okay, so the question is, we also have an app on iOS or Android too. How do you do continuous delivery for that?
Well, you can't really do it. I mean, in Android, I'm not so sure you can probably deliver quite fast, but to get through the Apple iStore, it takes a few days at least. So, you can't really deliver to your users a few times a day, but you should still be able to respect the principles behind it, so continuous integration, and always be able to deliver
and do a roll forward and these things. It's indeed a bit faster with Android, but you can't deliver a new app when you want. That's true. Any other questions? Yes, there's one over there on the right-hand side.
Actually, not that I would be aware of. So, I always ask, we were asked to just blog about it,
but to be honest... Good point, no. Would you prefer presenting about it? Actually, we want to start the blog to share more of our learnings, but we haven't done so far, so I'm afraid I can't really tell you anything about it. Any other questions? But I can give you my card, and you can ask me directly afterwards or via mail or stuff, okay?
We had a question over there. Yes, there's another one.
So, Simon, I guess that's yours. The question is about how to deal with caching. Yeah, so the answer is basically we are crap at caching, so that's why we're not talking about it. But thanks for asking. Any other questions?
Crap included. Yeah, there's one down there. Okay, the question is do you use App Service and do you load balance them and blue-green deploy them? So, in App Service you mean like running background services? Yeah. Yes, we have them.
No, we do not blue-green deploy them. Why not? It's probably because we can just take them offline and start them up again and no user would have any impact. That's for the most part of it. Okay? More questions. Yes, up there and down there afterwards.
No, it's you. So, there's sort of a process where we'd say, okay, we want to go live. Now, this might be handled different in every team,
but normally during the stand-up in the morning, we'd say, okay, we have to go live as soon as possible. And a developer, a tester, whoever would push a new version to the acceptance test environment and look over it and he would know because he's working with the team what has changed. And he might either say, okay, I've seen all the regression tests are green
or check a few things manually and then say, okay, this is ready to go live. So, there's no 100% sure that everything's working, but the idea is working closely together and knowing all the time what's going on. And what we also do is, when we have an acceptance test environment that is green in the sense that all the regression tests are green
and a tester has said, okay, I've looked over it and it's fine, it automatically goes live. Does this help? It looks like I'm facilitating your questions now. Okay, the question is for the tooling for the deployment.
So, what we use is TeamCity, Rake, and I think Rake also starts some PowerShell scripts. What else is there to it? But there's nothing else. Well, we used to have an archaic tool called, what was it called?
Refly Web. I don't know if anybody has aware of that. It was a big pain in the thing and we got rid of it as soon as possible after the one who really wanted it left the company. Okay, I guess one last question down there. Yeah.
Oh, okay. The question is, how much time have you spent on implementing all this?
That's a good question. I'd say a lot. A lot, at least to mental cost and to motivation and to just get it done in terms of investing the time, but in terms of the cost. Do you need anything, any number, any figure?
No, I think it's really hard to tell, yeah.
Okay. So, as all of this is part of developing, getting a story done, I guess, but that's only a rough guess. It's around 25 to 30 percent per user story, which is about, depending on the size of a story,
up to three to four days. Maybe only hours. But if you're talking in terms of investment that you have to take to get this, I think you actually saved time because before we had one week of release and we wasted one week of time, basically.
So now this is all faster and as soon as it's got into people's heads and into the process that they have, then you're actually faster and you save time. It's quite a good example to show that it pays off later in the day. Okay, there's one last question up there. No problem.
Yes, so the question is, how about big architectural refactorings and branching?
So, yes, we do still avoid branching. We try to split it up in little chunks that can go one by one. So we might do branches, but only for trying things out locally and then we do not merge them back, but we avoid branches at all costs, basically. Now, of course, we do have architectural changes and the splitting up the monolith that I was talking about,
this is a really big architectural change and, of course, when you have these big things, you really have to think about them and you can't just say, okay, I'll put a feature toggle and a migrating serializer and that's it. You really have to think about them and you have to plan them ahead and you have to split them up in small chunks. That's all I can really say in a general way about it.
But if you want to discuss specific scenarios, feel free to come down afterwards and we can see what we have experienced. So I guess, Simon, we'd be happy to answer. I think we have like three more minutes to take questions. Is that right? Yeah, that's right. Yeah, up there.
The question is monitoring tools, what do we use? Do you want to give an answer? Most of them with us are self-made. I just can't remember the name of the tool, but it may pop up in a second.
How about Audiscode? We try not to use, well, we try not to build our own, so we have Splunk running, but we're not very happy with the monitoring solutions we have at the moment, so we're actually basically looking for stuff that works better. I think we have PRTG, and then we have something else,
but I think we're on our way to finding something else. I can't remember it right now, but we can check off the presentation if you're interested. But I think you should be able to find stuff that is widely used and adopted and can good plugins and visualization for without building stuff on your own. You might have built some plugin to probe some stuff in your database
so that you can display that, but that's all. Yep. Okay, so the question is, what do we think is important to monitor?
Is that correct? I think the most important thing to monitor is really what your business is about. As Robert just showed earlier, if it's important for you that users can actually sign in and sign up, then you should monitor that, and you should really be aware of it at all times. We do not monitor values like,
is a value in a database correct, or are servers up, or is a certain thing running correctly? It's more important to actually have your business KPIs in your view because then anything that breaks those really means, okay, you have to take action. Everything else is really second priority. Okay?
Any other questions? Any more questions about caching or stuff we don't do? Stuff we did not talk about? Nobody's perfect, so there's loads of things. Or blogging, I don't know. Maybe expose us in some other ways. No. So we'll be happy to just answer any other questions right now.
We're here, we can have lunch together if you want, and thanks for listening and for the good questions. And enjoy making your supernovas more predictable.