We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Adding zero-downtime migrations strategy in a SaaS project

00:00

Formal Metadata

Title
Adding zero-downtime migrations strategy in a SaaS project
Title of Series
Number of Parts
141
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Zero-downtime migration is a technique for running database migrations without stopping the web app. As clients' databases grow larger, applying necessary updates to the database can become time-consuming or potentially break the database schema. This talk will describe problematic operation types and provide a strategy for writing and running migrations to release new software versions without downtime.
Software as a serviceStrategy gameComputing platformView (database)Asynchronous Transfer ModeStrategy gameSoftwareService (economics)Projective planeProduct (business)MultiplicationOperator (mathematics)Front and back endsLevel (video gaming)Revision controlDatabaseData storage deviceMultiplication signTable (information)Presentation of a groupDescriptive statisticsMobile appSoftware maintenanceRoundness (object)Open sourceGene clusterControl flowComputing platformField (computer science)Cartesian coordinate systemSimilarity (geometry)Ocean currentPhysical systemEndliche ModelltheorieSampling (statistics)View (database)Point cloudIdentifiabilityType theoryPlanningUniqueness quantificationDecision theoryKlassenkörpertheorieCASE <Informatik>Link (knot theory)Integrated development environmentClient (computing)Social classSingle-precision floating-point formatConfiguration spaceMusical ensembleSoftware as a serviceProper mapMathematicsMereologyComputer animationLecture/Conference
Physical systemOperations researchContext awarenessDatabaseUniqueness quantificationMobile appRevision controlInstance (computer science)1 (number)Level (video gaming)MathematicsField (computer science)IdentifiabilityCodeOperator (mathematics)State of matterWordProduct (business)Computer animation
Product (business)MathematicsObject-relational mappingFunctional (mathematics)Slide ruleEigenvalues and eigenvectorsDatabaseOperator (mathematics)Product (business)Field (computer science)Table (information)WordEndliche ModelltheorieInstance (computer science)Right angleRevision controlLinear regressionComputer animation
Product (business)Uniqueness quantificationMilitary operationDependent and independent variablesFinitary relationLine (geometry)Operations researchStrategy gameRevision controlRow (database)DatabaseProduct (business)Multiplication signTable (information)Revision controlNumberOrder (biology)Process (computing)Operator (mathematics)Field (computer science)Instance (computer science)Endliche ModelltheorieError messageINTEGRALLevel (video gaming)Client (computing)Object-relational mappingNormal (geometry)Physical systemVirtual machineMereologyElement (mathematics)Projective planeMathematicsIdentifiabilityCartesian coordinate systemSet (mathematics)Proper mapResultantSummierbarkeitWordComputer animation
Strategy gameRevision controlProduct (business)Time zoneTimestampCodeField (computer science)Revision controlMathematicsEigenvalues and eigenvectorsDatabaseTable (information)Different (Kate Ryan album)Uniqueness quantificationPhysical systemOperator (mathematics)Multiplication signCASE <Informatik>Process (computing)Instance (computer science)CodeStatement (computer science)Product (business)Row (database)Computer configuration2 (number)Task (computing)Level (video gaming)Point (geometry)Computer animation
TimestampTime zoneCodeOperations researchData modelProduct (business)Object-relational mappingRight angleDatabaseDefault (computer science)Revision controlField (computer science)MathematicsLevel (video gaming)CodeGroup actionObject-relational mappingComputer animation
TimestampTime zoneRevision controlField (computer science)Computer animation
Atomic numberTask (computing)Product (business)Order (biology)Object (grammar)CodeDatabase transactionTask (computing)Revision controlMathematicsInstance (computer science)Uniqueness quantificationMultiplication signRow (database)CodeField (computer science)Process (computing)Key (cryptography)StapeldateiSet (mathematics)DeadlockEigenvalues and eigenvectorsProduct (business)Computer animation
Revision controlAsynchronous Transfer ModeState of matterPatch (Unix)Physical systemMathematicsRevision controlOperator (mathematics)Field (computer science)Computer animation
Revision controlProduct (business)Software testingState of matterTime zoneTimestampPressureRevision controlDifferent (Kate Ryan album)Table (information)InformationAdditionProcess (computing)Instance (computer science)Configuration spaceMobile appMoment (mathematics)1 (number)Subject indexingTask (computing)CASE <Informatik>DatabaseProduct (business)Scripting languageShooting methodSoftware developerSoftware maintenanceControl flowProjective planePresentation of a groupSoftware testingField (computer science)Fundamental theorem of algebraLink (knot theory)Online helpEigenvalues and eigenvectorsMathematicsSet (mathematics)Source codeMultiplication signBitPattern recognitionPoint cloudWeb pageSoftwarePerspective (visual)Point (geometry)Right angleCone penetration testCellular automatonCrash (computing)Object-relational mappingMaxima and minimaComputer animation
Transcript: English(auto-generated)
Hello everyone, and first of all, thank you very much for being here and today I will say about implementing zero downtime migration strategy in software as a service project and Imagine you are a big player in e-commerce stage like Amazon and you are forced to take a maintenance break
Last time they experienced downtime. They were losing $66,000 per minute and We face similar the problem in sailor and that is open-source GraphQL first e-commerce platform
Obviously obviously not at this case because we are not We are not yet as popular as Amazon, but still the problem is the same. I Will show you how we handled the situation and the obstacles that we face along the way So, let me quickly introduce myself
I'm in gakar bovac. I'm from Wrocław It's in Poland and I've been using Python in my work for over five years right now and I've been working on sailor platform for almost four years and I'm happy to see how this project is changing from just an open-source project to a proper product
Firstly, I would like to tell you shortly the story that is behind sailor So sailor started as an open-source monolith project built on Django with views and in 2018 the big decision was made and GraphQL API was added and a year later
Django views were totally removed making sailor a headless app Which basically means that we started to offer just an API without the front-end side over time Sailor grew up we gained a community around and
The idea to build a product around sailor came up and this leads us to the creation of the full cloud environment that offers sailor as a SaaS product and As you can see the story behind sailor is constant grow and when you are getting bigger
You are getting more clients and you are facing bigger problems And one of clients requirements was the ability to update versions without any downtime So to sum up the background of the problem Is that that clients of e-commerce platform that sailor is?
Don't want to stop their stores at any time as of course, it means losing money So we need to ensure that the system is working all the time even during the update and We need to minimize the migration time And I think it's safe to say that sooner or later each SaaS application will face downtown problem
So let's move on to the examples and I will start We've shown you the problematic operations and then I will move to the solutions and for this talk I created a sample project with simple GraphQL API and
Database on PostgreSQL is the same configuration as we have on sailor and I will give you the link to to this project at the end of the presentation and Let's assume that we are Currently running version v1 of the system. We have multiple app clusters that are using shared database and we are
planned to Introduce some changes on product type and that we will release in the next version v2 and here is the product model on the current version in v1 and
the model is just simple Python class that refers to a single database database table and the class fields refers to table columns and Let's assume that our product model will have a name description and created fields and we are
planned to add new unique slack field that will be a human readable identifier of our product and Additionally, we want to rename field Created to create a that to be consistent with other part of the project as we previously state we are currently on version v1 of the system and
Assume that we release mentioned changes in version v2 and we want to upgrade to this version. So Upgrading of instances is easy. We just need to gradually Gradually replaced all the app instances with the new ones the problem is with the database because it's shared resource and we cannot just clone it and
replace it So instead we will upgrade the database in place So firstly we begin by migrating the database to the version v2 so we'll be in a stage where we have upgraded database but still running up instances on previous versions and
once the migrations are completed we can start the app workers of the versions v2 and Finally stop the app workers of previous versions So the two stages that we should worry about are two and three where we have upgraded the database
but still running app workers from the previous version and in other words, we have the old code that is using upgraded database and We will analyze these stages in the following examples and so let's move on to the problematic operations
We will start with adding a unique slack field that will be a human readable identifier of our product and What you see on your right is Django migration and for those who don't know migrations applies
object relational mapping changes into the database schema and in other words database needs to know that something has changed and in models and Regression's defines operations that will be performed on the database but also allow to run some Python function for making updates on
existing instances and To add a unique field. We need to follow three three steps First is adding a new field. So we will create new slack column on the product table
Next we need to update existing instances to set the proper values on To set the proper values for the slack column, so we are calling the Python function to do that That will do that synchronously and after the function is finished
we will have the new slack column filled in with proper values and will be ready to perform the Last step that is changing the slide field to be unique and So to apply this change we are altering the slack column on the product table
now, let's assume that we release those changes on version v2 and we Update the database to this version and we are currently on version v1 on of the system as is shown on this schema When the API request for product creation is
Called for second or subsequent time the integrity error is raised the error message is saying that product with empty slack value already exists and This is because the version v1 is not aware that the new Unique field was added and it's trying to say the row with null as a slack
Below you can see the part of the product table that is showing described situation so we have instances with proper slack values set and One with one row with empty value. So adding the next row with null as a slack and
we rise an error because values won't be unique anymore and We'll discuss the solution later now Let's move on to the second operation that is renaming the field from created to created at
to be consistent with other part of the project and Seems to be pretty simple operation. We just we just renaming the column, but it might be problematic as well As before and let's assume that we release machine changes in version
V2 and we upgrade the database to this version and we are currently on version v1 on the system So the APR request for product creation as we can expect is rising and narrow and this time the error message indicates that the
That column created on product table does not exist Which is accurate because we already renamed this column and we have only created at column on the database However, however, the version v1 is not aware that something has changed and it's trying to save the value
in the old created column and we'll get similar error when trying to retrieve an existing product from the database and This time the API is trying to fetch the data from the created column which does not exist anymore Now let's move on to the last operation that we'll discuss today
So let's suppose we have a large data sets of products perhaps a million or more So updating all existing instances will take significant amount of time. No matter how hard will you try? So the problem is that the update is
Blocking the migration process as we need to wait for update to finish to continue with the database migration It also logs the database tables and keeps the database in unstable stage which may which may result in slowing down the application or even or even make it unresponsive and
As a result, it significantly extends the times the time of the database migration So these were some of the problematic operations that we should be aware of and let's map them so firstly adding the unique or non-label field like adding slack field in our example and
In sailor we had such situation. For example when adding expiration date to our order model next updating a big number of data and I can say that in say that we are facing this problem most often as
our clients have quite a big collections of orders and products and Any field normalization like recently adding discounted price means that we need to update each instance separately And then renaming the field like renaming created in our example, but also renaming the table
Removing the field or table moving the data from one field to another and all of these operations will cause similar error like renaming the field and In sailor we face this problem when changing ID to a universal unique identifier
So the key element is to not remove any fields that previous ORM will use and to minimize the time of each migration Now as we know all the problems I owe you the solutions So the biggest difficulty in upgrading is changing
Is changing the database as its shared resource and we can just clone it and replace it and to ensure the zero downtime we need to We need to ensure that the updated database will work with the old and the new version of the system
And there are two possible options to do that First is make old code compatible with the new database schema and the second is make the new database schema compatible with the old code and Fixing old code old code is hard and it required to craft two releases. So we decided to choose the second path
As it's easier to achieve and I will describe the solutions that fits this statement and one important One more important assumption is that we are ensuring zero downtime only and only
From changing one version at the time. So in our example an Upgrade from v1 to v2 will be possible without any downtime but switching from v1 to v3 Won't be Let's start with the solution For our first problem that includes adding a unique slack field
So We need to ensure the compatibility of the previous version v1 with upgraded database to version v2 so we need to apply some of the database changes also in the version v1 and
The solution here is to apply the first two steps of migrations on version v1 or the system So first is adding a new label field. So we are adding a new New label slack column on product table and The difference is in the second step because we want to minimize the time of each migration
So instead of updating existing instances synchronously in the migration code will delegate it to the Asynchronous task that will do that in the background After the migration process and I will tell you more about that later for now the most important thing is that we are doing this asynchronously and
We also need to ensure that any new instances created on version v1 will have the proper value set so this leads us to the last step, which is Which is updating the API. So when any So when any new row is added it will have the
proper value set on on slack column Just after performing the migration from v1 the database will be in stage where we have new slack column with empty values and
when the asynchronous tasks are finished the slack column will be filled in with proper values and At that point we'll be ready to safely the field into Unique on our target version
So the operation that must be performed on the version v2 is to alter the slack column to make it unique and the second operation was renaming the created field so to ensure the Compability we will have to perform the changes in three main steps that are
First Adding we need firstly we need to add a new field next to existing one in the second step we need to copy the data from the old field to the new one and finally, we can remove the old field and
What is also what is very important we need to ensure that on each of these steps the Database will be compatible with the previous version of the system So let's start with the changes on the version v1 and Here the steps are almost the same as in case of adding a unique slack field
So firstly we need to add new new label created at column on product table in the next step we need to copy the data from From the old column to the new one to update existing instances and as before we will
will delegate this to the asynchronous task that will do that in the background and We also need to update The code so when any new instances are created on version v1 The instances will have the proper value set for both old and the new columns
And after finishing the migrations and a synchronized task will be in the stage that is shown below and We'll be ready to apply the changes on version v2 so on version v2 we need to remove the old field from the ORM and
From the code as we don't want to use it anymore But we cannot remove it from the database because it's because it's still used by the previous version v1 instead we need to ensure that the
old field old column created is new label or has the default value set and In our example, we'll make it a new label So in on the right in the migration we are separating the database and ORM changes to perform those actions and
We are also ready to change the field into known Into non new label So in version v2 will be in the stage where the old field is not used anywhere in the code but it's still in the database and
Finally in the next version in our example Version v3 when we are sure that only new field is in use We are able to safely remove the old old created column from the database And as you can see, there is quite lots of steps that must be performed to just rename the column the field
Now let's take care of update of a large Datasets and first of all, the update should be done in their version before the darker target version So we can be sure that all existing instances will have proper value set
and we can safely apply the changes on target version and To minimize the immigration time the data should be should be updated asynchronously after the migration process and
Here's an example of migrations that cause the task which copies the data from the old Created to new created at column and the task is delayed in post migrate signal Which means that it will record after the migrations are completed to not bother the migration process at anyway
So the data will be copied in the background Now take a look at the task code. I have some tips for you that we work out so firstly Update should be done in batches Secondly only instances that haven't been updated yet should be taken for update
and the instances should be ordered by some unique fields like primary key and in our example, we are taking products that have empty created at column
After the batch update the task should quit itself if there are still data to be proceed to not block the Asynchronous task with for too long and what's also very important the update of instances should be done in transaction with locked rows to avoid potential deadlock that might happen when multiple
Asynchronous task our task workers are in use Right now we know the problematic operations. We know how to write migrations to not crush the system So the last missing piece is how to proceed and update So firstly we need to release changes applied on version v1 as next minor or patch release
I will use minor release v11 for simplification So in our example on version v11, we'll have two new label fields slack and created at and
Next we need to upgrade to this minor version and then to the target version So firstly we need to switch from v1 to v11 and then from v11 to v2 and Upgrade through this v11 version is crucial and in both cases the process will look the same
So I will describe it in general on the example for switching from v1 to v11 So we start from the configuration where we have One app instance that is using the database and we have to Asynchronous task
Asynchronous workers, for example cellular workers So the first step is stopping the asynchronous task workers to make sure that they don't process any task during the database upgrade And then we need to run the database migration to update the database to
To the version v11 and after finishing the migration we can start the app workers of version v11 and Let's notice that at this point we have two app instances of different versions that are using the same database But we ensure that
database compatibility So when any new product is created on version v1, it will have the new value set for both Slack and created at column and When the new product is created on version v11, it will have the proper value set for For both columns as we add use the API to do that
in the next step and we can start the cellular workers to run the task delight in the migrations and And Finally we can stop the app workers of the previous version. So I describe you the zero downtime upgrade
Now, let me digress a little about what zero in zero downtime really means So during the whole upgrade process, there is no moment when all app instances are stopped Instead the database is upgraded in place while the app instances are still in use and this may result in
Minimal downtime, but it's so brief that is essentially zero and it's not visible to the user So moving back after you upgrade to v11 any new instances will have the proper value set and all old instances will have the new values for the start and
Will be updated by the asynchronous task in the background And when the tasks are finished the product table will be filled in with proper values and We'll be ready to upgrade to our target version v2 and to do that we need to
perform The all steps that we do for switching from version v1 to v11 So if you want the upgrade to go smoothly without any maintenance breaks Remember that the most important thing is to not remove any fields that previous ORM will use and that's it
and If you're interested, here's the link for the example project and that contains each steps That I explained today and some additional ones for example, like adding a database index
Two more slides, it's okay If you have questions, reach 10 to the point
If you have two microphones, then you can just grab the microphone and shoot from there Any questions? Hi, thank you so much, this was really insightful
My question goes, so I perhaps it wasn't clear to me One of the potential issues that I might see with this is that While the two versions are running you might be writing data with the old version That potentially doesn't get migrated when you run the script to like for instance when you rename
Moving the data from create to create it up Mm-hmm if the old instance is still writing new new stuff you might get into a situation where you're writing On the creator table and that data doesn't get migrated into the creator that did I get it wrong?
Or is that something that maybe I will move a little? Yes, here is this example if this is the moment when you have these two instances. Yeah from what I believe so Yes, the thing that I see the issue Just correct me if I'm absolutely wrong, which might be the case is that after these you stop the app version 1 and
The thing that I might be wrong is that if you if you do that at the very last step you might be Still creating stuff in the old version that doesn't get created in the new version
Yeah, you create in this moment when yes, and then you say is that just before they fight the fifth step? That's what I mean To this the last step. Yes. So you between the four and five You have the old instance that is still writing on the create table yes, and
but it's not problem because the Data are copied in the background Okay from the old cone to the new one and if any new instance is created on From the old app instance from version v1 The new value will be set for created at and for for created will have the proper value and it will be copied
So if I understand right the migration is asynchronously in the sense that it doesn't stop So it's not something that You wait for it to end before stopping the old instance So it keeps running until they are no nothing on the created
Column that's is that what you mean? The column created is still there after after Upgrading to version v11. Yeah. Yeah, that's exactly my source of my Confusion that's that's precisely it so Yeah, I don't want to keep this up. So shall we just talk about these later? Okay, we can take a later
Thank you so much for these talk being subtle, thank you Thank you for me as well for the nice talk I'm also in the e-commerce sector and I had quite a bit of a nice recognition here today So things that we do very similarly one thing I noticed though is that
Doing this migration kind of like into two steps as you explained puts quite a bit of onus on the developers to kind of notice that they're now doing something that requires doing that and That's something we've like had some troubles with we've tried like writing tests for that and so on but yeah
Get quite complicated, especially when you have indexes and so on Do you have any experience there how to kind of handle and help developers kind of notice that now they're really doing something that You need kind of that two-step process. I can say that we are currently learning that to be honest and Yeah, but like, you know If somebody is putting something to review all of the our teams needs to take care of that and say hi
Hello, you need to Add the zero downtime support here and we also have some you know Some pages in our docs that are saying that we need to do that But it's it's hard to to keep it
You need to remember you need to keep the pull request for the next version and you need to create the pull request for the previous Versions, so it's lots of job additional job for the developers Appreciate I'm not the only one who's having that problem Thank you Thank you very much for the presentation. I have one small question. Why we
stop salary worker before immigration Fundamental difference between app v1 and salary workers v1. Yes, because We are on version v1. We are v1 one
We are defining the seller seller workers that will be Sorry, the asynchronous task that will be called in the migrations and they are only on version v1 one So we need to upgrade the salary workers to this version v1 one to contains these tasks That will be a delight in the migrations. Yes, but why
Can't we stop salary workers to be one After Database migration, why can't we switch step one? Because the seller workers will don't have information that this task exists and
I'm not sure if it's gonna be like we if If the migration will call this task because the tasks are called directly in the immigration So if the migration we call the salary task that it does not exist in celery it might crash
Got it. Thank you very much. Thank you. Yes. I was just wondering what you're using to actually manage this process So for example when the database is done making all of its changes then you need to know to then stop app version 1 Right. So how do you do that? What what software or methodology are you using to track that?
But you are saying from developers perspective, sorry, I don't understand correctly the question I mean well you're So when you make the changes to the database, then you need to when it's done Then you need to tell a v1 to stop right? Mm-hmm. How does it know?
To be honest, it's a cloud developers work