5 Years of Rails Scaling to 80k RPS
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 44 | |
Number of Parts | 86 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/31224 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Multiplication signBlogProgramming paradigmDecision theoryEndliche ModelltheorieBitFacebookUsabilityReading (process)GoogolMereologyClosed setPoint (geometry)Cycle (graph theory)Computing platform
02:06
Mobile appBitProjective planeType theory1 (number)Metric systemMultiplication signNeuroinformatikDecision theoryNumberFigurate numberDebuggerDifferent (Kate Ryan album)Arithmetic meanConsistencyLoginTerm (mathematics)Content (media)Process (computing)Social classStructural loadSoftware testingRoutingDecision tree learningCartesian coordinate systemMathematical optimizationPoint (geometry)Source codeProduct (business)Computing platformScaling (geometry)MassSteady state (chemistry)Data centerSheaf (mathematics)Order (biology)Right angleCache (computing)2 (number)Operating systemVideo gameData storage deviceSphereDisk read-and-write headPattern languagePhysical lawService (economics)CASE <Informatik>Group actionMotion captureFeedbackTunisDomain nameServer (computing)Concurrency (computer science)System administratorFlash memoryLastteilungFrame problemMathematicsQuicksortSingle-precision floating-point formatQuery language
09:23
Level (video gaming)CodeMultiplication signDatabaseSpring (hydrology)Cache (computing)Software testingProduct (business)Mathematical optimizationIdentity managementStructural loadScaling (geometry)Process (computing)SequelComputer architectureSoftware developerInsertion lossMatrix (mathematics)Proxy serverData storage deviceQuicksortResultantReal numberAlgorithmNumbering schemeDatabase transactionCartesian coordinate systemComputer-assisted translationRight angleConnected spaceInstallation artGame controllerBlock (periodic table)Goodness of fitMereologyCybersexExistenceStandard deviationLibrary (computing)Category of beingCycle (graph theory)System callComplex (psychology)BenchmarkServer (computing)ImplementationFilter <Stochastik>Computing platformPoint (geometry)Relational databaseInterface (computing)CASE <Informatik>10 (number)Entire functionProcedural programmingBitMassRow (database)Read-only memorySparse matrixConnectivity (graph theory)Different (Kate Ryan album)Thresholding (image processing)Extension (kinesiology)Functional (mathematics)Decision theoryParallel portReading (process)Mobile appSingle-precision floating-point formatFlash memoryQuery languageZoom lensLogin
19:17
Connectivity (graph theory)Web 2.0Water vaporServer (computing)AreaMultiplication signSurfaceSocial classCube
20:03
Single-precision floating-point formatConnectivity (graph theory)Queue (abstract data type)Potenz <Mathematik>Unit testingMultiplication signCellular automatonPoint (geometry)2 (number)Software testingCartesian coordinate systemData storage deviceLevel (video gaming)Response time (technology)SoftwareEmulatorEndliche ModelltheoriePhysical systemProcess (computing)Dependent and independent variablesSheaf (mathematics)Message passing1 (number)Matrix (mathematics)Library (computing)Incidence algebraGraph (mathematics)Product (business)Mobile appTangent9 (number)System administratorBlock (periodic table)Call centreFigurate numberMultiplicationVideo gameSoftware developerException handlingIterationProxy server
24:39
Element (mathematics)Endliche ModelltheorieWeb 2.0Shape (magazine)Matrix (mathematics)Denial-of-service attackWriting
25:17
QuicksortChemical equationDomain nameScripting languageReal numberMereologyRule of inferenceSoftwareProcess (computing)Data centerDatabase2 (number)Right angleStrategy gameGroup actionLine (geometry)State of matterRoutingComputer hardwareTouch typingSource codeMathematical optimizationProduct (business)Order (biology)NumberPoint (geometry)Uniform resource locatorInstance (computer science)Computer architectureMultiplication signMultiplicationMassMathematicsInverse elementSystem administratorIndependence (probability theory)Mobile appLastteilungServer (computing)
30:58
MereologyDiagramData centerSystem administratorMultiplication signMultiplicationRule of inferenceSoftware developerRight angleOrder (biology)Sheaf (mathematics)Process (computing)Exterior algebraCartesian coordinate systemData storage deviceSource codeError messageMassElectronic mailing listEntire functionChannel capacityGame controllerDatabaseFitness functionMaxima and minimaEmailStrategy gameBuildingCASE <Informatik>Single-precision floating-point formatComputer architectureBitQuicksortSet (mathematics)Mobile appVideo gameEndliche ModelltheorieProjective planeSpacetime10 (number)Semiconductor memoryDifferent (Kate Ryan album)Group actionLastteilungCore dump2 (number)Lie groupParallel portLecture/Conference
36:38
MathematicsQuicksortJSONXML
Transcript: English(auto-generated)
00:00
My name is Simon. I work on the infrastructure team at Shopify, and today I'm going to talk about the past five years of scaling Rails at Shopify. I've only been around at Shopify
00:27
for four years, so the first year was a little bit of digging, but I want to talk about all of the things that we've learned, and I hope that people in this audience can maybe place themselves on this timeline and learn from some of the lessons that we've had to learn over the past five years.
00:42
This talk is inspired by a guy called Jeff Dean from Google. He's a genius, and he did this talk about how they scaled Google for the first couple of years, and he showed how they ran with a couple of my SQLs. They sharded. They did all this stuff. They did all the NoSQL paradigm and finally went to the NoSQL paradigm that we're now
01:03
starting to see, but this was really interesting to me because you saw why they made the decisions that they made at that point in time. I've always been fascinated by what made, say, Facebook decide that now is the time to write a VM for PHP to make it faster,
01:20
so this talk is about that. It's about an overview of the decisions we've made at Shopify and less so about the very technical details of all of them. It's to give you an overview and mental model for how we evolved our platform, and there's tons of documentation out there on all of the things I'm going to talk about today,
01:42
other talks by coworkers, blog posts, read-mes, and things like that, so I'm not going to get into the weeds, but I'm going to provide an overview. I work at Shopify, and at this point, you're probably tired of hearing about Shopify, so I'm not going to talk too much about it. Just overall, Shopify is
02:01
something that allows merchants to sell people to other people, and that's relevant for the rest of this talk. We have hundreds of thousands of people who depend on Shopify for their livelihood, and through this platform, we run almost 100K RPS at peak. The largest sales hit our Rails servers with almost 100K requests
02:22
per second. Our steady state is around 20 to 40K requests per second, and we run this on 10,000s of workers across two data centers. About $30 billion have made it through this platform, which means that downtime is costly, to say the least, and these numbers, you should keep in the back of your head as the numbers that we
02:44
have used to go to the point that we are at today. Roughly, these metrics double every year. That's the metric we've used, so if you go back five years, you just have to cut this in half five times. I want to introduce a little bit of vocabulary for Shopify because I'm going to use this loosely in this talk
03:02
to understand how Shopify works. Shopify is at least four sections. One of those sections is the storefront. This is where people are browsing their collections, browsing their products, adding to the cart. This is the majority of traffic. Somewhere between 80% to 90% of our traffic are people browsing their storefronts. Then we have the checkout. This is where it gets a little bit more
03:22
complicated. We can't cache as heavily as we do on the storefront. This is where we have to do writes, decrement inventory, and capture payments. Admin is more complex. You have people who apply actions to hundreds of thousands of orders concurrently. You have people who change billing,
03:41
they need to be billed, and all these things that are much more complex than both checkout and storefront in terms of consistency. And the API allows you to change the majority of the things you can change in the admin. The only real difference is that computers hit the API, and computers can hit the API really fast. Recently I saw an app for people who wanted to have an offset
04:04
under orders numbers at a million, and so this app will create a million orders and then delete all of them to get that offset. So people do crazy things with this API, and that's one of our other major...that is our second largest source of traffic after Storefront. I want to talk a little bit about a
04:26
philosophy that has shaped this platform over the past five years. Flash sales is really what has built and shaped the platform that we have. When Kanye wants to drop his new album on Shopify, it is the team that I am on that is terrified. We had a sort of fork in the road five years ago.
04:47
Five years ago was when we started seeing these customers who could drive more traffic for one of their sales than the entire platform otherwise was serving a traffic. They would drive a multiple, so if we were serving 1,000 requests
05:00
per second for all the stores on Shopify, some of these stores could get us to 5,000. And this happens in a matter of seconds. Their sale might start at 2 p.m., and that's when everyone is coming in. So there's a fork in the road. Do we become a company that support these sales, or do we just kick them off the platform and throttle them heavily and say, this is not the platform for you?
05:23
That's a reasonable path to take. Ninety-nine point nine something percent of the stores don't have this pattern. They can't drive that much traffic. But we decided to go the other route. We wanted to be a company that could support these sales, and we decided to form a team that would solve this problem
05:41
of customers that could drive enormous amounts of traffic in a very short amount of time. And this is, I think, was a fantastic suggestion, and this happened exactly five years ago, which is why the time frame of the talk is five years. And I think it was a powerful decision because this has served as a canary in the
06:00
coal mine. The flash sales that we see today and the amount of traffic that they can drive of, say, 80K rps, that's what the steady state is going to look like next year. So when we prepare for these sales, we know what next year is going to look like, and we know that we're going to laugh next year because we're already working on that problem. So they help us stay ahead,
06:21
one to two years ahead. In the meat of this talk, I will walk through the past five years of the major infrastructure projects that we've done. These are not the only projects that we've done. There's been other apps and many other efforts, but these are the most important to the scaling of our Rails application. 2012 was the year where we sat down and decided that we
06:45
were going to go the anti-fragile route with flash sales. We were going to become the best place in the world to have flash sales. So a team was formed whose sole job was to make sure that Shopify as an application would stay up and be responsive under these circumstances. And the first thing you do when you start
07:02
optimizing an application is you try to identify the lower hanging fruit. In many cases, the lower hanging fruit is very application dependent. The lowest hanging fruit from an infrastructure side, that's already harvested in your load balancers. Rails is really good at this or your operating system. They will take all of the generic optimization tuning.
07:23
So at some point, that work has to be handed off to you and you have to understand your problem domain well enough that you know where the biggest wins are. For us, the first ones were things like backgrounded checkouts. And this sounds crazy. What do you mean they weren't backgrounded before? Well, the app was started in 2005, 2004. And back then,
07:42
backgrounding jobs in Ruby or Rails was not really a common thing. And we hadn't really done it after that either because it was such a large source of technical debt. So in 2012, a team sat down and collected the massive amount of technical debt to move the background jobs into or remove the checkout process into background jobs. So the payments were captured not in a
08:04
request that took a long time, but in jobs asynchronously with the rest. And this, of course, was a massive source of speed up. Now you're not occupying all these workers with long-running requests. Another thing we did at these domain-specific problems was inventory. You might think that inventory is just decrementing one number and doing that
08:24
really fast if you have thousands of people. But MySQL is not good at that. If you're trying to decrement the same number from thousands of queries at the same time, you will run into log contention. So we had to solve this problem as well. And these are just two of many problems we solved. In general, what we did was we printed out the debug logs of every single query
08:41
on the storefront, on the checkout, and all of the other hot pass, and basically started checking them off. I couldn't find the original picture, but I found one from a talk that someone from the company did three years ago where you can see the wall here where the debug logs were taped. And the team at the time would go and cross off and write their name on the
09:00
queries and figure out how to reduce this as much as possible. But you need a feedback loop for this. We couldn't wait until the next sale to see if the optimizations we've done actually made a difference. We needed a better way than just crashing at every single flash sale, but having a tighter feedback loop. Just like when you run the tests locally, you know whether it worked pretty much right away. We wanted to do the same
09:21
for performance. So we wrote a load testing tool. And what this load testing tool will do is that it will simulate a user performing a checkout. It will go to the storefront, browse around a little bit, find some products, add them to its cart, and perform the full checkout. It will fussy test this entire checkout procedure, and then we spin up thousands of these
09:42
in parallel to test whether the performance actually made any difference. This is now so deeply webbed in our infrastructure culture that whenever someone makes a performance change, people ask, well, how did the load testing go? This was really important for us. And as Siege, just hitting the storefront, that is something that just runs a bunch of the same requests,
10:01
it's just not a realistic benchmark of realistic scenarios. Another thing we did at this time was we wrote a library called Identity Cache. We had a problem with, we had one MySQL at this point hosting tens of thousands of stores. And when you have that, you're pretty protective of that single database.
10:22
And we were doing a lot of queries to it, and especially these sales were driving such a massive amount of traffic at once to these databases. So we needed a way of reducing the load on the database. The normal way of doing this, or the most common way of doing this is to start sending queries to the read slaves. So you have databases that feed off of the one that you write to,
10:42
and you start reading from those. And we tried to do that at a time. It has a lot of nice properties to use the read slaves over another method. But when we did this back in the day, there wasn't any really good libraries in the Rails world. We tried to fork some and tried to figure something out, but we ran into data corruption issues. We ran into just mismanagement of the
11:04
read slaves, which was really problematic at the time because we didn't have any DBAs. And overall, mind you, this is a team of Rails developers who just had to turn infrastructure developers and understand all of this stuff and learn it at the job because we decided to handle flash sales the way that we did.
11:23
So we just didn't know enough about MySQL and these things to go that path. So we decided to figure out something else. And deep inside of Shopify, Toby had written a commit many, many years ago introducing this idea of identity cache, of managing your cache out of bound in memcache.
11:43
Idea being that if I query for a product, I look in memcache first and see if it's there. If it's not, if it's there, I'll just grab it and not even touch the database. If it's not there, I'll put it there so that for the next request, it will be there. And every time we do a write, we just expire those entries. That's what we managed to do. This has a lot of drawbacks because that cache
12:03
is never going to be 100% what is in the database. So when we do a read from that managed cache, we never write that back to the database. It's too dangerous. That's also why the API is opt-in. You have to do fetch instead of find to use IDC because we only want to do it on these paths, and it will return read-only records so you cannot change them to not corrupt
12:24
your database. This is the massive downside with either using read slaves or identity cache or something like this is that you have to deal with what are you going to do when the cache is expired or old. So this is what we decided to do at the time. I don't know if this is what
12:42
we would have done today. Maybe we've gotten much better at handling read slaves and they have a lot of other advantages such as being able to do much more complicated queries, but this is what we did at the time. And if you're having severe scaling issues already, identity cache is a very simple thing to do and use. So after 2012 and what would have been probably
13:03
our worst Black Friday, Cyber Monday ever because the team was working night and day to make this happen, there's this famous picture of our CTO face planted on the ground after exhausting work of scaling Shopify at the time. And someone then woke him up and told him, hey, dude, checkout is down.
13:23
We were not in a good place, but identity cache, load testing, and all this optimization, it saved us. And once the team had decompressed after this massive sprint to survive these sales and survive Black Friday and Cyber Monday this year, we decided to raise the question of how can we never get into this situation again? We'd spend a lot of time optimizing
13:43
checkout and storefront, but this is not sustainable. If you keep optimizing something for so long, it becomes inflexible. Often fast code is hard to change code. If you've optimized storefront and checkout and had a team that only knew how to do that, there's going to be a developer who's going
14:01
to come in, add a feature, and add a query as a result. And this should be okay. People should be allowed to add queries without understanding everything about the infrastructure. Often the more slower thing is more flexible. Think of a completely normalized schema. It is much easier to change and adapt upon, and that's the entire point of a relational database. But once you
14:21
make it fast, it often is a tradeoff of becoming more inflexible. Think of, say, an algorithm, a bubble sort. N square is the complexity of the algorithm. You can make that really fast. You can make that the fastest bubble sort in the world. You can write a C extension in Ruby that has inline assembly, and this is the best bubble sort in the world.
14:43
But my terrible implementation of a quicksort, which is N log N complexity, is still going to be faster. So at some point, you have to stop optimizing, zoom out, and re-architect. So that's what we did with sharding. At some point, we needed that flexibility back, and sharding seemed like a good
15:04
way to do that. We also had the problem of, fundamentally, Shopify is an application that will have a lot of rights. Doing these sales, there's going to be a lot of rights to the database, and you can't cache rights. So we have to find a way to do that, and sharding was it. So basically, we built this API.
15:20
A shop is fundamentally isolated from other shops. It should be. Shop A should not have to care about Shop B. So we did per-shop sharding, where one shop's data would all be on one shard, and another shop might be on another shard, and the third shop might be together with the first one. So this was the API. Basically, this is all the sharding API
15:40
internally exposes. Within that block, it will select the correct database where the product is for that shop. Within that block, you can't reach the other shard. That's illegal. And in a controller, this might look something like this. At this point, most developers don't have to care about it. It's all done by a filter that will find a shop on another database, wrap the entire request in the connection that that shop is on,
16:05
and any product query will then go to the correct shard. This is really simple, and this means that the majority of the time, developers don't have to care about sharding. They don't even have to know if it's existent. It just works like this, and jobs will work the same way.
16:20
But it has drawbacks. There's tons of things that you now can't do. I talked about how optimization might...you might lose flexibility with optimization, but with architecture, you lose flexibility at a much grander scale. Fundamentally, shops should be isolated from each other, but in the few cases where you want them to not be, there's nothing you can
16:40
do. That's the drawback of architecture and changing the architecture. For example, you might want to do joins across shops. You might want to gather some data or an ad hoc query about app installation across shops. And this might not really seem like something you would need to do, but the partners interface for all of our partners who build applications actually need to do that. They need to get all the shops and the
17:02
installations from them, so it was just written as something that did a join across all the shops and listed it, and this had to be changed. And so the same thing went for our internal dashboard that would do things across shops, find all the shops with a certain app. You just couldn't do that anymore, so we have to find alternatives. If you can get around it, don't shard. Fundamentally, Shopify is an
17:24
application that will have a lot of rights, but that might not be your application. It's really hard, and it took us a year to do and figure out. We ended up doing it at the application level, but there's many different levels where you can shard. If your database is magical, you don't have to do any of
17:41
this. Some databases are really good at handling this stuff, and you can make some trade-offs at the database level, so you don't have to do this at an application level. But there are really nice things about being on a relational database. Transactions and schemas and the fact that most developers are just familiar with them are massive benefits, and they're reliable.
18:00
They've been around for 30 years, and so they're probably going to be around for another 30 years at least. We decided to do that at the application level because we didn't have the experience to write a proxy, and the databases that we looked at at the time were just not mature enough. And I actually looked at some of the databases that we were considering at the time, and most of them have gone out of business. So we were lucky that we didn't
18:23
buy into this proprietary technology and solved it at the level that we felt most comfortable with at the time. Today, we have a different team, and we might have solved this at a proxy level or somewhere else, but this was the right decision at the time. In 2014, we started investing in resiliency. And you might ask, what is resiliency doing in a talk about
18:44
performance and scaling? Well, as a function of scale, you're going to have more failures. And this led us to a threshold in 2014 where we had enough components that failures were happening quite rapidly, and they had a disproportional impact on the platform. When one of our shards was experiencing
19:03
a problem, requests to other shards and shops that were on other shards were either much slower or failing altogether. It didn't make sense that when a single Redis server blew up, all of Shopify was down. This reminds me of a concept from chemistry where your reaction time is proportional to the amount of surface
19:25
area that you expose. If you have two glasses of water and you put a teaspoon of loose sugar in one and a sugar cube in the other glass, the glass with the loose sugar is going to be dissolved in the water quicker because the surface area is larger. The same goes for technology. When you have more servers,
19:42
more components, there's more things that will react and can potentially fail and make it all fall apart. This means that if you have a ton of components and they're all tightly knitted together in a web where if one of these components fail, it drags a bunch of others with it and you have never thought about this, adding a component will probably decrease your availability. And this
20:03
happens exponentially. As you add more components, your overall availability goes down. If you have 10 components with four nines, you have a lot less downtime if they're tightly webbed together in a way that one of them is a single point of failure. And we hadn't really, at this point, had the luxury of finding
20:22
out what our single point of failures even were. We thought it was going to be okay, but I bet you if you haven't actually verified this, you will have single points of failure all over your application where one failure will take everything down with it. Do you know what happens if your memcached cluster goes down? We didn't, and we were quite surprised
20:42
to find out. This means that you're only really as weak or as good as your weakest single point of failure. And if you have multiple single points of failure, multiply the probability of all of those single points of failure together and you have the final probability of your app being available. Very quickly, what looks like downtime of hours per component will be days or
21:05
even weeks of downtime globally amortized over an entire year. If you're not paying attention to this, it means that adding a component will probably decrease your overall availability. The outages look something like this. Your response time increases, and this is a real graph of the
21:21
incidents at the time in 2014, where something became slow. And as you can see here, the timeout is probably 20 seconds, exactly. So something was being really slow and hitting a timeout of 20 seconds. If all of the workers in your application are spending 20 seconds waiting for something that's never going to return because it's going to timeout,
21:42
then there's no time to serve any requests that might actually work. So if shard one is slow, requests for shard zero are going to lag behind in the queue because these requests to shard one will never ever complete. The mantra that you have to adopt when this starts becoming a problem for you is
22:01
that single component failure cannot compromise the availability or performance of your entire system. Your job is to build a reliable system from unreliable components. A really useful mental model for thinking about this is the resiliency matrix. On the left-hand side,
22:21
we have all the components in our infrastructure. At the top, we have the sections of the infrastructure such as admin, checkout, storefront, the ones I showed from before. Every cell will tell you what happens if that component on the left is unavailable or slow. What happens to the section? So if Redis goes down, is storefront up,
22:45
is checkout up, is admin up? This is not what it actually looked like in reality when we drew this out. It was probably a lot worse. And we were shocked to find out how red and blue, how down and degraded Shopify looked when what we thought were tangential data stores like Memcache
23:01
and Redis took down everything along with it. The other thing we were shocked about when we wrote this was this is really hard to figure out. Figuring out what all these cells and the values of them are is really difficult. How do you do that? Do you go into your production and just start taking down stuff? How do you know? What would you do in development? So we wrote a tool that will help you do this. The tool is called ToxiProxy.
23:25
And what it does is that for a duration of a block, it will emulate network failures at the network level by sitting in between you and that component on the left. This means that you can write a test for every single cell in that grid. So when you flip it from being red to being green,
23:41
from being bad to being good, you can know that no one will ever reintroduce that failure. So these might look something like this, that when some message queue is down, I get this section and I assert that the response is a success. At this point in Shopify, we have very good coverage of our recidency matrix by unit tests that are all backed by ToxiProxy.
24:04
And this is really, really simple to do. Another tool we wrote is called Semion. It's fairly complicated exactly how all of these components work and how they work together in Semion. So I'm not going to go into it, but there's a readme that goes into vivid detail about how Semion works.
24:21
Semion is a library that helps your application become more resilient. And how it does that, I encourage you to check the readme to find out how it works. But this tool was also invaluable for us to be able to be a more resilient application. The mental model we mapped out for how to work with
24:43
resiliency was that of a pyramid, where we had a lot of resiliency debt because for 10 years, we hadn't paid any attention to this. The web I talked about before of certain elements, dragging down everything with it, was imminent. It was happening everywhere. The resiliency matrix was completely red when we started.
25:01
And nowadays, it's in pretty good shape. So we started climbing it. We started figuring out writing all these tools, incorporating all these tools. And then when we got to the very top, someone asked the question, what happens if you flood the data center? That's when we started working on multi-DC in 2015. We needed a way such that if the data center caught fire,
25:24
we could fail over to the other data center. But resiliency and sharding and optimization were more important for us than going multi-DC. Multi-DC was largely an infrastructure effort of just going from one to end.
25:40
This required a massive amount of changes in our cookbooks. But finally, we had procured all the inventory and all the servers and stuff to spin up a second data center. And at this point, if you want to fail over Shopify to another data center, you just run the script and it's done. All of Shopify has moved to a different data center. And the strategy
26:01
that it uses is actually quite simple and one that most Rails apps can use pretty much as is if the traffic and things like that are set up correctly. Shopify is running in a data center right now in Virginia and one in Chicago. If you go to a Shopify-owned IP, you will go to the data center that is closest to you. If you're in Toronto, you're going to go to the data center
26:22
in Chicago. If you are in New Orleans, you might go to the data center in Virginia. When you hit that data center, the load balancers in that data center inside of our network will know which one of the two data centers is active. Is it Chicago or is it Ashburn? And it will route all the traffic there. So when we do a failover, we tell the load balancers in all the data centers,
26:43
what is the primary data center? So if the primary data center was Chicago and we're moving it to Ashburn, we tell the load balancers in both the data centers to route all traffic to Ashburn, Ashburn in Virginia. When the traffic gets there and we've just moved over, any write will fail. The databases at
27:01
that point are in read-only. They are not writable in both locations at one because the risk of data corruption is too high. So that means that most things actually work. If you're browsing around Shopify and Shopify Storefront looking at products, which is the majority of traffic, you won't see anything. Even if you are in admin, you might just be looking at your
27:20
products and not notice this at all. And while that's happening, we're failing over all of the databases, which means checking that they're caught up in the new data center and then making them writable. So very quickly discharge, recover over a couple of minutes. It could be anywhere from 10 to 60 seconds per database, and then Shopify works again. We then move the jobs because when we move all the traffic, we stopped the jobs in the source data center. So we
27:44
move all the jobs over to the new data center and everything just ticks. But then how do we use both of these data centers? We have one data center that is essentially doing nothing, just very, very expensive hardware sitting there doing absolutely nothing. How can we get to a state where we're
28:02
running traffic out of multiple data centers at the same time, utilizing both of them? The architecture at first looks something like this. It was shared. We shared Redis instances, shared Memcache between all of the shops. When we say a shard, we're referring to a MySQL shard, but we hadn't sharded Redis, we hadn't sharded Memcache and other things. So all of this was shared. What if
28:25
instead of running one big Shopify like this that we're moving around, we run many small Shopifys that are independent from each other and have everything they need to run. And we call this a pod. So a pod will have everything that a Shopify needs to run, as the workers, as the Redis, the
28:42
Memcache, the MySQL, whatever else there needs to be for a little Shopify to run. If you have these many Shopifys and they're completely independent, they can be in multiple data centers at the same time. You can have some of them active in data center one and some of them active in data center two. Pod one might be active in data center two and pod two
29:03
might be active in data center one. So that's good, but how do you get traffic there? So for Shopify, every single shop has usually a domain. It might be a free domain that we provide or their own domain. When this request
29:22
hits one of the data centers, the one that you're closest to, Chicago or Virginia, depending on where in the world you are, it goes to this little script that's very aptly named sorting hat. And what sorting hat will do is that it will look at the request and interpolate what shop, what pod,
29:42
what mini Shopify does this request belong to. If that request is on a shop that is going to pod two, it will route it to data center one on the left. But if it's another one, it will go to the right. So sorting hat is just sitting there, sorting the request and sending them to the right data center. It doesn't care where you're landing, which data center you're landing to.
30:02
It would just route you to the other data center if it needs to. Okay, so we have an idea now what this multi-DC strategy can look like, but how do we know if it's safe? It turns out that there just needs to be two rules that are honored. Rule number one is that any request must be annotated with the shop or the pod that it's going to. All of these
30:22
requests for the storefront are on the shop domain, so they're indirectly annotated with the shop they're going to through the domain. With the domain, we know which pod, which mini Shopify, that this request is belonging to. The second rule is that any request can only touch one pod. Otherwise,
30:40
it would have to go across data centers. And potentially, this means that one request might have to reach Asia, Europe, maybe also North America, all in the same request. And that's just not reliable. Again, fundamentally, shops and requests to shops should be independent, so we should be able to honor these two rules. So you might think, well, it sounds reasonable. Like, Shopify should just be an application with a
31:02
bunch of control actions that just go to a shop. But there were hundreds, if not a thousand requests that violated this. They might look something like this. They might do something going over every shard and counting something or doing something like that or maybe it's uninstalling PayPal accounts and seeing if there are any other stores with it or something like that across
31:22
multiple stores. When you have hundreds of endpoints that are violating something you're trying to do, and you have a hundred developers who are doing all kinds of other things and introducing new endpoints every single day, that's going to be a losing battle if you just send an email because tomorrow, someone joins who's never read that email who's going to violate this.
31:41
Rafael talked a little bit about this yesterday. He called it whitelisting. We called it shitless-driven development. The idea is that your job, if you want to honor rule one and two, is to build something that gives you a shit list, a list of all the things that violate the rule. If you do not obey the shit list, you raise an error telling people what to do instead.
32:05
This needs to be actionable. You can't just tell people not to do something unless you provide an alternative, even if the alternative is that they come to you and you help them solve the problem. But this means that you stop the bleeding and you can then, going forward, rely on rule one and two, in this case, being honored. When we had this for Shopify,
32:26
rule one and two honored our multi-DC strategy worked. And today, with all of this building atop of five years of work, we're running 80,000 requests per second out of multiple data centers.
32:40
And this is how we got there. Thank you. Do you have any global data that doesn't fit into a shard? Yes. We have a dreaded master database and that database holds data that doesn't belong to a single shop.
33:01
In there is, for example, the shop model, right? We need something that stores the shop globally because otherwise the load balancers can't know globally where the shop is. Other examples are apps. Apps are sort of inherently global and then they're installed by many shops. It can be billing data because it might span multiple shops, partner data. There's actually a lot of this data. So I didn't go into this at all,
33:24
but I actually spent six months of my life solving this problem. So we have a master database and it spans multiple data centers. And the way that we solve this is essentially we have read slaves in every single data center that feed off of the master database that is in one of the data centers. If you do a write, you do cross-DC writes. This sounds super scary,
33:45
but we eliminated pretty much every path that has a high SLO from writing on this. So billing has a lower SLO in Shopify because the writes have to be cross-DC. But the thing is that billing and partners and the other sections of this master database,
34:01
they're in different sections. They're fundamentally different applications. And as we speak, they're actually being extracted out of Shopify because Shopify should be a completely sharded application. And if they're extracted out of Shopify, then you're also doing a cross-DC write because you don't know where that thing is. So it's not really making the SLOs worse. And it's okay that some of these things
34:21
have lower SLOs than the checkout and storefront and the admin that have the highest SLOs. So that's how we deal with that. We don't really deal with it. How do you deal with a disproportionate amount of traffic to a single pod or a single shop? So I showed a diagram earlier that shows that the workers are isolated per pod.
34:44
This is actually a lie. The workers are shared, which means that a single pod can grab up to 60 to 70% of all of the capacity of Shopify. So what's actually isolated in the pod are all the data stores. And the workers can sort of move between pods, like they're fungible.
35:05
They will move between pods on the fly. The load balancer just sends requests to it, and it will appropriately connect to the correct pod. So this means that the maximum capacity of a single store is somewhere between 60 and 70% of an entire data center. And it's not 100% because that would cause an outage because of a single store,
35:24
which we're not interested in. But that's how we sort of move this around. Does that answer? How do we deal with large amounts of data? Yeah. Like someone who's doing importing 100,000 customers or 100,000 orders. Well, this is where the multi-tenancy strategy or architecture sort of shines.
35:45
These databases are massive. Half a terabyte of memory, many, many tens of cores. And so if one customer has tons of orders, then that just fits. And if the customer is so large that it needs to be moved, that's sort of what this active fragmentation project is around,
36:01
is around moving these stores to somewhere where there might be more space for them. So basically, we just deal with it by having massive, massive data stores that can handle this without a problem. The import itself is just done in a job. Some of these jobs are quite slow for the big customers, and we need to do some more parallelization work. But most of the time, it's not a big deal. If you have millions of orders,
36:21
and it takes a week to import that, you have plenty of other work to do during that time otherwise. So this is not something that's been high, high on the list. How much time I have? Done? Okay. Thank you.