We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Scaling Rails for Black Friday and Cyber Monday

00:00

Formal Metadata

Title
Scaling Rails for Black Friday and Cyber Monday
Title of Series
Part Number
15
Number of Parts
94
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Shopify is an e-commerce platform that powers over 150,000 online shops such as Tesla and GitHub. During the weekend of Black Friday and Cyber Monday, the platform gets hit with over a million requests per minute resulting in over 6,000 orders per minute (~50 million dollars.) Learn how we scale our Rails app and the measures we take (load testing, monitoring, caching, etc.) to ensure that there are no hiccups during this important weekend.
Frame problemTheoryPoint (geometry)Process (computing)CybersexTwitter
Frame problemMultiplicationComputing platformCellular automatonComputer animation
Web 2.0WebsiteMathematical analysisMobile appPoint (geometry)MereologyFormal grammarComputer animationMeeting/InterviewLecture/Conference
Multiplication signView (database)Point (geometry)Water vaporPopulation densityAreaPattern languageParameter (computer programming)Mobile appRevision controlServer (computing)Exception handlingProcess (computing)Scaling (geometry)NumberComputer animationLecture/Conference
WordForm (programming)Revision controlFlash memoryCybersexComputer animationLecture/Conference
Right angleCybersexEndliche ModelltheorieComputer animation
BitDecision theoryMultiplication signOrder (biology)AverageComputer animationLecture/Conference
Response time (technology)Order (biology)ResultantDependent and independent variablesMobile appWeb pageLine (geometry)Computer-assisted translationComputer animation
Design by contractService (economics)State observerWeb pageNeuroinformatikDecision tree learningComputer animationMeeting/Interview
Hybrid computerTheoryMereologyPattern languageGraph (mathematics)MeasurementSoftware developerAreaInsertion lossInternetworkingWordTask (computing)Line (geometry)Dependent and independent variablesQuicksortElectric generatorTranslation (relic)Web browserWeb pageDirection (geometry)Key (cryptography)Cache (computing)Event horizonGroup actionNeuroinformatikNumberRevision controlLibrary (computing)Physical systemMultiplication signComputer animationLecture/Conference
DatabaseLiquidCache (computing)ParsingWeb pageWeb browserGraph (mathematics)Revision controlLine (geometry)Bit rateProduct (business)2 (number)Flash memoryKey (cryptography)Form (programming)QuicksortPattern languageComputer animation
Query languageRoundness (object)Product (business)AreaGame theoryProcess (computing)Universe (mathematics)Row (database)Multiplication signSocial classNetwork topologySource codeTerm (mathematics)Open sourceCache (computing)Identity managementEinbettung <Mathematik>Medical imagingCondition numberDatabaseKey (cryptography)Associative propertyNumbering schemeSystem callStress (mechanics)Direction (geometry)QuicksortComputer animationLecture/Conference
NumberRight angleEndliche ModelltheorieProcedural programmingSoftware testingUniverse (mathematics)System callNumbering schemeProduct (business)Multiplication signFlash memoryCache (computing)Identity managementDiagram
1 (number)Gateway (telecommunications)ResultantAreaWater vaporResponse time (technology)2 (number)Computer animation
Content (media)Process (computing)Well-formed formulaMultiplication signField (computer science)Social classWeb 2.0EmailMaxima and minimaComputer animation
Linear regressionGraph (mathematics)StatisticsGrass (card game)SI-EinheitenTask (computing)Bit error rateMeasurementCodeSocial classNumberFront and back endsTemplate (C++)Process (computing)Multiplication signVolumenvisualisierungServer (computing)LiquidGraph (mathematics)Endliche ModelltheorieGenderImpulse responseSoftware testingView (database)Roundness (object)Error messageComputer animationLecture/Conference
Linear regressionFront and back endsDegree (graph theory)Service (economics)Operator (mathematics)ResultantMultiplication signProcedural programmingEndliche ModelltheorieComputer animation
Linear regressionProduct (business)Process (computing)Multiplication signCybersexPlanningForm (programming)Volume (thermodynamics)Computer animationLecture/Conference
LoginQuery languageTrailRootLink (knot theory)Configuration spaceCausalitySpherical capProcess (computing)Group actionLine (geometry)Computer fileCuboidSystem callTheory of relativityGame controllerForm (programming)FrequencyMedical imagingMultiplication signAreaOrder (biology)InternetworkingSet (mathematics)Endliche ModelltheorieWell-formed formulaComputer animationLecture/Conference
Functional (mathematics)Connectivity (graph theory)System callCausalityPhysical systemProcess (computing)Physical lawView (database)WordWater vaporComputer animationLecture/Conference
WebsiteService (economics)CausalityResultantQuicksortAreaMobile appVirtual machineCASE <Informatik>Computer animation
1 (number)Dependent and independent variablesSource codeMedical imagingTheoryWordExistenceAreaBit rateData structureDialectNatural numberStatisticsOrder (biology)Right angleForcing (mathematics)Computer programmingOntologyArithmetic meanMultiplication signTerm (mathematics)QuicksortService (economics)Prisoner's dilemmaTraffic reportingInterior (topology)Office suiteQuery languageLevel (video gaming)BitDatabaseProcess (computing)Data storage deviceSet (mathematics)Slide ruleError messageException handlingDigital electronicsDataflowPlastikkarteCausality2 (number)Single-precision floating-point formatSoftware testingCodeIntegrated development environmentBlogMobile appProxy serverState of matterPoint (geometry)Structural loadLibrary (computing)VolumenvisualisierungConnected spaceBlock (periodic table)Web 2.0Table (information)Computer animationLecture/Conference
Service (economics)Digital electronicsWeb pageQuicksortState of matterDecision tree learningKey (cryptography)Address spacePlastikkarteBit rateFrequency2 (number)Order (biology)Drop (liquid)Process (computing)DatabasePoint (geometry)Internet service providerData storage deviceChaos (cosmogony)Cache (computing)BitElement (mathematics)Real numberCASE <Informatik>Traffic reportingEvent horizonFlagMereologyNP-hardMultiplicationMultiplication signGateway (telecommunications)Similarity (geometry)Revision controlTask (computing)Level (video gaming)Medical imagingRight angleWater vaporWordSocial classObservational studyComputer programmingOnline helpGraph (mathematics)View (database)Series (mathematics)Game theoryCategory of beingDressing (medical)TheoryResultantTerm (mathematics)Sound effectVarianceGroup actionChemical equationParticle systemComputer animation
CountingData storage deviceStatisticsError messageMereologyCASE <Informatik>Computer animation
Computer animation
Transcript: English(auto-generated)
Scaling Rails at Shopify. This is our quick story of how we survived Black Friday and Cyber Monday
last year and in the past few years. My name is Christian. See Georgia on Twitter and GitHub. Don't follow me. There's no point. I'm from Montreal so Three months a year. This is what Montreal looks like. Cars are just buried in snow
People push buses and there is maple syrup heists And there's Pichon, which is probably the best reason to come to Montreal.
So I work at Shopify. Shopify is a company that is trying to make commerce better for everyone. Our platform allows merchants to sell stuff on any channel, on multiple channels. Primarily on what we call the web channel, which is websites. So we give our merchants the ability to customize the HTML and CSS of their websites.
They also have access to liquid so that they can really fully customize the look and feel of their site. We also have a point of sale for commercial stores, and we have a mobile app for people on the go that want to accept payments. Our stack is
a pretty traditional rails app. If you've seen John Dutt's talk earlier, I'm probably going to repeat a lot of things, but we use Nginx, Unicorn, Rails 4, and Ruby 2.1. So we're on the latest versions of everything, except for Ruby, I guess. We use MySQL.
We have around 100 app servers running in production, which accounts for roughly 2,000 unicorn workers. We have 20 job servers with around 1,500 rescue workers. What kind of scale are we talking about? As I talk about scaling, I need to throw big numbers at you, otherwise you just won't be impressed, and this whole talk will be kind of useless. So we have
150,000 merchants on Shopify, as of last night's check, and these merchants account for around 400,000 requests per minute on average, but we've seen peaks up to a million requests per minute during what we call flash sales.
These requests amount up to, we basically processed up to four billion dollars in GMV last year. So if you do the math, that's around seven thousand dollars per minute. So any minute we're down, we're basically burning money, and
someone somewhere is losing money. So because we're in the commerce industry, we have to deal with these really fun days called Black Friday and Cyber Monday. We... Yeah, Black Friday's just crazy. But Cyber Monday, we actually call it the Cyber Fun Day, because usually when Black Friday goes well,
we can just kick back and relax for Cyber Monday, because it won't be any worse than Black Friday. So this kind of stuff happens in malls. People go crazy, they fight for each other, to get this TV and stuff, but it turns out Black Friday's pretty crazy on the internet, too.
What do you expect? So we see around, so last year we saw 600,000 requests per minute, so that's about two times our average traffic on a normal day. We also processed three times more money during those four days than on average. So it's a pretty big time of the year for us, and we just can't afford to be down.
Everything has to go perfectly. So in order to understand a bit better the decisions we make to scale Shopify, you have to understand that we use Unicorn, so each request ties up a Unicorn worker. So in order to scale Shopify, we need to either reduce the response time or increase the amount of workers we have.
So I'm just going to go through the various techniques that we've taken to reduce the response time. And hopefully you'll be able to take some of this and apply it to your own apps. So our first line of defense is what we call page captioning.
So the idea here is you make this observation that if, let's say, 10,000 people hit the same page at the same time, chances are what we're going to respond is going to be the same thing. So it's kind of crappy that we're doing all this computation 10,000 times, right? For 10,000 requests at the same page.
Would be cool if we can just do the computation once and serve the same data to the rest of the people. The problem here is that, as you can see, there's this thing called, on this particular page, there's the amount of items in your cart. On some pages, people are logged in also, so the page won't be exactly the same.
So we wrote this gem called cacheable, and what it is is a generational caching system. So what this means is that we don't have to manually bus cache, because busing cache is the hardest thing to do in computer science, from what I read. And the other thing is what, naming things that's tough?
Off by one error. So the idea here is that we don't have to manually bus the cache. The way this works is that the cache key, the key event cache, is based on the data that you're actually caching in event cache.
So I'm just going to go through what a typical example of cacheable looks like. So in this, this is a post controller, with a very simple index action. We're scoping the posts per shot, because we're a multi-tenant app, and we're also doing pagination. And you'll notice that we wrapped the
action with this thing called response cache. Response cache does all this nice magic. And you'll notice there's also a method called cache key data. So whatever this method returns, we're going to basically do a two-string on it, and we're going to do an M5 cache on it, and that's going to be the key in that cache. And the value will be whatever is yielded there, so the response.
So here's like an example of how we're generating the cache key for this request. So you'll notice the shop ID is one, let's just pretend. The path is posts, format, whatever, params, like we decided that we're going to put the page params in the cache key, because you don't want the
cache for page one to be the same thing as the cache for page two, right? And you notice there's this thing called shop version, and this is what makes it generational. So every time a post is updated or created or deleted, we're going to increment this counter. So what happens is that if this shop version is in the cache key for everything that's cached, all the cache will just
go away, and we're going to start populating a new cache key. Does that make sense? The other thing that this library gives us is gzip support. So when we cache the HTML into mencache, we gzip it right away.
So when the request comes in, if we find a key in mencache for that cache key, we just take whatever is in mencache, we just serve it directly to the browser. So the nice benefit here is we're also saving on bandwidth, because we're sending gzip data to the browser directly. On the front of saving bandwidth, we also do ETag and 304.Modify.
So if the browser decides to cache the data within the browser, we don't have to send anything to it. We just tell it 304.Modify, and it just serves it up from the browser cache directly. So let me show you some numbers. This is what our graph looks like for cache hits versus misses.
So the blue line is cache hits, and the misses are the purple line. So we get about a 60% hit rate on this page caching. That's huge. That's like 60% of 400,000 requests per minute. Which is absolutely crazy. So these requests don't hit the database. They don't do any parsing of liquid templates, any compiling of liquid.
You really just take the data from mencache and serve it directly to the browser. The problem with page cache is that when we have a sale, so let's say some shop does this massive sale where lots of people are buying stuff at the same time,
you'll notice on the graph that the cache rate goes down. And this is because we're continually updating inventory on the products being bought, which bumps the shop version. So basically this shop is not running any page cache during a flash sale.
But we still get 40% cache hits in that case, so it's still pretty good. Our second line of defense is query caching. So we do around 60,000 queries per second, which is absolutely crazy. And so we want to reduce the stress on the database.
We have this thing called identity cache, which is a gem that's open source. And what it does is it marshals down active records and it caches them directly into mencache. So that we don't have to hit MySQL when we need these records.
The cache is opted by design. So the idea is that when you want to use the cache, you have to actually, there's a method called fetch. And when you use fetch instead of find, you're actually loading it from identity cache. The idea here is that in mission-critical areas, like say a checkout process, you don't want to rely on cache, because cache can be wrong.
You want to really hit the database directly. So we decided to make this opt-in by design. The caveat to identity cache is that unlike generational caching, we have to manually bust the cache for IDC. So we have an actor commit book that whenever a record is modified,
or an association of a record, we go manually to link the keys to mencache. So the problem with this is that there could be race conditions where you manage to save the database, but you don't manage to clear the mencache keys. But it's something that doesn't happen very often and we're okay with that trade-off.
So what does identity cache look like? This is a very simple example. It's sort of like a product model that includes identity cache. A product has many images, and you'll notice that we're caching the has-many relationship, and we put embed true there. I'll explain what that means. So basically, you see instead of doing product.find with the ID, we're doing fetch.
So this will actually load the data from the database that's not cached, and when it does that, it's going to save it into identity cache after the fact. You'll also notice that we're doing fetched images. You can kind of see what's going on here, right? You replace find by fetch, and that's how you use identity cache.
So the cool thing with embedding is that these two calls do one mencache call, because images are embedded within the same record. Does that make sense? It's pretty cool, right? We're saving two MySQL queries over one mencache query,
which is really good in the grand scheme of things. Identity cache also allows us to provide secondary indexes, so you don't always want to find a product by ID, right? In our case, we use handles, so your product is like slash product slash the handle. Identity cache allows you to define secondary indexes,
so you can load a product, in our case, by shop ID and by handle. Let's look at some graphs again. So this is cache hits and misses for identity cache. You can barely see the misses. It's pretty crazy.
So basically, the blue line, every time there's a cache hit, we're saving a call on MySQL, which is pretty crazy. During a flash sale, there's no dip, because during a flash sale, like I mentioned, all we're really doing is we're updating inventory count, so we're doing a single update on a single product. So it's such a small thing in the grand scheme of things that there's no dip at all.
So these are two strategies. The third one is backgrounding things. So because we're doing commerce stuff, we have to deal with painted gateways. I'm not sure if you've dealt with painted gateways before, but this is a 95th percentile of response time of painted gateways.
20 seconds. So if our unicorn workers had to wait by a second, during a sale it would just be down. There wouldn't be anything to do. So we background these kind of things. So we background a lot of things.
We background web hooks, email sending, payment processing jobs, also fraud analysis, basically anything that doesn't have to be done in that request, we background, so that we can release the uniform workers to the sale and continue processing for other requests. A nice benefit of doing this is that, depending on how you set up your queue,
you can do throttling with background jobs. So you can say, only allocate a maximum amount of workers to a specific queue, and you know that only that many jobs will pop up at the same time. So now what?
So we have everything in place to handle 600,000 requests per minute, right? The thing is, regressions happen, right? And the best way to know if a regression happened is measuring things. So we have this thing at ShopRite where we just measure all the things. So we have thousands and thousands of graphs and measures,
and the way we do this is with stat skiing. I'm sure if you've all used stat skiing before, but it's basically a server that you run that you throw numbers at it, and it aggregates these numbers, and it gives you 95th percentile, minimums, maximums, counts, you name it. And with this data, you can then plot it on different backends.
So we have this gem that makes it a lot easier for us to instrument our code. It's called stats the instrument, and this is an example of how we use it. So we have this class called liquid template. We can extend it with the module, and then we can call statsd measure on it.
And what statsd measure will do is it's going to measure the amount of time it takes to call the render method, and it's going to save that metric into the liquid.template.render statsd key. What this gives us in the end is we can plot these graphs of the 95th percentile of liquid template render method, which is pretty cool.
The gem also gives us statsd count, so you can count the amount of times things are called. So in our case, we count the amount of times the performance is called on the payment processing job, which gives us the amount of payment processing jobs that we run. So this is all fun, right? What is this good for?
We use this service called Datadog, which is a backend to statsd, and we plot all this data on our dashboard. And this is actually our health dashboard. So at a glimpse of an eye, we can see if Shopify is doing well or not,
and we can identify regressions pretty quickly. The cool thing about Datadog is it does alerts. So I was looking for a screenshot, and I found this alert. So one of our ops set up an alert on whenever the temperature of the ops room goes above 24 degrees Celsius, it fires off these alarms, which is pretty funny.
But if you get clever and do really useful alerts with Datadog, that sounds all fun and perfect, but it's not perfect. Even though we have all this in place, regressions can still happen, and sometimes you don't find out,
but until it's too late. And we don't want this to happen, so we do load testing a lot. We have this tool called Genghis Counting, basically what it is. It simulates Black Friday and Cyber Monday. Sounds pretty crazy. It's actually really simple.
It's just a tool that simulates a person going through the checkout process and buying something, and it just does that thousands of times concurrently for many, many minutes, and we just see what happens. We're basically just DDoS-ing Shopify in production to see what's going to work.
It might as well break before Black Friday if it's going to break. We do this several times a week. It helps us plan for the next week. It ensures us that when Black Friday does happen, that we're going to be totally fine, at least for things that we control.
How many of you use MySQL? Wow, okay, cool. I was expecting a lot of Postgres or something. So we use MySQL at Shopify. One thing that happens sometimes is there's slow queries, right? And MySQL gives us a really nice tool called the MySQL Slow Query Log.
It's a really nice tool, right? It logs this to a file. It's so useful. Cool. It actually is useful if you can figure out what causes slow queries. I want to go through a three-step of how to determine the root cause of a slow query
because I find it pretty interesting. I figured it would be useful for others to know this. Here we go. Step one. If you're using nchnx, there's this module called... I put the link there, nchnx-request-id. What this does is it exposes a variable in your nchnx config
that you can pass along as a header. And it's just a unique ID for this specific request. That won't help us. Not alone. The second step is there's this thing called log process action in Rails. And what it allows you to do is it allows you to add stuff to the last line of a request.
You know, when it says completed 200 OK, you can add stuff there. So we add the request ID there. So we're getting there. Step three. We use marginalia, which is a base cap gem. And what this does out of the box is it adds the name of the controller and the action that performed the query.
But we also add the request ID there. This is pretty crazy because once all that's done, our slow query log looks like this now. And that's way more useful because we can see exactly what request, starting from nchnx, causes slow query. And that will allow us to make it easier to debug the root cause of the slow query.
We actually have this nice nchnx to Rails to slow query relation. And there's a bonus too. We add the request ID whenever we queue a background job. So this allows us to know what requests queue the job. Because sometimes it's interesting to know this if you're debugging something.
So the next thing I'll talk about is resiliency. Anybody know what that means? No? Or maybe you're shy. I don't know what it means. I'm going to read a quote.
This quote. A resilient system is one that functions with one or more components being unavailable or unacceptably slow. That makes sense. So here's what happens. You start building a Rails app. You're having a really good time, having a way. You need to use sessions, right? Cause you want to remember if someone's logged in or not. So you add this session store.
Then you continue coding away. Now you need background jobs. So you add redis. And then you add memcache. And then your users want to be able to search for whatever reason. So you add elasti-search. And then the next thing you know, someone calls you up with this screenshot of this famous 500 error and you're like, oh god.
The person's on the phone. They're pissed off. Cause they can't get to your site. What went wrong? So what went wrong is that you just assume that these services work, right? I mean, redis doesn't go down. Like you did pseudo app to install redis. So it's on the same machine. It shouldn't go down, right? So you assume that things are always up and fast.
But in reality, that's not the case. And basically, don't let minor dependencies take you down. You don't want something like the session store to take your whole app down, right? Cause really, the only thing you need the session store for
is to make sure the customer's logged in or not. So you probably have this code in a before filter that checks if there's a session ID and it loads the customer. The problem with this code is that if the session store is down, this before filter is just going to explode for every single request, right?
And that's bad. So what do we do here? Well, we can rescue data store unavailable. I mean, it works. It's probably not at the right level of extraction, but this is something that we should always do.
We should always not take for granted that the session store will be up. And you should do this for every data store except for, I guess, your database. Cause if your database is down, I guess your whole app is down. So sprinkling these rescues in your codebase will help, but if you don't have tests to ensure that these flows
do work without these data stores, someone can go around and just remove the rescue and think that, oh, it's useless, and then you're back to state one where your app goes down. So we built this tool called ToxiProxy. And what it is, it's a very simple TCP proxy that you put. And it's not just Rails specific.
It's really just a, it's written in Go, it's a proxy that you put between your Rails app and your services. And what ToxiProxy does is it allows you to simulate a service being down or even worse, a service being slow. Cause if the service is down, you'll get the response right away, right? Like the service is down, the connection failed.
But if the service is slow, well, that's another thing. It's just slow. The cool thing about ToxiProxy is that we released a Ruby library that allows you to control it. So we have ToxiProxy between our Rails and our minor dependencies in the development environment and in the test environment.
And what this allows us to do is we can write tests to assert that, for instance, in this case, we're testing that when a session store is down, that the request to slash still responds successfully. Now we're absolutely sure that this flow works,
even if the session store is down. There's this really nice blog post on the posted slides after the talk that describes the process of making Shopify resilient. I'd encourage anybody to read it. But essentially, the TLDR is we did what I just described for all the minor dependencies we had.
So we came up with this nice table of here's the Shopify checkout, here's the Shopify web channel, and here are all the services that it depends on. And we just make sure that whenever one of these services are down, there's a proper fallback to make sure that we don't render 500s,
that we try to fall back smartly and serve 200s. Because what's worse, to the user, seeing a 500 or seeing that they're logged out temporarily? Seeing 500s is obviously better, right? So I mentioned slow resources, so this is a tough one.
So we have three shards, so we split our data into three MySQL databases. So for those of you who don't know what sharding is, basically we have data for, say, shop one, shop two, shop three on shard one, so this is one MySQL database.
And we have the same thing for shard two, shard three, so we split our shard into three shards. And then we put Rails in front of that, and whenever a request comes in, using the host name, we can determine what shard that shard is on, and we query that database. Sounds really cool, right? But there's a problem with that.
What happens if shard one is slow? Because the same Rails app is serving all three shards, if shard one is slow, well, your unicorn workers are going to respond slower, and at one point they won't be able to take any more connections, right? Doesn't that kind of defeat the purpose of sharding?
Isn't point of sharding to be able to just kill off one shard and still be able to serve two shards? So we thought about this, and we're thinking, well, how can we make it so that shard one being slow doesn't affect shard two and shard three?
How can we fail fast on this? So we have this gem called Senian, which is a smart circuit breaker. So the idea here is, oh, I'll just show some code, it'll probably make more sense. We register the shard one as a resource,
and we say, there's five tickets, so you can do five queries on shard one at a time. There's a timeout, so if there's a sixth query coming in, we wait 0.5 seconds until a ticket is freed up. If it doesn't free up, then we just pretend that my SQL shard one is not there. So that's our way of failing fast.
So if there's a slow query that's causing the shard one to respond slower, we'll only respond slow to five requests, and then the other request will just fail right away. So there's a couple other settings here. This is our error threshold. The idea here is if we have 100 errors,
we're just going to pretend the shard one is down. So we'll open the circuit, and after 10 seconds, we're going to put the circuit to a hack-open state. The idea here is that we're going to let a bit of traffic go through, and if we see that shard one is healthy again, we close the circuit,
but if shard one is not healthy, we reopen it. So the idea here is that we reduce the impact of one database being slow for the rest of the connections. And the way you use this in Rails is you basically acquire a resource,
and you do your query within that block. So if you go back to our example, now if shard one is slow, Seme will kill it off, and we'll be able to still serve traffic in shard two and shard three successfully. So what else can go wrong?
So many things can go wrong. These are all the things that we depend on. We depend on shipping rate providers like FedEx, UPS. We depend on payment gateways, Stripe, PayPal, fulfillment services, internal services. So during Black Friday,
all these services are thrown in the same amount of traffic as Shopify. So even if Shopify can scale our internal services, we're still at the mercy of, say, FedEx for calculating shipping rates. So for this, we have manual circuit breakers. Basically, they're just flags.
We wrap our things with if statements, and we can manually go into a panel and disable a specific service. So let's say PayPal's having a hard time during Black Friday, which is join a panel, disable PayPal, and Shopify continues working for everybody else that doesn't use PayPal. That's all I have. Any questions?
Are his attacks unexpected? Do you know when he's coming?
So this is pretty cool. Is GenghisCon predictable? Yes. So we have a Google calendar event where we say, do a Genghis run at this time. So that is predictable. But the cool thing about GenghisCon, or at least not GenghisCon specifically, but the cool thing is we use GenghisCon to do...
I'm not sure if you've heard of TS Monkey before. This is a thing that Netflix does. So they just randomly pull flags and they throw a monkey in the data center. They just wreck havoc, right? So we started using the Genghis flows, so the strings, for this tube. We get it for free, right? So we have one that's predictable, but we also have our Chaos Monkey that just wrecks havoc,
but at a lower intensity. But our Genghis friends are really like, here's what we expect for next year's Black Friday, and we run it, and we have our mouse on the stop button, in case anything goes wrong, but in most cases we go gradually,
so we know we've never actually put Shopify down with GenghisCon before. We got some in production. The payments, you said you move the payments to the payment, right? So how do you know when the payment is something wrong with payment?
Yeah, so how does the... So when we put payment processing in the background job, how does the user know that the payment went through correctly? So normally when you're on a checkout flow, you hit submit. I mean, you enter your credit card information,
you hit submit, and the next page you see is, thank you, your payment was successful. In our case, we had to add a page that says please wait. What I showed you was a 95% drop. On average, payment processing takes a second. So on average, the users will stay on this page for a second,
and we just refresh, we poll the page, and our payment processing job sets a flag on our order model, called payment successful or whatever, and once that flag is there, we send the person to the receipt page. We also, I mean, if there's an error,
we also just send them back to the process. Again, you need to add a spinner page, which is not ideal, but if you use Ajax, you can make the experience better by just making it a spinner on a button or something. Yeah? Yeah, actually, that's a good point.
So caching of external dependencies, one that's, here, one that's a bit obvious is the shipping rates. So we cache the shipping rates with, basically if you're shipping from point A to point B, with a given cart, the prices will always be the same, right?
At least for a certain amount of period. So we cache shipping rates for six months, six days, it's not six days, sorry, six hours, and it's just like, it's a memcache, you just, like, the key is probably, like, the address is cached, and, like, the cart can hunt them. I guess I was going to ask something sort of similar. So it's always hard to think about minor dependencies
because none of them really feel minor when you're in the middle of it. What are, and the shipping rate to providers is, like, the perfect example, right, because you can't calculate, you can't calculate that, and you feel like you can't really fill the cart. I don't know. Are there other ways to respond to some of these other minor dependencies?
Actually, I would consider these major dependencies. The minor dependencies are, like, honestly, the session storage was a real use case at Shopify, so we had this report filter that was trying to load your customer ID from the session store,
and when the session store goes down, when all of Shopify just goes down, that one is... You don't think about it, right? You just assume that they work. For the shipping rate providers, you need to think about something. So if FedEx is down, we can't provide any rates to people, right? At all. Like, people can check out. That's pretty bad, right?
So what we do for that is we have so much data in our databases that we try to be smart about it and try to estimate the shipping rate. We can look at, like, did anybody order this exact cart to, say, this state from this state yesterday? We can approximate the amount and use that.
But, again, we're really at the mercy of these external services. And there's really nothing we can do besides providing fallbacks. Did you guess wrong? Did you just eat the pasta? The merchant would, I guess, yeah.
I mean, you could be smart about it, like, maybe add in dollars. I don't know. It's tough. Yeah? Do you open the circuit on FedEx, or do you really disappear?
So what we did for last Black Friday is really just very simple. We just wrapped everything with an if, and when you open the circuit, all the UI elements go away. The interesting part here is that some of our merchants have multiple shipping providers, so even if we kill off FedEx, like,
we still supply with some rates. Same thing for, like, payment gateways. A lot of our merchants accept two payment gateways, primarily, like, Stripe, or, like, a credit card, and PayPal. So if PayPal had issues, we'd still be able to pay with a credit card.
Yeah, but the UI elements do go away. When you're rescuing, yeah, okay.
I'm just going back to this. Well, this could be a nice place to put stat speed. You can record this in stat speed and have, like, alerts that say, oh, this session storage is having troubles. This is a very simple example, but, like, you could, like, totally see
putting, like, a stat speed count here and having an alert that says, like, if there is more than, like, ten errors on sessions with, like, a minute far off alert, then someone can look into it. You probably should do that, because that's like a rescue in, like, close your eyes.
Any other questions? You're good? Well, thanks a lot.