We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Testing Rails at Scale

00:00

Formal Metadata

Title
Testing Rails at Scale
Title of Series
Part Number
58
Number of Parts
89
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
It's impossible to iterate quickly on a product without a reliable, responsive CI system. At a certain point, traditional CI providers don't cut it. Last summer, Shopify outgrew its CI solution and was plagued by 20 minute build times, flakiness, and waning trust from developers in CI statuses. Now our new CI builds Shopify in under 5 minutes, 700 times a day, spinning up 30,000 docker containers in the process. This talk will cover the architectural decisions we made and the hard lessons we learned so you can design a similar build system to solve your own needs.
81
EmailMultiplication signComputing platformDirect numerical simulationProduct (business)InternetworkingSoftware testingScaling (geometry)Computer animation
Video gameUniqueness quantificationProcess (computing)Point (geometry)NeuroinformatikVirtual machineScheduling (computing)CodeConnectivity (graph theory)Web 2.0BuildingPhysical systemMathematicsScripting languageContrast (vision)Scaling (geometry)CircleSoftware testingTraverse (surveying)Offenes KommunikationssystemView (database)Web pageBootingInternet service providerTouch typingKey (cryptography)Type theoryComputer animation
Multiplication signSingle-precision floating-point formatSoftware testingCASE <Informatik>Process (computing)Computer animation
Internet service providerMereologyVirtual machineScheduling (computing)CodeChannel capacityMultiplication signWorkloadConnectivity (graph theory)Direction (geometry)Point (geometry)Scripting languageResultantSemiconductor memoryCASE <Informatik>Physical systemSuite (music)Level (video gaming)Event horizonWebsiteError messagePropagatorNeuroinformatikComputer animation
Semiconductor memoryScaling (geometry)Resampling (statistics)Mathematical optimizationDynamical systemField (computer science)Branch (computer science)Physical systemInstance (computer science)WorkloadUtility softwareComputer fileVirtual machineScalar fieldChannel capacityNumberForestData managementWordInternet service providerBootingOcean currentWechselseitige InformationBuildingMobile appLatent heatComputer animation
Graph (mathematics)QuicksortPoint (geometry)CodeMultiplication signGreatest elementProduct (business)Software developerNeuroinformatikComputer animation
Revision control1 (number)Software developerConfidence intervalSingle-precision floating-point formatCodeMultiplication signIterationVirtual machineDistribution (mathematics)BootingInternet service providerScaling (geometry)Software testingPhysical systemComputer fileSymmetric matrixScripting languageAreaProcess (computing)Moment (mathematics)Different (Kate Ryan album)Channel capacityCuboidType theory2 (number)Cache (computing)Configuration spaceWeb browser10 (number)BuildingServer (computing)Software repositorySuite (music)Subject indexingCategory of beingWindows RegistryCoordinate systemMobile appInstance (computer science)SpacetimeQueue (abstract data type)Roundness (object)Product (business)Connected spaceSet (mathematics)Green's functionComputer-assisted translationLaserOcean currentSoftwareSoftware bugComputer architectureAutomationKernel (computing)Parallel portSlide ruleRun time (program lifecycle phase)NeuroinformatikInsertion lossMomentumMetric systemNumberMedical imagingWeb 2.0Casting (performing arts)Function (mathematics)Fitness functionRule of inferenceUniqueness quantificationPressureControl flowForcing (mathematics)Contrast (vision)DeadlockHash functionAttribute grammarProjective planeSelectivity (electronic)TwitterElectronic program guideWater vaporConstructor (object-oriented programming)StapeldateiParameter (computer programming)Execution unitTap (transformer)State of matterSocial classComputer animation
Cartesian coordinate systemCodeDialectRight angleInternet service providerSet (mathematics)NeuroinformatikIterationMultiplication signMatching (graph theory)Distribution (mathematics)Software testingPhysical systemProjective planeConfiguration spaceCondition numberPropositional formulaFrequencyMathematical optimizationCASE <Informatik>Parallel portBuildingLimit (category theory)
Multiplication signPhysical systemArmVirtual machineCodeOrder (biology)Cartesian coordinate systemSoftware testingBitTask (computing)CausalityNeuroinformatikDistribution (mathematics)Queue (abstract data type)Parallel portData managementChannel capacity1 (number)Different (Kate Ryan album)Centralizer and normalizerState of matterFrame problemResponse time (technology)Phase transitionQuicksortProjective planeRange (statistics)Instance (computer science)Lecture/Conference
Computer animation
Transcript: English(auto-generated)
Let's begin. Hi my name is Emil. Today I'm going to be talking about testing rails at scale. So I'm a production engineer at Shopify. I work on the production pipeline and performance on DNS. Shopify is an e-commerce
platform that allows merchants to set up online stores and sell their products on the internet. To give you a little background on Shopify, Shopify has over 240,000 merchants. Over the lifespan of the company we've processed 14 billion dollars in sales. Any given month we do about 300 million uniques and we have over a thousand employees. So when you're
testing rails at scale you typically use a CI system and I wanted so that we're all on the same page about how I think about CI systems. So I like to think of CI systems as having two components. You have the scheduler and you have the compute. The scheduler is the component that decides when the build needs to be kicked off. Typically it's a webhook that comes in from something like github. It orchestrates the work so it decides
what scripts need to run where. Whereas in contrast the compute is where the code is actually run, where the tests are actually run. Now the compute is everything that touches the computer as well. So it's not just the machine but it's the orchestrating the machine, making sure it's there, getting the code to be on the machine and everything that's involved. If you look
at the market of CI systems you typically have two types of CI systems. You have the managed provider. This is a closed system multi-tenant. They handle both the compute and the scheduling for you and you just give them the keys to your code base. Some examples are CircleCI, CodeShip or hosted Travis CI.
In contrast you also have unmanaged providers. So these are systems where you host both the scheduling and the compute in your own infrastructure. It's an open system you have access to the code base so you can make whatever changes you'd like. And to give you an idea some of these are like Jenkins, Travis CI or Strider. Today Shopify boots up over 50,000
containers in a single day of testing. During that time we built 700 times so we built Shopify 700 times. For every build we run 42,000 tests and this whole process takes about five minutes. But this wasn't always the case. Winter around last year Shopify's build times were close to 20
minutes. We experienced serious flakiness issues not just from code health but also from the provider we were on. We were the biggest customer of this provider and they were running into capacity issues so we'd get problems like out of memory errors. The provider we were also using was expensive and not just in the dollar amount but you pay typically for a
hosted provider by month. But if you think of your typical workload it's five days a week, eight to twelve hours a day. So the rest of the twelve hours you're not using that compute time. And so we set up a journey to go and solve this problem. So we were given the directive to lower our build
times down to five minutes. At this point due to the level of flakiness and the long build times you would have to rebuild your build even though the suite should be green multiple times like two or three times before you got agreed and got to deploy. And the goal was also to maintain the current budget. So we looked around on the market and we found an Australian
CI provider by the name of Buildkite. The interesting part about Buildkite is they're a hosted provider but they only provide the schedule component. You have to bring your own compute to the service. And the reason that's very valuable is because for 99% of use cases the scheduling component is the same for any CI system. And this satisfied our non-invented
here worries of rebuilding the wheel. So the way Buildkite works is you run Buildkite agents on your own machines. Those agents talk back to Buildkite. Buildkite also ties into the events for your repo. So when you push code to GitHub, Buildkite knows that it needs to start a build. You tell Buildkite what the exact scripts you want the agents to run.
Those agents pull the code down from GitHub, run the scripts and then propagate back the results to GitHub or sorry to Buildkite. And Buildkite propagates to wherever you need the results to be sent to. So the
C4 8x larges in EC2. That gives us about 5.4 terabytes of memory and over 3200 cores. So the cluster is hosted in AWS. It's auto scaled. We manage it with Chef and pre-built AMIs. The instances are memory bound and this is because in the containers that we run on the instances we put in all the
services required for Shopify to boot. And finally we had to do some IO optimizations on these machines because of the write heavy workload. We do when you download a ton of containers. So we use RAMFS on the machines. So I mentioned we auto scale our compute cluster. We couldn't use
Amazon's auto scaler because Amazon's auto scaler works only on HTTP requests. So instead we had to write our own. It's just a simple rails app but the way it works is it pulls Buildkite for the current running Buildkite agents and it checks how many are required. And Buildkite calculates this based on the number of builds it needs to run currently. Scrooge then
goes and activates, boots up new EC2 machines or scales it down. That's how it basically works. We also kept cost in mind as we built the system. So we do some AWS specific optimizations. So that includes things like keeping
the instance booted up for the full hour because Amazon builds by an hour. It also includes using spot instances and reserved instances. We try to improve utilization. So if a machine is booted, since the machines can be booted up for an hour even though we don't require that capacity, we allocate a dynamic amount of agents for builds. So at peak for branch builds we can give up to a
hundred agents or up to 200 agents for master builds. Keep in mind not one size fits all. So for us AWS and auto scaling works. For other companies bare-metal might be the correct solution. So the funny thing about Scrooge is Buildkite agents are this implicit sort of how productive developers are at the company. Because when you're pushing more code
you're more productive. And so we can track the amount of productivity per se going on. And so I took this graph from an average day and you notice there's three points. Can anybody guess what the two valleys and the one peak is? Any guesses? Lunch. So at Shopify at the bottom, that's in UTC time, and
lunch at Shopify is at 1130 to 130. So that first peak is the first lunch rush. People get up and they go to lunch. But what, sorry, that first canyon is the lunch rush. What's the peak? Well what do you do when before you
leave your computer and you're working on something? You commit, work in progress, and then you push it up to GitHub, right? So that's what that second peak is. And then that big dip is everybody going for lunch. So I mentioned containers. The large speed up we got with the compute was using Docker and using containers to run our tests. So we were able to get a large speed up
because we do all the configuration that you would need during the container build, and then you only have to do it once. And the moment a container is on a machine, it can instantly start running tests. So things we do are things like, we get our dependencies all on the machine, we compile all of our assets. We also get test isolation from Docker. This isn't
as big of a deal with Rails, but it's still quite useful. Finally, Docker provides a distribution API. Most things speak Docker, so we can put the container anywhere we want as long as we announce where the registry is. So Shopify has outgrown Docker files. We have our own internal build system
called Locutus. It uses the Docker build API to build containers using bash scripts. It ran on a single EC2 machine at the time of this first iteration for our CI system. And this single EC2 machine wasn't dedicated to Locutus. It was one of those machines where you have a bunch of apps that need to run in production, but they're not critical for production. So you
like put an app, you put an app, you put an app, and then eventually you have a bunch of apps on this one machine and it's become production critical. So it was one of those. And building containers for our CI system forced us to repay a lot of debt, a lot of technical debt that the app has grown. Shopify is ten years old, the code base, and so you're a cure a lot of
technical debt. And so while we were trying to build containers, we ran into really great issues like compiling assets required a MySQL connection. For every container ran a set of tests plus an offset based on the container index. We
had two categories of containers. So some containers ran Ruby tests and some ran browser tests. And the issue we ran into with this is the Ruby test pool was much larger and browser tests are much slower, whereas Ruby tests were faster. And so what would happen is the Ruby test would be
complete running and then the browser tests would take a couple more minutes resulting in longer build times. For artifacts, at the end of a CI run, the agents on the boxes would go into Docker, grab the artifacts, and upload them to S3. We also had an out-of-bound service that would get
webhooks from Buildkite, dump some of those artifacts into Kafka, and emit symmetrics to Sats-D. All roads and Shopify on Kafka lead to data land. So we were able to use some of those artifacts then later go and find flaky tests or flaky areas of the code base. And this was our first iteration. This is what the final architecture of the first
iteration looked like. But then Docker decided to strike back. So we shipped a second provider, but when we shipped the second provider, we brought a bunch of confusion to the company. We decided to run both of them in parallel. We also noticed that a single box doesn't scale and
run into capacity issues. And having two different types of containers run tests was making our builds longer than they should have been. So, battle of confusion. We decided to ship both CI systems in parallel so we could gain more confidence before we rolled it out and removed the old CI system. The problem with this is we did a bad job of
communicating to the whole company what we were planning on doing and how we were doing it. And developers saw two statuses. One was green, one was red, and they weren't sure which to trust, which to believe in. And so this unfortunately eroded developer confidence for us. The solution for
it was to full-on switch the new system to 100% and take the dive in. Cluster and locutus. So when we outgrew our current locutus instance, we knew we had to go to the drawing board and rebuild and make it scalable. We also wanted to keep it stateless as much as possible. And so this is what we
ended up building. We ended up building, so the old version of locutus had a single instance. It would get the webhooks and it would run, it would build the container and then push it up to the docker registry. In this new version, we had a coordinator instance that would get the webhooks, it would then allocate the work to a pool of workers. The workers
each repo was hashed to a particular worker and so the same worker would receive the same work, or sorry, the same, yeah the same work. And now when I say it's stateless-ish, that's because there's a cache on each
one of these machines. And when you can lose the cache and the containers will be able to build, the workers will be able to build the container fine, the problem is though is once the cache is lost, it could take upwards of 20 minutes to build a new container, which just doesn't work that well. So our second stab at test distribution, the first container that would boot would
load up all its tests into Redis. The rest of the containers would look at that Redis queue and pull off one by one the test suite or test job. We also got rid of test specialization, so containers ran all tests and this equalized the running time of containers, so they would finish within tens of
seconds of each other. Finally this is what the second iteration of our CI system looked like. So Docker, the gift that keeps on giving. So no one tests starting tens of thousands of containers a day, and Docker doesn't do
that unfortunately, but it's exactly what we were doing in our CI system. We ended up running into a bunch of instability with Docker and we didn't account for these failures, and this unfortunately did erode some developer confidence in our new system. Every new version of Docker had major bugs, they would fix old ones but they're bringing new ones or bring back old ones. Some examples are we'd see network timeouts randomly happening, we'd
see kernel bugs where Docker would refuse to boot up if app armor was on the machine, we saw issues where concurrent polls would cause deadlocks, so that was a lot of fun. And since we didn't allocate for this, we would cause builds to fail. You would have a green test suite, but your build would
fail, and that's very annoying for a developer. So the solution was to actually swallow the infrastructure failures, identify when they were occurring, and swallow them. And one thing going into this project, you hear stories from Google where a drive fails every couple minutes, and you think well that's Google, that's not for us. Well what we saw was even at our scale we
still saw over a hundred containers fail a day, which made us realize we can't ignore this problem. So the solution to thinking or to approaching your infrastructure is to get into this mindset of pet versus cattle. So you want to treat your servers as, you don't want to treat your servers as
pets, don't treat your servers as pets. But the way you can sort of identify if you are, is you give each server its own unique name. If something breaks, you SSH in, you find out what the problem is manually, and then you create an artisanal fix, and then you move on. In contrast, when you're treating your servers as cattle, each server has a number. Node 1, node 2, node 3. You want
to automate detection of issues, you want to remove it automatically from the cluster, and you want the node to know how to clean itself up, and then put itself back in. And we had to go and do this, and until we had done this, we had a lot of toil on the team, where we would manually find a broken node, go and
fix it, and then put it back in, and we just wasted a lot of time. Side note, while I was making the slides for this talk, I found a bunch of pictures of cats with lasers in space. And so I just wanted to say that I love the internet, and I think we all deserve a round of applause for making this possible. Now our third iteration on our test distribution was
actually stability. So the problem we saw with test failures is you can get into this race condition, where a container pulls a test off of the queue, it fails, but then since all the tests are run, and all the tests were green,
and nobody knows that this test was never run, the build is green. And that's a very scary proposition, or a situation to occur. So what we do now is when we dequeue a test, and after it's successfully run, we enqueue it again in a second set, or we insert into a second set, and at the end of the
build, we know all the tests that should have run, and all the tests that have run, we compare the two, if they don't match, then we fail the build. This is a rare situation, we don't see it often, but it's good to have that safe keeping in there. So this is what the final iteration of Buildkite looks like today, this is what it looks like internally at Shopify. So in
conclusion, don't build your own CI if your build times are less than 10 minutes. It's not a productive use of your time. It took a long time for us to get through this project. We had multiple people working on this for months. Also, if you have a small application, typically the issue isn't compute, typically it's a configuration issue, and you're likely to be able to find
large optimizations. When to build your own CI? If your build times are over 15 minutes, you should start considering implementing it on your own. If you have a monolithic application with snowflakes all over the code base,
and you've optimized as much as possible, getting your own compute and being able to have more impact on that could be very effective. Also, if you've reached parallelization limits in your CI provider, having your own compute allows you to break past that. And if you do decide to go and build your own CI system, please don't make the same mistakes we did. Be sure to
commit 100% once you've built your new system. Be aware of rabbit holes. I know we all like to say it'll be done in two weeks. It's very difficult for that to be the case. And finally, make sure to think of your infrastructure as cattle, not pets. You'll save yourself a lot of headache and time. Thanks. So the question was, did we spend any time optimizing the
code base or the tests instead of just focusing on the CI system? Yep. So we actually didn't. We found that parallelization was enough at the time. When you have something like 40,000 plus tests, you're gonna have some slow
ones and it just evens out in the long run. The issue we did find with the test code base is flakiness. So you'd be surprised by the amount of tests that assume state because of the order the test readers run in. And when you distribute tests from a queue and they're on different containers, the state is
different. And so we had to spend a lot of time and develop some tooling around going in and figuring out why test is flaky, fixing it. So that's where I spend time. So the question is, how bound is the system to Docker? Yeah. I would say most of the speedup we got was
actually from Docker. And not Docker itself, but using containers. So the reason is, when we built the container, a lot of the time today you spend in a CI system or in most CI systems is the configuring the application to be able to run tests. So compiling assets, downloading the new gems, on and on. When we built, when we used Docker, we were able to do all that once and then all the
instances just could instantly start running the tests. So a lot of the speedup did come from Docker. We also did gain a lot from the parallelization, so you would see a lot of gains there though. So the question was, what was the time frame of the project, essentially? So we started initially working on this
in the winter last year. By the summer we were, I would say, around phase two. So most of the company was already using Buildkite and this new system, and we had seen the performance gains, but during that time we spent a lot of time going through learning the, hey, now half the team is like trying to fix these machines, or we're still seeing quite a bit flakiness because of the
test distribution we're doing. And that lasted until about September, at which point the project mostly wind down and the team moved on to other things. The question was, did we maintain our costs? And the answer is yes. We were able to keep the same amount, we maintained the same budget. It was sort
of like, you can fill up this budget. And we, yeah, we had more compute capacity, faster build times for the same amount of money. The question was, how big was the team? Around six to eight people. It shifted, but like I say around there, yeah, in that range. Thank you. Thanks.