Merken

Testing Rails at Scale

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
right and that the and of the beginning of time and in the mail today talking about testing rails at scale so much engineered chop by are in the production pipeline of performance on DNS shop pricing e-commerce platform that allows merchants to set up an online stores and solar products on the Internet to give a little background shot by
shot has over 240 thousand merchants of over the life span of the company we've process 14 billion dollars in sales in any given month review about 300 million uniques and we have over a thousand points so when
testing rails at scale of you typically use a sea ice and I want to so that we're all on the same page about how I think about CSS so I think statuses having 2 components you have a schedule of any of the computer the schedule is the component that the size of the buildings to be kicked off typically it's a web of the comes in from something like it how it orchestrates the work so but it decides what scripts each run where I'm whereas in contrast the compute is where the code is actually run with it has to actually right now the computer is everything that Texaco view as well so it's not just the machine but it's the auction the machine make shirts they're getting the code to be on the machine and everything that's involved if you look at the market of
CI systems you typically have to Texas yeah systems you have the marriage this is a closed system multi-tenant on they handle both the computer and the scheduling for you and you just given the Easter base of some examples are a circle CI coaches traversal in
contrast also have unmanaged writers so these are the systems where you host both the scheduling and the computing in your own infrastructure and it's open system you have access to the code basic make whatever changes you like and to give an idea some of these are like Jenkins transient today shopify beats up
over 50 thousand containers in a single day of testing on during that time we build 700 times so we both are present times for every build we 142 thousand tests in this whole process takes about 5 minutes but this wasn't always the case that
went around last year shop bicycle times were closed 20 minutes we worst experience use flakiness issues not just good health but also from the provider we're on but we we were the biggest customer of this provider in the running to capacity issues so we get problems like out of memory errors on the prior all season was expensive and not just the dollar amount but you the typically for has provided by my but if you think of a typical workload it's 5 days a week 8 to 12 hours a day and so the rest of that 12 hours you're not using that compute and so we set on a journey
to go and solve this problem
so we were given the directive to Blobel dams down to 5 minutes and at this point to the level of flakiness in the long bill times you have to rebuild your build even of this we should be green multiple times the 2 or 3 times before he got and got quite In the goal was also to maintain the current budget so we looked on the
market and we found an Australian CI provided by the name of build cut the interesting part about built K is there a hosting provider but they only provide the schedule component you have to bring your own computer the service and the reason that's very valuable is because for 9 and personal use cases the scheduling the point is the same for any CI system In this satisfied are not here worries of rebuilding the wheel
so the way the way works is you want will cut agents when your machines those ages talk back to build K built that also ties in to the events for your ego so when you pressure to get out of the bill site knows that he needs to start of build you tell Bill K what they exact strips you want the to run those ages of pull the code and from you have a run the scripts and then propagate back the results to get out our searchable to build things of and built a propagates to where you need the results to so the compute cluster we have
of is that peak 90 C 4 8 x large is easy to all of that gives us about 5 . 4 terabytes of memory and over 30 200 course so the cluster is hosting
AWS it's auto scaled but we manager with shaft and a pre-built A-mei's it's the instances are memory-bound and this is because in the containers that we run on the instances we put in all the services required for shop by 2 and finally we have to do so I'll optimizations on these machines and because of the right heavy workload we do when you download file containers so the use random tests on emissions so I mentioned we thought of skill of our compute cluster we couldn't use Amazon's or a scalar because Amazon is on still works on an HTTP request so as if we had to run around it's a simple rails out but the way it works
is it holds Bill cake for the Curie running built-in agents and this it checks how many required on base and OK capaces spaceman number of bills is news to run correctly so abstruse then goes and the activates but use of new EC 2 machines forest scales it down that's a basic words are we also try we kept their
cost in mind as we built a system so it is something that we are specific optimizations so that includes things like keeping the instant speed up for the full hour because I was to uh bills by an Arab it also includes on using spot instances and reserved instances we try to improve utilization so it mission is but it says machines to be Grafenauer our given the we don't require the capacity we allocate dynamic amount of agents for bills so at peak for branch fields we can give up to 100 agents or up to 200 agents for massive builds keep in mind that not 1 size fits all so for us the WSN autoscaling works for other companies bare-metal might be the correct solution
so the funny thing but Scrooge is built on agents for this implicit sort of how productive developers are at the company right because when you pushing more code you more productive on and so we can track the amount of productivity per say going on and and so I take this graph from an average day in you notice there's 3 points anybody guess what the 2 values and the 1 he goes any guesses the have been lunch so a chopper by the bottom but that's in UTC time in legit shufflers 11 30 to 1 30 so that 1st he is the 1st one trash people get up get lunch what a so that 1st the Canyon is literature with the peak always do when before you leave your computer you work on something you commit work-in-progress pushed up to get up right so that's what that 2nd peak is and what they did is everybody going for lunch so much containers
the large speedup we got with the computer was using docking using containers to of run a test so we were able to get a large speed up because we do all the configuration of the you we need during the container filled and you'll have to do it once and the momentum a continues on a machine making is Lisa running taps so things we do things like that we get our dependencies on the machine we compile all assets we also get Tess isolation from Dr. this is as big of a deal with rails crystal quite useful finally Dr. provides distribution API and on most things speak about docker so we can put the container and eerie want as long as we are my notes for the registry um so shopify
has on of outgrown Dr. files we have our own internal of building a system called units they use the docket build API to build continues using batch scripts on the a single ECG machine at the time of this 1st iteration of 1st year system In this England situation was a dedicated to look it is this all those machines where you have a bunch of apps they need a reproduction of an optical production select for an output an output and eventually but you have this by Johnson's will machine and become production critical so was 1 of those and in building containers for CI system forces to repair water dad what technical that's the apples growing that shot by years old the code based on and so you're a cure a lot of technical that and so while retriable containers of parameter really really issues like compiling asses recorded lexical connection for 10 distribution and
uh we went through a simple solution we have every container as a set of tests plus an offset based on the container index of we had to go categories containers so something is when we test some browser test and the issue you to with this is really test the Ruby test rule was much larger and of browse it has a much slower pressure because passive and so what happened is the really test would be would complete running and then the browser test would take a couple more minutes resulting in built it like a laudable times the for our artifacts at the end of a of a CR under the aegis of the boxes were going to doctors grab the artifacts and uploaded to S 3 we also had an art bounds service that we get will be web books from build kind of dumb some of those artifacts into cafe and emits a metrics on the 2 sets the all roads and shopify on lead to deal and so we're able to some of those artifacts the later go and find a a way to test for a fleet years convinced and this is our 1st iteration this is what
the final architecture the 1st iteration looked like the
but and opposite the trait that but so we should the
2nd provider on but when we ship the 2nd provided we brought much confusion to the company we decided to run both of them in parallel we also noticed that a single box doesn't scale of on its capacity issues and the of 2 categories for the 2 of them having 2 different halves of continues run us I was make their bills longer than they should have been so I'll confusion when we should we
decided to ship both CI systems in parallel so we can gain more conference before we rolled out and remove the old sea ice is the problem with this is we did a badge of communicating to the whole company we're planning on doing how we were doing it in a developers the downside to statuses 1 was green 1 was red and they weren't sure which to truck which to trust which to believe and so and so this unfortunately developer confidence the for us the solution for it was to full on switched the new system to 100 per cent and take a dive in classes like it's so we when we
outgrew are currently cutis instance we knew we had to go to the drawing board and rebuild and make it a scalable we also want to keep a stateless as much as possible so and so this is what we
ended up building so we ended up building so that the New Deal version of the queues had a single instance it we get the web books and run article the container the constructive operators trends in this new in this new version that we had a coordinator and since that would get the Web books it within allocate the work to you a pool of workers the workers each in each with hash a particular worker and so the same worker receive the same work of so the same yeah same work the analyses state was finished but that's because there's a cache of each 1 of these of machines and but you can lose the cash in the contains we would build into the workers beautiful that define the problem is that was once the caches loss if you take upwards of 20 minutes to build a new container of which just doesn't work that well the so a 2nd stab
at test attribution we the 1st and here there would be you would load up all these tests to readers the rest of the containers would look at that red Q and pull off 1 by 1 the test we differ by Castro the on we also got rid of test specialization so containers ran all talents and this equalize the running time of containers so they finish within tens of seconds of each other
finally this is what the 2nd iteration of our CS is somewhat like so so
darker the gift that keeps on giving up on so no 1 test starting tens of thousands of continues today and Dr. doesn't do that fortunately but it's exactly what we're doing a CI system we ended up running into much visibility with dopamine account for these failures and this unfortunately did roads and develop a confidence in our new system every new version of Dr. had major bugs their affects all the ones that they're bringing new ones or bring back old ones some examples are we see never timeouts randomly happening on but we seek real but we got refuse to boot up it abnormal is on the machine we saw issues were concurrent holes would cause deadlocks so there's a lot of fun and has been allocated for this so we would cause builds to fail we have agreed tests we build would fail in a bad better known for developer so the Socialist actually swallow the infrastructure failures identify when the current and swallow them I 1 thing going to this project you hear stories from Google where a drive fails every couple minutes and you think well that's cool that's not for us all what we saw was even in our scale we still sort of over 100 containers fail a day of which made us realize we can't ignore this problem so the solution to thinking about your much new infrastructure is to get this mind set of Pat versus cattle so you want to your services you don't want your
service as that don't fit your services methods but we you can sort of identify few are is you give each server its own unique name if something breaks you SSH in defining what the problem is manually and Ukraine artisanal fixed and you move on of In contrast
to the a services cattle each server has a number Node 1 Node 2 Node 3 you want a automate detection of issues you wanna remove it automatically from the cluster of and you want to know to know how to clean itself uh and then put itself back in and we had to go into this and until we had done this we have a lot of quality worried manually find a broken nose go and fix it and put it back in images with a lot of time side note
was making a slightly stock there yet slides for this talk and which pictures of cast of lasers in space size want to
say that I love the internet and old is a round of
applause for making this possible and now
a 3rd iteration on a test distribution was of actually stability or so the problem is has even in this race condition where a container pulls a test of the q it fails but then since all the tests are run all the tests were green and nobody knows that this test was never run the builders period and that's a very scary proposition her citrus occur so what we do now is when we DQ a test and when it's after successfully run we think you again in the 2nd set or responsa intersections and at the end of the build will we know all the dissociative Ronald has the right we compare the 2 if they don't match we available and this is a rare situation we don't see it often but it's good to have that say keeping their so this is what the final
iteration of Bill tight looks like today it is what it looks like internally a chopper so I conclusion
don't build your own CI if you will times are less than 10 minutes it's not a productive use of your time he took a long time but for us to get through this project we had multiple working people working on this for months I also have a small application typically the issue is in computer typically it's a configuration issue and you're likely to define small but large optimizations of went to build your own CI
if 15 minutes you should sigh considering and implementing it on your own if you have a monolithic application with fast snowflakes all over the code based on and you optimize as much as possible getting your computer in being able to have more impact on that could be a very effective also the regionalization laments in year ICI provider what have you compute allows you to break and if you do decide to go and build your CI system please I make the same mistakes leaders of be sure to commit 100 % once you've built a new system and we were rabbit holes the work done by nova let's say we done in 2 weeks it's very difficult for that to be the case of and finally make sure to think of infrastructure as cattle not had still save yourself a lot of headache and time thanks few
on so the question was did we spend the time optimizing the code base for the test of these edges focusing on this yes and no on so we actually didn't we found that parallelization was enough at the at the time on when you have something like 40 thousand + tests like did you have some slow ones on managers evens out in the long run the issue we did find that as the base is flakiness so you'd be surprised by the amount that assumes state because the order the test readers runnin and when you distribute task from a queue when they're on different antennas the status difference on and so we have a lot of this is that a lot of time developed into an around going in and like figuring out why has flaky fixing it so that's a response time it so the question is how bound is to see if the system to Dr. on I would say most of the speed of we got was actually from Dobra arm and not Dr. itself but that using containers so the reason is when we build the container a lot of time to use them in a sea ice most systems is reconfiguring the application to be with run test compelling assets downloading the new genome and so on on when we build when we use doc agreeable to do all that once and then I have all this is just could easily start running the tests and so a lot of the theater come from Dr. but we also the again gain a lot from the parallelization so you would you see a lot in the and so the question was what was the time frame of the project central so we started initially working on this in the winter last year the by the summer we were out it's face you so much the company was already using build phase this new system and we had seen the performance gains but during that time we spent a lot of time going through learning the Haiti now happy team is like trying to fix these machines or we still seem quite a bit flakiness because the test distribution reviewing of and that lasted until about september and at which might project mostly lying down the minute you move on to other things so the question was did we maintain a cause and the answer is yes but we were able to keep the cinema but we making the same budget was sort of like you to fill up this budget on In we yeah we had more compute capacity Basseville times for the same money the question is how the was the team I around 6 8 people it shifted but like I say them yet the marriage thank you thank you thank if you want you know the the the the you know
Softwaretest
Zentrische Streckung
Softwaretest
Maßstab
Direkte numerische Simulation
Speicher <Informatik>
Biprodukt
Systemplattform
E-Mail
Computeranimation
Internetworking
Softwaretest
Zentrische Streckung
Videospiel
Reihenfolgeproblem
Punkt
Prozess <Physik>
Sichtenkonzept
Medizinische Informatik
Eindeutigkeit
Gebäude <Mathematik>
Code
Computeranimation
Homepage
Systemprogrammierung
Virtuelle Maschine
Scheduling
Benutzerbeteiligung
Komponente <Software>
Skript <Programm>
Kontrast <Statistik>
Offene Menge
Kreisfläche
Multiplikation
Medizinische Informatik
Kreisfläche
Mathematisierung
Physikalisches System
Code
Computeranimation
Physikalisches System
Scheduling
Service provider
Offenes Kommunikationssystem
Polygonzug
Code
Kontrast <Statistik>
Softwaretest
Service provider
Beanspruchung
Prozess <Physik>
Festspeicher
Kanalkapazität
Einfache Genauigkeit
Service provider
Computeranimation
Fehlermeldung
Punkt
Besprechung/Interview
Computeranimation
Richtung
Übergang
Resultante
Web Site
Punkt
Medizinische Informatik
Gebäude <Mathematik>
Ausbreitungsfunktion
Gebäude <Mathematik>
Physikalisches System
Service provider
Ereignishorizont
Computeranimation
Virtuelle Maschine
Scheduling
Dienst <Informatik>
Komponente <Software>
Mereologie
Skript <Programm>
Resampling
Beanspruchung
Dienst <Informatik>
Minimierung
Festspeicher
Identitätsverwaltung
Elektronische Publikation
ROM <Informatik>
Skalarfeld
Computeranimation
Instantiierung
Zentrische Streckung
Wald <Graphentheorie>
System Dynamics
Verzweigendes Programm
Softwarewerkzeug
Zahlenbereich
Kanalkapazität
Physikalisches System
Computeranimation
Virtuelle Maschine
Datenfeld
Wort <Informatik>
Instantiierung
Softwaretest
Distributionstheorie
Distributionstheorie
Impuls
Punkt
Schaltwerk
Graph
Medizinische Informatik
Gebäude <Mathematik>
Biprodukt
Quick-Sort
Code
Computeranimation
Virtuelle Maschine
Softwaretest
Minimum
Booten
Softwareentwickler
Konfigurationsraum
Konfigurationsdatenbank
Quader
Wasserdampftafel
Browser
Virtuelle Maschine
Iteration
Gebäude <Mathematik>
Code
Computeranimation
Virtuelle Maschine
Benutzerbeteiligung
Softwaretest
Einheit <Mathematik>
Ausreißer <Statistik>
Trennschärfe <Statistik>
Statistische Analyse
Skript <Programm>
Elektronischer Programmführer
Ereignishorizont
Funktion <Mathematik>
Einfach zusammenhängender Raum
Softwaretest
Distributionstheorie
Parametersystem
App <Programm>
Linienelement
Kategorie <Mathematik>
Gebäude <Mathematik>
Browser
Indexberechnung
Einfache Genauigkeit
Schlussregel
Physikalisches System
Biprodukt
Druckverlauf
Dienst <Informatik>
Menge
Forcing
Automatische Indexierung
Stapelverarbeitung
Service provider
Zentrische Streckung
Kanalkapazität
Softwaretest
Quader
Kategorie <Mathematik>
Einfache Genauigkeit
Kanalkapazität
Iteration
Parallele Schnittstelle
Unternehmensarchitektur
Computeranimation
Softwareentwickler
Bereichsschätzung
Klasse <Mathematik>
Physikalisches System
Softwareentwickler
Computeranimation
Instantiierung
Softwaretest
Distributionstheorie
Konstruktor <Informatik>
Einfügungsdämpfung
Zehn
Gebäude <Mathematik>
Zwei
Versionsverwaltung
Diagramm
Rechenzeit
Computeranimation
Last
Benutzerbeteiligung
Softwaretest
Konfigurationsdatenbank
Twitter <Softwareplattform>
Caching
Hash-Algorithmus
Warteschlange
Attributierte Grammatik
Aggregatzustand
Instantiierung
Softwaretest
Zentrische Streckung
Verklemmung
Softwareentwickler
Booten
Zehn
Versionsverwaltung
Iteration
Physikalisches System
Computeranimation
Programmfehler
Eins
Virtuelle Maschine
Dienst <Informatik>
Softwaretest
Maßstab
Menge
Bereichsschätzung
Projektive Ebene
Softwareentwickler
Versionsverwaltung
Kreisbewegung
Server
Eindeutigkeit
Zahlenbereich
Prozessautomation
Computeranimation
Dienst <Informatik>
Fahne <Mathematik>
Softwarewartung
Server
Kontrollstruktur
Skript <Programm>
Kontrast <Statistik>
Mini-Disc
Streaming <Kommunikationstechnik>
Bildgebendes Verfahren
Fitnessfunktion
Rechenschieber
Internetworking
LASER <Mikrocomputer>
Raum-Zeit
Auswahlverfahren
Softwaretest
Distributionstheorie
Distributionstheorie
Softwaretest
Menge
Matching <Graphentheorie>
Rechter Winkel
Konditionszahl
Iteration
Aussage <Mathematik>
Frequenz
Computeranimation
Medizinische Informatik
Minimierung
Kartesische Koordinaten
Gebäude <Mathematik>
Physikalisches System
Dialekt
Service provider
Code
Computeranimation
Inverser Limes
Parallelrechner
Projektive Ebene
Konfigurationsraum
Softwaretest
Zentralisator
Distributionstheorie
Subtraktion
Bit
Medizinische Informatik
Rahmenproblem
Physikalischer Effekt
Kanalkapazität
Kartesische Koordinaten
Physikalisches System
Code
Eins
Task
Virtuelle Maschine
Identitätsverwaltung
Warteschlange
Response-Zeit
Ordnung <Mathematik>
Parallele Schnittstelle
Phasenumwandlung
Aggregatzustand

Metadaten

Formale Metadaten

Titel Testing Rails at Scale
Serientitel RailsConf 2016
Teil 58
Anzahl der Teile 89
Autor Stolarsky, Emil
Lizenz CC-Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.
DOI 10.5446/31566
Herausgeber Confreaks, LLC
Erscheinungsjahr 2016
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract It's impossible to iterate quickly on a product without a reliable, responsive CI system. At a certain point, traditional CI providers don't cut it. Last summer, Shopify outgrew its CI solution and was plagued by 20 minute build times, flakiness, and waning trust from developers in CI statuses. Now our new CI builds Shopify in under 5 minutes, 700 times a day, spinning up 30,000 docker containers in the process. This talk will cover the architectural decisions we made and the hard lessons we learned so you can design a similar build system to solve your own needs.

Ähnliche Filme

Loading...