'Acing' Infrastructure Testing with Chef

Video in TIB AV-Portal: 'Acing' Infrastructure Testing with Chef

Formal Metadata

'Acing' Infrastructure Testing with Chef
Title of Series
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Sports move fast; infrastructure testing has to keep pace. The team responsible for the cloud that powers Wimbledon, The US Open and some of the world's largest sporting events uses Chef to manage their infrastructure. Developing a testing framework presented many challenges, including: Scaling infrastructure testing to handle a distributed enterprise. Testing in remote locations without the fastest internet speeds. Contending with the Great Firewall in China. The team has spent 100s of hours optimizing test environments to support fast testing by distributed teams. Learn how the team approached test optimization, what worked and what failed.
Integrated development environment Statistical hypothesis testing
Category of being Focus (optics) Enterprise architecture Tournament (medieval) Multiplication sign Open set Coma Berenices Analytic continuation Sinc function Row (database) 2 (number) Time domain
Time zone Dialect Time zone Cloud computing Information privacy Time domain Category of being Workload Mathematics Uniform resource locator Different (Kate Ryan album) Hybrid computer Point cloud Point cloud
Statistical hypothesis testing Server (computing) Computer file Transformation (genetics) Code Multiplication sign Combinational logic Open set Login Category of being Mathematics Self-organization Cloning Data structure Task (computing)
Point (geometry) Enterprise architecture Enterprise architecture Software Distribution (mathematics) Multiplication sign Source code Mereology Computing platform Statistical hypothesis testing Time domain
Band matrix Mathematics Uniform resource locator Mathematics Divisor Feedback Combinational logic Open set
Laptop Statistical hypothesis testing Point (geometry) Group action Service (economics) Code Software developer Gender Shared memory Virtualization Statistical hypothesis testing Time domain Mathematics Data management Integrated development environment Point cloud Energy level Cuboid Pole (complex analysis) Local ring Social class
Statistical hypothesis testing Covering space Implementation Service (economics) Multiplication sign Bit Disk read-and-write head Statistical hypothesis testing Diameter Time domain Revision control Medical imaging Exclusive or Point cloud Data conversion Computing platform Point cloud
Laptop Statistical hypothesis testing Arithmetic mean Term (mathematics) Multiplication sign Phase transition Quicksort Disk read-and-write head Mathematical optimization Statistical hypothesis testing Time domain
Laptop Server (computing) Computer file Multiplication sign Image resolution Computer-generated imagery Device driver Disk read-and-write head Mereology 2 (number) Statistical hypothesis testing Medical imaging Mathematics Roundness (object) Cloning Data compression Mathematical optimization Physical system Cohen's kappa Block (periodic table) Message passing Word Digital photography Kernel (computing) Software Point cloud output Musical ensemble Resultant Local ring
Mathematics Multiplication sign Cuboid Pattern language Booting 2 (number) Physical system
Server (computing) Multiplication sign Login 2 (number) Direct numerical simulation Medical imaging Mathematics Causality Netzwerkverwaltung Point cloud Configuration space OSI model Error message Physical system
Server (computing) Expert system Bit Login 2 (number) Particle system Medical imaging Vertex (graph theory) Diagram Right angle Digital Equipment Corporation Communications protocol Physical system
Scripting language Service (economics) Touchscreen Divisor Multiplication sign ACID Insertion loss 2 (number) Revision control Process (computing) Software Netzwerkverwaltung Arithmetic progression Mathematical optimization Physical system
Statistical hypothesis testing Software Product (business) Physical system 2 (number)
Statistical hypothesis testing Touchscreen Computer file Open source Multiplication sign Feedback 2 (number) Statistical hypothesis testing Revision control Medical imaging Preprocessor Goodness of fit Roundness (object) Integrated development environment Personal digital assistant Point cloud Plug-in (computing) Physical system
Torus Computer file Multiplication sign Synchronization Data storage device Bit Drop (liquid) Mereology 2 (number) Connected space Directed graph
Statistical hypothesis testing Scripting language Installation art Enterprise architecture Computer file Code Element (mathematics) Statistical hypothesis testing Process (computing) Integrated development environment Software Operator (mathematics) Mathematical optimization
Server (computing) Building Computer file State of matter Code Java applet Multiplication sign Image resolution Source code Virtual machine Archaeological field survey Client (computing) Function (mathematics) Proper map 2 (number) Statistical hypothesis testing Revision control Casting (performing arts) Business model Mathematical optimization Physical system Scripting language Installation art Covering space Shift operator Content (media) Data storage device Directory service Sphere Category of being Software Personal digital assistant Software repository Point cloud Right angle Quicksort Freeware Resultant Local ring
Statistical hypothesis testing Laptop Scripting language Multiplication sign Data storage device Principle of locality Mereology Statistical hypothesis testing Phase transition Formal verification Point cloud Formal verification Object (grammar)
Statistical hypothesis testing Meta element Goodness of fit Multiplication sign Reduction of order Formal verification 2 (number)
Laptop Statistical hypothesis testing Point (geometry) Server (computing) Concurrency (computer science) Multiplication sign Execution unit Parallel port Instance (computer science) Mereology Login System call Statistical hypothesis testing Finite difference Point cloud Formal verification Pattern language Quicksort Physical system
Default (computer science) Open source Multiplication sign 1 (number) Online help Food energy 2 (number) Statistical hypothesis testing Twitter Medical imaging Blog Phase transition Gravitation Formal verification Right angle Quicksort Data compression Booting Local ring Mathematical optimization Physical system
so to get started here this is my own personal redemption story it's about the mistakes I made when transitioning our environment to a test-driven infrastructure and what we have done so far to resolve but but poor I dive into that what exactly does my team at IBM do the were known for sports we
do the Australian Open we do rowing heroes of the French Open which is going on right now Wimbledon the US Open we also do we large golf tournament in Georgia but it's not just sports we also do
ibm . com and all of the digital properties that comprise the ibm . com portfolio and we have this focus on we call continuous availability In fact we've been running high in that comes with 0 outages in 0 downtime since June of 2001 in a time span ready move datacenters 18 times with 0 seconds in downtown and we do this
across a hybrid cloud we have 3 private data centers and for public cloud regions in the IBM cloud that we serve these visual properties off the
we do with a large change the e + people in 8 different time zones in 35 work locations and so we these tools and methods to let us work independently yeah collaborate when necessary their workload is
highly mission-critical are partners and teams spend 15 weeks a year preparing for 2 were less weeks no all of fix this next week that doesn't happen there are hold the US Open whether ready or not we have to be ready and not only is that these properties in this time spent on the world stage is also they derive most of their revenue so we have to deliver flawlessly the now
here with a mistake when I was doing this transformation I went out there and I surveyed the landscape and looked at all the best practices so OK remove is starting to get to Auburn and code reviews for and I have a delivery pipeline were in new infrastructures code were into local testing and I need all these changes that once and did not go over there was a lot of change to push through all at once and the biggest problem we had was that simple tasks that users take 3 minutes were now taking hours use of your login the server and I edit the file and move on to the next thing and now I'm doing a git clone on editing the file on committing it is going to code review on building a locally and it's taking hours so 3 major change was now taking our to do and was this combination of these best practices which are great practices and other challenges within our organization a structure that was causing these changes take a long time the the other problems
that we faced was that we have to put out a lot of enterprise software it's part of our platform and so this awkward very data we have software that's over a gigabyte in size and if you're working locally and you're trying to do a vagrant up or test kitchen and tested the point I want to go buy package every time that can take a lot and a team is
distributed with people all over the world we have a large part of our team is in China we people in India we give or United States and Europe and not all sitting right next to the source of those binary artifacts so download time is very important and then and remove surveillance lived in a
remote location possibly or we send them to 1 we send them to the US Open there's a fixed amount of bandwidth available for you to download things constantly and test and
the biggest feedback I got was that this combination of factors was resulting in changes that basic segment taking hours and I you know I brought this problem under the team's high side this is my problem that to work really hard to resolve so we had to find a solution the
and our solution was we moved all of our testing away from local laptops to cloud in particular run the idea about and we use the services of blue boxes now IBM private cloud as a Service was an OpenStack based on that and to do this we used the kitchen OpenStack Logan the to let us out test machines on a cloud and test our changes out there instead of local no still developing locally were still doing the again poles and the code reviews but the actual execution of Oliver testing takes place now in the IBM cloud the everybody do this and make this transition easier easy because how we manage everyone's laptops when you come to work team and you start to work on share you since run occur batch and from that point on we take over the management of the development tool and a laptop to install did at the right level we install Vega and virtual Berkshire decay all the things that you need to be you will get your environment going at the right level and this is great because lets us make changes easily in transition things like local testing the cloud testing without a lot of impact of breakage the and so
we introduce a week OK class this was our effort to let you had the local testing and also bring in as cloud testing and what really was was a gender we wrote called K cloud convert and a couple things for 1st the the 1st thing it does
have is the reason in your kitchen and I should back up a little bit that cake out converter we have access to everyone's best shot C and so we added to their environment so stated having kitchen to do a test is just a out instead and then under the covers we read in the kitchen animal we validate that the only testing Red Hat Enterprise Linux and that's implementation detail we have that we're moving always stuck to the platform so here on different platforms we still just locally but in our platform for our stuff it's already heads so validate that we check the includes exclusivity might have because you could be only testing against a version of red at 6 or at 7 and we wanna keep that across the cloud the local test we do eat the plaque on stanza which is in the kitchen and was an applied move move it to the OpenStack testing we can restore those includes explants so you still have them as you test and we convert posting directed so if you're doing vagrants it's like and posting and if you're on OpenStack it's running so we just transparently move that along you and then this was only found as a learning lesson we prepare and all your sweet names with cloud and the reason we did this was if you were testing both locally and the cloud at the same time you won a major work locally then check at about you get his name conflicts because the way test kitchen worked is this we name identified the resource and so you would go to test it locally and into UK cloudy with the service was not running and so we saw that by prepending overswing names with cloud and because we did that we had to set the verifier name directly to our test kitchen it infers the verifier name based on this we name no you change the sweet name we to explicitly tell it what the verifier name and we write that time kitchen died April diameter desk so this all happens in a matter of milliseconds and eat Amélie prepares you take what you're testing locally instead now tested in our i cloud test set up and I kind of those
that we ended up like OK I've done this work I got this gem I converted it no longer testing locally things have really really really quick and they weren't it was very disappointing as I thought I had filed you know my my golden ticket to winning everybody back the we were standing exam and a half minutes of all tests not actually converging anything we had 7 have minutes of time spent in the image of beauty and create the preparing for the test and before we even got to converge and 7 and a half minutes is a long time to wait for nothing to happen so we had a tactics and we attack in 4 steps
we attach the create the prepare to converge and the verify and those are the test kitchen steps that are out there it the sort
baseline for create mean in the head which is you Her typing something you wait a minute happily internally if nothing is happening the and I spent a lot of time trying to optimize this and I I realize after the fact I over optimise it in terms of the amount of time that I saved but to me this was the most important step because is the first one they all depend upon on and things so immediately start coming back to your laptop as a testing the user perceptions as a very slow so can optimize this phase even if it doesn't impact the overall test time all that much user perception is much better user experience is much better and that's what I was trying to win back was a user experience so I
broke my create optimization into 2 different steps the launch and the boat the launch was everything up until the 1st system the message 1 system the ropes a log entry then during a start In the initial launch was 20 seconds which bad but what was in great so don't think about it the 1st thing I did was like OK got this images like you have megabytes I will shrink it down so they're small kernels and their heads my kappa set up to remove those kernels and and wrote a bunch of zeros to where files are deleted and helps compression and then ran vs. Spratt and for those offer versus prep for pairs of EM by clearing thing about making and ready to be relaunched again and I was like perfect have cut out like 20 % my file size my launched times November 20 % go home early today's can be great and my name is lost
time was the exact same so what I learned is that the image size doesn't really matter that much these images at fast local and hypervisor on OpenStack system the user's copy-on-write so slavery copying the image and much armistice creating a clone of the change in bands so I learned OK does that any time on image optimization where else can we get some wins here I did notice this the round trip time to look up the UID use for the various OpenStack entities was taking 6 seconds so if you look at the image represents and network reference and flavor reference each 1 of those here the test kitchen driver calling OpenStack saying OK result me with this name means and comes back with the UID and then once again that it does it for the flavor and then for the network and the systems are remote and you have the speed of light problem he tries request was taking about 2 seconds happening in Syria so we we have 6 seconds before it actually told OpenStack OK node so we did it was submitted the or was merged support question benign and this you hard-code the UID the and this took 6 seconds off of our image creator and this was important because that blocks every other optimization idiomatic tell you're Pesek system I need a server nothing else is an and so is important to get that part down as quick as possible and this thing was a grub so you run time out of their photograph words the thing that kind of new booty press Escape together didn't use your kernel created a laptop or desktop on a cloud system not a lot of benefit the sat there for 5 seconds waiting for input but that was never going to happen which another 5 seconds before the producing useful for users so drop that 2 0 now
skipped right past that there we bypass assignment so those 2 minor changes they
dropped our launch time and half 11 seconds which is when the actual work and then get started the and then emit boot so out of the
date and this is a Linux system during was taking like 70 sector as a really long time for Linux box to boot so I knew something here was not right that's not a classic pattern I know some people don't like system the the
IPO in this system but system the analyze blamed is really really good for digging into and figure out why the system is going the way it is and what is taking time it's a great tool for diagnosing problems what I found
looking through our give logs actually was that we had a configuration error and the system and clouding it for recording these multiple time and we had essentially set up a images for when they were local images to not have DSP managed by the network manager which was great when it was local cause we were coming to bearing in hitting own internal DNS servers and all great but in a public cloud turns out neediness to do a job and so we have reverted that change for these public cloud images and have got better 25 seconds that's the deadly bird in 69 seconds but I knew we could optimize this further and so will the horse some more gotten together with the system the analyze and can diving into make us better and better this
was the seat in of the house look at the logs a thing like 9 seconds to get my p and I was no DEC particle expert so went to Google like Wikipedia DEC protocol was like an ISO diagram like these are the steps DCP takes the VIP and they were not the steps i was saying in my log so a little bit more and knows that the system was requesting an IT wasn't asking TCP-friendly is like I had a slightly and the DSP server like I'm I'm talking to you come and then he would say that is at the end like please and it's like you know my talking you still can finally gave up and said they beginning IT and my right
about the behavior it was essentially the fact that he was trying to hospital the from the image was created which I found bizarre because I ran vertices perhaps which very loudly stated that it was clearing out all the DAC is it turned out
that there is no incompatibility between our version of versus practice in our version of Red Hat Linux and so are SRF takes all the problems and so that was the we way we went down we had a script a factor drop the blue down 15 seconds so we're making some good progress on the optimisation
but I wanted better entities so this is the analyze again looked more and network manager way online screen that to me is still taking a fairly long time so we looked at that and what is that service was that doing its job is to block the system from doing sounds bad just to begin until the network is stable and does this for good reasons their losses services out there they get very upset when they come up and there's no network or even more upset when there's a network and disappears on them a new 1 comes up behind turns out though as acidity doesn't care which is great that's the only network service we care about were doing tested
so we may ask that service so that it essentially it doesn't run under which and we bypass the whole week the network come up and when assist as a 60 comes up in the network is there it's well enough behave to understand that and that works fine for testing on recommender spike production systems so great protest systems but don't do that in production so we drop in
on itself so in 69 seconds that seconds
the and then the entire preprocess on their 22 down to 20 seconds and had this case and you start a good user experience a typing a cloud a kitchen and converges are starting the seating on a screen images of the 120 seconds so it's a much the better experience for the user we were done other steps we had optimized and that's what was prepared and prepare
was taking a very very long time so a kitchen prepare losing often we're that's where test kitchen and grabs all of your cookbook files and all of your environment data and your data bags roles and sent it to the of system for testing this is taking over 5 and a half minutes which was peculiar as we a locally took like less than a 2nd so we had a new look at the code and think is open source you can figure out what's going on and it's essentially SEP under the covers the analyst but in the version that we were using it was 2 years file at a time the and our role cookbooks have over 11 hundred files so each 1 of those round trips to remote system was adding up adding up to over 5 minutes of sitting there with no feedback your consuls to what's going on so we looked around um you know what's due to resolve this we found a couple of the plug-ins out their protest at and that really helped us along the first
one was speedy SSH this 1 does is up those files at parts of the cookbook centers of no data stores up the role data as if he it over and then on torus and a drop it on 5 . 6 seconds which was pretty good and was you know we're 5 minutes now it's 5 . 6 seconds the it would be a little bit faster it was a single part each of those connections door round-trip time back on and
so with the kitchen sink misuse of the arcing transport and seemed to just do a single connections think everything and it was done at 3 . 4 seconds so bringing
in those 2 elements that optimize create the optimized there we have now gone down to 20 2nd boat and the 3 . 4 2nd prepare and then converging begin which was really what we're trying to do in the 1st place is convergent test the stuff now on converging here we didn't optimise across 100 + cookbooks to make them all as fast as possible what we did was we implemented optimizations that would all of our cookbooks and testing in a lot of these are our operations promote testing the local testing they could still apply the
first one that we did was we utilize a tool called that cookery is anyone here familiar with the and agree by chance go become familiar but it's some it's not a tool it's you very easily make packages and so what we would do is we would get all our enterprise software that usually came with its own separate installer it really a job instalar or be a shows that you would run and historically we started down our curation mentioned that we would write all these custom resources for these code and have to deal with the fact that the customer recessed the item potent an interview with upgrades and all this chore just to install software but it turns out the soccer is a pretty well solved problem and we don't have to reinvent that so we turned to this tool Afghan cookery which lets you easily make packages from other artifacts the and because we manage everyone's development environment we can give them a quick command to generate all the scaffolding required to build 1 these packages and is what this looks like so it's a simple script that does a clean install and then once the actual package command inside of a file we do a couple
of things as made about we sink over some required directories so recipes that's where your code goes that explains how you make a package the the staging is where your binary artifacts restored and we store those in source control under I get out fast for large storage and package is the output of the rest of the as we're building in our cases RPM but if you're dead in shopping those deviance review on to build on the works transparently and then we also run a shift recipe to set up a server to give and get and Ruby every other needs to build these packages in annexes the script and the recipes really simple so here's an example recipe is how we package a tool we use could deploy the client port we point to the zip file Patel the sources and FPM and cookery will automatically go on uncompressed for us we tell it it's a omnibus recipe which we give a directory and all that means is missing is done take all the content that directory and put into a package we set some metadata such as the version of revision stuff like that but this is the actual recipe here to do all the work to take this piece of software and turn it into an API and we essentially copy of a properties file and then we run the installer . sh and we're done as at the end result is and this runs in data and so for us running on Reddit machines we had a RPM if you ran it on and you know and you get it there that the end is automatically use the proper packaging tools yeah of in that state
does a lot of time because when you are downloading large soccer packages and then spinning of Java to install them or whatever method they're using they just ends up in your test time and allow these packages run across all of our systems some sort of our modeling tools so some important optimize their side of it now the for 1st time in a room red system when you install a package which shaft what happens under the covers is young needs to make a cash so it goes and downloads all the Repo data from all this known repost increase a binary database and from there can quickly check and see this verse solvers available is where you download it but it takes about 15 seconds to do that so you pay this fifteen-second tax on the 1st packet you install on every single test and 1 thing that in 2nd tax so we found a way around it from local survey packaging taxable we turned cinco cloud in it so if you've miracle the stack and other cover writers there's a tool called indicates run whenever the nodes for the 1st time and you can pass it some data that away execute so we write out a simple bio superscript colony on make and then we tell Cloud and need to run this script using packed and to do at now which is means good would immediately and the reason we do that right and running it directly is a please ran make caching cloud it all you would do is slow down the cloud and in time by 15 seconds that's the return immediately in the background while the knows Spanish including while chef is doing the resolution cookbooks while the kitchen prepares happening the young cast can be made and these are a multiprocessor system so we have CQ despair society free optimization that we get in this state 15 seconds out of every single test and you just sit there in the 1st hour impacted wondering what is going on so as a good general sphere that we had and then
data-locality so moving part testing to the cloud was great because all over binary artifacts were also right there next to an object storage so when you're on your laptop like here if I go to you know and so on are the MIT download from somewhere and so are Pianta over a gigabyte in size so I can take a lot of time to pull that down and you before you install the package was right there you got these quick wins for all our packages like converges would like happen like that for all our package always the the the last phase that we spend
some time was the set up of the verification so the verified phase for test kitchen installs a couple gems and then it connects you your test so we essentially do the same trick would act in which we rode out a script to install reinstalls of needed for the verification and then ran at at now
and we say 5 seconds so this modernization is still 50 per cent of the time I was saved by doing so overall
reduction we started like 7 and a half minutes of non converge time to prepare to converge down to less than 30 seconds may give us a really good overall user experience was really on behalf of the benefit of a cloud-based testing the other
part was that is also for concurrent testing we have recipes and cookbooks that can have Mike seminary testsuite across 2 different operating systems so that's 16 Cassiope around 16 Bianchi and spin up and some of them were large so we had tests that would take 4 hours to push out the test cross others patterns locally but the benefit of doing it remotely is that we do that in parallel or concurrently for free there's no slowdown new doing 16 verses 1 to try to 69 a laptop it would crush you could do it but we could take things there anything for hours were taking minutes just by doing in parallel and doing them out on remote resources so we have to wait a couple
tools to support this testing the person was that we build a kill switch into our test servers using cloud and it again test locally and forget to destroy your instance is not a big deal because it's you remember some point time but if you do it in the cloud you now blocking the resource for someone a to the kill switch is set to run 30 minutes after the node is created it just using ajob again and make sure that I've underlined there and it kills the those from within itself and that was great except for people like a I was working on this node and it went away and the middle call for some reason or other part limited so we have to respect to this so any sort of login into the test system whether it's test kitchen unit converge or a verifier you logging yourself to look at something killed that old agile starting a new 1 and we set the timer that works very well for us this isn't
tooling that I wrote to help find hot spots for optimization and it works with remotely and locally depending on what you're doing I hope to open source this for today I had already but I'm still in the legal approval systems I get that I will fully open source it's Iesson gravity 1 optimize your local urine test runs energy is used 3 uh analyzing tools built into it a person unless you analyze the create and Swiss between the boot and the image launch the 2nd 1 gives you recipe that timing seems to run it and see which recipes take the longest across 1 or more of your test runs and then the last 1 is the duration of the individual steps so just takes all your steps sometimes them and tells you which ones take the longest and if you have a cross role cookbooks user died in and say I can spend some time here get good optimization for all my tests and this was very helpful because we had 1 package we had that was not below that in cookery we of a handcrafted artisanal you made package that we had ended the default for that package was ills EMA to compression which was greater than it was small but it was also a gigabyte-sized package and so it took a minute and a half to uncompress and we change that Jesus and went up by 15 % in size but installed in 17 seconds so that data locality talk about sort of got rid of any sort of benefit of smaller compression right so I have a lot more details
on my blog have that the little steps optimizations on about how we optimize various phases and also on Twitter questions you and then no thank