Testing in Production
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 88 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/37341 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer | ||
Production Year | 2018 | |
Production Place | Pittsburgh |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
RailsConf 201819 / 88
9
14
16
19
20
22
23
26
27
28
34
35
36
37
38
39
41
42
46
47
53
57
60
62
63
64
69
72
80
85
87
00:00
Computing platformPoint cloudGoogolStatistical hypothesis testingStatistical hypothesis testingProduct (business)Mobile appService (economics)Point cloudStatistical hypothesis testingShared memoryMultiplication signData storage deviceBasis <Mathematik>PlastikkarteSoftware bugDatabase transactionExpected valueSoftwareClient (computing)Term (mathematics)Communications protocolServer (computing)Process (computing)Moment (mathematics)INTEGRALMereology2 (number)NumberLatent heatWebsiteLaptopCrash (computing)Information securityQuicksortWeb browserBookmark (World Wide Web)Image processingBlack boxStructural loadReal numberIntegrated development environmentWeb 2.0Category of beingBitCloud computingSoftware frameworkPresentation of a groupVirtual machineSlide ruleTwitterEntropie <Informationstheorie>Form (programming)Formal verificationSoftware developerGateway (telecommunications)CASE <Informatik>AdditionRevision controlCondition numberCuboidLevel (video gaming)Internet service providerGroup actionDiagramComputer animation
07:22
Statistical hypothesis testingStatistical hypothesis testingThermal expansionLocal GroupServer (computing)CASE <Informatik>BitProduct (business)Entire functionAxiom of choiceCoefficient of determinationComplete metric spacePurchasingData conversionMiniDiscStatistical hypothesis testingError messageServer (computing)CodeStatistical hypothesis testingData centerGroup actionSubsetSharewareSemiconductor memorySheaf (mathematics)Multiplication sign1 (number)Latent heatStrategy game2 (number)WordTrailIntegrated development environmentSoftware bugCombinational logicQuicksortComputer animation
09:56
CodeCASE <Informatik>Rollback (data management)
10:27
Rollback (data management)Data recoveryDatabaseHuman migrationRelational databaseBounded variationChemical affinityStatistical hypothesis testingExperimentelle VersuchsforschungComputer programUsabilityStability theoryStatistical hypothesis testingStatistical hypothesis testingAffine spaceGraph coloringProduct (business)CASE <Informatik>MultiplicationHuman migrationDatabaseRollback (data management)Computer programmingMultiplication signAlpha (investment)Bounded variationTerm (mathematics)Beta functionSheaf (mathematics)Focus (optics)Traffic reportingRow (database)Key (cryptography)Process (computing)Statistical hypothesis testingUsabilityStability theorySoftware bugFeedbackGroup actionLatent heatExpected valueMereologyServer (computing)Dependent and independent variablesLine (geometry)InformationEmailRight angleOffice suiteAbsolute valueINTEGRALScheduling (computing)Suite (music)SubsetReplication (computing)Physical systemGene clusterMilitary basePartial derivativeRule of inferenceFunctional (mathematics)Core dumpBackupWebsiteRoutingFirewall (computing)BitWritingData centerSocket-SchnittstelleWeb 2.0CodeChecklistPoint (geometry)Condition numberThresholding (image processing)2 (number)Pattern languageData integrityRevision controlLastprofilComputer clusterRelational databaseRouter (computing)Connected spaceFlagDifferent (Kate Ryan album)PurchasingNumberData recoveryGame controllerSoftware developerComputer animationLecture/Conference
19:32
Formal verificationData recoveryDatabaseStatistical hypothesis testingType theoryWeb pagePhysical systemCASE <Informatik>SoftwareMiniDiscMereologySheaf (mathematics)Statistical hypothesis testingTwitterProcess (computing)Game controllerChaos (cosmogony)Server (computing)Row (database)Proxy serverInformation securityQuicksortError messageAxiom of choiceGoodness of fitPower (physics)Form (programming)Multiplication signData recoveryPlanningRule of inferenceSoftware bugProduct (business)BackupRight angleBranch (computer science)DatabaseNatural numberFormal verificationSemiconductor memoryHuman migrationGraph (mathematics)Device driverStack (abstract data type)System callScripting languageBasis <Mathematik>Data centerOpen sourceComputer animation
24:52
Read-only memoryGoogolMobile appBit rateError messagePhysical systemExpected valueLattice (order)BitStatistical hypothesis testingStatistical hypothesis testingBounded variationTracing (software)Device driverStack (abstract data type)WebsiteTwitterSet (mathematics)MiniDisc
26:32
Execution unitMathematical analysisLemma (mathematics)GoogolPairwise comparisonWordAreaBit rateError messageExpected valueDependent and independent variablesBitBimodal distributionComputer animation
27:13
Product (business)Server (computing)Denial-of-service attackView (database)Expected valueRemote procedure callFigurate numberProcess (computing)Cycle (graph theory)Slide ruleClient (computing)Statistical hypothesis testingStatistical hypothesis testingData recoverySocket-SchnittstelleWeb 2.0
28:44
Statistical hypothesis testingDensity of statesForm (programming)PlanningStatistical hypothesis testingDependent and independent variablesEmailBasis <Mathematik>INTEGRALSoftware bugProcess (computing)Operator (mathematics)Product (business)Web pageComputer animation
30:09
Statistical hypothesis testingProduct (business)MereologyService (economics)Server (computing)Query languageSequelReal numberLatent heatLecture/Conference
30:55
Coma BerenicesBlock (periodic table)Data typeXMLComputer animation
Transcript: English(auto-generated)
00:14
thing in production awesome cool I'm Aja I am on I'm the thagomizer on Twitter I
00:22
am thagomizer on github and I blog at thagomizer.com I did not post the slides yet but I will post them immediately after the talk I am a bad presenter and I really really like dinosaurs so Pittsburgh has been amazing I landed at like 2 a.m. after like two hours of delay in Chicago
00:41
because it was snowing going down the escalator and there's a dinosaur there it was amazing I love this city I work on Google cloud platform I am a developer advocate if you're interested in Google cloud kubernetes other things like that I'm happy to answer questions and I have plenty of opinions but you don't have to ask me we've got seven of us here we have a booth down the vendor hall you can come say hi I think we might be out of fidget
01:03
spinners but I got a couple that I stored away in my box in my box over here so afterwards you can get one from me and we're here because Google loves Ruby we love Ruby it's a group of Rubyists who work on our Ruby support and I love my Ruby community so victory conditions for my talk these
01:22
are the things that I want you to be feeling or thinking when you leave first of all I want folks in this talk to feel comfortable with testing in production I heard the phrase testing in production and I had a slightly less polite version of dear heavens no the first time but more thought about it the more I realize that this is actually a good thing and
01:42
it isn't scary because in many cases you're already doing it you just not may not be aware of it and so in addition to being comfortable with the idea of testing and production I want you to walk away from this talk with the ability to be a bit more intentional about your testing so I'm going to do quick definitions first definition is production production is
02:02
any environment that is not pre-production a second definition testing for the purposes of this talk testing is what we as developers call verifying our expectations yes I did just use expectation yes I am a mini
02:21
tester it's okay if it makes you feel better you can think verifying behavior instead of verifying expectations so we're Rubyists we test all the time one of the things I love about this community is that we test we test a lot you wouldn't dream of pushing a gym without at least a couple tests not to give documentation but we were good at testing and we have
02:43
all of our great test frameworks where we set up a scenario do some verification and then hopefully clean up a little bit call our method under test somewhere in there these are the traditional tests so I'm going to call them but there's also a huge category of black box testing which is where I got my career started I was a black box web tester doing manual testing you can
03:01
do black box tests automated do it a fair amount now all of this is still testing and I'm bringing this up because we're gonna use both techniques for testing in production so why should we test in production isn't that naughty the answer is because a real environment gives you real bugs you find stuff that you just can't find in your pre-fraud environments for
03:24
example production is where you have real user load while load testing is awesome and I highly recommend it most load testing frameworks I've worked with can't actually simulate real user load because humans are fantastic entropy machines they have the time when I tell a quick story talking to some of my
03:41
co-workers last week and I was objecting to a form I had to fill out for another conference I'm doing and one of my co-workers is like yeah whenever I see something I'm not quite okay with I find a way to hack around it she's like I was signing up for a bike share and for reasons I don't understand they wanted to know if I was male or female so I opened up the form and realized they're storing male as one and female as two and I managed
04:01
to convince the server on the other end that I was a four humans are fantastic entropy machines other things you can only do in production you can test your integrations who here uses a billing service of some sort audience participation is okay okay your billing provider probably has a test
04:21
gateway or a test API endpoint that you can hit when you're testing they probably also provide some test credit cards that you may or may not be able to use against our production endpoint but at what how often do you point your staging environment at the real production gateway using and use a real credit card to run a billing transaction through whenever I've built something that took credit cards we did that once or twice before initial
04:42
rollout but we didn't do it on a regular basis after that so if we wanted to test that we were actually integrating with this third party correctly we had to do it in prod if you don't have a billing service maybe you have a third-party storage use some cloud storage maybe there are other services like an image processing service or OAuth that you're using make sure you're testing those and frequently the only place you can
05:03
test them for real is prod or maybe you don't use third-party services but you work on a large team building a huge app and your team builds one microservice and there's other microservices built by other parts of the company when those come together that's a seam how often do you test your seams how often do you run an integration test across all of this
05:21
one of my most frustrating moments at a previous job we were three days before big GA we're gonna go you know actually to production with some new stuff and we had two teams client server we haven't tested that the two pieces could talk to each other and out of curiosity I spun it up and the first thing it did was crash hard because the person who was
05:40
orchestrating and running the client-side team and the person who was orchestrating running the server-side team had a misunderstanding in the protocol that they had developed between the two and so it just exploded so you have to test your seams I would hope you test them before production but sometimes sneaky bugs can get in and testing your seams in production is also valuable who's heard the term highs and bug so for
06:04
those of you don't know it highs and bugs are those bugs that can only be produced in production by that one really important client at that one really important company maybe it's an artifact of their network or their browser or some sort of security thing they have but testing and prod allows you to find these same company as the last story we had a very important
06:24
client who's like so it mostly works but like we do this one thing something weird happens it doesn't crash but it doesn't seem right spent about a month and a half debugging with them remotely and finally we picked up laptops and went and took a site visit about an hour away we get
06:41
there and they're like well what are you planning on doing we're like we're gonna run a you know network speed test because it appears that you're not getting the full download and they're like oh that's not gonna work because we cut off any download greater than a specific number of kilobytes we're like oh but we wouldn't have found that unless we had been testing in prod so the second thing about testing in prod is I heard
07:01
this at a meetup about nine months ago my favorite meetup in Seattle I live in Seattle I be coffee ops someone's like hey I want to talk about testing in prod today awesome what's testing in prod and this person goes on and on I'm like I'm thinking I'm gonna learn new stuff here and talking about monitoring they're talking about logging and talking about tracing talking about blue-green deployments and canaries and I'm like oh there's
07:22
nothing new here this is stuff that's been often in common use since the 60s in many cases everything I talked about today is techniques that I have seen in use since I got started in tech in 2002 so I guess that means I'm old now so preemptively I'm telling you all to get off my lawn so I've talked a
07:42
little bit about the background but I haven't talked about the how so to keep myself on track because I'm going to talk about a lot of techniques I'm gonna throw a lot of words at you I'm not gonna give you a ton of specifics but I'm gonna give you enough that you know which ones are interesting and know what to search for if you want to find out more I've divided this talk into four sections deployment testing user focused testing we're
08:04
using tests and my favorite one implicit testing let's dive in deployment testing so the first technique I'm going to talk about is canaries canary is just a phased rollout where you roll out your release gradually to some of your servers at a time over a course of minutes hours days or even weeks you
08:22
have a subset of your users or a subset of your servers that's going to receive the new code once you've rolled it out you monitor vigorously for things like error memory disk but you also might want to monitor for user based metrics like for your free trial conversions or purchase path completion if everything is thumbs up you expand the canary group and you
08:42
keep doing this release a bit more monitor expand until you've rolled out your new release to all of your servers so that's all great but how do you choose your canary group so you can use internal users sometimes we call this dog fooding you can push out to people who don't have a choice but to use your new version and find all the other bugs in it you can just
09:03
choose randomly I'm gonna choose you know I've got 600 servers 600 containers I'm just gonna choose some of them you can do it geographically this is how a lot of folks do it they're like okay we're gonna start with a small percentage of the servers in US West and we're gonna switch then we're gonna do that entire data center then we're gonna go to US East then we're gonna
09:20
go to Europe then we're gonna go to Asia you can do it based on a demographic maybe you only want to roll this out to users who are new or users who log in 18 times a day and you're not quite sure why they're using your product so much but yay you can also ask users to sign up to find out to get access to stuff early we're gonna get into that a little bit but you can also use it for canaries and the cool thing is you can pick as many
09:42
of these as you like you can use any sort of slice-and-dice combination so that you the goal is you start with a small group and you roll it out gradually to make sure that whatever you're doing is not toxic does not take down your environment second deployment strategy is blue-green deployments all you have is two copies of prod two copies one is blue one is green in this
10:02
case the blue is live and the green is idle one is always live one is always idle when you want to roll out new code you deploy it to the idle side this case the green once it's up and running you have your new code on your idle your old code on your live you start routing traffic to the new code so now we end up switching live and idle and you've done your
10:25
deployment nice thing about this is if something goes wrong is an easy rollback because you have the previous known good version live just a couple minutes ago so you can just swap whatever router rule you did to move your traffic and move it back it's also depending on how you do your blue-green might be
10:40
really good for disaster recovery if your blue and green are in different parts of the same data center and you have a partial data center power outage which I have been through multiple times you might be able to move traffic back to your other half of prod because you've got two copies of everything it's fantastic having two copies of everything though is not always great doing databases bases with blue-green deployment is kind of a pain so don't use databases or maybe just leave your databases out of your
11:08
blue-green clusters if you want your databases to be part of the system you can do things with snapshotting and replication but depending on exactly your database setup and how good you are at setting up your databases and how
11:21
often you're writing you may have a little bit of a blimp as you blip as you flip which one of your databases is the replica and which one is running is the main one or you can use a non-relational database it a lot of the problems with relational database databases and replication and stuff are solved if you use a non-relational database when I started my career we'd
11:43
used a variation of this technique when we did rollouts we divided our server cluster in half a half and B half we did and we would deploy to a we would test it behind a firewall once it was good we would route all the traffic to a and then we deployed a B this was not true blue-green though because we couldn't actually run the site successfully at peak load on just
12:03
half of the cluster we had to have at least two-thirds of it up so we could only use this technique late at night and we didn't have the hot swappable backup at all times but testing in prod do what works for you both of these techniques plus many of the others I'm going to talk about work well in conjunction with auto rollback and auto rollback you have some
12:22
predetermined metrics and if you ever hit you know those thresholds condition is tripped and your set your deployment system automatically rolls back to a known good release to do this you have to make sure stuff is scripted but I'm hoping most of you have scripted your deploys at this point when I started I was releasing based on a 34 point printed out checklist so I
12:42
hope you guys are doing better than I was and if you're going to do auto rollback you need to be very conscious and careful about your data or database migrations if you go back will you lose important data will the old code actually work against the new schema things you need to consider is anyone still using session affinity or sticky sessions okay I figured there was still
13:03
a couple of us out here if you're using web sockets it's really hard to get around actually sticky sessions and session affinity if your user has to hit a specific server because that's where their connections established how are you going to deal with that when that server goes away things to think about the biggest thing you can do is you can separate your data migrations
13:20
from your code pushes push the code make sure the code can work with both versions of the schema then do the data migration once everything's stable then push code that can only work with the new version it's a really common pattern lots of us have been doing it for years again get off my lawn but it's an important thing to know because it's not the way you're taught when you're doing rails as a newbie second section user focus
13:44
tests these are things that test the user experience and you're like I'm a developer that's not testing in prod totally counts you're just testing something than the underlying stability and correctness of your code who's done a B testing hey people test stuff in prod it's fantastic a B
14:03
testing is just an experiment you have a control group and you have some number of experimental groups you run the users through different experiences when you have enough data to be statistically valid a lot of big numbers and all you figure out if there are significant behavioral differences between the groups and you decide which one you're gonna go with different than
14:22
blue-green because both are live at the same time blue-green remember one is always alive and one is always idle but in an a B test they're both live at the same time which means you have some interesting things with data integrity another way of doing user focus testing is betas and EAPs for those who haven't heard the term because I haven't before I started working at
14:41
Google EAP is an early access program it's like a beta but usually before a beta but not an alpha and these give you an ability to test your stability and more specifically the usability of something you're about to push because nothing finds edge case this is the way users find edge cases but it's
15:01
important if you're gonna one run of these programs that you give users enough time I know folks are like we had a beta for like eight whole hours no not a beta you need to give people multiple weeks in many cases so that they can use your product over time make sure that it works for all of the scenarios they do not just kind of glance ahead and say hey I like the new colors and you need to make sure that your expectations are clear if
15:22
there's an expectation that someone who participates in your EAP is going to give a specific amount of feedback you need to make sure that's clear up front you also need to tell them where the known issues are every beta has got some edges some places where we know stuff's broke tell them about that ahead of time because you don't want to actually have 19 bug reports that are the same third section reusing tests so there was a fantastic talk on
15:44
Monday about checkups and this is similar to that content but I have stories that are different because I've done it as well the really thing I like about this is that each and every one of you can do that running a usability test or a beta is going to require cooperation of many other people changing your deployment process unless you hold the keys to deployment
16:02
is going to require the cooperation of many other people you can do everything in this section without talking to anyone it's awesome so the big thing is to like run smoke tests against production another story I was working at a relatively large company that was not my current employer and I had been doing manual testing but I got permission to start doing some basic
16:25
automated testing with a really really clunky record playback tool record playback tools make really really brittle tests but you know better than nothing I'd rather not run that same test manually 15 times a day and I was sitting there one day and I'm like hey I've got these extra servers in my
16:41
office because I was running the test lab because I used to get cold so it was make warm and I'm like hey I could I could run these smoke tests against production right you know I would hope that they never fail so I set it up set them to run every four hours on a you know on a cron set it up so to email me if it failed and then you know let it go and it worked for a couple days and I was really excited and then it promptly mostly
17:01
forgot that this was happening come back from lunch one day about two months later and I have an email saying it failed I'm like there's no way it failed like if this was actually down for 30 minutes someone would have noticed so I go run the test manually and actually it had failed one of our suppliers was not sending all the information we needed to to us
17:22
when we made a request and normally monitoring would catch this but I something along the lines of they were sending back a response is just the response body was empty so we were getting 200 not failures meant that it didn't get caught by normal monitoring but it did get caught by this test so I'm like hey is broken and we managed to contact the third party that
17:41
we were using bring have them fix their thing make sure our stuff was still working managed to do it all in a couple hours before anyone noticed because we wouldn't have caught this bug without a user notice unless I'd been running these smoke tests that I normally used for releases and day-to-day testing against production and no one knew that I had set that up I just you know have a server might as well I've been using the term smoke test
18:02
for folks who don't know a smoke test is a super simple test of the core functionality of your product comes from the idea of where there's smoke there's fire or if this fails something is on fire and I personally believe that even really big complicated products will have relatively few smoke tests everywhere I've worked we've kept it under six I would imagine
18:20
almost everyone can keep it under a dozen because you're just testing the very basics so if you're gonna do this pick a subset of your existing tests you probably have something that you would consider smoke test in your integration suite already just reuse it set it on a schedule every n hours once a day once a week whatever makes sense focus on things like your third party integrations and absolute core functionality of your product and
18:44
I'm gonna point out here that if you use something like the VCR gem when you normally run your tests so that the tests run faster and you don't make requests against a third party consider not doing that when you're doing these tests against production because you're not actually testing your integrations if you're faking out the integration part of it and the big thing is leave no trace your tests absolutely leave no trace ideally you want them to clean
19:04
up after themselves because of this most of my smoke tests don't do purchases everywhere I've worked that I've been involved in doing the database schema our purchase database has not allowed updates or deletes it's only subsequent rights so if I did a purchase I couldn't delete it if you have a system like that make sure that you have a way of not doing
19:22
purchases or if you do purchases you can flag those because you don't want I've been running this test every minute and all of a sudden we are making tons of money to show up in your reports on to my last section and I'm gonna call this next section controlled breakage so basically controlled breakage you
19:41
want to purposefully and deliberately break various parts of your system take servers down pretend that the disk went bad pretend that your network pipe got really really small and what are you testing in this case you're testing your ability to respond and recover is your system supposed to be self-healing does it or is your the person carrying the page you're supposed
20:04
to detect these types of errors and address them do they I really like this testing I did start my career in test I fundamentally love breaking things it is fantastic and wonderful and is one of my favoritest things so the first time I got permission to do this I went nuts I've found all sorts of
20:21
stuff that was completely and utterly busted riding up all these bugs and then they all start coming back as won't fix won't fix won't fix because just like security durability is something where you can never be absolutely durable you can be more durable or less durable but it's always a trade off between durability and the amount of engineering time you want to dedicate to it which is a proxy for cost and it doesn't make sense to be
20:43
durable against you know for lightning strikes in a row that hit your server directly because it's not a realistic scenario for most people most of the time so if you're going to do this stay in scope stay in the scope of stuff that makes sense stay in the scope of stuff that you you and your
21:00
team have agreed should be you should be able to respond to I can't talk about controlled breakage without mentioning Netflix's simian army and chaos monkey they're open source go check them out they're cool and we actually do this kind of testing this controlled breakage testing at Google we call it dirt disaster recovery testing I have I have not participated
21:21
as an engineer in that process but I found a fantastic talk that if everything worked correctly should be tweeted under my Twitter handle already and you should go watch it it's by one of the SREs who started the dirt process and has some fantastic stories about things that they accidentally and on purpose did to test the disaster recovery and durability of Google
21:41
related is penetration testing who's been able to do some pen testing I got to do some it's I got to do some about six months ago and it was awesome I got a week of time and I were just I'm just packing against stuff it was fantastic it's really really fun to put yourself in the mind of an you know evil adversary you got like your curly mustache you know horrible hat thing
22:01
going on doing some rockin' bilingual stuff there and it's totally another form of controlled breakage you try to figure out kinds of mistakes that you likely have made and figure out if you've patched against them but since we're talking about this I'm going to talk about the fact that controlled breakage needs to be ethical breakage DHH touched on this in his keynote
22:20
that we have the power for both good and evil make sure you're using your power for good think carefully about the potential impacts of your choices on your users on your company on your job want to make sure that the choices you are making are reasonable and ethical and every time I've worked on penetration testing or talk to folks about it there's always rules of play frequently for big exercises there's also a proctor who can make sure that
22:44
you are playing fair and you are being ethical on what you're doing my last form of testing and production is disaster recovery and verification who has a DR plan who's tested it in the last year so congratulations you guys
23:00
have successfully tested in production and you're doing better than the vast majority of the audience so disaster recovery is when you make a plan for your data center catching on fire in a way that you can't predict I did a talk at RubyConf in Cincinnati about the time that they were replacing pieces and the power conditioners the power conditioners at the data center caught on fire we were down for an hour and then we ran on diesel for 11
23:22
days it was an error that the supplier of the power system had never ever seen before it was not supposed to be able to happen so therefore disaster disaster recovery is how you're planning on dealing with things like that and this is for real needs to happen in production if you haven't tested this plan in production you haven't tested it because by its very
23:41
nature your disaster recovery plan is for when something bad happens in production you need to move traffic to another cluster maybe you need to move data between data centers maybe you need to restore databases from a backup I accidentally deleted well accidentally corrupted a production database at 11 p.m. at night once because I ran the feature branch
24:01
migrations instead of trunk migrations against it yeah that was fun luckily I had taken a database backup right before I did that so I was able to restore from the backup and I knew how to restore from the backup because I had actually been practicing that on a regular basis so I was able to do it without thinking because I was freaked out I was like oh god they're gonna fire me they're gonna fire me they're gonna fire me but it
24:21
all ended up being okay as part of DR you want to make sure you're testing scripts all the scripts that do network migrations database restores all that but you're also testing your people and everyone's like testing people isn't testing ha there's my mini test test for testing people right there so implicit testing this last section of calling implicit testing I
24:46
originally was gonna call it passive testing but I didn't like the way that sounded this is the testing that you're all already doing but you don't actually think of as testing so that's a stack driver monitoring graph of memory usage on an internal app that I work on in Google that's a spike and if this
25:02
was actually a mission-critical app I would have hoped to be alerted to that spike luckily if it goes down for a couple days we don't actually care that much but I'm talking about monitoring what's monitoring have to do with testing so who's got monitoring please most of you thank you who has alerts on their monitoring turns out alerts or tests think about that for a minute we
25:23
think of alerts as the thing that tells us that something is wrong but if we you know massage the English a little bit they tell us that the system isn't meeting expectations and back at the beginning of this talk I defined testing as verifying that your expectations are met so by definition alerts are testing still don't believe me say I have an alert if latency is
25:44
greater than 500 milliseconds ha there's my test and if you're gonna be doing your monitoring too many folks I know just look at the system for a couple weeks like it's supposed to look like and set up their lights based on that I encourage you to take a step back and think about how you want your system to be working think about the kinds of behavior that you
26:02
need maybe you have an endpoint that's hit 90% of your traffic goes through that endpoint that one should probably respond pretty fast huh maybe you want your error rate to be less than 5% set that test up or maybe you think that your disk should never be more than 80% full set that test up we just call these tests alerts and the variation on this is looking at month over month or
26:24
year over your trends so that you can actually answer questions and make assertions like our error rate should not get larger and our site should not get slower here's a screenshot from stack driver trace of the same app doing a I believe this is a month over month is actually a year over year comparison I
26:41
believe yeah it is and you can see that it's bimodal and depending on which one the blue is new or old it maybe got a little bit slower on the on the far ends where responses are slower but it mostly looks the same so I feel pretty okay that my assertion that behavior has not actually changed my
27:00
expectation that behavior hasn't changed is actually a valid expectation again those having fun with this you want to assert that your error rate your old one and your new one are the same or hopefully that your new area is less so I've thrown a whole bunch of thoughts at you ideas words I'm gonna give you some basic do's and don'ts so and at the end there's a cheat sheet so you don't have to take pictures of every slide and I will
27:21
publish the slides so do you have clear goals you should go into this intentionally figure out what your goals are figure out what your expectations are and start from there when you're picking what you want to monitor and test in production don't DDoS yourself so I was doing a disaster recovery test we took down a server that was holding a bunch of web sockets we're like okay the clients are supposed to reconnect so they
27:42
reconnect to you know the fallback server and the fallback server promptly falls over because it wasn't it wasn't capable of handling that many simultaneous reconnections and so it falls over and so the clients start trying to reconnect as we bring it back up it falls over again we got into the cycle of fail so we learned things but in the process of our disaster
28:03
recovery testing we accidentally DDoSed ourselves with our own app so don't do that is bad think carefully about the possible impacts of the tests you're about to do before you do them we talked about it before but test your seams test where your stuff integrates with the people who sit you know down the hall or people who sit on the other slack if you work remote don't
28:24
mess with user data no no no no no we do not mess with your user data we do not view user data unless we have a really good reason if your company does not have a user data access policy you should do that thing it's the right thing to do keep your tests as walled off as possible and make sure that they aren't considered user data as well because you don't want that data
28:41
corrupting any of your other reports and do clean up after yourself do the Girl Scout thing leave no trace be a good citizen this is one of my soapboxes alerts should be actionable so if you're using alerts as a form of test awesome but make sure that a test that isn't urgent does not page someone
29:00
at 3 a.m. if they get this a page and there is nothing that they can actually do other than go back to bed and deal with it in the morning they shouldn't have been paged in the first place it's the way we get ops burnout just don't do it verify your integrations after that experience in my first job where I found the bug that we haven't found just by running some of the production I now trust but verify all of my third-party integrations on a
29:24
regular basis because they can fail and more actually common than the third party failing is they updated their API and you didn't actually get the email and you were using VCR so you're getting the old responses and then cool so make sure you're testing and the big one is whatever you
29:41
choose to do act methodically make sure you are doing stuff with a purpose and a plan so that if something goes completely wrong you know what you've done and you know how to undo it here's your cheat sheet have clear goals test your seams verify your integrations clean up after yourself don't need us yourself leave user data alone and keep alerts actionable
30:03
I want to say thank you and get off my lawn the question is how do you handle off against production servers so the way I've always done it as I've created the magical test account and then I think about that is that everything that's associated with the magical test account I know to ignore
30:23
I worked at a place where we used a specific last name for magical test accounts it started with five X's so we hopefully wouldn't pick up anyone's real name in the sequel queries and that's because we wanted to part of our smoke was testing sign up so we had to create new accounts there are other ways to do it there are tools you can do there are companies that
30:42
actually offer production testing services do the right thing for you just you're already testing in production so you might as well do it on purpose okay thank you all