Any questions? Shout them out as we go. That's more fun.
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 133 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/48791 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Bookmark (World Wide Web)Graph (mathematics)Mobile appDensity of statesProduct (business)HoaxStructural loadProof theoryWebsiteBitComputing platformMultiplication signBusiness modelOrder (biology)Endliche Modelltheorie2 (number)QuicksortSpacetimeOffice suiteTerm (mathematics)Arithmetic meanClassical physicsDifferent (Kate Ryan album)Metropolitan area networkScaling (geometry)Contrast (vision)JSONXMLUMLComputer animationLecture/Conference
03:06
Bookmark (World Wide Web)Mobile appGraph (mathematics)Density of statesIntegrated development environmentSoftware testingLine (geometry)Computing platformStructural loadHoaxMultiplication signOrder (biology)Real numberSoftware testingPoint (geometry)Computer hardwareMetric systemCuboidLogic gateProduct (business)Goodness of fitIntegrated development environmentFunctional (mathematics)Home pageSocial classPhysical systemContext awarenessFeedbackSoftwareProjective planePropositional formulaSemiconductor memoryBlock (periodic table)Arithmetic progressionWeightPoint cloudOperator (mathematics)Process (computing)1 (number)Service (economics)Software developerKey (cryptography)2 (number)Arithmetic meanINTEGRALType theoryLinear regressionQuicksortDifferent (Kate Ryan album)Analytic continuationControl flowBitServer (computing)Visualization (computer graphics)Queue (abstract data type)Computer fileComputer animation
11:17
Density of statesGraph (mathematics)Bookmark (World Wide Web)Mobile appKey (cryptography)Computing platformResultantCuboidSoftwareElectronic mailing listCodeFreewareSoftware testingProcess (computing)Reading (process)Home pageLink (knot theory)Performance appraisalPressureGoodness of fitPhysical systemFeedbackRevision controlOperator (mathematics)InformationScheduling (computing)CASE <Informatik>BlogMobile appHookingFunctional (mathematics)BitAddress spaceVirtual machineGroup actionFeature Driven DevelopmentWeb pageInclined planeMultiplication signRow (database)HoaxBuildingSound effectSuite (music)Context awarenessMenu (computing)User interfaceInternet service providerStructural loadSet (mathematics)Server (computing)LoginConnectivity (graph theory)Product (business)Dependent and independent variablesAnalytic continuationLoop (music)Object (grammar)System callOpen sourceReal-time operating systemScaling (geometry)StatisticsArea1 (number)Right angleLine (geometry)Interrupt <Informatik>Ultraviolet photoelectron spectroscopyComputer configurationScripting languageService (economics)WordWritingComputer animation
20:00
Bookmark (World Wide Web)Graph (mathematics)Mobile appControl flowVariety (linguistics)ComputerNeuroinformatikSoftware testingArithmetic progressionBitMultiplication signPoint (geometry)InternetworkingComputer clusterAnalytic continuationGoodness of fitOperator (mathematics)Server (computing)Classical physicsPhysical systemMereologyFigurate numberEmailStructural loadRandom matrixMetric systemPlastikkarteScaling (geometry)Table (information)HoaxLoginPressureThread (computing)AreaUtility softwareWeightSoftwareAxiom of choiceProcess (computing)Chemical equationFilm editingResponse time (technology)1 (number)Image resolutionBusiness modelCuboidRevision controlQuicksortDependent and independent variables10 (number)Variety (linguistics)Product (business)Channel capacityCache (computing)EmulatorComputing platformDesign by contractFunctional (mathematics)Level (video gaming)Service (economics)RandomizationCovering spaceSemiconductor memoryMiniDiscDemosceneSign (mathematics)Internet service providerSet (mathematics)Expert systemTheory of everythingSimilarity (geometry)TouchscreenOffice suiteDifferent (Kate Ryan album)Menu (computing)CurveComputer animationLecture/Conference
28:43
Bookmark (World Wide Web)Mobile appGraph (mathematics)Density of statesSoftware testingLevel (video gaming)Data managementLie groupSet (mathematics)10 (number)Software testingLevel (video gaming)EmailNeuroinformatikCuboidMessage passingGroup actionElectronic visual displayMetric systemComputer fileStructural loadQuicksortOrder (biology)1 (number)Physical systemPredictabilityMereologyLine (geometry)Type theorySmoothingTable (information)MathematicsWeightReal numberEmulatorHoaxDisk read-and-write headBitMultiplication signPressureInstance (computer science)Data managementCall centreFunctional (mathematics)Web pageDifferent (Kate Ryan album)Operator (mathematics)Product (business)Covering spaceData conversionIncidence algebraOpen sourceSet (mathematics)Direct numerical simulationNetwork socketNumberStaff (military)MiniDiscLimit (category theory)Office suiteSystem callThresholding (image processing)Query languagePay televisionPoint (geometry)Scaling (geometry)Error messageStatisticsImplementationCodeMassFrequencyOcean currentState of matterSequelServer (computing)Computing platformPlotterNoise (electronics)Loop (music)ExpressionPlastikkarteWater vaporComputer animation
37:26
Set (mathematics)Density of statesBookmark (World Wide Web)Mobile appGraph (mathematics)Point cloudProxy serverPlastikkarteTime zoneLoginObject-oriented programmingSoftwareStructural loadService (economics)State of matterCodePlanningProcess (computing)Configuration spaceQuery languageQuicksortServer (computing)Metric systemPosition operatorPoint cloudProduct (business)EmailPhysical systemAddress spaceElectronic program guideBitWeb 2.0Image registrationTable (information)Enterprise architectureLimit (category theory)CASE <Informatik>Game theoryType theoryInstance (computer science)Application service providerGoodness of fitAzimuthMereologyPoint (geometry)DatabasePressureLastteilungBasis <Mathematik>Semiconductor memoryOperator (mathematics)Multiplication signLevel (video gaming)Normal (geometry)ImplementationWeb pageMessage passing2 (number)HoaxUniqueness quantificationSpring (hydrology)PiDifferent (Kate Ryan album)Proxy serverChemical equationElectric generatorContext awarenessView (database)FrequencySequelWeb serviceWeightWhiteboardUltraviolet photoelectron spectroscopySubject indexing40 (number)Computer animationLecture/Conference
46:09
Density of statesBookmark (World Wide Web)Graph (mathematics)Mobile appPlastikkarteScale (map)Dependent and independent variablesSoftware testingChannel capacityContinuous functionGreen's functionStructural loadFeature Driven DevelopmentProxy serverData centerFault-tolerant systemMetric systemProfil (magazine)Object (grammar)QuicksortPredictabilityChainPhysical systemOrder (biology)SQL ServerWorkstation <Musikinstrument>Link (knot theory)MereologyConnectivity (graph theory)MathematicsIntegrated development environmentChannel capacityDependent and independent variablesProduct (business)Software testingGroup actionCodePoint cloudBitFocus (optics)Software developerRange (statistics)Multiplication signOpen setOperator (mathematics)NP-hardContext awarenessKey (cryptography)PressureStress (mechanics)Hash functionScaling (geometry)Execution unitComputing platformPropositional formulaPoint (geometry)Color managementEncapsulation (object-oriented programming)Semiconductor memoryFunctional (mathematics)WindowComputer clusterTerm (mathematics)Web pageProcess (computing)AnalogyComputer fileHookingBlogLevel (video gaming)Endliche ModelltheorieNeuroinformatikRevision controlContinuous integrationLambda calculusINTEGRALWeightAreaData conversionException handlingServer (computing)Social classMathematical optimizationChaos (cosmogony)Type theoryDatabaseCross-platformLimit (category theory)Lattice (order)Service (economics)1 (number)Data managementSoftwareOffice suiteRow (database)Different (Kate Ryan album)HoaxSubject indexingLoop (music)Interrupt <Informatik>Heegaard splittingSet (mathematics)Electronic mailing listSequelLaptopElectric generatorSpring (hydrology)Right angleUltraviolet photoelectron spectroscopyGame controllerSoftware crackingComputer configurationTheory of relativityIncidence algebraSpherical capOpen sourceImage registrationLecture/ConferenceComputer animation
Transcript: English(auto-generated)
00:07
This is a talk about running fake load in production every night to basically prove that we can take it and some of the consequences of that and some of the benefits of that.
00:21
Just Eat are an online sales channel for takeaway food. Any customers in the room? Excellent. Okay. So, we started in Denmark in 2001. We launched in the UK in 2005. We're now in about 15 countries. Nine of those are our own homegrown platform and the rest are platforms that we've acquired
00:42
along the way. And the business model is essentially we put the takeaway menus online on our site. We SEO that, we make it findable, we advertise it, we market it and we farm out orders that the customers want to the restaurants and then we take a commission.
01:03
We're a classic middleman model. I joined in 2010. My first day was figuring out how we did on the busiest day of the year. That was January the 1st in Denmark. The other bit of the first day was spent going to the pound shop to buy some space heaters because we didn't have any central heating in the office.
01:22
So, seven people in tech, bit of an auspicious start and we've come a long way since then. We're now 250 people across three different technology hubs, one in Bristol, one in London, one in the Ukraine. We're entirely hosted in AWS and we operate at a much higher scale than before. And we're recruiting, everyone's recruiting.
01:43
In terms of performance, what does this mean? We do about 70 searches per second at the top end of the funnel where we capture the customers, where our customers land, I should say. And then at the tip of the funnel where we're converting, about 1,000, 1,500 orders
02:02
per minute depending on the time of the week, the time of the year. About 25,000 restaurant partners in the UK, about the same again globally. And then several million customers in the UK. And by contrast, six years ago, that busiest day of the year I mentioned,
02:24
we fell over at two to three orders per minute in our busiest country for the third year running. So yeah, we've come a long way. Who can tell me when our peak time is? Anyone? Shout it out.
02:41
Lunchtime? No? Just after work finishes. Do you fancy takeaway? I don't really fancy cooking. I'm just leaving work. It's exactly that. So starts around five, ramps up, finishes around nine, 10, that sort of thing. Bit of another kick as people are finishing their night out, that sort of thing.
03:07
So it's also the same time we dose ourselves. So that red line is the aggregate. Oh, it doesn't come out very well. Sorry about that. The red line at the top is the aggregate, real orders plus fake orders.
03:23
The blue line that you see ramping up and then being fairly solid is us running fake load against the platform and the green line is the other load agents, the customers. That's a chart taken from the 14th of December last year. That's us running about 1,200 fake orders per minute against ourselves.
03:46
We've got cyclic demand. I mean, that's like most businesses. Our daily peak time is between five and nine, give or take. The weekly peak is a Saturday night.
04:00
The monthly peak is the payday weekend. And winter is busier than summer because, you know, you don't really want to go to the trouble of cooking when you have to go and shop for food when it's raining or snowing. That's no fun. So use us. Use us. And who can tell me the busiest day of the year in the UK at least?
04:21
Anyone? Any guesses? New Year's Day? No? No? And there's nobody that does roast dinners, so it's not Christmas Day either. It's actually Valentine's Day. Yeah, I don't know.
04:40
So we practice continuous delivery. And to put that in perspective, in 2010, we did maybe 30 deployments because we didn't know anything about the system and we just rebooted the tech team. We didn't know where the thing was deployed to. We didn't know how to deploy. We had different servers with different file paths to deploy onto and we didn't know any of them.
05:02
We didn't have permission to go and actually do the deployment. We didn't know, well, anything. We didn't know which developer was the one with the blessed Visual Studio that you could compile and actually produce something that might run in production if we knew where production was, which we didn't. So it was a bit of a seat of pants operation at that point.
05:21
In 2014, by September, we had already done more than 800 deployments. So that's the sort of tempo that we were operating at then. And we stopped counting at that point because it's kind of a vanity metric. You know, we've ticked the box. We deploy when we want. We don't deploy after 3 p.m. on a Thursday because we like our weekends.
05:41
That's just sensible. We don't deploy on Friday, Saturday, Sunday. And yeah, we sort of opened the gates from midnight, but obviously nobody's working at midnight. And yeah, so the problem with continuous delivery, everyone wants to change everything all the time.
06:01
It's horrible. It's chaotic. It's brilliant. So the traditional approach that we had for performance testing was, well, let's have an environment. Let's make it like production, and let's run load through that. There's a few problems with this. Nobody owns it, so it's always out of date.
06:22
You know, the people that own it are the people that want to run the test at the time. So they have to go and fix it from the last people that didn't clean up after themselves or, you know, that sort of thing. Everyone wants it, so it's a bottleneck, and it's never available because, you know, you don't want just five minutes of time. You probably want an afternoon or an hour or actually probably a couple of days because we weren't very good at it.
06:46
And duplicating production, well, it costs too much. So at that point, we weren't in the cloud yet. We had hardware, so we bought hardware to be a performance test environment, and it wasn't the same. It was actually better than production, and so we had to make an adjustment, a compensation in our charts, what few charts we had at the time.
07:07
And so, yeah, it's just not the same. It doesn't have the same software. It doesn't have the same hardware. Yeah, cost is an issue as well. Hardware, you buy it, and then it just sits there because nobody wants to run a performance test all the time when we don't have automation to do it.
07:23
When we're doing 30, you know, we did maybe 30 of these in the first year, whereas now we do it every day. So the hardware just sat there, and sometimes people would steal it for their other projects. And so that was kind of bad as well. Individual tests, they take too long.
07:43
Who here has done some performance testing? Hands up, yeah. So about an hour is a reasonable, I mean, you know, it's going in proposition. How many of you have kicked off a test and, yep, looks good, looks good, is running, and
08:00
gone away and done something else for the hour, coming back, and it failed in the first five minutes? Anyone? Yeah? Done that every time. So you've just wasted an hour. You've wasted someone else in the queue, so you've pushed them back. You've been a blocker, and that's horrible. And it's boring waiting for an hour, watching a progress bar, isn't it?
08:20
Who does that? The worst thing about being in the cloud is I can get a pristine new server in 20 minutes, whereas before it took five weeks. For five weeks, I'm going to do something useful. For 20 minutes, I'm going to make coffee, maybe, twice. So the other thing about performance tests is you're trying to break things.
08:41
So they're going to be broken, and so you're going to have to invest some time fixing them again, because you're running load at stuff, and maybe your drives will fill up. Maybe memory stuff will leak, and, you know, the intention here is to try and break things and find the breaking point to see that nominal operations is below it. So, yeah, dirty process.
09:02
But, of course, you don't want to slow down. We've got continuous delivery. We don't want to stop doing that. That's great. We don't want to block people from releasing. And I guess the other kind of obvious bit is we don't want to break production. So you don't want those embarrassing performance regressions that come up in the coffee queue.
09:23
Yeah, yeah, the homepage is now taking 30 seconds to load. But we tested that in QA, and it worked, right? You don't want that. We've had that a lot. Well, not so much now. So some of these things you can fix by testing all the time. I mean, it's just like continuous integration.
09:42
It's just another type of testing. So you could continuously run some load, some trickle of load through, and you're probably doing this with functional tests already, maybe not in production, but in QA. You could continuously have your end-to-end test running. You can leverage monitoring and alerting to prevent getting worse.
10:04
So we treat monitoring and alerting as just another class of tests. They happen to be checks that continuously run in production and, if we wanted to, in QA. But we don't in QA because we have them in production, so why do it twice?
10:20
And this is okay, but it doesn't really solve the works-in-my-environment syndrome that you might have if you don't do it in production. Because your customers are not in your QA environment, are they? And the customers do really weird and wonderful things. They buy stuff, and then they interact with your system, and they do things that you couldn't have imagined in a requirements session.
10:45
And they're great. We love them because they're the reason that we do this. So, yeah, test in production. And this was a bit new for us, and I swear there were a couple of quotes, so that's one of them.
11:07
And this is another one. Let's just do it in production. You know, it'll be fine. What could possibly go wrong? And it turns out, happily, in our context, that, well, that's not such a crazy idea.
11:23
So, mostly because of the culture that we have. We have tight feedback loops. We've got quite good test coverage for functional stuff, quite good alerting coverage for production stuff, monitoring stuff. And crucially, every engineer knows that they're operationally responsible for the code that they commit to source control.
11:44
Because it will end up in production, and they're on page duty. Not all at once, on a rotor, but still. So, you have engineers now, or we have engineers now, that are highly invested in not regressing the non-functionals. Because, well, you're going to get your personal time interrupted if you or someone on your team does that.
12:03
And that's not something that you really want. I have an obnoxious page of duty alert that I don't ever want to hear. We have teams that are cross-functional. So, essentially mini startups for the feature delivery teams.
12:25
And dev and ops, dev ops, if you want to call it that, in the component ownership groups. We're organized along two lines. Everyone sits in two teams. One team focused on delivering features to meet some objectives and key results.
12:41
One team focused on operational health. So, keeping the lights on, making sure we're scalable, making sure the platform is sustainable, and securing it. So, as I mentioned, engineers own their code and the operations thereof.
13:00
If a feature team wants to push out a feature that isn't necessarily ready for production, doesn't have monitoring, doesn't have logging, then the operational group is going to say, no, you're not done yet, and be totally justified to say that because they're on the hook for it. It's a very open source model.
13:21
No, we don't want your feature yet because we're going to be supporting it. Or, yes, we'll absolutely take it, but here's the pager duty schedule while you're building it. That's fair too. That happens. It's quite autonomous. The other thing that makes this work is we've got, well, we've invested in real-time monitoring and alerting.
13:43
We've built that, you know, we run our own stuff for that. We don't use a third-party provider. We use the StatsD Graphite Grafana sarin suite of tools. That's served us well for the last four years. And engineers realize that, well, you know, you write your functional test coverage.
14:01
You also want your production alert test coverage. It's just test coverage in a different place, in a different context. We also have centralized logs with the Elasticsearch logs-kibana stack. Again, not a third-party thing. That's, again, served us really well in the last two, three years. And that's for figuring out why something has fallen over
14:23
or correlating that, hey, the menu API is slow because this dependency of it is slow. And we can do that really easily using that suite of things. Engineers get paged while they're on call. And, well, you know, it's fake load.
14:40
It's a test. It's contrived. So we can turn it off when we want. That's crucial because, you know, you're running 50% extra load, but you can always turn it off. You can always stop. And so that's the first operational response because you have a pressure release valve built into the system that way. And, of course, we can also, in a continuous delivery world,
15:02
continuously deliver backwards if we want to. So we can roll back real easily. So, yeah, it turns out this isn't that crazy for us. That said, it was a bit, you know, it had a rocky introduction. There were some technology aspects. You know, how do we actually do this?
15:22
And there were some people aspects. How do we actually do this? You know, that sounds like a terrible idea. Everything's going to break. No, no, actually, it's going to be okay. So the first thing is have the idea. And that's really what I'm trying to get across here. Unfortunately, you can't really see it,
15:41
but that's a link to a blog post that somebody read. As with all great ideas, we didn't have it. In fact, our director of engineering found it, mentioned it to our CTO. The CTO really liked the idea. Nothing happened for a bit. And then we decided to prioritize it and do it.
16:00
It was actually our engineers that were the least, the most skeptical about the whole idea. You want to do what? No! That's roughly the reaction. You know, that's going to be terrible. I'm going to get paged all the time. No, no, no, it'll be fine. A, you won't get paged all the time. B, I'm on the hook with you because I'm the director of engineering.
16:22
And C, it's going to be, you know, we're going to use it to optimize the platform because, well, it's the sensible thing to do. We don't want problems in production in peak time. We want problems, we want to know about them before they happen. It's just good sense. It still took a while.
16:41
The next thing is we needed to choose the scenarios we care about. You know, you still have to prioritize the effort. You're not just going to blanketly throw load across the whole box, across the whole platform box. That would be ambitious. So for us, that's the customer must be able to buy food.
17:01
You know, that's where we make our money, so let's start there. So that means the customer lands on the homepage or opens the mobile app and gives us the postcode that they want to get food in We show them a list of restaurants that service that area. They pick one of the restaurants or end up on one of the restaurants.
17:22
We show them the menu, they put some stuff in the basket. They give us some more information maybe because, you know, the delivery address might be different from the one we've got, blah, blah, blah. And they pay for it and it arrives magically. So probably you've already got a functional test scenario that covers the end-to-end case
17:41
because it's really important to cover the full flow, right? So you can start off with, well, let's run that in a loop. That's good enough. You know, create a fake user, create a fake restaurant in our case, create a fake set of food on that. You know, you can buy margarita pizza from that restaurant
18:03
and keep doing that. And we didn't even get as far as the payment because of, well, reasons. But, you know, we were exercising all of the bits up until there. And that was good. We didn't do any data setup or teardown. We literally, you know, created a restaurant called do not change this
18:20
so that we could always find it in the search results and a menu item that similarly do not change this because we're going to buy it over and over again. Crude, effective, quick. And then, well, running something in a loop means writing software to do that and hosting it and so on.
18:40
So there are better options. Choose a load agent. I think this in-depth evaluation process was, well, it was very rigorous. We basically tried out JMeter and it worked and so we went with JMeter. It's free. It's got a reasonable user interface so you can record scenarios
19:02
if you're that way inclined. You can script it. You can source control it. It's friendly for automating so you can provision it onto machines and you can run it and you can get reports out of it although we tend to use our monitoring and alerting for that. And it worked and we kept running with it. And to this day, two and a bit years later, it's still there.
19:21
It's got its warts, yes.
19:42
But on the other hand, if we waited for that load to appear naturally, then we'd still have to scale up to the servers and we wouldn't know in advance what the load would be. Sorry, the question is, in an autoscaling world, if we scale based on reacting to load, then, well, we're going to keep scaling and we're going to keep spending money and we're going to basically have a runaway effect.
20:02
Well, yes, if we did it like that. We're quite lucky at Just Eat in that we have very predictable load. So we know exactly what time of day it starts and we know roughly what time of day it finishes and so we don't actually reactively scale. We scale on a schedule. It's the best cron job that ever got written.
20:20
So at about 3.30, just in time for the 4 o'clock fake load start, everyone scales up and then at about 10 o'clock when we can be reasonably sure customers are going away, we scale down. And yes, that's not the most efficient way to do it. Our utilization is not tracking our demand curve, but it's still better than being up here all the time.
20:42
And this also, you know, we can source control and all of our cluster sizes and that makes it very predictable as opposed to reacting to load as it comes in and then, you know, you have to smooth away spikes to... It's just a bit more complicated and so we haven't bothered to do that.
21:01
Does that answer the question? Yeah. We try and find in a very roundabout, convoluted, complicated way the simplest thing to do stuff. But yeah, back to Jmeter. Totally arbitrary choice. There are other ones out there. You could use Gatling, Soong, Grinder, find excuses to use Scala.
21:23
That's all cool. There are various commercial ones that you have to pay lots of money for. We didn't fancy that. Yeah, so pick a load agent. And then start dipping your toes, I guess. You know, you maybe already have that if you're running your functional test continuously.
21:41
But if you haven't done that yet, now's the time. For us, start early in the working day. Everyone's in the office. Customers haven't started to get that yearning for curry in the morning. I hope. And, you know, see how that goes. Figure out how many servers you're going to need to scale
22:02
for however many threads Jmeter is throwing at you. And it's a coarsely-grained thing. So, you know, try stuff is the best way to put it because, you know, your platform will be different from how ours reacts. Everyone is their own unique snowflake. Try stuff. Experiment. Scientific method.
22:24
We then moved on to graduating to later in the day. So there's some real traffic happening now. But still, you know, we're looking at charts. We're monitoring. And we're noticing that actually our alert coverage is maybe not as good as we thought it was.
22:41
So let's add some more alerts. Good table stakes for this is you have alerts around resource metrics like the classic four, CPU, memory, disk, and network. Those are really your safety net. If you're tripping those, then you're probably either missing alerts earlier or your capacity is under pressure.
23:02
So that's the sort of last gasp scale-up kind of trigger. I guess what we also found is it's generally better to measure people rather than computers. So rather than having alerts on our capacity,
23:24
we want really alerts on the work that our system is doing because in response to customers. So for example, we might have alerts around the response times of operations. The login operation might have an alert that says it will fire if the upper 90th percentile
23:42
is more than 250 milliseconds. That's an aspiration. It probably has an alert of one second, but, you know, we'll get better. But the point being that you want, ideally, to be able to figure out which bit of your system is under pressure as opposed to the thing that handles login and other stuff.
24:04
So we have a consumer API that handles various operations, and the login's operation might be slow, but the rest might be fine. And if you average out the response time for the API, then you might lose the fact that your logins are slow. So be a bit more granular.
24:24
Later, start adding data variety. So, you know, we started out with one fake user, one fake restaurant, fake stuff on the menu, and he bought a lot of fake menu, fake pizza.
24:41
Later, we introduced tens of thousands of fake load users because, you know, you want to probably cache stuff. So you need data variety to defeat your own caching because you want to include your persistence layer in the load. Then later on, we got to the point of being able to pay online
25:01
by cash, not by card. So we're covering that part of our system. And then later, we added lots more fake areas and fake restaurants. There's no point looking for them in production because you can't actually buy any food. Yes?
25:20
Well, we canceled. And we sent an apologetic note to the restaurant whose box has basically just exploded. That doesn't tend to happen now, but, yeah, like I said, there was a rocky start to it. There are some war stories at the end in a similar vein.
25:44
But, yeah, data variety. And then the most important bit is get the computer to do this every day. So this is a screenshot of TeamCity. Again, apologies for the resolution. But basically, that's our Monday to Wednesday load because Monday to Wednesday is a lower level,
26:01
and then Thursday through Sunday is a higher level. So we split them up like that. Just like any other tests, and I'm going to hammer this home through the talk, if your tests go red, then fix your tests because otherwise they're pointless. You know, don't ignore them. Don't let them be red. Either delete them or cause the system to make them pass.
26:24
That's just like any other tests, right? There's no difference here. We got more elaborate as we went on. So we started to introduce a header like this one. I won't actually share the actual one
26:41
because for a while it was possible to get free food. But yeah, it's similar to this, but the idea is very consistent. The fake load agent adds this header to the requests that it generates, and this is different from customers, unless you know.
27:04
And so, well, we did this because, well, our business model works best when our restaurants, when we don't have to go to our restaurants and do debt collection. So that means if the customer pays with cash, the restaurant has the money, and we have to go to them to get our cut.
27:22
So that's work. Whereas if the customer pays online, then the money comes to us first, we keep our cut, and then we give the restaurant the balance. So that's easier for us. So everyone please pay online. But of course, our payment processes are external dependencies,
27:45
and, well, they don't really sign up to the idea of being load tested at all, let alone every night. And, you know, you might question, well, aren't you going to be able to take it? And we did.
28:00
But, you know, regardless, they weren't up for it, so we faked them away, and we used this header to be able to do that. We have an API that is a facade across our payment providers, and so that's the only bit that we needed to touch to fake away the real one to a fake one that we wrote that has the same contract as the real one.
28:22
This PSP emulator can be configured so that X percent are going to be successful payments and Y percent are going to be failed and Z percent are going to be timeouts or, you know, random nonsense that happens on the Internet. I don't think we've got that elaborate. I think everything succeeds, but, again, work in progress.
28:41
So, yeah, get more ambitious. Cover more of your system. And even more elaborate still, we haven't done this yet, but fake away the restaurants themselves. We have a magic box in the restaurant that we communicate to over a TCP socket.
29:00
It doesn't understand DNS. It doesn't understand various things, but so it's quite difficult to interact with it other than for real. We do have a box emulator for functional tests, but what we don't have, because our system is message-based in this part anyway, we don't have a way to propagate that header
29:20
to the messages that the box handler receives, and so we can't yet send fake orders to real restaurants but then tell the box to not process them. So that's coming. So I've mentioned, but, you know,
29:41
there were some hiccups along the way, and so how have we kept doing this? What have we learned from our experiences? I've mentioned this. I'll mention it again because it's worth repeating. Don't allow tests to be read for long. You know, we'll have failed tests, but that's great because it's telling us
30:00
a part of our system needs improvement, and it's probably telling us which bit of our system needs improvement because we have good logging, we have good monitoring, we can triangulate very quickly and quite fine-grainedly. So when we get failures, we pause fake load, or the person on call or perhaps the person in the office if it's early in the working day,
30:21
we pause fake load, we fix the problem, and then we re-enable fake load after deploying to fix the problem, just like any other tests. Sorry, was that a question or a head scratch? Okay. We found that we needed to tune the levels that we're running fake load at over time
30:40
because we're happily a growing business, and if our fake load that we were running two years ago was what it is now, it would not be 50% extra. I mean, I think two years ago, it was something like 600 orders per minute. Now it's in excess of 1,000 easily. So, yes? So we're using StatsD, Graphite, and Grafana
31:06
to be the sort of the metrics ingestion and display and maths stuff so you can smooth lines and do predictions and that sort of thing. I can talk to you in this whole talk by itself.
31:21
Monitoring sucks. And then we use a thing called Sayrin to define checks that run in a loop. Each check is a Graphite expression, so a query to Graphite, and then a warn threshold and an error threshold, and then a set of subscriptions to go and say, this is a warning or this is an error, too.
31:44
And then we look at that when we make changes, and we realize that looking at that is really dull, so we write alerts to look at that for us. We make the computer do that work. Like I say, we don't use a third-party product. We use a set of open-source tools that, well, we found,
32:01
and we hacked together in a hackathon, and then we made real. But, yeah, we need to tune the levels over time. Otherwise, our safety net diminishes. Our safety net, if load is here, customer load is here, and fake load is here, then this is our safety net. This is the pressure release that we can have
32:21
by turning off fake load. So that gives us time to horizontally scale up because instances take a few minutes to come online. We can buy time to think is basically what that is buying us. We need to get smarter,
32:42
or we needed to get smarter about data management. So, hmm, tens of thousands of fake load users, a thousand orders a minute at peak time. That's quite a lot of fake orders in two years, right? I think we haven't deleted any of them. So, you know, now our disk consumption on the SQL server,
33:03
at the very least, where it might have been doing that, is now doing that. So we also needed to change the back-office tools that our staff used to support the system, like, I know, the call center tool that handles when an order is bad. So, for example, where's my food is a common complaint.
33:24
A fake order is obviously not going to arrive, but if the call center person can't tell the difference between a real order and a fake order, and there are many fake orders happening, then the real ones will be lost in the noise. So let's not even show them the fake orders. They don't need to see them. That's our test, not theirs.
33:42
Back to the SQL usage, or the, well, whatever data store usage. We haven't written archival tools. We don't do things like file groups yet. Actually, that might not be true. We might have done that before Christmas. But, yeah, again, we can defer writing those tools
34:03
because we can just go delete from orders table where type of order equals fake. And so we've now bought ourselves some time. We don't, yeah, it's a common theme. We're buying ourselves time to do real engineering work with this safety valve.
34:23
We also learned to embrace the fact that things just break. Sometimes that's because we weren't as smart as we thought we were, and sometimes that's because, well, actually, it's because we weren't as smart as we thought we were when we were writing code. And, yeah, we've bought ourselves time again.
34:42
This is a fun point. We, for all that we invest a lot of effort into this going on, we slipped a bit in the last year, in the last two quarters of the last year, and we kind of noticed that we slipped a bit and started to do something about it in Q4.
35:02
We tried to continuously improve, and we have retrospectives, and this was one of the things that came up in a retrospective. Hey, we've been having a lot more incidents lately. Yeah, maybe we should focus on some, you know, that's roughly how the conversation went. It's all fine. You know, we changed our priorities. We pivoted, and we did that instead of that.
35:23
So we started to try and predict how much load would need to survive the winter period. We do a slightly odd thing at Just Eat where the holiday year for everyone runs January to January,
35:40
and the busiest time of year is when the fewest number of people are in the office. So, you know, you want to not have problems at Christmas time just because it's Christmas time, but you also want to not have problems when you don't have any people to fix the problems. It just kind of stands to reason. Maybe we should change the holiday year, but that's the current state of play.
36:02
So we had a large effort cross-team to make sure that we were ready to handle the load. So that involved figuring out what the target was, ramping up our load agents, because, well, just add some more, and making sure that our alerts were sound
36:21
and our scaling was sound, and, well, we had a few problems on this journey. So this comes back to the point that we don't delete fake orders. When you introduce, let's say, the customer wants to see his order history story to your platform,
36:41
and you have a whole ton of fake orders, and then later on you enlist this operation in fake load, so a user will come along and try and see their order history, and you don't limit the number of orders that come back, that's going to be a bit of a problem because you're suddenly returning tens of thousands of orders.
37:02
And we'd love customers to have tens of thousands of orders, but none of them do yet. I think the highest is about 4,000, and those are the outliers. So, well, that was a bit embarrassing. Who's released an operation that doesn't have paging on it? Yep. Who's had this sort of problem with the customer?
37:22
And who's had to then turn off the feature to fix the problem rather than just stop fake load for a bit and then add paging, and that's roughly what we did. We turned off fake load for the duration that it took to fix the problem, added paging, and now we have a reasonably robust and standards-compliant paging implementation for an HTTP API
37:40
that we can share to all of the other places where we've forgotten to do this. So, yeah, learning. Monitoring needs to be solid, so I'll paint you a picture. We're coming up to Christmas, and we're running at our normal Saturday night levels, and the bright idea comes in to try and run at the higher level
38:04
that we're going to need, and it's a stab in the dark. Let's just double our stuff, because that's the simplest, easiest thing to do. Cost isn't, you know, we're in a happy position of cost not being a particular issue because we're a quite successful business. So, yeah, okay, double it and call it done,
38:21
and then, guys, that great monitoring stack that we've got, it can handle millions of metrics per second, but we're now throwing many millions more per second at it just because we didn't do any thinking about this and did a shot in the dark. So, yeah, monitoring needs to be more resilient
38:41
and scaling ahead of the thing that it is monitoring because otherwise you're flying blind. So we turned off fake load, we fixed monitoring, we turned back on fake load. Sorry to be repetitive, but that's the process. That's the game. We're in the cloud, so, yeah, let's just double everything.
39:04
We can do that, multiply by two. And then we realized that, well, the cloud isn't our cloud, it's their cloud, and they put limits on stuff, and we didn't notice that doubling would push us past those limits, and so Amazon said no.
39:20
And so we stopped fake load, we raised the support ticket, you know, enterprise support, and they go, oh, yeah, okay, we'll raise those, thanks for letting us know, and crisis averted. Obviously, had that happened on the busiest day of the year, that would have been at least a 15-minute turnaround because that's Amazon's SLA for an ultra-high priority,
39:42
the world is on fire ticket. And then a bit more because, you know, the latency of the web chat is not all that. And then, you know, talking about what the problem is, yes, we need the M4 large instance type needs to be, we need to have 200 of them, not 100, and, yes, all of the others as well, them too, do them too, please, quick.
40:02
That would have been roughly how that would have gone. But, well, it didn't go that way because we can just turn off the pressure. We actually separately noticed that our DynamoDB throughput, it operates on a provisioned throughput basis. Our DynamoDB throughput across the whole production account
40:21
also had a limit in place that we had to ask Amazon to, would you please raise this? So, you know, we'd like to give you some more money, could you please let us do that? But, yeah, now we have automated alerts around how close we are to the Amazon account limits. So that's, again, another piece of learning that we have.
40:41
Amazon Trusted Advisor and CloudWatch is great for this. Now we're getting into the sort of, wow, that really shouldn't have happened. Who uses HAProxy? It's a load balancer, it's a software load balancer. It does some reverse proxying as well.
41:00
And it turned out that we'd configured it. So, sorry, we run in one region of Amazon across all three of the availability zones that that region has. And so this means, in our context, that our website sessions are sticky to an AZ, and all of the web servers within there,
41:22
but not sticky to a single server. So it's better than sticky to a single server, but still not non-sticky. It's a compromise. We realized that our HAProxy was actually magnifying our load down onto the servers in one AZ. So that's bad from a resilience point of view,
41:41
because if we lose the AZ, everyone loses their session, and that would be embarrassing. And it's also bad from a load point of view, because, well, we've got two AZs that are just, could you send me some traffic? You know, I'm bored, I'm just sat here. Yes. So we use the ASP.NET session state service for our sessions.
42:02
We did at the time, we're now doing something different, as a consequence of this. And if the state service goes away, for whatever reason, then all of the web servers that were using that to store their sessions now have to go, oh, there's no session server anymore,
42:21
and so I don't have your session, and so you're not logged in anymore, and that sort of thing. That's what I mean by that. One per AZ. One per AZ, exactly. It was not the most resilient configuration that we could possibly have come up with, but it was pragmatically the one that worked and has served us well for two years. So, you know, a problem we were not interested in solving
42:43
until we were prompted to solve it. Sticky to one AZ, yeah, yeah. So at the Christmas load levels, this very quickly blew away one session state service and so we turned off fake load, we did something different,
43:01
and now we don't have that problem anymore and we've got fake load running again. Yeah, so, you know, going from the we forgot stuff to the we were actively dumb case here. I can say that because this was me.
43:23
One scenario that we created back in spring, we shipped Apple Pay and with Apple Pay, it's possible to check out without first having a user account. So the registration process happens after the payment.
43:41
And so for this fake load scenario, a new user would be created in our database every time and that user, because of, well, an oversight, had a unique, sorry, had a non-unique email address. So it was the same email address that was being reused every time this scenario was run.
44:02
And it's a reasonably low traffic scenario because in this country, we've already got a decent contactless payment method and so people don't pull out their 600-pound phone to pay for stuff. They pay for stuff however they already do. But regardless, we didn't notice this. Slipped through code review, slipped through monitoring, oops.
44:23
And in the springtime of the year, we're not that busy, so, you know, didn't really notice. Forgot the fact that our system allows you to have non-unique email addresses because of other reasons. And before you know it, 350,000 users of this email address
44:41
were in our users table. Okay, great, still not really a problem. But as we come into this sort of winter period, we're under more pressure and we have SQL Server being under memory pressure because we need to do some work optimizing that as well. And this manifests in this particular case
45:01
as it's going to inject the query plan for the login user operation, which is indexed. You know, it's quite well done. But of course, SQL Server, when it's got 5% of its users that are the same user, it's going to quite reasonably think, oh, I'm going to use a table scan now
45:21
to look up that user and to look up all users because, you know, the chances of me hitting it are quite high. I won't bother with the index. So 40 minutes of downtime later, we realized what the problem was and we turn off fake load and we delete those users. And we shamefacedly go and inject some GUID
45:43
at the end of the email address name part and turn it on again and everything's good. And the bit that particularly makes me laugh about this is I was at the WinOps conference giving a session on how great we are at operations and monitoring, and my pager duty goes off for this.
46:05
And so I have to leave early. So yeah, that's totally impossible to read. But that's me asking my team, hey, I noticed that, you know, we fixed this some weeks ago. Let's turn on the fake load for this again
46:21
and let's get that green. Yeah. So yeah, we try and do good things, but we're just people. The whole point of this is we're discovering problems during peacetime, not peak time, not wartime. And so that's a much better environment
46:40
to come up with good solutions to problems rather than hacking bits and then Band-Aids and leaving them there and they pop off at the worst possible time. It's just a better way of solving this sort of class of problem. So what did we gain from all of this effort?
47:00
It's mostly about peace of mind. We've got continuous early warning about getting slower and running out of capacity. We want to aim for getting no worse without noticing. That's the key part, without noticing. We want to know continuously on a Thursday
47:21
that we're going to be able to withstand the Saturday night pressure. We want to know that we're on a Saturday night or maybe on a different day of the week would be a better idea. We're going to know that we're able to withstand six months from now. So ongoing capacity testing, ongoing performance testing.
47:41
Yeah, just basically have that headroom and remember it's not about just our servers and their performance but it's also within the environment that they operate in, the AWS cloud. As we learned, it's not our cloud. We have to monitor what AWS allow us to launch and they do that to save ourselves from ourselves.
48:02
They basically have the soft limits in place so that you can't accidentally spend all of your money. I can see why they do that. We also have a really good simple clear operational response for most things that go wrong in production as a sort of going in proposition before you get the page,
48:22
open the laptop, I'm not thinking, I'm not thinking, do this. Is fake load running? Stop it. Scale up. Now start thinking. You can basically muscle memory that stuff and at some point we'll probably have a Lambda that will do this stuff for us but not yet.
48:41
So that start to think part is have a look at the runbook for the component whose alert just went off. Probably in the alert itself there's a few links to dashboards. Click on those so that the loading in the background while your brain is booting. So it's great for dealing with unexpected spikes like, for example, a competitor going down.
49:03
Great, come to us but we have to be able to handle that. We can. And by stopping fake load immediately we have less stress on the platform, that pressure release valve, we're buying ourselves time to think.
49:22
We also have a, basically a, we're respectful to the personal time of our engineers, of our employees. If we have a problem on a Thursday night that's performance related then, you know, don't have any heroics over the weekend. We don't want that, we don't want you to burn out. Turn off fake load, don't run it.
49:43
Enjoy your weekend just like normal healthy people and then fix it next week when you get back to the office and the rest of your team is there and the rest of the team is there, remember there's 250 of us now, and, you know, fix it. It's business as usual.
50:01
We promote the fact and have promoted successfully within the culture that performance and operability are first class concerns. No engineer, no anybody wants their personal time interrupted. You know, for the big incidents, executives get interrupted. You don't want that, nobody wants that.
50:20
But, you know, I also don't want to be interrupted. My personal time is my personal time. We realized that, and again, baked into the culture, we realized that ready for production is not really the same as ready for customers and perhaps not the same as ready for all customers at the same time, plus fake load. So, you know, we know what the bar is
50:41
and it's a very predictable bar because, well, it's in source control and so you're done when you can handle this and when the system knows that you're healthy handling this, not just when you're in production and it works. And, of course, it also, because there's a powerful incentive
51:02
to ship code that does work and works fast, it stimulates that performance profiling stuff and metrics and logging being added as part of the development process, not the it's over the wall now shit and we need to do that type of arrangement.
51:25
So, I guess what I'm saying is we're a bit more mature about what it takes to ship software that works well. Not just works, but works well. We have this notion that alerts are automated tests in production. This is something that a lot of new starters that are developers
51:42
rather than necessarily engineers take a little bit of time to sort of onboard, but one of the ways I've had a bit of success with is explaining it literally is this. You write tests for your functional aspects of your system, right? So, this is just that, but they run all the time in production.
52:04
So, write both sets of tests. At Just Eat, you're not done if you haven't written both sets of tests. That said, we have quite a lot of autonomy in these little mini startups that we run at Just Eat. You can absolutely ship stuff, but if you ship stuff
52:21
and it has problems that affect outside of your team and even inside of your team, then you're probably going to have conversations, just like anywhere else, I suspect. So, promote the idea of performance budgets. I mentioned this earlier. Say you log in operation. You have a target that it should be 250 milliseconds or less.
52:41
If it breaches that, you now have a data-driven conversation with the person that is prioritizing work to say, well, we agreed that it's got this target and it's not meeting that, so we need to do some prioritization to figure out where to put the optimization work, or we need to change the target, just like a set of tests.
53:02
It's a much shorter conversation. You don't have to go and justify because this is a pre-agreed target and you're no longer meeting it. So, it's, you know, fewer meetings. That's great. And we also have a more extensible system, a more decoupled system where we can put in hooks
53:21
to change the behavior based on the type of traffic that we're throwing at the system now. So, Git push production is one step closer. I can imagine a world where I commit some code. Maybe I remember to compile it. That would be great. Every once in a while, I don't. I commit some code.
53:40
Continuous integration runs the tests, deploys it to a QA environment, deploys it, runs functional tests against the integration environment, deploys it to production, doesn't open it up for customer load yet because we have traffic routing maybe, runs the tests there, cranks it open to the employees perhaps
54:00
because they're a nice, safe user base to try stuff out with, lets them know that there's a new feature and where to go and find it, and then cranks it open to maybe a beta test group of customers, and then at each stage of that process, the computer will decide whether to pass or fail and go to the next, promote the change to the next part of the chain.
54:21
That's a pipeline I want. We don't have that yet, but it's one step closer with this sort of activity. So, yeah, hopefully there's some stuff you can take away from this and apply to your own environment, your own culture, your own context.
54:41
If this is interesting, then we're always recruiting. We've got like 50 open spots for talented people across the whole gamut of the job range. Have a look at our jobs page if you're interested. Ping me, and that's it. Are there any questions?
55:01
Yes. Yeah, that's right. We tried it out. We did a reorg towards it in spring of last year,
55:21
and it's loosely based around the Spotify model of guilds and tribes, but we haven't called it that internally because we don't want to open ourselves up to, well, we read the blog post and we're doing it wrong. We don't want the dogma of that. So I sit in two teams. I sit in the feature team whose objective is making the restaurants happy,
55:44
doing things to help them be happy with their partnership with us. And so I'm currently working on exposing some data so that they can see their metrics of how many orders they're doing a night and that sort of thing. And I also sit in the component ownership group,
56:01
the sort of operational focused group that owns the consumer context of the platform. So log in, registration, send push notification to consumers on order accepted, show the order history. Yeah, that paging thing was me as well. That sort of thing. And so I lead that group, which is I think now six other engineers.
56:24
There are no UX or designer type people or product type people in the component ownership group because it's a very technical focused group. It's engineer led. Whereas the feature team is the UXE and the product manager and the roadmappy and the that stuff
56:41
and cross-functional engineers. So there are still engineers in this group that know how to operate in production. They just don't own the components in this group. And so everyone wears two hats. And there are some pros and there are some cons to this and that's probably a talk in its own right.
57:01
Cool. Yes? In terms of our hosting, we could do that, but we haven't done that. We're basically very happy with AWS. We are a Windows platform.
57:22
We write .NET. All of the money-making bits of our system are C-sharp .NET with a few exceptions. And then all of the infrastructure type stuff are cross-platform open source, Stasi graphite, that sort of stuff. And we have one team that is responsible for the infrastructure services
57:40
that underpin the e-commerce platform. But no, we haven't tried hosting in a different cloud. We have tried hosting in a data center or anything like that. Yeah. We use AWS CloudFormation for, basically, I guess the unit of encapsulation
58:02
for a feature within our environment. And Azure, I think, only very recently has an equivalent to that. So CloudFormation allows us to get away with writing a lot less automation because you basically create a JSON file that declares what resources you want, parameterize that, and then tell AWS,
58:23
please give me this. Azure doesn't seem to have an analog for that, or at least it didn't until very recently, I think. Terraform is another option, but we're happy, basically. Any other questions?
58:41
We've got a couple of minutes. Yes? So we don't run profilers in production. We run profilers as a... If we have a problem in production, we'll deploy the same code into a QA environment, having turned off fake load and scaled up,
59:02
into a QA environment and then hook a profiler on, or do it on a workstation. Our system is reasonably decoupled, and so usually it ties back to something being slow, one thing being slow and then that transitive dependency chain slowing down as a consequence.
59:21
We're not so great at graceful failure. We still do have quite hard dependencies on bits within our system, but they are separate bits, so they are smaller, easier to optimize in isolation. Did that answer the question? I'm not sure. Yep.
59:43
So we... Yes and no. I mean, we can still run load at an individual component because we can tell it to fake away the dependencies that it has with a recording proxy that we have, so we can get quite a long way with that sort of approach. And then it's...
01:00:00
usually reasonably evident where we've written a tight loop or when we're not resilient to the database being slow or something like that. It's quite rare that we need to do some profiling because we have decent monitoring in place. So we're able to use that to get a sort of general feel for where the area is.
01:00:21
And then that's usually enough that we can take that and we can profile that without actually running load per se. Yeah? You mentioned you weren't used to that. Because we operate with beautiful chaos.
01:00:44
Every team is autonomous about choosing their dependencies. We have a monolithic SQL server as a sort of legacy concern because that's what we inherited over time. As we moved to the cloud, we started to use more of the cloud's capabilities. It's actually easier to list the services that we
01:01:01
don't use from Amazon compared to the ones that we do. So we're really bought in, I guess. And various people decided, oh, DynamoDB, great. Guaranteed sub-10 millisecond latency, structure-less. I can have a hash key and a range key.
01:01:20
What else do I need? I'll just shove data in that. Yeah? If you need to query it, you'll probably have some problems because you have to then index it and pay for that. But really, the cost of Dynamo is so eclipsed by the cost of our EC2 spend, it doesn't actually make that much difference. We're also using Elasticsearch in a few places.
01:01:41
Our restaurant search is being powered by that in some aspect. Our centralized logging is also totally Elasticsearch. And then we're sort of gradually engaged in splitting up that big monolithic SQL server mirror pair database so that each API is
01:02:01
the owner of the data that it exposes. So then we can have a lot of smaller databases that are inherently easier to scale than a single large dependency. But basically, you ship it, you support it. And so you can infer from that that you can pick what you
01:02:21
ship and how you ship it. But you're supporting it. So any other questions? OK. Thank you very much.