We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

You’re in production. Now what?

00:00

Formal Metadata

Title
You’re in production. Now what?
Title of Series
Number of Parts
110
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
A tiny subset of your users can’t login: they get no error message yet have both cookies and JavaScript enabled. They’ve phoned up to report the problem and aren’t capable of getting a Fiddler trace. You’re serving a million hits a day. How do you trace their requests and determine the problem without drowning in logs? Marketing have requested that the new site section your team has built goes live at the same time as a radio campaign kicks off. This needs to happen simultaneously across all 40 front-end web servers, and you don’t want to break your regular deployment cadence while the campaign gets perpetually delayed. How do you do it? Users are experiencing 500 errors for a few underlying reasons, some with workarounds and some without. The customer service call centre need to be able to rapidly triage incoming calls and provide the appropriate workaround where possible, without displaying sensitive exception detail to end users or requiring synchronous logging. At the same time, your team needs to prioritize which bugs to fix first. What’s the right balance of logging, error numbers and correlations ids? These are all real scenarios that Tatham Oddie and his fellow consultants have solved on large scale, public websites. The lessons though are applicable to websites of all sizes and audiences. The talk will be divided between each of the three scenarios described here, and then stuffed full of other tidbits of information throughout.
23
63
77
Principal idealSoftware developerMultiplication signGroup actionGraph coloringAreaLocal ringWebsiteWeightScripting languageCodeCartesian coordinate systemTerm (mathematics)Default (computer science)Product (business)BitLogicFocus (optics)CASE <Informatik>Java appletProcess (computing)Web applicationQuicksortSoftware testingInternetworkingNumber1 (number)Application service providerSoftware frameworkWeb 2.0BuildingComputer animation
Home pageTwitterHypertextScripting languageSheaf (mathematics)Software developerAsynchronous Transfer ModeAsynchronous Transfer ModeFlagString (computer science)Query languageProjective planeIntegrated development environmentForm (programming)Scripting languageRevision controlComputer fileSoftware developerProduct (business)SummierbarkeitProcess (computing)Descriptive statisticsComputer animation
Software developerScripting languageTwitterHypertextMessage passingEmailSheaf (mathematics)LoginCache (computing)HTTP cookieHost Identity ProtocolSet (mathematics)Maxima and minimaBuildingProduct (business)Local ringWebsiteHome pageMultiplication signLink (knot theory)Configuration spaceFlagWebsiteInformation securityHTTP cookieProduct (business)Cache (computing)Boolean algebraWeb pagePoint (geometry)NumberConsistencyServer (computing)Set (mathematics)Query languageHeegaard splittingTerm (mathematics)Web browser2 (number)Structural loadFrequencyAsynchronous Transfer ModeSoftware testingIntegrated development environmentComputer fileInformationType theoryRight angleDomain nameSheaf (mathematics)Web 2.0DatabaseProjective planeFunction (mathematics)Table (information)Sound effectLocal ringResultantInheritance (object-oriented programming)String (computer science)MereologyKey (cryptography)Client (computing)Root1 (number)Different (Kate Ryan album)Data miningRoutingWordRevision controlCASE <Informatik>Java appletProcess (computing)Source codeConfidence intervalAreaSpacetimeService (economics)Scripting languagePerspective (visual)Program slicingView (database)Hydraulic jumpComputer animation
Product (business)Software developerWebsiteConfiguration spaceKey (cryptography)Software testing1 (number)Closed setComputer configurationSeries (mathematics)Field (computer science)Multiplication signHTTP cookieWeb pageDifferent (Kate Ryan album)Computer animation
Software developerAsynchronous Transfer ModeHTTP cookieKey (cryptography)Function (mathematics)DatabaseSoftware testingConfiguration spaceWebsiteMultiplication signCartesian coordinate systemDifferent (Kate Ryan album)Web pageInformation securityData storage deviceServer (computing)RandomizationScripting languageElectronic signatureNeuroinformatikDisk read-and-write headBeat (acoustics)CASE <Informatik>MathematicsBitSign (mathematics)Product (business)Computer animation
Product (business)Software developerLocal ringHome pageElectronic signatureChief information officerForcing (mathematics)Information securityHTTP cookieDomain nameSoftware testingVirtual machineDirect numerical simulationComputer fileReal numberDomain namePoint (geometry)Public key certificateData structureBit rateLocal ringInternetworkingSign (mathematics)Cellular automatonSoftware developerClient (computing)Computer animation
Uniform resource locatorScalable Coherent InterfaceSoftware developerInfinite conjugacy class propertyInternetworkingException handlingLikelihood-ratio testGame controllerGroup actionString (computer science)NumberBuildingComputer fileDependent and independent variablesServer (computing)Instance (computer science)FingerprintLattice (order)Online helpView (database)Visual systemIntegerElectric currentHash functionCodeMaxima and minimaRange (statistics)Fluid staticsoutputAbsolute valueLimit (category theory)Message passingEvent horizonNewton's law of universal gravitationMIDITask (computing)Public key certificateSoftware developerProduct (business)WebsiteEvent horizonCodeProjective planeMereologyLine (geometry)Integrated development environmentString (computer science)NamespaceInstance (computer science)Tracing (software)Range (statistics)Hash functionInformationPhysical systemIntegerCartesian coordinate systemDifferent (Kate Ryan album)Web 2.0Set (mathematics)Type theoryException handlingUsabilityRaw image formatQuicksortMassIntrusion detection systemBitField (computer science)CodeNumberForm (programming)EmailPerspective (visual)Data managementMultiplication signApplication service providerVideo gameWeb pageDependent and independent variablesNegative numberLimit setData modelIP addressCASE <Informatik>Image resolutionLoginLatent heat2 (number)Cross-correlationTraffic reportingService (economics)Memory managementOperator (mathematics)System administratorZoom lensDemo (music)FreewareMathematical analysisAreaPhysical lawShared memorySurfaceCausalityGroup actionTheory of relativityFormal grammarProcess (computing)Graph coloringOcean currentComputer animation
Home pageSoftware developerWeb pageProcess (computing)MathematicsSoftware testingRevision controlCASE <Informatik>Right angleCartesian coordinate systemIntegrated development environmentSystem callCodeComputer file2 (number)Graphical user interfaceRootSet (mathematics)Numeral (linguistics)Computer animation
Software developerWebsiteNumberSelf-organizationData managementType theoryWeb browserSubsetDirection (geometry)Error messageService (economics)Web pageLibrary catalogComputer animation
Web pageSoftware developerLoginLoginRevision controlNeuroinformatikServer (computing)EmailType theoryWeb pageWebsiteBuildingProcess (computing)Core dumpState of matterLevel (video gaming)Goodness of fitDisk read-and-write headVideo gameSoftware bugLocal ringRow (database)Point (geometry)Computer animation
View (database)Ewe languageSoftware developerScripting language2 (number)Group actionInstance (computer science)Internet service providerGame controllerException handlingFunction (mathematics)Electronic mailing listTask (computing)Virtual machineInternet service providerHTTP cookieServer (computing)WeightComputer fileType theoryWeb pageMessage passingOperator (mathematics)Row (database)State of matterService (economics)Software testingIP addressComputer animation
LengthContent (media)Computer-generated imageryEmailAutocovarianceFormal languageLocal ringWebsiteSoftware developerWeightAuthenticationDependent and independent variablesServer (computing)Control flowCache (computing)Form (programming)PasswordLoginHTTP cookieRevision controlData typeEmailDependent and independent variablesHTTP cookieWeb pageLoginAuthenticationLevel (video gaming)BitFrequencyAddress spaceInformationForm (programming)Right angleDifferent (Kate Ryan album)FamilyTraffic reportingAnalytic setMultiplication signSet (mathematics)Service (economics)GoogolSource codeComputer animation
Latin squareStaff (military)LoginTime zoneNeuroinformatikDifferent (Kate Ryan album)HTTP cookieWeb browserMultiplication signSet (mathematics)Power (physics)BitCross-correlationToken ringEmailServer (computing)Uniform resource locatorRevision controlInformationGoodness of fitService (economics)Complex (psychology)Computer animation
Software developerCache (computing)Local ringException handlingInstance (computer science)View (database)Computer fileOnline helpFluid staticsString (computer science)Event horizonOvalDependent and independent variablesInternet service providerVisual systemFingerprintScalable Coherent InterfaceBuildingMultiplication signCodeSemiconductor memoryQuicksortAreaSingle-precision floating-point formatDecision theoryPoint (geometry)Hydraulic jumpLoginDatabaseVirtual machineAddress spaceData dictionaryWeightCuboidProcess (computing)Service (economics)Web 2.0Projective planeInformationWater vaporStrategy gameMoving averageRight angleProfil (magazine)Event horizonInternetworkingProduct (business)Integrated development environmentStatuteStress (mechanics)Very-high-bit-rate digital subscriber lineCartesian coordinate systemReal-time operating system2 (number)IP addressCASE <Informatik>Group actionInternet service providerEnterprise architectureWindowLevel (video gaming)Server (computing)Mobile appDemo (music)HookingBuffer overflowCrash (computing)Tracing (software)Core dumpBinary fileScripting languageApplication service providerSoftware frameworkSoftwareContext awarenessComputer animationSource code
PredictabilitySubject indexingDifferent (Kate Ryan album)Virtual machineConnected spaceProjective planeCuboidProduct (business)40 (number)Series (mathematics)Router (computing)SoftwarePoint (geometry)Proxy serverFirewall (computing)Video gameOrder (biology)Multiplication signTheory2 (number)Instance (computer science)Cartesian coordinate systemRevision controlWeb 2.0Chemical equationFrequencyCodeDecision theoryWeb serviceLoginNumberMessage passingQuicksortBitServer (computing)LastteilungLevel (video gaming)BootingFront and back endsComputer animation
Drum memoryLoginSoftware developerOvalFluid staticsRing (mathematics)Computer fileFingerprintScalable Coherent InterfaceBuildingPhysical systemSoftware frameworkAssembly languageRevision controlWeb pageAbstractionAsynchronous Transfer ModeAuthenticationInformation securityForm (programming)NamespaceRule of inferenceGame theoryView (database)SpacetimeDemo (music)Event horizonRun time (program lifecycle phase)Modul <Datentyp>Instance (computer science)Exception handlingOnline helpAutocovarianceAngleVideo game consoleCross-site scriptingLinear mapRead-only memoryGoogolMultiplication signScripting languageWeb browserObject (grammar)String (computer science)Web pageForm (programming)NumberRun time (program lifecycle phase)TouchscreenDefault (computer science)Process (computing)Content (media)2 (number)BitClient (computing)Projective planeType theoryPoint (geometry)Web 2.0Java appletDirect numerical simulationCheat <Computerspiel>Slide ruleInformationEvent horizonConnected spaceSoftware testingDiagramDifferent (Kate Ryan album)Analytic setQuery languageGoogolDisk read-and-write headSoftware frameworkValidity (statistics)Traffic reportingRevision controlWave packetWebsiteSystem callFunctional (mathematics)Computer configurationGame controllerBasis <Mathematik>Level (video gaming)Memory managementMereologyApplication service provideroutputAsynchronous Transfer ModeMobile appCartesian coordinate systemStructural loadIn-System-ProgrammierungServer (computing)Computer animation
Software developerBoom (sailing)QuadrilateralArtificial neural networkPoint cloudCommunications protocolFunction (mathematics)Scripting languageSynchronizationSource codeChi-squared distributionMaxima and minimaFormal grammarPasswordOnline helpPrincipal idealTraffic reportingRight angleInformationConfiguration spaceSet (mathematics)PlanningDifferent (Kate Ryan album)BitSystem callChannel capacityMultiplication signWebsiteWeb pageQuicksortMehrplatzsystemInfinityBand matrixNumberSampling (statistics)Revision controlHTTP cookieDecision theoryConnected spaceDescriptive statisticsProcess (computing)Open setDemo (music)Server (computing)Mobile appComputer animation
Software developerDefault (computer science)Principal idealPlastikkarteRepository (publishing)Demo (music)CodeImplementationCore dumpMiniDiscHistogramPoint (geometry)LoginLibrary (computing)Configuration spaceSign (mathematics)Open sourceInformationProcess (computing)WebsiteBlogScripting languageProduct (business)2 (number)Thresholding (image processing)outputModule (mathematics)Pattern languageWeb pageDiagramMultiplication signSoftwareResponse time (technology)Projective planeSubject indexingCycle (graph theory)Coma BerenicesMedizinische InformatikFrequencyExtension (kinesiology)Physical lawComputer animationEngineering drawingDiagram
Transcript: English(auto-generated)
All right, everyone, we'll get going. So my name's Taithe Motti. I work for a consulting company down in Australia. And my focus is on web applications. And what this talks about is a number of the lessons I've had over the years of working on large public websites and how as soon as you put something on the internet, doesn't matter how much you've tested it,
it will break straight away. So you've made it into production, and it's kind of a question of, now what do we do? So it's a bit of a kind of DevOps-focused talk, if you're familiar with that term, except very much talking about application code. I'm not talking about deployments or PowerShell scripting or any of that sort of stuff. A lot of the techniques I'm going to show you,
well, there's going to be sort of three main areas I'll walk through. They're quite simple in how they actually work. They're just very simple gets, posts, things like that. But I find them very powerful. They're also not specific to any particular web framework. So I'm going to be showing you some stuff in ASP.net. You can go and apply this in Ruby on Rails, Node.js, whatever you want to use.
And they're also equally applicable to sites of all sizes, small ones, large ones, and also both internal and public websites as well. So what I'm going to kick off with is I've got a bit of an application that I'm running here, which is mysite.localtest.me. And this is a bit of an auction website
that I've been building. Here's the story to it. So we can see the time's ticking over. I've got a login button, an about button, that sort of stuff. Now the time that's ticking over here, we're showing that clock because it's an auction site and we want people to know what our official time is. And that's obviously driven by JavaScript. So I always like to approach JavaScript with building everything first of all
in a way that it works without JavaScript and then adding JavaScript as your progressive enhancement on top because even if users do have JavaScript, they don't have it while your JavaScript's still loading. And I also find it makes things easier when you're building it. Now in this case here, if we were running in production, if I go down and we look at the bottom,
I'm going and combining all of my JavaScript into a single file. All works well enough. The problem is if I have a problem that's happening only in production or in an environment where I'm running combined script mode, what do we do? We usually end up going and changing some form of conflict flag or something like that and we have to then let that environment spin back up again.
It's not that nice. So one of the ideas that we developed on one project was the idea of actually being able to pass up just a query string flag where we can just go, the JavaScript mode is off. Simple as that. And I have no JavaScript. Or I can say the mode is dev for development where I do get JavaScript but then if I look at my page source,
you'll see I'm getting the three different files exploded out separately. Incredibly simple, means it can work in any environment. What this also opened up for us was the ability for us to actually have functional tests that targeted our non-JavaScript scenarios as well which was quite easy to do rather than going and trying to automate disabling and enabling JavaScript in the browser.
In terms of the impact, one of the lessons that I kind of personally got out of this was starting to think about these things in that you can go and have this flag on a big public website and there's no impact. Like even if a user finds it and knows it's there, all they're doing is getting the same JavaScript file split up. So it's actually quite safe to do.
So that's simple enough. But as soon as I go to another page, I go and I lose that query string key obviously. So this doesn't work for everything. So the next approach that we had, if I switch over into Firefox for this, is I can actually jump down here into my cookies and I can go and create a cookie
where I'll say something like config.js mode, set the host to dot my site local test, make it a session cookie, and then I can just say the js mode is, we'll make it off, dev so we can see it. And then when I go out of that, now every time I go and load any page on the site,
I'm getting my JavaScript getting split out. So I can go and walk across multiple parts of the site. Now JavaScript is only really one part of the story. You could obviously do all of this by just enabling and disabling JavaScript. But what we're starting to do here is actually generally just keep all of our config settings
in a way that can be overridden. Now we actually, so this here is about a specific project I worked on. We moved all of our config settings as much as we could out of the web config and actually into a database. And we just had a database table where we had the config key and the value and a TTL for how long it could be cached for across all the different servers.
And the advantage that this gave us was that we could go and turn things on and off across different servers very easily. We also had a fourth column where we could target config to a particular server. So then what we did was we whitelisted config settings to be able to say, okay, these ones are allowed to be overridden from the client. So the next scenario that this creates is, well, what happens if we've been working
on a whole new section of the website and marketing wants it to launch at a particular time like exactly 10 a.m. on Monday morning right in the middle of peak traffic because that's when an ad campaign's kicking off. We wrapped all of this type of stuff in feature flags. So if I have something like config secret product launch
and then I can go and set that to true, then when I go and load my page, I've got a new link showing up at the top, new and the page becomes available. So rather than actually going and deploying features out
on the date and time, what we do is just turn the config flag on. Now from a server consistency point of view, we wouldn't have actually used a Boolean like I just did here. What we actually used was always dates and times so that the servers could re-cache their config information every couple of minutes and the server times are all consistent. So what we'd have in the config setting is we'd say this page or this feature
becomes available at 10 a.m. so that they'd all turn on simultaneously. So we turned the links and everything on. From an operational perspective, it also meant that we could then turn features off again later. So there was a couple of times where we ended up on national TV where there were segments about the site and everybody would go and jump on and if you'd watch the firewall and we'd literally see 50,000 extra sessions
get added in less than a five minute period. So during that time to control load on the site, we'd actually go and turn features off like stop you from being able to go to the second page of search results. You just see one page and then we'd turn off the ability for you to change the number of items per page. So you just got 100 results no matter what to increase the effectiveness of our caches
and we could do all of that through these database-based config settings that also allows us to then have client-side overrides. So let's say we've then gone and deployed this out to production, the new features there. Now marketing wanna be able to get in and actually check that the feature is good to go and that all the content's ready because even if it's functionally right,
they wanna see that the data in the production database is showing up properly before they launch it. Now they're not gonna go in and create cookies and you're gonna have security impacts around that as well. So the other thing we built was this tool which I called Configurator. So we were running on mysite.localtest.me. This tool runs on configurator.mysite.localtest.me.
So under the same parent kind of domain. And what this tool does here is I can actually, you can see it's showing, I've got a JS mode of dev is what it's overridden to. I could say turn that off and I can turn my secret product on. I go and reload the page and those settings have taken effect. Now this isn't changing anything on the server.
All it's doing because it's in the same root domain is it's just creating those cookies locally and going and putting them in. So then we could actually open this tool up to internal users to be able to go and try out different configurations on the site. And then one of the other interesting scenarios we get into, you'll see one of the keys here, UTC Now.
Because we also relied on this very heavily for our functional tests, and we'd have things like you'd have, we'd open an auction, we'd put a series of bids on them and then we'd have to close the auction. Except our auctions, if people keep bidding like a real auction, they keep going. So they only close once nobody's bid for 10 minutes. Now you don't want a functional test
that just delays for 10 minutes. So what we could actually do is I could go in here and I can even make this tomorrow. As soon as I blur out of that field, now when I go and reload the page, you'll see that the server time is now tomorrow because I can actually override that even via a cookie as well. So our functional tests were able to turn different features on and off for what they wanted to test
and move the time backwards and forwards, which was fairly powerful. Obviously some security questions around this now then. Having users turn different pages on and off, a little bit scary. Having them go and change the clock on a bidding website really isn't gonna work. So the other advantage of having this tool is every time our application got deployed,
one of the things that our database upgrade scripts would do is they'd generate a new random key and they'd store that in the database. Now whenever you send a cookie override, what we actually do in production, where we have a cookie that's something like config.js mode equals off, there then also needs to be another one,
config.js mode.signature equals, and then a digitally signed value based off the key that's in the database and whatever value you're sending up for the override. And then that key would roll over every time we deploy. So obviously not something you can go and compute in your head, whereas using this tool, it could go and compute all of those signatures
and then be able to send them up. So whoever gave access to this tool and it had access to the underlying key, they could do it. So that way there we could even have the CIO could be sitting at home and he could just open it up and turn features on and off because he had access to the tool, which was quite powerful. Any questions around any of that?
So yeah, incredibly simple but opened up a whole bunch of scenarios for us. Before I go on from this little sidetrack, the domain name I'm using here, you'll see localtest.me. Now in coding this up as a solution to make this work, I needed to actually have host names because of the way the cookie security works obviously.
What you traditionally do with this is you'd go and open up your host file and you'd make that up and you'd point it at 127.0.0.1. A bunch of us got sick of doing that. So this is actually a domain name. It's a real domain name on the internet, localtest.me. And anything.localtest.me just points to your local machine. So you can just go and make up host names.
The only request that goes off your machine is the DNS hit and then we just point all the traffic straight back to your machine. And then the other advantage of it, so this is actually all, the only domain name it doesn't redirect is readme.localtest.me. So that's got all the instructions on how to,
or what it does. But we also actually give you a completely legitimate SSL certificate. So if you're doing client-side SSL development at all and you get sick of those self-signed certificates, we pay for and ship an SSL cert that you can just use yourself that works against those host names. It's a wildcard cert.
All right, so we've got our website into production and we're now starting to getting some errors back. The first thing that we did was built a really simple endpoint which is just debug slash throw exception. The number of projects I get to where whenever they want to work
on their custom exception page, they go in and they break some line of code to make it throw an exception. Just add an endpoint that all it does is throw an exception. And that means you can actually go and test that in every environment. But then what we did, you'll see here because I'm running locally, I'm going and listing out the full text. On this particular project,
and it's something I've done subsequently since, we also write out this kind of error number here. This was a company who had a customer service line and people would phone up and they'd have issues and things like that. Now the way this error number is structured, you'll see every time I go and refresh this page, the last part of the number is changing. The first half of the number is what we refer to as the error ID
and the second half is the instance ID. So what we do when there's an error, reinitializing ReSharper, and this is all lovely demo quality code, so please don't judge me too much. What we go and do is we get the exception text
which is that's what's basically anything that's gonna be consistent every time this error occurs. And we calculate an event log safe integer hash. I'll explain what that is in a second. And that gives us the error numbers. We basically just take that text and hash it.
And in the instance specific text, we go and add all sorts of information about what was the IP address they came from and what were all the headers they sent and their form data and all that sort of stuff. And then we go and calculate another hash off that. And then we just concatenate the two numbers together. Now the advantage of this is from an operational perspective, when we get the error reports coming out of this,
we can actually go and group our logs by the error ID at the start and then just say okay, tell me which error happens most often. And we could focus our efforts there rather than having to go troll through all of the different instances. It also meant from a customer support perspective that when we had something like this come in, we could work out what the work around was,
let the customer support people know or help desk or whoever you've got and just say whenever you get a 54172 come in, that's a known issue. The expected resolution's about a week from now and here's the work around. So rather than just having that really generic page that just says something broke, we could actually have a story to go back to our users and give them some information about what they could or couldn't do at the time.
And if it was a case where we really didn't know what was happening or is specific to one particular user, they could obviously give customer support that whole number and then using the last part of it, that's guaranteed unique to every single instance every time an exception gets thrown across our site. So we could just go and look up that particular correlation ID.
Now from a reporting perspective, you'll see there's two different ways that I calculate the hashes in here. So for the error number, we calculate this event log safe integer hash. Zoom out a little bit.
So I'm a firm believer it doesn't matter what your application is or what logging you have in place, you can do all of that as well as long as you write to the event log. The event log is where admins expect to find stuff on the system. So every time we have one of these errors come out, we go and write it into the event log. Now we also wanted to be able to,
in this particular case, we were using SCOM, System Center Operations Manager, but there's heaps of tools that'll go and surface information at your event log. So using that error ID, we can actually write that into the event log as structured data and then we can, all of these other tools, you get all of that reporting and analysis for free basically. The problem with that is there's a very limited set of IDs you can use in event codes
without going and adding the custom fields. So basically we just go and take the error text, get an MD5 hash, jam it down to 128 bits and then down to 32 and then we wrap it into a range of 10,000 that we can use. So there's a risk there that our error IDs can overlap, but it's fairly rare and it hasn't actually impacted us at all.
That'll obviously depend more on how you wanna go and log things in your particular scenario though. The other thing we did in showing the instance ID, basically we build up this massive string of text that has the date, time, the request headers, the form, all that sort of stuff and then we just call string.getHashCode except from just a raw usability perspective.
One of the problems with getHashCode is it returns a negative number lots of times and users aren't very good at reading out negative numbers. They'll just ignore the dash because they think it's some formatting thing which would make it hard for us to go and find stuff. So what we do in that case is we get the hash code and it's a 32-bit integer and we just wrap it around into a 64 to make it always positive.
So just a little approach there. And then the information that we'd get coming out would allow us to go and actually go and search for all these different instances. So this is the type of information that end up in the event log. There are 50085 is a favicon not found error.
That's another one. Yeah, so fairly simple logging. So this is stuff which is really a development responsibility to get it out there. Makes your own life a lot easier. And then the last part to talking about event logs
because they're the most interesting thing in the world. ASP.NET actually has a really good infrastructure under it which a lot of people aren't aware of of the system web management name space where it has these audit events that you can go and use. And it actually has a predefined set of web event codes for a whole bunch of different scenarios.
And then anytime you're going and raising your own events it also has a base code which you then go and add on to for your different types of events. So doing that you get a whole bunch of native support for out of SCOM and things like that for monitoring applications.
So in talking about going and exposing out these little end points that were useful, one of the other things that we found was useful to us was actually just going and dropping a text file on the root of the application which has the specific build that went and generated whatever it is that's running in that environment.
Because it made environment traceability a whole lot easier and it meant we didn't have to go and write it out to the footer or something like that which was ugly and users could see it. And it also meant that our application doesn't even have to be running to serve the file because it's just a dumb static file. So really easy form of build stamping. In this case you can see that obviously they pushed a hotfix about eight days ago
and it's how it's got out there. And then if we look at, there's another one we stamped as well, version.text which has the change set that that build came from. Now in this case it's coming out of TFS so it's just numeric change sets. But two different concepts. One, what was the code that we used to actually get to here? And then what was the build process we went through
so we can always go and download those exact artifacts. Just made environment traceability a whole lot easier. Chrome just forgot to render my tabs for a second. Cool.
All right, so the next thing I wanted to talk about was a scenario that I actually encountered just last week which I found rather painful and how we actually went and debugged that. And what this was was we launched a website publicly and it had a fairly small number of users. We're talking about three to 4,000 users but also incredibly high value users
because it's an investment management site and they're all worth a lot of money to the organization. And there was a very small subset of users couldn't log in at all. They'd go to log in, they'd type in their credentials and then they wouldn't get an error message that said your credentials are wrong or anything like that. They'd just end up back on the log in page.
And we couldn't repro this at all and we spoke to a number of them while they were phoning up customer service and we ended up getting to some of them directly. And they were all using a consistent browser which was the good old wonderful IE8. Except even so, we couldn't repro it in IE8 and there was lots and lots of users in the logs who could log in just fine.
So we started to really wonder about what to do next. And the scenario here, if I go and log in, if I put in bad credentials, I get something that tells me my credentials are wrong. If I go and type in good credentials, I just end up back on the log in page. I didn't go anywhere. So how do we go and debug this?
Is this a scenario type of thing that people have run into? You can't reproduce something like this? Any nods? Do you have a great solution for it? Cool. So what we ended up doing was we wanted to be able to go and basically say, okay, what headers are going backwards and forwards over the wire, right? We wanted to see Fiddler level detail.
But the one guy that we were talking to who we'd kind of established direct contact with him past customer support is this 80 something year old guy who lives another state away and we really couldn't ask him to even run up Fiddler core. Getting him through the process of, are you on the log in page? Yes, okay. Was excruciating as it was.
So we needed a better solution than that. So what we ended up building was basically like a server side version of Fiddler where we also didn't want to go and record every request flowing through because we'd just generate way too much data. So what we ended up with was this endpoint
that we put up on the public website slash debug slash start trace because we didn't know where this guy's computer was. We didn't want to gather everyone's logs. And we made it as basic as it possibly can be. There's no cookies, there's no nothing. And no ReSharper running on this machine so I don't know how to use it.
There we go. So what start trace does is there's literally a get and then when you hit the start button it posts back. We have this trace provider. So when I go and do that, it says, okay, your session ID is big long number. Then we can get him to read that back out to us over the phone. What we've done on the server is gone and pushed into server state
and said, hey, anybody that comes from this IP address, because we've just gone and captured his IP, just log everything and record it against this session ID. So this is the experience that he'd have. And on the previous page we had a nice little message that says, hey, you've reached this page. You're gonna help us diagnose an issue. In doing this we will see everything, like our operations people will see everything.
You type in, are you okay with this? Yes, I agree, next. So we're in and then he could go into test, test, click log in, we've got the problem. And he goes, okay, yep, I've just had the problem again. What we were able to do in our server side
is we had a file that had been written out that we were actually producing using log for net. And if I go and open this up, here I've got, where are we? So started trace from that host address and then basically it just puts the session ID behind all of them. And we can even see here the post request
where they started that trace. So I can see, okay, well, he went and posted to debug start trace. We sent back a 200 response and those were our headers. And then he got the home about page, got a 401, got redirected to the login page, so did a get there, got a 200, and then here we go.
So we've got raw URL, it's the login page. We've done a post. We can see the request headers coming up. So we can see that he's in IE8. We can see the form, we can see the form body here. So we actually had enough information we could replay this entirely locally.
But even so, we went and replayed it in Fiddler and nothing broke. So we kept going down. We've got the response headers here. We're going and issuing a authentication token to him. And that's got a, where are we, 1643, that's 42. It's got a 20 minute expiry period on the token is what we're using.
Then we get to the next bit where he gets the about page and we get a 401. But with this level of information, what we're able to see is if you look at the home about get request there, there's no cookie. We sent him a cookie and the cookie never came back. And we had other cookies that were reliably coming back every time, like the Google Analytics cookies.
Has anybody picked the problem yet? Right. So IE8 doesn't look at, even though the server, so what most browsers do now is the server sends back a date header and the set cookie which has the expiry date. And they look at the difference between them and then apply that difference to the current machine's time.
It might say 4.40 p.m. down here, but if I go and adjust my date time, that's on Sunday, I've rolled my date forward. What this guy had done is he'd bought his computer in Sydney and then he'd moved to Adelaide and he'd changed the time but not the time zone. And there's a half hour time zone difference between Sydney and Adelaide.
And we have a 20 minute cookie expiry for the authentication token. So as soon as he logged in, he was 10 minutes past his expiry time. So quite a complex scenario to go and work out without any information from the person. But what we've basically got there is what I'd describe as a server-side version of Fiddler,
where we can just give them a URL to go to, they can just click it in an email, they hit start recording, and it gives us a correlation ID where we can actually go and do it. To be fair, we probably didn't even need the correlation ID, because you'd only really hopefully have at most one of these scenarios you're working through at a time. You're not really gonna have five people on the phone working through five different scenarios. But it just gave us a bit more traceability there,
which we thought was powerful. Any questions about any of that? Everyone's quite sedate here. I got told that you go to conferences in America and everybody asks questions all the time
because they want to give their opinions the whole time. And you come to Norway and everyone just sits there and listens quietly. So the way we go and actually spin this tracing up
is actually incredibly simple. When you go to start the trace, we just look up the user's host address of just whatever machine they're on or whatever network they're on, and we go and generate a trace session ID out of that. And then all that's doing is just in memory, we've just got a dictionary of just user host address
to a long of their IP address. Now, this solution here is demo quality code and only has to run on one machine. You could do a very similar thing going and persisting it back to app fabric or a database or something like that and letting that propagate out. And in our global ASAX, ASP.NET has a really nice method
of log request, which surprisingly a lot of people don't seem to know about. They jump straight to end request. Log request is the perfect point if you have every single piece of information about the request to go and log it, which is quite nice.
Okay, any questions about any of that? I was expecting more questions. I'm running nice and ahead of time.
The scenarios where you were talking about, there's a couple of people who nodded that you had a similar sort of scenario. Would that be sufficient information for you to solve that? You were nodding.
Right, right, so server-side things dying. We had a similar thing on this same project where we pushed a build out to the farm and all of the app pools just started recycling
on every server that the new build was in. The problem is we didn't detect this until we'd actually rolled out the entire farm because the way our deployment strategy works was we had 40 web servers. We'd take two out of the pool. We'd upgrade them. We'd put them into the pool for an hour. We'd monitor error rates. And then as long as they were sufficient,
it would then automatically roll out to the rest of the servers. All nice and automated and everything just ticks along. The problem was an hour wasn't long enough to surface this problem. And we were in a roll forward only strategy. So we could push out new fixes very quickly, but it was very hard for us to go back
because we were mutating data and there's lots of real-time data. And what it turned out to be was actually a particular piece of data on one auction lot that people could look at would cause a stack overflow and just tear down the entire farm. So the way that we actually ended up diagnosing that was first of all, we didn't really have an easy way
to go and hook up like wind debug generating a dump very easily. We didn't want to go and do that across our whole production environment in an automated way of every time the process crashed. So what we did actually do was we hijacked our deployment process at the time and we created a new build that we had
because our deployment process, which was Go, it's a ThoughtWorks product, already had access to all the boxes. And what that allows us to do is actually just run the script at any particular box and it would just go and snapshot all of the processes at that point in time. And they were crashing often enough that we could then just pick a box that was slow and go, right, go and snapshot that one
because it was in the middle of a stack overflow. So yeah, it was just another way that we were gathering server-side information and dumping it out to disk. And then another scenario that we had was actually a performance-related problem where we were struggling to understand
why something was only slow in production and not in any other environment. And we couldn't run a profiler in production because we're dealing with millions of requests coming in and we couldn't generate enough traffic on a single box to cause the problem ourselves. We tried doing that in UAT
and we couldn't run a profiler on a production box just for the impact and what it was gonna do to our production environment. So what we actually ended up doing for that is there's some really low-level tracing in Windows E-T-W, which is Enterprise Tracing for Windows, Event Tracing for Windows, something,
E-Tracing for Windows, whatever that is, which actually has a whole bunch of ASP.NET and IIS providers in it and it's incredibly efficient tracing framework that goes and basically spits out a binary file format of just every single event that's happening for the candidate group you set it up for, which you're then able to convert into CSV,
which we could then pull into Excel to go and group and you basically get every single event in the ASP.NET pipeline, which allowed us to work out which module, in our case, was causing the slowdown. So that's another tool a lot of people aren't aware of, E-T-W, because that gives you a mix of native and .NET tracing and everything as well.
In the end, the lesson out of that one was not using the network layer for any, or it was basically an application level decision. What we had was, and this is where we completely came undone. How's this gonna work?
What we had was a series of 40 front-end web servers and a load balancer on the front here where we'd bring requests in and then we had two search index machines that would go and query and on each of these, we'd run two copies of the index, one on port 16.100 and one on port 16.101
because we didn't have, so I'll explain the reason for that and then we had another kind of load balancer in here. So the web boxes had hit this and then they'd go out and hit the search boxes. Now the reason we were running dual indexes on each box was that we didn't have live updates at all. So in order to go and update the index,
we'd built the new one. This was an old version of Endecker. We'd build the new one and then we'd sort of spin it up and then we'd have to turn the old one off except there was a startup time to swapping over the indexes. So what we'd do is we'd kind of alternate backwards and forwards in the one box between port 100 and then as soon as it went down, we'd go to port 101.
We'd query that. Five minutes later, they'd toggle backwards and forwards and we were doing this at a network layer where the application code at the time, we thought, oh, this is brilliant. We'll just, if the index is down, we swap to the other one because it also gives us failover and it just means we can just toggle stuff on the box. We don't have to go and notify all the web servers what's going on because we didn't want to have to have some way
to message all 40 web servers and tell them to swap index. The problem was, and this was our performance issue, when the port went down, TCP tries for a while to go and set up the connection unless it gets an active refusal. So it sends a packet, it waits three seconds. It sends another one and then it waits nine seconds.
Then it waits another 11 seconds. And in the way that this particular firewall had been, this particular load balancer had been configured, it had also been set up as a firewall, which meant that it was never returning the reset packets back. So our box didn't know that the connection had been actively refused, so it got into this three, nine,
and then 21 second retry. And by the time that period had actually, it spent so long retrying that it always actually ended up staying on the same index because that index would come back up within that period of time, which we didn't realize until afterwards. We looked at our logs and went, wow, we were never actually using the second index. And this is why it was so slow. We were waiting for everything to kick on or boot up again.
So the lesson out of that was we were using a, we were making an application level decision of we were toggling resources on and off, except we deferred that decision basically to the network layer, which was the wrong layer. So the solution that we ended up doing was we built a software based proxy
on the front of these boxes, which was port 100. And then we had the indexes behind it, 101 and 102. So then what the actual indexes would do is they only had to send a message to that software kind of router, which was just a switch and say, hey, by the way, we're now moving to index 102.
And then it would just flick all the traffic that way. So that way the web boxes were actually able to also maintain persistent connections through to kind of different instances of this software based router, which kept us nice and performing. Out of this, though, when we got the ETW data, the way that this got identified as the problem
was before we actually got into all the detail of drilling through all the logs and trying to get profiling and everything, it was just looking at the actual kind of performance. And what we did was we had a histogram of three, I don't know, five, nine, 21, 25 seconds. And what there was was we plotted all of the request speeds
and there was kind of a big cluster around here and then a big cluster here and then a big cluster there. And the particular, the way the numbers went up of three, nine and 21, as soon as one of our IT guys saw it, he just went, that's a TCP issue, because they're just the common retry times. So if you have something that takes three, nine and 21, it's a network issue.
Just go straight to the network. It took us something like four weeks to actually diagnose that problem. When you're dealing with a million requests that are coming in in production, it's very slow going and profiling these things out.
One of the other things that we do in another project that I'm on is we use a bit of JavaScript to go and monitor our points, sorry, yep.
So what you're talking about is request validation.
And where request validation gets triggered is as soon as you go and touch the query strings or the form collection or anything for the first time, it runs through and it validates it. What they've actually done is they've done a lot of work on that in kind of recent versions of ASP.NET.
Can't remember which, okay. I'm not gonna go, actually, one last check. HTTP runtime object or node. There's a request validation mode in here, which if you've got an existing app
you've been upgrading in ASP.NET, it'll be set to 2.0. You can also go and update that number there, which is better around kind of MPC applications in particular. Because what that'll allow you to do then is under request, there's the validate input.
Okay, don't know where it is off the top of my head because I don't use it and I'll explain why in a second. There's two extra collections. There's an unsafe form and an unsafe query string, which you can then go and query directly if you want. Alternatively, you can, in this request validation mode,
you can actually go and turn it off at a server-wide level, or sorry, at an app-wide level, which you haven't been able to do previously. You had to do it on kind of a page-by-page basis. And that's actually what I do on most projects now, anywhere where you've got rich content coming up from users.
Because if you're using Razer and HTML string, you're encoding by default anyway, you have to explicitly go and not encode something onto the page. So I think that's fairly safe. If you're in a web-forms world, some of the other updates they've done around this allows you to turn off request validation for individual controls as well. So it'd be worth looking into that.
And there's a heap of different options in here about, you can actually put in your own request validator. So there's a request validation type in there, which you can go and put in your own validator, and that way you could wrap the existing one if you'd wanted to keep its behavior, but go and audit stuff out. Yeah. So there's a few different options there.
Any other questions? Cool. Yeah, so moving on to the client side, one of the other things that we've been doing on one of the projects I'm on is there's a nice little bit that showed up in,
one second, as soon as I get the right API. I'd blame jet lag in the 21 hours of flying from Australia to my JavaScript capabilities.
Don't worry, I'm not Googling for it. I'm just getting a screen.
This is the object I want. Cool. So in the browser, there's a new object that most of the browsers are starting to expose, which is window.performance.timing. I was missing the dot performance part. And this is a really interesting object with all of the raw data the browser captures
around lots of different client side events. So where you'd normally get this information coming out in your kind of waterfall diagram in Fiddler or PageSpeed or something like that, you can go and capture a lot of interesting things in here around how long did the DNS lookup take, how long did it take to establish the TCP connection, which is information that you can't gather yourself
in your own testing, because it all depends on where the users are and what connections they're on and how crappy their ISP is. So we actually go and aggregate a lot of this data because we're really interested in performance. So we can understand stuff like, you know what, most of our page load time is actually getting slowed down by the DNS is horribly slow. Maybe we should go and actually add an extra DNS server
in that particular region or something like that. And look at kind of tuning our TTLs on the DNS and all those types of things. Recently as well, Google Analytics have actually added support for you to, if you want, you can just go into your Google Analytics account, tick a box, and they'll start recording all of these client side timings for you as well.
And this is the API that they're using to do that. Unfortunately, they did that after we built all of our own aggregation framework for it. One of the things with the Google Analytics stuff, if you go in to use that though, they only serve it out to a percentage of your users, which is normally 5% by default. And they won't give you the reports
until you have a useful amount of data in there. So if you have a lower traffic site, sub 10, or I think something like sub 20,000 a day is their recommendation, you wanna go in and actually turn that percentage up, which you can do in your client side script.
I'll bring that up. And then they'll actually go and aggregate all that and put it into a nice report for you. Awesome. Broke it properly, twice.
So particularly if you're in any sort of public website where you've got that diverse set of users, it's incredibly useful information to have to be able to make informed decisions about it.
There we go. So in our Google Analytics script, what we do is we push a setting or a config value where we say that our sample rate is, we actually wanna sample every single user who hits this. So the impact of that is once the page is fully loaded, it sends up an extra AJAX request back to Google which says here's the information for it. It's just a little bit of extra traffic on the user's browser.
But that's after the page is loaded, so it doesn't really have any impact on them other than bandwidth cost, which we deem acceptable. Funny story about bandwidth cost actually though. There was one site I had once where it went live and we ended up with a infinite redirect problem.
And the way we learnt about this was we had a customer who phoned us up very angrily because he'd gone away for a week and he'd left our site open. And our site had sensitive data on it. So what it would do is after your session timed out, the JavaScript would actually be monitoring the cookies and go, oop, that cookie's not valid anymore. And it'd reload the page so that you were no longer on your authenticated page.
And there was a problem in this in his particular browser and he'd gone away for a week and had left it open. And he was on a 100 megabit cable connection which then just reloaded nonstop for a week. And we'd pushed 24 gigabytes of HTTP traffic down to his connection out of our server farm
and exceeded his quota. And then also given him something like a $300 broadband bill. So we had to send him a gift voucher for that one. Unfortunately, we had enough server capacity we didn't notice. Like, oh, that's where that 24 gig went. Not recommended.
All right, so that was actually, they were all the different scenarios I wanted to go through. It's a little bit kind of disconnected there, but just a number of different tips and stories and things like that. So is anybody questions or anything else they want to discuss? No, everyone's here. Yep, Thomas.
Any plans of open sourcing server-side Fiddler? Well, actually these, so this demo app here complete with the configurator and the server-side Fiddler and everything is at hg.tath.am slash somewhere.
Yeah, DevOps hooks demos. Okay, it must be a private repository. So I'll expose that. So that has the code we've been looking at today.
So that's that if anybody wants to download it. I'll make sure that that's open after the talk. It's horrible. Demo quality code shows how something works. Don't copy and paste it. Or if you do, use it to help you diagnose why your production just crashed and died. The implementation of what it sends over the wire is right
and it doesn't have any of the signing stuff in there for the config approaches. I would like to, I don't know, write a blog post more so than probably do an open source library of how we wired all that together because I keep redoing it on every project now.
Yep, so the intermittent performance issues is what we had here where, because this only happened every five minutes. So the first thing I found that is just so valuable, I was used to just jump straight into all the detail of saying give me every log.
I want to know what method's running and whatever. But I actually find that rather useless now as a starting point. And the first thing I actually do is just get a histogram over time of your response times. Because then with this issue, what we saw was that. And we could actually then kind of note and go, that there is a five minute cycle. And as soon as we identified five minutes,
we went, you know what, it has to do with the search indexes. And as soon as the network guy went and saw that, he went, you know what, that's TCP. And at that point, it's fairly specific. We only had one TCP related search thing. That would have saved me four weeks if I'd started at that diagram. So getting into there and just looking for the, there has to be a pattern to it.
If not, if you're in iOS 7, it's got the failed request tracing stuff is really good. You can actually go and set a threshold in there of saying if this page takes longer than three seconds to render, go and just dump all the information about it. And that'll tell you exactly which modules are loaded and everything, where it's at in the pipeline. So you'd at least be able to tell
is it in the request processing or is it in a pre-module or a post-module or something? And then after that, you basically get into, one of the things we did try and do on this when we had this problem but we weren't very effective in making it happen was actually having a script that monitored our performance and as soon as it got slow in one of these periods,
it took a snapshot every 30 seconds with win-debug. That had a production impact because we'd lock up the worker process while we wrote the dump to disk except we had a production impact anyway and that the site was slow. That's why the script was triggering. But when we were doing that, after the fact, we were able to find all the relevant information
back in those dumps but we just drowned ourselves in information unfortunately. Cool, okay, thanks for coming along. You did it. Thank you.