Sherlock Homepage - A detective story about running large web services
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 96 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/51843 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
NDC Oslo 201677 / 96
2
7
8
9
12
14
19
20
26
28
31
33
38
40
43
45
48
50
51
61
63
65
76
79
80
83
87
88
90
93
94
96
00:00
Home pageWeb serviceWebsiteInvariant (mathematics)String (computer science)File formatImplementationCommunications protocolVirtual machineSerial portSoftware testingMaß <Mathematik>Physical systemComputer networkSimilarity (geometry)TouchscreenQuicksortBitGoodness of fitBit rateWebsiteUniform resource locatorVideo gameScaling (geometry)Web serviceInternetworkingLastteilungCloud computingRotationVirtual machineCASE <Informatik>Server (computing)Physical systemCausalityDomain nameWeb 2.0Software bugRepresentational state transferComputer architectureSerial portRootCodeRight anglePoint (geometry)Group actionType theorySimilarity (geometry)Direct numerical simulationEntire functionCartesian coordinate systemChemical equationInstance (computer science)Profil (magazine)Data storage deviceDatabaseStructural loadTwitterLine (geometry)CodeComplete metric spaceProcess (computing)PlastikkarteData managementCAN busCycle (graph theory)SequelFerry CorstenError messageOrder (biology)Home pageDomain nameUMLComputer animation
06:48
Home pageWeb serviceFocus (optics)SoftwareContext awarenessScaling (geometry)Web serviceLevel (video gaming)Server (computing)Computer architectureCore dumpSoftware developerCASE <Informatik>WebsiteComputer animation
07:35
.NET FrameworkWebsiteSoftware developerDatabaseWeb serviceArchitectureServer (computing)Data storage devicePrice indexStatisticsWebsiteCASE <Informatik>Fluid statics.NET FrameworkSoftware developerCountingClient (computing)Server (computing)StatisticsSubject indexingDatabaseDebuggerCache (computing)Web serviceType theoryUniform resource locatorScaling (geometry)Multiplication signData storage deviceFront and back endsData centerComputer architectureTable (information)Query languageProcess (computing)Mathematical optimizationDifferent (Kate Ryan album)Core dumpContent delivery networkVisualization (computer graphics)MetadataProof theorySimultaneous localization and mappingAverageSequelRevision controlNetwork topologyPosition operatorWhiteboardNeuroinformatikComputer animation
11:06
File formatString (computer science)Web serviceDatabaseData storage deviceMetric systemVirtual machineBlogStatisticsPointer (computer programming)File viewerServer (computing)Event horizonData storage deviceWeb serviceWebsiteComputer architectureMetric systemSequelPhysical systemPoint (geometry)LastteilungWeb portalFehlererkennungMultiplication signMathematicsMoment (mathematics)LoginDatabaseError messageQuicksortBit rateWater vaporCausalityRow (database)File viewerEvent horizonProcess (computing)BlogTracing (software)Profil (magazine)DemosceneNetwork topologyComputer animation
13:37
WebsiteDemo (music)Hidden Markov modelView (database)Semiconductor memoryException handlingComputer fileMathematical analysisDigital filterThread (computing)State of matterDatabaseDiagramSpeicherbereinigungWeb serviceMultiplication signWordQuery languageRepresentational state transferTranslation (relic)System callClient (computing)BitNetwork topologyServer (computing)Computer animation
15:11
Web serviceDemo (music)BefehlsprozessorThread (computing)State of matterView (database)Motion captureSemiconductor memoryException handlingCompilation albumJust-in-Time-CompilerMathematical analysisDigital filterWeb serviceSpeicherbereinigungStatement (computer science)CodeContent (media)Multiplication signMoment (mathematics)Subject indexingStructural loadBitCodeVirtual machineSemiconductor memoryCrash (computing)Computer animation
16:45
CodePrice indexBlock (periodic table)MereologySynchronizationSemiconductor memoryStorage area networkReading (process)Token ringData typeOvalDuality (mathematics)Source codeRankingRead-only memoryObject (grammar)Exception handlingWebsiteEvent horizonOpen sourceSpeicherbereinigungData dictionaryBitClosed setCountingSoftware bugMultiplication signSubject indexingResource allocationTracing (software)Object (grammar)Mobile appSemiconductor memoryCodeCASE <Informatik>Moment (mathematics)Statement (computer science)String (computer science)Serial portCodeStructural loadSource codeData storage deviceCycle (graph theory)Query languageProjective planeNatural numberDemosceneRevision controlWeb serviceRight angleData managementSequelComputer animation
20:45
Simultaneous localization and mappingTwin primeException handlingWebsiteEvent horizonWebsiteSequelMultiplication signDifferent (Kate Ryan album)Data storage deviceResponse time (technology)Projective planeSearch engine (computing)Profil (magazine)Server (computing)Exception handlingVirtual machineWeb serviceDependent and independent variablesAnalytic continuationMobile appError messageNumberComputer animation
21:57
Server (computing)Error messageDependent and independent variablesCrash (computing)Core dumpProcess (computing)Windows VistaLatent heatMultiplication signModule (mathematics)WindowCore dumpKernel (computing)Web 2.0Different (Kate Ryan album)Virtual machineFile viewerEvent horizonError messageResponse time (technology)Connected spaceState of matterSearch engine (computing)ResultantServer (computing)Physical systemCrash (computing)CASE <Informatik>Wrapper (data mining)Mobile appAdditionCartesian coordinate systemNumberProcess (computing)InformationQueue (abstract data type)View (database)LoginExtension (kinesiology)GoogolUniform resource locatorMeasurementSpeech synthesisEndliche ModelltheorieComputer animation
24:03
Core dumpDemo (music)Address spaceVirtual realityThread (computing)Memory managementInformationRevision controlProcess (computing)ArchitectureException handlingCodeError messagePhysical systemPointer (computer programming)Symbol tableMaxima and minimaPersonal identification numberExecution unitString (computer science)Queue (abstract data type)InfinityDemo (music)Virtual machineCodeSymbol tableWeb servicePlanningLine (geometry)Server (computing)Revision controlDiscrete groupStatisticsNumberMultiplication signMoment (mathematics)Core dumpGoodness of fitException handlingInstallation artEmailVariable (mathematics)Source codeConfiguration spaceInformationControl flowUniform resource locatorStatement (computer science)Covering spaceRegular graphResultantRow (database)Visualization (computer graphics)State of matterStudent's t-testCartesian coordinate systemWindowDisassemblerView (database)CausalityCrash (computing)Task (computing)Thread (computing)Entire functionAddress spaceCompilerMathematical optimizationCodeReading (process)Open setComputer animation
28:08
System callSynchronizationException handlingCrash (computing)Metric systemQuery languageSimilarity (geometry)Task (computing)Revision controlScheduling (computing)Entire functionProcess (computing)Escape characterMultiplication signFerry CorstenCodePressurePoint (geometry)WeightException handlingSingle-precision floating-point formatCASE <Informatik>Line (geometry)StatisticsServer (computing)Task (computing)Theory of relativityData miningCategory of beingEvent horizonGoogolProcess (computing)CompilerResponse time (technology)Installation artData storage deviceScheduling (computing)Search engine (computing)CodeDebuggerVisualization (computer graphics)State observerCrash (computing)Computer animation
30:12
Scale (map)Dependent and independent variablesServer (computing)BefehlsprozessorNormal (geometry)Installation artResponse time (technology)Server (computing)Instance (computer science)BefehlsprozessorProfil (magazine)Scaling (geometry)Dependent and independent variablesMultiplication signCASE <Informatik>Parallel portCausalityComputer animation
30:58
Structural loadSheaf (mathematics)Thread (computing)Instance (computer science)Maxima and minimaCore dumpAddress spaceMereologyPoint cloudScripting languageParallel portProfil (magazine)CASE <Informatik>Server (computing)Reading (process)Cartesian coordinate systemInverter (logic gate)Closed setCausalityRevision controlCloud computingResponse time (technology)Dependent and independent variablesVirtual machineMultiplication signNumberDefault (computer science)WebsiteLine (geometry)CurvatureMathematicsSet (mathematics)DatabaseSheaf (mathematics)BootingGraph (mathematics)Scaling (geometry)StatisticsThread (computing)LastteilungLogicGraph (mathematics)Task (computing)Mobile appSystem callBefehlsprozessorScripting languagePoint cloudComputer animation
33:24
Revision controlServer (computing)AverageDependent and independent variablesCodeBefehlsprozessorRecursive descent parserUser profileSystem callMultiplicationSynchronizationLimit (category theory)Concurrency (computer science)Term (mathematics)Thread (computing)Virtual machineFiber bundleControl flowNormed vector spaceMultiplication signResponse time (technology)Set (mathematics)ResultantSystem callCodeComputer fileConnected spaceWindowProfil (magazine)BitVirtual machineLevel (video gaming)Cartesian coordinate systemLimit (category theory)Product (business)CASE <Informatik>MathematicsMetropolitan area networkData managementVisualization (computer graphics)LastteilungMathematical optimizationServer (computing)Web serviceCommunications protocolDefault (computer science)MassObservational studyRevision controlFrequencyNetwork topologySoftwareFilm editingWhiteboardData storage deviceIntelligent NetworkDependent and independent variablesProcess (computing)Client (computing)Point (geometry)Content delivery networkThread (computing)Mobile app2 (number)BefehlsprozessorCode.NET FrameworkVideo game consoleLine (geometry)Task (computing)Front and back endsWeb browserExtreme programmingIntegrated development environmentStructural loadComputer animation
39:58
Read-only memoryInstance (computer science)Semiconductor memoryDecision theoryServer (computing)Right angleInternetworkingMoment (mathematics)Multiplication signServer (computing)Instance (computer science)Mathematical optimizationSemiconductor memoryPoint (geometry)Scaling (geometry)Virtual machineLocal ringSoftware bugApplication service providerWeb serviceComputer animation
41:18
Demo (music)LoginNumberOpen setSoftware testingMetreServer (computing)PlanningStandard deviationJava appletStructural loadParameter (computer programming)Process (computing)Physical systemComputer animation
42:27
Social classParsingParsingComputer-generated imageryServer (computing)Thread (computing)PermianBroadcast programmingLocal GroupError messageScheduling (computing).NET FrameworkSemiconductor memorySpeicherbereinigungObject (grammar)Read-only memoryPointer (computer programming)SpacetimeRun time (program lifecycle phase)Address spaceProcess (computing)Memory managementResource allocationRootGraph (mathematics)Mobile appLocal ringVariable (mathematics)Default (computer science)Physical systemServer (computing)Latent heatWindowNumberVirtual machineLoginIntegrated development environmentMultiplication signWeb serviceMemory managementDifferent (Kate Ryan album)Semiconductor memoryElectric generator.NET FrameworkEntire functionVariable (mathematics)SpeicherbereinigungOpen sourceObject (grammar)Computer networkOperator (mathematics)Formal languageData dictionaryBlock (periodic table)Point (geometry)Cartesian coordinate systemBuffer solutionState observerLine (geometry)Condition numberSinc functionInfinityException handlingMathematical optimizationProfil (magazine)Run time (program lifecycle phase)CodeCASE <Informatik>Address spaceCycle (graph theory)MereologyRevision controlString (computer science)Intrusion detection systemSource codeInstance (computer science)SoftwareProduct (business)Resource allocationWebsiteCountingDefault (computer science)Video gameFreezingPort scannerType theoryForm (programming)ResultantComputer animation
50:44
Read-only memoryDemo (music)Execution unitLeakKeyboard shortcutCustomer relationship managementPhysical systemSemiconductor memoryProduct (business)Cache (computing)Context awarenessObject (grammar)Group actionKey (cryptography)Network topologyHidden Markov modelQuicksortEvent horizonMemory managementSoftware frameworkNumberZoom lensDatabaseTracing (software)InformationSpeicherbereinigungPressure dropMereologyDiagramMultiplication signString (computer science)Electronic mailing listMoment (mathematics)Open setInjektivitätGreatest elementTraffic reportingFrame problemRight angleDefault (computer science)Computer animation
54:58
Digital object identifierString (computer science)Read-only memorySource codeObject (grammar)SynchronizationPressure volume diagramLimit (category theory)Cache (computing)OvalAmsterdam Ordnance DatumMathematical analysisDependent and independent variablesSemiconductor memorySpeicherbereinigungOpen sourceMathematical analysisObject (grammar)Response time (technology)BefehlsprozessorSemiconductor memoryElectronic mailing listCodeCache (computing)CASE <Informatik>Context awarenessCycle (graph theory)Closed setAnalytic continuationSoftware frameworkMultiplication signExistential quantificationSound effectDifferent (Kate Ryan album)1 (number)Physical systemSoftware bugEntire function2 (number)ChainPrice indexAlgebraDefault (computer science)NeuroinformatikStability theoryType theoryFamilyQuicksortInjektivitätComputer animation
57:49
Computer configurationPhysical systemVirtual machineComputer networkLibrary (computing)Computer fileRead-only memoryCore dump.NET FrameworkComputer networkSoftware bugSemiconductor memoryPhysical systemChainEntire functionFamilyCodeRevision controlCodeType theoryTracing (software)Natural numberState of matterWebsiteSpeicherbereinigungComputer animation
Transcript: English(auto-generated)
00:04
All right. Good morning, everyone. Can everyone hear me? Awesome. Thank you. I was a little bit worried when I saw no one sitting in the middle, but it sort of explains itself because the screens are over there, so thank you. It's not because I smell, which is awesome after a couple of days of NDC. Welcome to this session, Sherlock Homepage.
00:23
It's a big detective story about running large-scale services, and we'll see in a bit what that means, but first let's dive in. So on one good day last year, I think it was around April last year, we had an issue with our life sites. We were completely unavailable, nothing was available, none of our customers could
00:42
use our website. So what we were seeing was that our primary website location was just gone off the internet. Nothing available. If they used our domain name, we could not hit our website, and nothing was visible there, so that was an issue. Now the good thing is we are smart people, and of course we have a secondary location, and we have traffic manager in front, which
01:03
is a DNS service that does failover in case the primary location goes down. So awesome, we thought we had fixed it. Problem was, when that failover happened, we found out that the secondary location was also unavailable, and you would think we would be as smart to have a third location, but unfortunately no, so we were completely gone off the internet. So that's a
01:23
big issue. Website down, our customers were complaining, Twitter was on fire, everyone was complaining that we had an issue. So we went on an initial investigation, so we got some monitoring alerts and stuff like that, and we started looking into what was happening there. And of course we saw
01:41
that the website was unavailable from the internet, but we did see that all the server instances in both the primary and secondary location were all available. If we hit the direct endpoints of each individual machine, we could reach the individual machine and just look at the website and do whatever we wanted there. The only thing was that traffic manager, so the
02:01
DNS didn't work, we couldn't reach the website. And if we would hit the load balancer endpoints that was in front of those web servers, we would also not hit the website. So either we had a DNS issue, or the load balancer would have an issue. So we went to look, and we found running on Azure, that it actually was the load balancer causing a
02:22
problem. So I don't know if you know this, but if you run cloud services on Azure, you can actually instruct the load balancer to check your website every couple of minutes, to see if it's still available. By default, it just checks the root of your web URL, sees if a 200 OK comes back, and if that's okay, your machines are in the load
02:40
balancer pool and they will get traffic. Now, in our case, we have customized this, and we had a custom internal API endpoint where we went to API slash status. And if that one didn't give back a 200 OK, then that machine would go out of the load balancer rotation and would become unavailable through the load balancer. Now, the funny thing there was that eventually all the machines were
03:03
going out of the load balancer because this endpoint was returning a non successful status codes, which meant the load balancer was there, the web servers were there throwing 500 errors, but the load balancer would just take them out of rotation. And that happened on both locations. So primary location and secondary location, which meant we were just completely
03:22
gone because all traffic would stop at the load balancer, because the load balancer had nothing to serve traffic to. We found out in our codes of this status service, we found out that the root cause of this issue could be one of many services. This status endpoint internally checks if our REST
03:41
API's are available, if our SQL database is available, if our storage is available, if our search API is available, and if we could post statistics. So one of these services was probably down. And because of that, this status endpoint was returning an error, which in turn would remove all the instances from the load balancer and just make us
04:00
disappear from the face of the earth. So we solved it. And we basically removed the load balancer probe so that all the machines would be back in business. And yeah, we started digging in. Now before we dig in, I do want to mention this one. I actually like the website highscalability.com very, very much because they have interesting topics on all the big companies like Twitter,
04:23
like LinkedIn on their architectures and stuff like that. And every now and then they have an insightful post. And I especially like this one, where they mentioned the book called Mindhunter, which is written by an FBI profiler who does investigations on serial killers and psychopaths and so on. And it's funny that
04:42
hunting bugs in your code and hunting issues in your code and your applications is quite similar to how those guys work. They try to bring their minds into how the system works into what the bug or the psychopath in this case would be doing and why he would do that and what
05:02
the approach would be to cause havoc and things like that. But the bottom line of this book is you have to know what it's like to be in the shoes of the psychopath or serial killer, or in this case, the bug or the issue in your system. Another nice quote in there is that you have to have a sympathy for the
05:22
machine, you have to really think like the machine, think like your code, know your code, know whatever is happening, and so on in your system. So that's a really important thing. And maybe you don't know your entire system completely. Imagine you're new on the job and you're tasked to investigate an issue like this, but you'll probably
05:41
have done something similar in a previous job. So you take away all that experience to your next job and you get to know this type of distributed systems and how typically these architectures are built. So we had that knowledge, we had empathy for our system because our system was done. And we were thinking like, what could be the case that our system is down? And all
06:02
that combined, we basically had a hunch thinking maybe it's the load balancer. And because of that experience, we proved that it actually was the load balancer taking all the machines out of rotation. One nice quote in this book as well, which I don't have on the slides, is that bugs are like serial killers. If you put
06:22
them in society and they are in the right circumstances, nothing will happen. They will just remain quiet and do nothing until at a given point the right circumstances are there and then they take action. And they can take one action and then they can take the next one and maybe the next one until they get caught. And the same with bugs. And that's
06:41
a thing we are going to see in this story. We fix a bug, then a new bug turns up and well, we basically kept on rolling there. So thank you. Welcome to Sherlock Homepage, a detective story about running large scale web services. I'm Marston Baio. I work at Microsoft on the NuGet team. So you can imagine what this large web service could
07:01
could have been. You can also imagine last year who would have shot me on the stage last year if I told them I was from the NuGet team. I would have shot myself. All right. Thank you. So you may have noticed in the past year that things have improved on the server side and that we did a lot of work. And that's what this talk
07:21
is going to be about. And there's a couple of nice things in there. But before we dive in there, I want to give you some history on how NuGet came about and how the services were built and what our architecture looked like last year around. Yeah, I think March, April timeframe. So our site serves, in case you don't know, NuGets. We serve packages to all .NET developers
07:42
in the world. And with .NET Core coming about, I think we are only going to see more traffic on NuGets than what we currently already have. On average, we are good for some 10 million requests per day on our compute nodes. We also have a lot of stuff in a static CDN type of approach where we don't have to have compute to
08:01
serve requests, which is awesome because we can offload a lot of things from our dynamic servers, but still we have 10 million requests per day on that dynamic website. It was built four or five years ago. I still feel that it was built as a proof of concept, saying, okay, we have OData, which is awesome.
08:22
We want to do this packaging thing, which everyone is doing. So let's maybe create a database, put a packages table in there, have storage where we can serve the packages, then slam OData on top, basically expose the SQL data over HTTP. And yay, NuGets. So that was how it was built. Unfortunately, after a
08:41
while, the team found out that those things were not the good approach. So there were some optimizations like the OData feed would not always go to the database. It would hijack some commonly used queries like package restores, for example, and move them to a different API internally. That said, last year, we had some rough times at
09:00
this time. I still know, running around here at NDC, talking to some of the other speakers, I know Paul Stack was continuously bugging me like, hey, Martin, is NuGet up? And I was like, yes, it is up right now, but maybe in an hour or not. So we had some rough times. So our architecture at the time was we had front end
09:21
servers in two regions, so two data center locations in Azure, and the front end servers run MVC and OData. MVC for the website, if you go to NuGet.org, and you register and you search for packages and stuff like that, that's the MVC site. We also had WCF OData to serve the packages to all the clients like Visual Studio, NuGet.exe, and so on. That
09:44
said, we have, of course, a SQL database in the back end to serve package metadata, download counts and things like that. And those two communicate with one another. So this was the initial architecture that we had, which didn't really scale because everyone would hit the database continuously because the OData
10:01
layer was a really thin layer on top of that SQL database. So we did some improvements. And the biggest improvement there was that we started using Lucene for search, because we have a lot of searches like in Visual Studio, if you search for a package, that would be a search and it would search pretty much the entire database if you would hit SQL. So we basically hijacked that type of
10:23
request and put it on a search service that runs Lucene. Also for package restores, if you search a package by ID and version, which is like an ID lookup, we go to that search service, which has a cache of this package, has that metadata and lives at this URL so we can quickly serve your data. Now, of course, this thing has to have data as well. And we
10:42
store our index and download counts, documents and things like that on Azure storage. But of course, we need that storage to be populated with data from the SQL database. So we have some job VM runners that basically do things like creating the index every couple of minutes, updating statistics if you download a package that you can see an
11:00
increment in a download counts or things like the download counts from the last six weeks and so on. That was the architecture. Now, did we solve the crime of this load balancer going already? I don't think so. We solved the immediate danger where we were offline. We basically told the load balancer to not care about the status endpoint and
11:21
still have the website up and running. But did we solve the crime? No, we still don't know what costs this load balancer probe to go ARRI. So one of our services caused this. SQL, storage, our search service or our metric service. We have a check on the gallery service there as well, but because that
11:41
is hosted on the same system, we are pretty confident that that was not the issue because it would, yeah, sort of, it's a redundant check. So one of these four services was actually acting up and we had to find out what was the cause of this issue. So what we did was log spelunking. You dive into the logs, you try and search clues on what went
12:01
wrong and why the service was actually actually acting up. So the first thing we did was check some SQL database logs and we logged into the Azure portal and we wanted to see the logs and we found out we had none. That's been fixed, of course, but that was quite inconvenient at the time because we didn't have those logs. So we couldn't check the database if that one had
12:22
been down at that moment in time or not. So we sort of knew we had database, maybe yes, maybe no, that was the cause of the issue. But we couldn't say for sure because we had no logs of that one at the time. We looked into storage logs. On Azure you can basically enable logging on storage and you can see which requests pass through, what your
12:41
error rate is and things like that. And storage was healthy as a fish in the water. Nothing was wrong with storage. So definitely not storage, maybe SQL. We looked at search. We have some extensive logging on there as well, but we didn't see anything out of the ordinary there either. All 200 OK, it's coming back from search. And yeah,
13:02
nothing wrong there. We looked into our metric service, also nothing wrong there. It's basically something that has been deployed, has been stable for months in a row, actually two years in a row where we didn't make any changes to the process and it just kept on working perfectly fine. So yeah, we still weren't the wiser and we started
13:22
to look around at the crime scene. We started to put those little yellow markers of evidence on the VMs. So we logged into our Azure VMs and we started looking at IAS logs at event viewer logs and things like that. We did some profiling as well. And yeah, I still have those traces with me. Well, most of the traces,
13:41
I didn't keep them all, which is unfortunate, but I do have a couple. So one of the things we found on our gallery database was this trace, sorry, on our gallery VMs, we did a trace and we got back this data. So there's a couple of things to note here. First of all, the CPU diagram looks quite
14:00
healthy. It's not out of the ordinary, no excessive garbage collections happening or something like that. So actually the service looked quite well when this issue was happening. Another thing we saw was lock contention, no locks, no waiting for locks and things like that. There is quite some time spent waiting, but we found out that wait time was basically
14:21
just because the service didn't get enough traffic and the CPU was idling because yeah, there's almost no traffic. What we did find out though, was that the top called methods, and I would like to zoom into the right. Okay. You'll have to believe me on my words or tries to zoom in yourself. The top method
14:42
we had was this hijack method. So that was the method called most in our search, in our gallery server, which is the method that actually takes the OData query coming in, translates it to the REST API of our search service, and then makes the call to the search service and fetches that data and returns it to the client. So that
15:00
was the most called methods at that time. So we were like, Hmm, so maybe it is search even though we didn't find anything on search. So we went on and we went into profiling the search service. I have those traces as well. And this looks a bit more problematic. You see
15:22
CPU is quite okay. But garbage collection is crazy. This machine was doing 65% of work was it was doing garbage collection, which is insane. Another thing we found on search was lock contention. 16% of the time spent in our search service, it was
15:40
waiting for locks to be freed. So we had lock statements in our codes. And the code was just waiting for those to be freed again, and to continue to work. Damn, that's two bad things. So lots of garbage collection and locks. Now fun thing is if we filter on where are you waiting for locks, that is the moments where it's not garbage collecting. So
16:01
we really had an issue there at the moment. So we went and we looked for what could be the cost of this lock statement. And we found that, can I zoom in? Nope. We found that our Lucene reloads methods was actually waiting for a lock to be freed. And that's not bad because
16:20
the way Lucene works, it's a, it loads an index into memory, so you can search through it. But before it has been loaded completely, you cannot search in that index because things would break down and crash. So Lucene has a lock statement in there. We had a lock statement around that's codes to wait until the index was loaded before serving traffic on that, on
16:40
that index. The one issue we had there though, was the following. Of course it was search. This looked really bad. And we started looking into the codes and we found in our searcher manager, by the way, we are open source. So if you look into those commits, you can actually see these things. We had a start reopen method in there
17:00
and that methods would basically lock until the entire index was loaded into memory and we could do search on that. Now this is not very problematic because it's how you want to work with Lucene. The moment where it starts to become problematic is when this is in your HTTP request path, because
17:20
that means all your users are waiting at that moment. And that was what was happening. We were basically blocking HTTP requests because the index wasn't loaded yet or refreshed completely yet. And we had a lock statement in there. So we removed that one. We factored a little bit and found that it worked perfectly fine. Another thing we found in this
17:41
in this profiling session was this excessive garbage collection. So I don't have the traces anymore, but we also did a memory profiling session there. And we found out that this garbage collection was caused by excessive allocation of strings and J objects and dictionaries. Turns out
18:01
when we serve traffic on our search service, we tried to correlate the search query with the download count of the packages. So that's packages with higher download counts in the past few weeks rank a little bit higher than packages with lower download counts. Now to load those download counts, because they continuously updates, we
18:21
don't want to write them into the index because Lucene, the way it works is you write a document in the index. And if you want to update the documents, you actually have to delete and re-add the documents. And that causes a lot of churn and a lot of waiting, merging, things like that. So we don't want to do that. What we do is we have a blob in storage containing all these download counts. We load
18:41
it as a JSON blob and we use that as a source of this, of this data. Turns out that we had some nice code in there. Remember we have around, I think around eight or 900,000 packages right now. And every couple of minutes we would do this load download counts into a
19:01
dictionary of package ID, package version, and the download counts. Anyone see what's wrong with this code? There's nothing wrong with this code. There is something wrong with this code. If you are loading 800,000 entries into these dictionaries and you reallocate the entire dictionary every five minutes, that's the problem here. Because
19:21
we were doing that, the garbage collector would have to garbage collect every single entry in these dictionaries, recreate the dictionary. And this was happening every five minutes. So every five minutes, this would happen. Garbage collection would take some time. Then we would hit our lock. So we would wait some more time. And then after five minutes, we would start this cycle again and again and again. And this was why
19:40
search was misbehaving. So what we did was we basically replaced this with a dictionary and we just update the values. If the package is already in the dictionary, we just update the download count in the dictionary instead of recreating the entire dictionary with all its entries, reducing memory traffic and reducing garbage collection a lot.
20:01
So that was what we did there. Case closed. Awesome. We found out why we were down and what the issue was and yeah, happy times. Now, remember when I said that bugs are like serial killers in society when given the right circumstances, they do something until you catch
20:20
them. Well, we caught this bug, but the circumstances changed by fixing this bug. So it had a new opportunity. Before we dive into that, we did some things. So we found out we had no SQL logging at the time. So we enabled those things. This was also the time when App Insights came about on Azure and we started looking into that one.
20:40
I'm not sure if you know this thing, but I'm just going to quickly show you. So you have an idea. App Insights basically correlates your server response times, all your different performance counters and so on with HTTP requests. So we can see which use, which request is slowest. We can correlate it with a time. We can correlate it with which dependencies
21:00
have been called like storage or SQL and so on. So it's basically a search engine for monitoring data and you can see whatever is happening there. If there's an exception, you can see the stack trace, things like that. So really useful tool to do some quick debugging. It's like a continuous mini profiler running in Azure and collecting all that data. So quite nice. So we did that.
21:21
We added some more logging. App Insights was introduced in all the different services and we thought happy times we can, we can continue with our regular work. Now, unfortunately, one of those days we had internal server errors on all our machines during package restores. So the thing is when you open up a new project and
21:41
you build the project for the first time and you get packages haven't been downloaded yet. What visual studio does is restore those packages from our services. But of course, if we say internal server error at that time, people get no packages and people get mad and want to kill us. So that's what we were seeing. What we were seeing in App Insights this time was that of
22:01
course a lot of internal server errors were happening. We saw that the response times went up while the number of requests actually went down. If we looked in event viewer and in the different views in App Insights, we saw that IAS was crashing on us on the machines themselves. Well, actually on the log location, we found a
22:21
lot of crash dumps from IAS as well. So something was fishy here. IAS was crashing every couple of minutes. That's not good. So IAS crashes. What do you do to investigate? Well, there's event viewer, which tells you some things. But in case you don't know, IAS is basically a fancy wrapper with additional things
22:41
around HTTP.sys, which is a kernel module in Windows that actually serves HTTP traffic. So IAS is just a fancy thing around HTTP.sys. Now IAS does logging, but this HTTP.sys kernel module also does logging. So if on Azure you go into this spot in system 32, and actually on any
23:00
IAS web server, you can go there, you will find different logs that give you different information about connection states and so on. And what we saw in that log was a lot of entries saying connection abandoned by request queue. That's a nice error. And if you search for it on Google, not on Bing, because Bing sucks in Belgium, where I live.
23:22
It's actually quite nice in the US whenever I'm there. But we search for it and you cannot find any results on any of the popular search engines. You can find some information, some vague information on MSDN. But somewhere we found that this error message means that a worker process from the application
23:40
pool has quit unexpectedly or orphaned a pending request by closing its handle. So this basically means IAS gave up, we closed the connection and see you next time. That's basically what the error was. So we knew something was happening, IAS was crashing, HTTP.sys was giving up on users. So we went to the gift of the gods, which is crash
24:01
dumps. And we pulled those from the machines and we looked into that. Now, if there are any hardcore debuggers in here, you will probably look at me and say, why aren't you not using WinDBG? Because WinDBG is quite complex and not the best demo. So that's why. But fortunately, you can also
24:20
open crash dumps in plain old Visual Studio. And it's actually quite good. You get some information about the crash dump, and you can see that the thread tried to read or write to a virtual address for which it doesn't have appropriate access, an exception. So this exception was the last thing recorded in this entire dump before IAS
24:40
gave up. So this is probably because of IAS crashing. Now this error, if you search for it, you don't find a lot of useful information. You find lots of useful information like this can be why it crashes, but still you get 100,000 results with every result actually being a different result, telling something else about the
25:01
cause of this error. So what you can do is there's actually a couple of nice windows as well in Visual Studio where you can look into the threads and states at that time in the application. But the fun thing is you can also debug a crash dump. So let's do that. Let's debug with managed only. And you actually get the states of the machine at
25:22
the moment this crash dump was taken. And you can look into variables. If you have a full dump, this is a mini dump. So you only get stack traces. So we see this exception and it's the same exception we already saw. If there's a hand, sorry, access violation, reading location, some number, okay, let's break like you would do in a
25:41
regular debugging session. And you see, ah, there's no symbols. I'll cover that in a bit, but we can also go hardcore and just view the disassembly. And as you can see, this move statement failed and that's why I has crashed. Any questions? No. So yeah, that's not
26:04
very helpful. This statement, you could go through the IL and there's actually a couple of guys within Microsoft who would go through the IL and just debug it from there. But they are insane. We took the other approach, which was finding the symbols for this, for this code. Now we found out that our
26:21
build server would not publish symbols anywhere. And we found out that our retention policy on the build server had already deleted the symbols for this specific deployment of our service, which meant we had no symbols and we could not correlate with whatever was happening at the time. Good thing. There are some awesome tools available where you can
26:40
just decompile whatever is happening. So thank you JetBrains for dot peak. There's a couple of others as well. But these guys have a symbol server built in, which means you just open any assembly in there. If visual studio asks for symbols, they decompile it and give you source code that is whatever was compiled at the moment. It's not the
27:00
actual source code. You get all the compiler optimizations and things like that, but you do get a good view of wearing your coat. This thing is actually crashing. So let's do that. Let's configure our symbol server. Let's enable that one. Disable just my codes. Yes. And then we should be able to load the symbols.
27:21
Huh? There we are. Sources from dot peak decompiled from our assembly for which we didn't have the code anymore. Where was this failing in a task that run where we would post download statistics to our download service. So whenever you download a package, we increment the number we save the header. Was this an install or a restore or an update for a, for
27:40
example, and the version of the new get client. So we know when we can deprecate a specific version of the new get client, for example. So this line was crashing. Now this line, if we went into our coats, wasn't that special. The only thing it would do is read some headers and post the number into some service somewhere.
28:01
So, yeah, not very helpful. It is helpful. We knew where in our coat this was crashing, but not why this was crashing. So, an exception crashes IAS. Well, yes, this was the actual code. So the decompiler actually did a pretty good job of bringing that coats
28:20
into visual studio. So this was the actual codes and our task that run post download statistics was actually failing. So what we did was search Google, Bing, search, whatever search engine we could find. And we didn't find anything useful about the exception that we were seeing in relation to HTTP requests
28:40
and async await and tasks and things like that. Until at a given point in time, I think after seven hours of searching, that's all we did that day. We found one guy who said, well, unobserved exceptions in a task that run can crash your process. Who knew this? I didn't
29:00
know. So apparently whenever there's an exception happening in a piece of code that you are running inside a task that you are no longer observing, which is what we do here. This is basically a fire and forget task that run, you know, it will run eventually to post the download statistics. If there's an exception in this post download statistics codes, the task will
29:20
crash, but because no one is listening to the exception, the process will crash. Damn. So how do you solve this? Well, apparently you can tell the task scheduler that you want to attach an event handler to the unobserved task exception events, and you can just do your logging there and just exit the method with
29:40
set observed. And then the exception is observed and IAS will no longer crash in this case. Reason for the crash, by the way, was that our HTTP request would fail from time to time. So if we couldn't post download statistics, our front end would go down because this task would throw. So we added these few lines of codes, added some logging, and we were happy again and case
30:01
closed. Awesome. This was funny. That was a colleague of mine. Anyway. So the next thing we were seeing after we fixed this one was high response times on our server. We were
30:20
seeing really high response times on people doing restores and installs on all the different packages. So what we were seeing was very high CPU usage, long response times, Azure would auto scale. We had auto scaling at the time, adding more instances, but basically what it would do was, Oh, CPU's high, add more instances. Those new instances would get
30:41
high CPU and we would add more instances and basically spend our budget for the year. The profiler when we attached it also showed lots of wait times in there as well. So that was definitely an issue, but what was the cause of that one? So we went to eating donuts again, the
31:01
parallel with profilers and so on. We basically sat down, thought what could be the case of this and did some searching and so on. And we found that there's a couple of possible causes for this issue. This one was actually what we were seeing. When IAS launches your application,
31:21
it will assign a couple of worker threads to the application to basically run things like HTTP requests running in the back and so on, or reading from the database, for example. And by default, I think it's two threads that IAS starts per CPU or per logical CPU in your machine. And it automatically adds new
31:41
threads whenever they are required to serve traffic or serve background tasks or whatever. The fun thing is that IAS, and this is actually documented, if IAS gets burst loads, like a lot of requests in a short time, the thread number doesn't scale up fast enough to cope with all that incoming traffic. So if you have a new
32:01
machine and you have one user and then two and then three and then four, in a reasonable amount of time, IAS will cope just fine. If all of a sudden your machine comes into the load balancer and there's a thousand people hitting that machine, it will probably start waiting for things to happen because it doesn't have enough background threads to do database and HTTP work and
32:20
so on. So we went in there and we changed this setting in the IAS settings. And this is what we saw. This was actually during NDC last year. I fixed this from the speaker room. Seriously. So that was funny. We saw in App Insights when we rolled out this change on all
32:41
the different servers, the response times where they were really flaky before would just stabilize in a nice, smooth, flat line. And in this case, that's actually a good thing. Compared to the last week, you can see the graphs there as well from the last week in App Insights. This was really much, much, much better than what we had at the time. So case closed, I think.
33:01
The only thing is we were running on cloud services, so we had to automatically change this setting whenever the machine would boot. Luckily you can add a startup script in there and do some tweaking when the machine starts up. And basically what you do is call into IAS's command line tool and set the right sections in your application host.config. That's what we did there.
33:22
Case closed, but we were not there yet. How much time do I have left? Still a bit. We were seeing the next thing we were seeing were package restore timeouts. So people doing restores were not getting high response times. Their response times were normal, except for a few requests that would just never come
33:41
back and time out eventually. So what we were seeing on our V2 based feeds, which go through compute, which is in Visual Studio, no longer the case. So in Visual Studio 2015, we actually made an improvement with a new API that is served from static files on a CDN somewhere. But with our previous versions, the V2 feeds would go through
34:01
compute and would show lots of timeouts. These timeouts would not happen for a couple of hours. Then every seven to 15 hours, we would see this massive increase in response times, like extreme load times. You can see one get by ID taking one minute. It's insane. But it would
34:21
fix itself as well. So after about 15 minutes, response times would drop again to a normal level. That's an interesting one to find out what is going on there. So there was no easy way to reproduce this. We tried reproducing the issue with the latest code we had deployed and so on on local machines. We
34:40
were unable to reproduce it on our local machines or our dev environments only happening on production. So what we had to do was find a way to troubleshoot this issue. We normally don't do this, but in this case, since it was only happening on production, we sort of isolated one machine. We replayed, well, we let
35:00
traffic through to it, but didn't let it respond traffic to our end users. So we could see the traffic coming in and observe whatever was happening. What we did was we RDPed in and in task manager, we saw a hundred percent CPU usage, which is awesome and which was the issue we were seeing. But the problem with that is if you want to launch a
35:21
profiler, when your machine is at a hundred percent CPU, the profiler adds even more overhead. You start swapping and doing process interchanges and so on. And yeah, you basically don't get a reliable result from that. So, damn. So we sat down and started thinking and we thought, okay, what did we change last week? And
35:41
the thing we changed was this min IO thread setting in IIS. In App Insights, we could learn that the slow code path was find packages by ID. That was the only call that was being slow. And we thought, okay, what does this call do? This call actually does an HTTP request to our search service to get back a results and to serve the
36:01
package to your end user. And that was a slow call path. So these things seemed to correlate somehow. We had enough threads because we changed the setting, but our HTTP requests would probably cause this issue because that code path does not do anything else than just making HTTP requests and returning the
36:20
response. So we started searching and we found out that when you are using HTTP clients in a .NET application, that thing actually doesn't have a limit on the amount of requests it will make. So if you write a console application, fire up 32,000 threads on your machine and hit
36:41
another machine with it, you will suffer TCP port exhaustion. All your ports are stuck because there's no additional traffic flowing over the line because everything is stuck at the OS level. So we started thinking, okay, HTTP client is doing this nice, but that can be an issue because you have a
37:00
lot of TCP ports in a machine. So we read up on TCP IP a little bit more, and we found out that by default on Windows, whenever a connection is closed, the OS actually holds on to the TCP port for about four minutes before it frees it because there may still be traffic coming in, a load balancer may be acting up or whatever. Some packets may
37:21
still come in related to this connection. And that's why Windows says, okay, I'm not freeing this one for reuse yet because whenever something comes in, I want to basically ignore it and not confuse whoever's reusing this TCP ports. And after four minutes, it frees the port and you can use it again. Now, a fun thing there is that for each find packages by ID, which
37:41
is our package restore path, we would make an outgoing HTTP request, one port, but the user would also come in and ask us for that restore. That's two ports, lots of restores happening, four minute wait time in Windows. So basically what we were seeing was that we had TCP port exhaustion
38:01
and everything was waiting for Windows to free up those ports. After reading up some more, we found out that you can actually block the HTTP client from showing this behavior by setting your service point manager dot default connection limits to a reasonable value. So if you change that HTTP client will actually
38:20
adhere to those settings. And if you set it to, for example, 500, it will only do 500 concurrent outgoing requests from your machines. So that's what we did. We changed the codes and we did some, some other optimizations there as well. Well, of course the default connection limit is one we did. We didn't want to have TCP port
38:40
exhaustion by just calling into our backend service. So we fixed that, but we also disabled nagling at the time. I'm not sure if you're familiar with what nagling is. Basically the TCP IP protocol has an optimization where you have a certain package packet size which can contain data if you make an outbound request or
39:01
an incoming requests. And if you have small requests, TCP IP protocol will try to bundle those requests into as few packages packets as possible to not saturate your network connection and things like that and make things appear more smoothly and more fast. Thing was our traffic was really small because we
39:21
just fetch by ID from the search service. So actually nagling was working against us as well. So we disabled it and had smaller packets flowing across the wire there. So both optimizations were enabled and our issue was gone. Awesome. Some charts, the response time dropped from one minute to three seconds. I think
39:41
that was still during the flighting period where we just patched some servers. And after that response times dropped to a normal level at the time. So awesome. Case closed again. How much time do I have left? 20 minutes. All right. There's more. Unfortunately. What
40:04
we were seeing. So we did all these optimizations and we followed this entire path of finding a bug, squashing the bug, and then hitting a new bug. So what we were seeing was 100% CP memory usage on our medium Azure instances that we were running the service on. Okay. Scale up.
40:23
Bigger machines, happy times, everything fine until two or three minutes after deployments, 100% memory usage. We were like, Hmm, okay. Nice. This was the moment where ASP.NET said they were going to do ASP.NET next thing. We were seeing a lot of new
40:40
users coming in and so on. So we thought, okay, maybe we just have a lot of new users coming on. Maybe we need just another bigger machine. So we did another upgrades two minutes in 100% memory usage. If at that point, you know, you don't realize there's an issue then yeah, you can just keep on spending and spending, but you will
41:00
see the same thing on even bigger machines as well. So what do you do? You start analyzing memory on the servers. What can you do to do this on the server itself, which we don't like doing. So what we tried to do was reproduce the issue on our local machines. And that's what we did. So the way we
41:20
did it is we basically fetched a number of logs from our where's my code? Where's my there it is. Okay. So we fetched a number of logs from our server is logs, which are a typical standards logging system.
41:41
And we downloaded two called J meter. And J meter is a Java based tool, which allows you to do things like load testing and so on. Nice thing about J meter is that on high DPI screens, the UI is really, really small. But apart from that, it's actually a really cool tool because he can do things like replaying replaying
42:03
traffic. Let's open my test plan is as small for me as it is for you. Let's go. Come on. Where did I put this document? There it is. We
42:21
basically configured a test plan, which would read our access logs from my is. So you basically tell this is my access log, you can read it from this specific paths, replay it against this server. And then you can give it things like the number of users that you want to run against. It's how you want to ramp up the number of users and so
42:40
on. So to really reproduce the traffic, your servers were seeing at a given time. So what we did was we looked into the logging. We decided on the window of the logs that we needed to replay this issue on our local machines and our dev environments. And that's what we did. So thank you Jamie for that. Your UI sucks, but the tool is really, really
43:00
awesome. So we did that and we went into the memory. Now, who of you would say they are really familiar with .NET memory management? Nobody? Good, because I think nobody really is. Not even the
43:20
guys who wrote this thing. But I'll give you a quick introduction on how .NET memory management works so that you understand the problem we were seeing afterwards. So you replay the traffic, but before we look at the results, there's a couple of things you have to know. So in .NET, as opposed to languages like C++ and so on, in .NET, there's
43:40
actually a runtime, a managed runtime, that allocates memory for you if you need memory. In C++, what do you do? You tell the OS, I need memory. They give you back an address in memory where you can store your stuff. And if you don't free it, you will get an out of memory situation because nobody's freeing up the memory for you. You have to really take care of
44:00
that yourself. The way .NET works is whenever your application starts, the runtime asks the OS, can you please allocate a big block of memory for me? And I'll then in that block, manage my own memory for you. So that's what .NET does. And that big block of memory is the managed heap that you sometimes read about.
44:21
Whenever you instantiate a new object in .NET, it gets allocated in this managed heap. And that's really fast because you don't have to wait for the OS to give you access to a certain address. .NET can basically tell you, okay, here's memory because it already has the memory available. Of course, some unmanaged memory is also consumed. Like if you write a UI
44:42
based application, WinForms or XAML or whatever, you also get things like graphics buffers and things like that, which are still unmanaged. But most of the memory you are consuming in your .NET applications is this big chunk of data, the managed heap. Now, whenever your variable goes out of scope and it's no longer needed, you don't have to go and
45:01
tell the OS to basically free the memory up again. .NET will manage this for you. It will notice that the variable is out of scope. It will ignore that fact because it still has a big chunk of memory available, so it will not free the memory immediately. What it will do is at a given point in time, it will run the garbage collector. The garbage
45:21
collector, basically what it does is it scans this entire block of memory, searches for variables that are no longer being used, and then removes those from the managed block of memory so that the memory can be reused in future times. Now, I said it scans the entire block of memory, and that's quite an expensive operation to do. That's
45:42
why .NET came up, well, actually a lot of managed languages came up with the concept of generations in this garbage collector. Basically, when you create a new variable or instantiate a new object or whatever, the memory for that is allocated in Gen0, as it's called. Gen0 is where short-lived
46:01
objects go. In a method, you create a string, and two lines beyond that, your method goes out of scope. Well, that's Gen0, and it can be reclaimed really, really fast because .NET knows that probably all the things in there, or a lot of the things in there will probably be out of scope and can be recycled anyway, except for the objects that cannot be
46:20
recycled. Those are moved into Gen1. Gen1 is where medium-lived objects live in .NET, and .NET keeps them around until they are no longer needed in there as well. The garbage collector mostly runs on Gen0. Every once in a while, it runs on Gen1 to free the objects in there. If an object
46:40
survives Gen1, it's moving to Gen2, so the really long-lived objects. If we think back to our download counts example, where we had this big dictionary of strings that were package IDs and versions and download counts, that thing would probably be either Gen2 or the large memory heap, because that's a long-lived
47:02
object. It lives the entire application lifetime and never gets recycled by the garbage collector. So that's basically how .NET works. It has these generations of memory, and it frees the memory up for Gen0 mostly. Then every once in a while, it does Gen1 as well, and then even fewer times, it does
47:21
Gen2. Reason for that is if you have a long-lived object, there's no use in scanning this entire part of memory to free up memory, because it's probably still in use. So that's how it works. So visually, you first have your object in Gen0, then it moves to Gen1 if it's still alive. If not, it's being
47:41
reclaimed and the memory is freed. If in Gen1, your object lives on, it will go to Gen2. If not, it will go out of Gen2. In Gen2, same thing. If it keeps on being alive, it will just remain in Gen2 until future collections. If it can be freed, it will be freed eventually. Then
48:01
there's another special case in this managed heap, which is the large object heap. This is where large objects go. So if you have an object of more than 85 kilobytes in memory size, it goes into the large object heap because .NET thinks, well, it's going to take a while to scan this entire object and make sure that I can reclaim it during a garbage collection cycle. So
48:21
it basically puts it there and that's where things don't have to end up ideally, but they go there. One issue with the large object heap there is if you have a lot of objects in there, there's not a lot of optimization going on there by default, which means that if you have a lot of objects in there, you get
48:41
fragmentation and you will get eventually out of memory exceptions. All right, almost there. The .NET garbage collector is actually a really nice thing because to your application, it looks like you have an infinite amount of memory. You just create new variables,
49:01
create new objects, create whatever you want to create in your memory. And .NET will handle the memory for you and make sure there's always some memory available to do new things and your application doesn't have to care about what's happening in there. Now, when does this garbage collector run? That's actually not really well documented. There's a couple of documents
49:21
describing a lot of things, but I don't think everything is documented. Since .NET Core, the garbage collector is actually open source. So you can look in the sources, but it's, I think last time I checked this, it's a massive, I think it's around 37,000 lines of code doing the memory management. So if you can go in there and
49:41
find out when the garbage collector runs, please tell me, I would like to know. But from observation, most people say, okay, it happens when there's an out of memory condition. When no memory is available in the large, in the, in the managed heap, garbage collector will run and try to free up some resources in memory. After a significant allocation,
50:01
again, our download counts, if we would allocate a new dictionary and populate his entire dictionary with 800,000 entries, we were seeing garbage collection there. Even in our profiler, we were seeing garbage collection because of that. Also when it fails allocating resources, it thinks, okay, maybe there's a memory issue here. Let's optimize
50:20
this. When you have a profiler attached, or when you force it to do a garbage collection, or when your application moves to the backgrounds. But the most important thing from all these other observations is that the garbage collector is not guaranteed to run. If you have enough memory available, for example, it will run fewer times or maybe not run at all.
50:42
Interesting. Keep that in mind. Let's analyze our memory usage. So we replay traffic on our production instances, and we got a memory snapshot. This is what we got. The gray part is the unmanaged memory that's being used. The green part, sorry, the blue part is gen zero. The
51:01
red part is gen one, and the green part is gen two. So what we can see in this diagram is actually quite interesting because we can see gen two is growing continuously. That's one thing. That's not ideal, to be honest. What we can also see is that every time we take a snapshot in memory, memory pressure drops and we get
51:21
some memory back. So some garbage collection is happening, which is a good thing. So let's dive into one of our snapshots and see what was in memory at that time. Now I'm going to take snapshot six and I'm going to take it for some reason. This is the reason. 2.2 million
51:41
objects basically have been removed from memory when I took this snapshot, which means the garbage collector did its work. And what I will be able to see in this snapshot is probably what's the garbage collector didn't collect. And that's what we were, we're interested in. So let's open this snapshot and wait until that memory shows us the
52:02
things. It's actually a nice tool because it shows you some of the, the most common memory issues that you can have like duplicate strings or event handlers that you don't dispose of correctly and so on. And you can dive into whatever is in there at the bottom, we can see the heaps. And as I already said on the big chart, we were seeing gen two was growing and
52:21
growing. So let's see what's in gen two. We can go in there and see which objects are in there. And by default, it will sort the objects by the number of retained bytes. And that's basically the amount of bytes this objects holds on to in other objects. And this
52:41
object is ninject.activation.caching.cache. Okay, nice. I cannot zoom there. I'll tell you there's one object of those, which sort of speaks for itself with a cache, you have one cache holding on to a number of objects. The cache itself is only 48 bytes, so
53:00
nothing wrong there. And it's keeping on or holding on to 90 megabytes of objects. Could be good, could be bad. So let's dive in. What's in this object? We see that it has some entries and these entries are an array that contain keys and values and items. And it's basically easy to really
53:20
get lost in this entire tree. So that's not really useful information. So let's take one step back and look at this entire list one more time. Another object we see here is this guy. Of course, there's a number of collections and things like that in our memory as well. But what's interesting is that our entity framework context
53:40
shows up as the sixth biggest objects held in memory in this trace. Fishy, especially since it's in Gen 2. Because I would expect if I serve requests that have to make database calls, we open an entity framework context, we handle the request, we close the entity framework context
54:00
and it gets garbage collected in Gen 0 or the latest in Gen 1, not in Gen 2. What's even more interesting is that it's almost 80 megabytes of entity framework contexts floating around in memory. Fishy. So let's go in there. We
54:21
see that there's a number of objects of this entity framework context. Let's group them by similar retention. Basically, look at who is keeping this entity framework context in memory. What's the reason why this thing has not been garbage collected? Who is keeping this in memory? And the interesting thing is that we see our
54:42
Ninject cache showing up as the object holding on to our entity framework context. In fact, it's holding on to 532 entity framework context in our Gen 2 heap. Hmm. All right. Interesting. So
55:00
our DI container, Ninject at the time, seemed to be the reason why we had a lot of entity framework context still floating around in memory. So we decided since Ninject is open source to go through the codes and see if we could find what was happening and why this was still being held in memory. And we found this garbage collection cache
55:21
pruner. It was kept in memory by the cache and we find this cache pruner. So let's look at the code in there. It has, let me zoom in a little bit. Oh, that's fast. So fast. We live in the golden age of CPU cycles and this is still
55:40
so fast. Anyway, this garbage collection cache pruner would actually make sure that our cache of objects that we instantiate through our DI container is every once in a while being cleaned up and the object it holds onto are being garbage collected. The way it does that is it has a weak reference and a weak reference is a
56:00
type of object. If you instantiate it and the garbage collector runs, it will basically remove the reference. So, you know, something has happened to it and someone has recycled this object. So you know, a garbage collection happens. It has a list of caches. It has a timer and a Deadpool timer. Now, the interesting thing is whenever
56:21
the timer runs, which is every, I think 30 seconds, it was by default, it would purge prune cache if garbage collector has run. Okay, let's go there. Prune cache if garbage collector has run. If you are on a PCL, you are unlucky. If not, we would lock. We would check if
56:42
our indicator, our weak reference has been collected or not. If it hasn't been collected, we just return and don't clear our cache. Okay. Remember that the garbage collector is not guaranteed to run. That's the reason this variable was not collected. And
57:00
that's the reason why this timer would eventually just return every single time and just keep all the objects that it's caches in its internal cache floating around and basically exhaust our memory continuously. So, yeah, inject was not very friendly to us. And we found some other people with exactly the same issue and no fix
57:22
yet. So the thing we did was we changed to a different, a different TI container, AutoFac in this case. We ran the new analysis and we saw that response times dropped tremendously, first of all. So it was faster than inject as well. And our memory issue was also gone and we are now
57:41
stable at a four gigabytes memory usage overall on our compute nodes there. So that's a really, really nice improvements. Case closed. So in conclusion, when you start debugging and when you start spelunking an entire chain of possible bugs that follow one after another, you will have to build some
58:01
family familiarity with your system that you are trying to debug and trying to investigate. Try to, in your mind, think like the compiler, think like .NET's garbage collector, think like whatever can happen in your own codes. And that will really help you finding these issues.
58:20
Bugs hides. And we found out that whenever we fixed the bug, there would be a new bug popping up because we fixed the original bug and we would just open the floodgates to the new one. So keep that in mind. If you fix this type of issues, you may run into the next one the next day or even the next hour. What I did learn from this experience, I've been
58:41
doing this for three, four months, this entire thing. What I did learn is that I learned a lot about how .NET works. And I learned that I actually like spelunking through memory traces, through performance snapshots and so on, because you learn how your code is behaving even better. You learn how the .NET framework is behaving in the different
59:00
versions of the .NET framework and you overall build experience for the next bug that comes along. So if you have an issue like this, don't go, it's an ops problem, dive in and try and investigate. It's really, really useful to do with that. Thank you for being here and enjoy the rest of NDC.