FastFlood: The Story of a Massive Memory Leak in
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 24 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46567 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Product (business)Incidence algebraCodeSemiconductor memoryLeakLink (knot theory)MassShape (magazine)VirtualizationRhombusJSONXMLComputer animation
00:21
System identificationBuildingProduct (business)Shooting methodPRINCE2BuildingProduct (business)Incidence algebraInferenceDigital photographyComputer animation
01:34
Data managementProjective planeDigital photographyMeeting/Interview
01:54
Semiconductor memoryLeakServer (computing)MassTwitterSemiconductor memoryLeakMassDigital photographyMeeting/Interview
02:17
DigitizingInformation technology consultingProduct (business)Line (geometry)Projective planeMeeting/Interview
02:44
Server (computing)Presentation of a groupWeb pageVolumenvisualisierungComputer animation
03:04
Semiconductor memoryVariable (mathematics)Drop (liquid)LeakSemiconductor memoryLeakWeb pageMultiplication signDigital photographyComputer animation
04:40
Graph (mathematics)Semiconductor memoryCASE <Informatik>Cartesian coordinate systemObject (grammar)Digital photographyLecture/Conference
05:09
Graph (mathematics)InformationSemiconductor memoryProfil (magazine)State of matterGroup actionMemory managementNumberReal numberDistanceDirectory serviceRootGraphical user interfacePoint (geometry)Cartesian coordinate systemGreatest elementComputer fileObject (grammar)Multiplication signFreewareSoftware developerSemiconductor memoryProfil (magazine)Memory managementObject (grammar)Computer animation
09:57
Semiconductor memoryProfil (magazine)Drop (liquid)Group actionMereologyMemory managementInstance (computer science)Cartesian coordinate systemComputer animationLecture/Conference
11:31
CodeMathematicsProduct (business)MultilaterationLeakProjective planeMemory managementLocal ringRevision controlProcess (computing)Cartesian coordinate systemObject (grammar)FlagMathematicsLeakLocal ringDigital photographyXMLComputer animation
13:24
CodeServer (computing)Process (computing)Instance (computer science)Graphical user interfaceWebsiteConcurrency (computer science)FreewareSoftware developerLeakServer (computing)Computer animation
14:24
Semiconductor memorySound effectConstructor (object-oriented programming)Object (grammar)
14:55
LeakTerm (mathematics)Revision controlPresentation of a groupMultiplication signXML
15:41
Semiconductor memoryTime zoneCASE <Informatik>Windows RegistryDigital photographyXMLComputer animation
16:22
CodeLeakCore dumpLeakXML
16:59
CodeMathematicsTelecommunicationSemiconductor memoryTask (computing)Line (geometry)LeakTheoryProjective planeServer (computing)Dependent and independent variablesCASE <Informatik>Presentation of a groupPressureCartesian coordinate systemMultiplication signVulnerability (computing)Service (economics)Mobile appTask (computing)Military base
20:56
Semiconductor memoryComputer configurationPoint (geometry)Digital photographyComputer animationMeeting/Interview
21:26
Fluid staticsStructural loadLeakMereologyRevision controlServer (computing)Dependent and independent variablesSerial portAdaptive behaviorConnectivity (graph theory)Template (C++)RoutingPoint (geometry)Router (computing)Data storage deviceWordCartesian coordinate systemBootingEndliche ModelltheorieObject (grammar)Multiplication signVolumenvisualisierungGame controller
24:12
Combinational logicCategory of beingLeakMoment (mathematics)EmailAuthorizationNeuroinformatikEmailAuthorizationDefault (computer science)
25:21
Level (video gaming)Physical systemRevision controlRemote procedure callMultiplication signVirtualizationTwitterMeta elementDigital photography
Transcript: English(auto-generated)
00:03
It was a really exciting but first let me tell you that the story, all names, characters and incidents portrayed in these productions are fictitious.
00:31
No identification with the actual persons living or deceased, places, buildings and products is intended or should be inferred. Just as a curiosity, this was started in the film industry after the MGM production, Rasputin and the Empress.
00:48
In that production, it was a film, Rasputin raped Natasha. Natasha was the character portrayed in Prince Irina.
01:01
And Princess Irina shoot MGM and she won what it would be today, over 2 million dollars in court and 90 million in out of court settlement. If you are not a native English speaker like I am, and I mean I'm not a native English speaker, this is legalese for do not sue me please.
01:35
As I was saying, it was a really exciting day. After months and months of work, we were releasing a new feature.
01:44
The team was really really excited and thriving because of this. But suddenly, our project manager came with bad news. We were having a leak in the server. And I'm not talking about a small leak, I'm talking like this kind of leak.
02:09
Welcome to Fast Flood, a story of a massive memory leak in Fastboot land. My name is Sergio Arbeo, I'm also known as Serave in Twitter and GitHub.
02:23
I work for Dockyard. Dockyard is a digital product consultancy from ITIA to find our product. We have QA designers, project managers, engineers. If you're looking for some people to work with you, just drop us a line and we see what we can do.
02:45
If any of the presenters are not familiar with what Fastboot is, Fastboot is the server render engine of Ember. This means that we can render in our server the pages so we can serve and already render pages and let Ember take it from there.
03:04
If you're not familiar with what a memory leak is, it's basically that. It's a piece of data that should have been garbage collected, but it's not for some reason.
03:20
Let me do a kind of clear example with this. Let's say we have variables 1, 2, 3, 4, 5 and 6 in our memory. Let's say we drop 1, 2 and 3.
03:41
If after this we still have 1, 2, 3, 4, 5 and 6 in memory, we say we have a leak. It could be good that whenever we drop 1, 2 and 3, we find ourselves with 1, 2, 3, 4, 5 and 6 in our memory. This means it's reproducible. That's great because it would be easier to debug it.
04:06
But it's still bad that we have 1, 2, 3 drop and still find ourselves with 1, 2, 3, 4, 5 and 6. And not just because we cannot get rid of 1, 2 and 3.
04:21
It's just because we don't know what else we can have in our memory. Like, oops, an accessibility soy. That said, I hope this makes things much clearer. But why does memory leak happen?
04:45
The main reason is that something else is keeping a reference. This is almost 100% of cases. There's a tiny, tiny chance, really tiny, that it's a garbage collected book. It usually happens in frameworks themselves.
05:03
It's really rare to see one of these in our applications. As we are talking about references, we can easily create an object memory graph. This is the object memory graph tool in Firefox. We are not going to see much of Firefox here because as we are in fastboot and fastboot is not,
05:22
it's much easier to work with Chrome developer tools in here. I don't know for you, but this tool reminds me a lot of the file directory tool in Jurassic Park. And a colleague told me that that was an actual thing in Solaris.
05:41
But let's see a tool that's much more useful for us. And that's the heap profile. In here, we have two panels. In the panel above, we have all the objects in the memory. And then we have information about them. The first piece of information we have is what people call the distance.
06:01
That's the distance from the GC root. It's a little hard to explain. It's much easier to see written documentation about this. But the general idea behind this is that the biggest memory leak, the smaller this number must be. It's not a real correlation, but it's highly likely.
06:22
And then we have the shallow size. That's the size of the object itself. Finally, we have the retain size. The retain size is the size with free, so we freed that object. Let's see an example of, for example, the object.
06:43
In this case, we have 400,000, like 3% of memory in shallow size. But so we freed this object, we would be freeing other values as well. And those would free almost 30% of the memory.
07:01
Below this first panel, we have the retainers panel. We can send objects from this panel to the panel above and vice versa. This is really useful because we can look for an object in the panel above and send it to the retainer panel. And we can see which object are retaining that one.
07:23
Really, really useful. As I said, this is the heap profile. We can do really good, really cool things with this. Basically, we captured the memory state at one point in time. And the tool let us compare several different profiles.
07:44
For example, if you work mostly in the browser, we can do things like this, what we call the free snapshot technique. For doing this technique, the first step is to warm up our application. Let's say, just started or started unlogging in would be warming up.
08:03
This would create a few objects in our memory. After this, we create the first snapshot. After the first snapshot, we do the action we suspect is leaking memory. And we do a second snapshot.
08:22
As we can see, after this action, a few objects have been marked to be recollected. For example, the one in the bottom left corner is marked for recollection. Then we repeat the action and we do a third snapshot.
08:46
Okay, now we have three snapshots. You might have suspected we would do so because it's called the free snapshot technique. But what can we do with this thing? We can do the following. We want the objects that are in the third snapshot.
09:04
That removes all the objects marked for recollection or recollected already. Then we want the objects created after the first snapshot. We are not interested in the objects created during the warm up.
09:20
Maybe if they move now. And finally, we want the objects created before the second snapshot. We are not interested in the object created after doing the action for the first time. While this does not pinpoint us to an object that is leaking,
09:42
this does just reduce a lot the memory we need to inspect. But this is not really useful for us. Because in fastboot, the memory are more atomic. We don't have leaking between requests. For that, it's much more useful the timeline tool.
10:01
The timeline tool looks exactly like the heap profiler we saw before. But with the timeline above. Let's inspect that timeline. In that timeline, we have a blue bar that represents the memory we are consuming.
10:23
If some of the part of that memory is being recollected by the carbox collector, that part is displayed as a gray bar. More about the memory in fastboot is that usually in fastboot the warm up action involves a higher memory being consumed.
10:45
But subsequent request does not consume that much. Usually after a few requests, a new application, because that was the application, if you remember when we introduced fastboot, there were application initializers and instance initializers.
11:01
It's mostly the same here. We create the application and we create the instances. A new application is created and the other one is dropped. In this scenario, all the requests are leaking almost 90% of the memory. The ideal situation would be something like this.
11:24
We see all the requests, gray. Okay, you'd be wondering, now we have the tools, what? Okay, I'll tell you the process we followed and we will find during that story.
11:41
The step zero is we need to reproduce that locally. Some of you might be thinking about using Git-B-Sec. That's a really useful tool if you can use it. In our case, since we were using feature flags extensively, we've been working on that for months, so it was not useful for us.
12:01
In any case, this is useful for anybody. Production is built. Why? Because we want to have the build as close as possible as production. That means that we might need to remove some loggers or some services,
12:21
but if we were building the Fastboot application and moving it to another project, we would be doing that in here. We want to be as close as possible. One big change that really needs to be done is no notification. That's just because in the panel we saw before in the heap snapshot profile,
12:45
the name of the objects would be there. But what if your object had no name, like a simple podium you were passing? Well, we have a snippet for that later.
13:01
The next step would be look for the leak in our code or look for changes between versions. We can approach this like, okay, we have just received the project in one state, let's inspect the project as it is now, or look for the changes that happened in those months.
13:25
For finding the leak, we followed this process. The first one is running the server. Don't forget to use inspect on inspect vrk, so you can use the Chrome developer tools with your node instance.
13:41
Then we do one request. This idea was taken directly from the free snapshot technique. And also we do this first request manually. This is important because sometimes you don't solve the memory leak, but break the build, and that would let you see if you are still returning a website.
14:06
Then you start the timeline and finally make a few requests so you can inspect the code. For making those few requests, we usually use Apache Bankmark, the AV tool, with concurrency one,
14:20
so you can see more clearly each of those requests. This is the snippet. So you can see while inspecting the memory the name of some pojos. You can use the snippet that just would let you see that pojo has leaked detect in the inspector and look for it.
14:44
Or this other one. They have the same effect. If you need several names, just change the leak detect for the names you want. Foo, bar, macarena, whatever. Then we have step two. We need to find the dominator.
15:00
Dominator is the term in the industry. I haven't found the other one. If you know of a better one, let me know and I'll change the presentation. But the dominator is basically the retainer we need to remove so the leak is gone. Or we can also find the dependency because the leak can be
15:21
in one of our dependencies updated during this time. Step three. Remove the dominator or change the dependency version and win. Thank you so much. Wait.
15:42
This was not that simple in our case. We were dealing with two big problems. First, we were a fully remote team. There were four people on our team and I think there were even four time zones.
16:01
And we were leaking the container. If you are new to Ember, container is basically the registry Ember is using for everything. Everything is in there. So that's the reason we were leaking almost 90% of our memory. So what do we do? Well, after confirming we were leaking the container.
16:20
That was on the very first day. We have two approaches. The first one is look for owner leaking. Owner is basically the public API of the container. So we might be leaking the container somewhere. Might be our code or some of our dependencies. And also update to late December.
16:41
We were not in the late December because of reasons I cannot disclose. But that's the other approach. Maybe, hopefully, sorry for the Ember core team, but hopefully the leak was there and it was not our fault. Spoiler alert, we don't know.
17:01
Then we assigned tasks based on people's knowledge. For example, there were one person on our team that had updated a similar application. So we asked him to start working on that. Update our Ember JS. The other person was the main person behind the changes,
17:22
behind this new feature. So we charged him with going through the changes and see what could be wrong. And two of us had more experience finding leaks and inspecting memory. So we charged those people on doing a general investigation on approaching these as if you were new to the project.
17:44
Done this, I cannot suggest enough that communication is key. Communicate early and communicate often. This is just if in a remote environment, communication is really, really important.
18:01
In times of crisis, it is more. Early and often let us prevent duplicate of a 40 different tasks and also use your colleagues as rubber ducks. Even if you think you might be wasting the time of your colleague,
18:22
this is not the case because this is a time-consuming task that consumes also a lot of morale. You really need that human contact as well. Take small victories before winning the war is one of the key concepts I want you to take from this talk.
18:41
First, finding the leak won't be done by one individual. As we were splitting the task, the responsibility should not be split. Why? Because the only reason one person in that team is finding the leak is because the rest of the team is trying other approaches.
19:02
This is really important. This is not a competition, this is a team effort. But why taking small victories before finding the leak? First, and more important, morale. While going through this process, even if it is just a few days,
19:20
there will be really intensive days that will take on your morale. But why these small victories affect them and lift your morale? Well, it decreases the pressure. If you consume less memory, you need to restart the server less and you get less pressure from the external services.
19:44
Also, it improves your code base. Less memory consumption? Snap your apps. And less memory consumption? You need to inspect less memory to find the leak. And that's nice, that affects morale as well. If you need to inspect less memory, it's easier to find it.
20:03
At least in theory. But please don't take weak victories at any price. Some improvements are not worth it. Think that you might make a change that will need to be taken into account for the foreseeable future, every time you do something.
20:23
Those changes need to be easy to drop, in case you want to drop them. And doesn't need to be hard to maintain. For example, one of the small victories we took is that we were using presenters in our teamplates. And we stopped catching those presenters in fast bootleg.
20:44
There were four or five lines of code for that and they were easy to remove, in case we wanted to. And that removed the memory consumption by 30%. And that's nice. But four days later, we were still in the same point.
21:03
We were consuming much less memory, almost half of it. That's nice, of course. But we were still leaking like 40-50% of our original memory. What can we do now? This is hard to describe because we were out of options.
21:24
Okay, then we thought. This is basically the request in Emberland. A request, if you're not familiar with this in fast bootleg, you just get a request. It goes through several middle words because fast boot is basically an express middle word.
21:44
Then hits fast boot, fast boot goes to the router, the router creates the routes, the routes loads the data from datastore. Then it initializes the controller and the controller renders the template that uses the right components to be rendered. This is a simplified version and really inaccurate
22:04
but I think it's useful for our purpose. So the first thing we did, and we did this early, like the first day or the second one, is to check if it's something in our other middle words because we were using several of them. What we did is substitute fast boot with a static response
22:24
and the leak was gone. So that means that it's actually in our Ember application. After that, what we thought was the weakest point we can attack and easily change for a static response that simple, the template.
22:42
What we would do is we would remove the template and just use a static HTML received from the server. We did this, the leak was still there. That means the leak was not in our templates or any of the components below it.
23:02
Next place, we would replace the model in the route and we would return a plain old JavaScript object. We did that and bingo! The Ember leak was gone. So we knew the problem was in the store.
23:21
We had a really, really custom store adapter and serializers. So that was bad news. The good news is that we were using those customized adapters and serializers for really long. So we were fairly confident on not being there.
23:42
Our memory leak. What we did is at this point in time we spent a couple of days replacing parts of Ember data and our adapters for static responses. This is not as simple as it sounds because depending on the point we might need to tweak different things.
24:03
After a couple of days we found the problem was in our adapter. Do you want to see the problem? The leak was here. In our adapter we have a computed property for headers.
24:23
This is using the old syntax because this happened almost a year ago. In these headers we were returning an authorization with a token injected from one add-on. Do you want to see the fix? Because this is going to be really nice.
24:42
The fix was this one. Heathers was just a getter. But why was that happening? We suspect that something was happening in the request because all the properties in the request are being lazy
25:02
like computed at the last moment. And we think that's a combination of that and how the value in the bearer was injected. But we don't really know. So my last advice for this would be
25:21
let go. If it's hard to reproduce you won't be able to send a reproduction to the Ember team so they can find out. And maybe it's over your level of knowledge. Maybe it's over any of your team's level of knowledge and you cannot really find it.
25:41
You can spend some time on it but don't sweat over it. Thank you all for attending my talk on this remote version of EmberConf. It's been a pleasure talking to you, at least virtually. If you have any questions I don't know if there will be any system in place for doing that live
26:00
but you can reach me on Twitter at Serave. Thank you.