Using Skylight to Solve Real-World Performance Problems
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 88 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/37353 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer | ||
Production Year | 2018 | |
Production Place | Pittsburgh |
Content Metadata
Subject Area | |
Genre |
RailsConf 20187 / 88
9
14
16
19
20
22
23
26
27
28
34
35
36
37
38
39
41
42
46
47
53
57
60
62
63
64
69
72
80
85
87
00:00
Right angleCASE <Informatik>Right angleDiagram
00:51
Level (video gaming)Entire functionArithmetic meanVideoconferencingMachine codeXML
01:33
CASE <Informatik>System callOpen source
02:52
Intrusion detection systemMIDIOpen sourceFreewareOpen setData miningMobile appSingle-precision floating-point formatComputer programmingCartesian coordinate system1 (number)InformationOpen sourceMereologyComputer animation
04:17
CodeSoftware bugOpen sourceFreewareEvent horizonContent (media)Real numberWordMobile appUniqueness quantificationData managementSoftwareWebsiteOpen sourceMachine codeEvent horizonOnline helpSystem administratorInterface (computing)Computer animation
05:49
EmailOpen setLink (knot theory)Maxima and minimaComa BerenicesDependent and independent variablesAveragePermianGroup actionWebsiteGraph (mathematics)Key (cryptography)Vector potentialMetric systemDatabaseMultiplication signNumberMobile appResponse time (technology)MereologyTask (computing)Green's functionDependent and independent variablesWeb page1 (number)Distribution (mathematics)CurveDisk read-and-write headAverageHistogramMedianStatisticsProxy serverNormal (geometry)Query languageNormal distributionRow (database)Computer iconEvent horizonLine (geometry)Game controllerGreatest elementView (database)Artistic renderingXMLComputer animation
10:30
Game controllerPrice indexProgrammable read-only memoryAuthorizationDatabaseQuery languageAdditionMachine codeInformationSubject indexingMultiplication signGame controllerProcess (computing)Direction (geometry)BlogTemplate (C++)Loop (music)Computer animation
12:20
Convex hullQuery languageChecklistDuality (mathematics)CodeQuery languageOpen sourceInformationPoint (geometry)Multiplication signBitComputer animation
12:52
Source codeMachine codeOpen sourceProjective planeComplete metric spaceWeb 2.0Student's t-testBuildingComputer animation
13:21
Computer programmingTimestampPrice indexIntegerDatabaseSubject indexingMultiplication signProcess (computing)Mobile appInformationArithmetic progressionStudent's t-testSubject indexingQuery languageDatabaseWeb pageRow (database)CASE <Informatik>Revision controlOrder (biology)Table (information)Field (computer science)FreewareComplete metric spaceMereology
17:27
Price indexStudent's t-testWeb pageQuery languageInheritance (object-oriented programming)CNNWechselseitige InformationProjective planeCASE <Informatik>Multiplication signTable (information)Row (database)Subject indexingSheaf (mathematics)Total S.A.Computer animation
18:02
Machine codeTable (information)Insertion lossMultiplication signMathematical optimizationComputer animation
18:35
Execution unitCodeMaizeSoftware bugSoftware developerSource codeOnline helpMobile appOpen sourceQuery languageMultiplication signProjective planeSoftware bugDirected graphSoftware maintenanceNumberRight angleCodeComputer animation
19:45
Software bugDuality (mathematics)CodeOpen sourceOnline helpSource codeSoftware developerCodeSoftware bugOnline helpMobile appEmailWorkloadBit10 (number)Repository (publishing)Projective planeSoftware maintenanceScaling (geometry)InformationWeb pageBookmark (World Wide Web)
21:07
Point (geometry)Web pageMeta elementQuery languageOrder (biology)CountingProjective planeNumberArithmetic meanInformationMathematicsPrime idealCache (computing)Extension (kinesiology)Computer animation
22:04
Cache (computing)Meta elementHacker (term)LaptopHome pageInformationBitMultiplication signMeta elementMobile appPlastikkarteGoodness of fitEmailOpen sourceSource codeComputer animation
23:08
Cache (computing)Software repositoryReading (process)View (database)VolumenvisualisierungMultiplicationPlastikkarteMultiplication signSoftwareOverhead (computing)Query languageLoop (music)Software bugData storage deviceSoftware maintenanceMachine codeContent (media)
24:15
Cache (computing)Data storage deviceMultiplicationCache (computing)Content (media)Home pageMathematicsPlastikkarteComputer animationSource code
25:21
Radio-frequency identificationCache (computing)PixelSquare numberPlastikkarteSpacetimeData storage deviceFraction (mathematics)
26:00
Cache (computing)Endliche ModelltheorieHome pageData storage deviceImplementationOrder (biology)PlastikkarteSoftware as a serviceInternet service providerComputer animation
27:05
SatelliteExecution unitStatisticsPlastikkarteCache (computing)Data storage deviceGroup actionMultitier architectureSpacetimeFlow separationStatisticsFunction (mathematics)PlastikkarteDifferent (Kate Ryan album)MathematicsNumberCountingState observerHome pageSource codeXML
29:44
Cache (computing)Plastikkarte1 (number)Entire functionSpacetimeBit rateSoftware bugSoftware as a serviceInternet service providerMultiplication signTheoryData storage deviceOpen sourceComputer animation
31:06
Cache (computing)Cache (computing)Reading (process)Multiplication signPoint (geometry)Data storage deviceXMLSource code
31:51
Cache (computing)View (database)Multiplication signReading (process)MultiplicationRow (database)XML
32:41
Data compressionVacuum10 (number)Data storage deviceTwin primeCodeMultiplication signProjective planeOpen sourceSpeech synthesisSoftware bugSpacetimeCache (computing)Patch (Unix)Data compressionXMLComputer animation
34:24
Inheritance (object-oriented programming)Drop (liquid)QuicksortFigurate numberPoint (geometry)WebsiteWeightOffice suiteBlogDirection (geometry)FamilyMobile appOperator (mathematics)Computer programmingComputer animation
36:46
Game theoryFreewareOpen sourceXMLComputer animation
37:18
Coma BerenicesBlock (periodic table)Data typeXMLComputer animation
Transcript: English(auto-generated)
00:12
All right, so 2.40 on the dot. Let's do it. I'm Godfrey. No one is introducing me today, so I have to do the work myself.
00:20
But before we start, a very important question for you. Did you file your taxes? In case you didn't know, the deadline was yesterday. But if you missed it, don't worry. There will be another one next year. But more importantly, are you in the right room? Because this is a sponsored session, in case you didn't know what it means. Just making sure everyone is here at the right room
00:42
for the right talk. No regrets. Yeah. All right, so sponsored talk. So if we all know what we're getting into, let's do it. This video is brought to you by Blue Apron. You're busy, we get it. We're busy, too. But that doesn't mean you can't have a nutritious and delicious.
01:04
Yeah, I figure I should burn the evidence before I get lawyers calling me. I actually work for a small company called Skylight. I don't know if anyone actually believes us when we say we're a small company, so we decided to put the entire company on stage this year. So hopefully people will believe us.
01:22
But even though we're not sponsored by Blue Apron, if you do want to sign up for Blue Apron, I have a personal promo code that you can use. Speaking of personal referral code, this is my personal referral code on Skylight. If you sign up with this URL, we'll
01:40
both be swimming in Skylight credits. So I'll give you a few minutes to copy that down. Got it? OK. If you don't want to remember the long one, you can also do skylight.io slash r slash resconf, in which case I won't get any money, but you will. Anyway, if you like free money,
02:01
you probably are into cryptocurrency. So we're announcing the Skylight ICO as a token of appreciation. You can find the details at skylight.io slash pepicon.ico. So we have been very busy working on a lot of things in the past year.
02:21
But instead of telling you about all of them, today we decided to focus on just one. And that's Skylight for open source. And Vadi will be here to tell you all about it. Your call is very important to us.
02:42
Can you hear me? Oh, how about this? All right. Hello, my name is Vaidahi. And as Godfrey mentioned, one of the things we've been working on is the Skylight for Open Source program. And it was born from our own experiences
03:01
as open source contributors. So many of us on the Skylight team actively work on open source. And from our own experiences, we've seen that it can sometimes be pretty hard to get existing contributors to actually work on application performance issues. So only a few contributors actually consider working on performance problems to begin with.
03:22
And even the ones that are already interested can find it hard to know where to start. And that's why we created the Skylight for Open Source program. This is, of course, the same Skylight that you know and love, except that open source apps will have their public dashboards accessible to everyone.
03:40
That way, all your potential contributors can access this performance data easily, even if they don't have a Skylight account. We started this program because we hope that by making performance data and information more accessible, we could inspire potential contributors to tackle those really slow parts of their favorite open source apps.
04:01
And by leveraging Skylight, they could finally have a good way to see if their own contributions actually helped. Because at the end of the day, every single app, large or small, open source or not, has room for improvement when it comes to performance. So after spreading the word about the Skylight
04:21
for Open Source program, we decided to try contributing back ourselves. We had a company-wide awesome-thon. Yes, awesome-thon. It's like a hackathon, but it's for open source projects, as you might have guessed. And we paired up to help make performance contributions and improvements to some of the apps
04:41
that were participating in the Skylight for Open Source program. So today, we're going to show you three different open source apps that are running on Skylight, each of them with their own unique performance challenges, varying when it comes to complexity. So to start, I'm going to hand it off to Yehuda, Lee, and Peter, who will introduce us
05:03
to the very first app. Hey, Peter, want to see something awesome? Sure, what's awesome? Well, I'm glad you asked. Awesome stands for Open Source Event Manager.
05:22
I have no idea if it's actually pronounced that way, but I'm going with it anyway. Awesome is an event management tool tailored to free and open source software conferences. It's a one-stop shop to set up a website for your conference, handling everything from program, registration, sponsorship, as well as the admin interface for editing all this content. Since Awesome is designed to be a self-hosted solution,
05:42
working on performance issues could be tricky, since you'll need real data and traffic to understand how the code is performing in the real world. Luckily, one of the maintainers of Awesome is running a conference later this month, which is LinuxFest Northwest 2018. So you can see Awesome working in action on the website and on Skylight too.
06:02
This is the Skylight dashboard for the LinuxFest website, which is powered by Awesome. On the top, you'll see a graph showing some key performance metrics over time. The blue line represents the app's typical response time, the purple line is the problem response, and the gray part at the bottom is the number of requests per minute.
06:20
Now you're probably wondering, what is the typical and problem response time? The technical name for these metrics are the median and 95th percentile, which are a pretty good proxy for what a typical user of your site would experience on a typical day versus a rainy day. However, why do we have to show you two numbers? Can't we just show you the average response time here?
06:40
That is a very good question. Hang on to that thought for now, as Yehuda will explain in a minute, not yet. For now, let's move to our attention to the second half of the page, the endpoints list. This shows you all the endpoints in your app. They're typical and problem response time as well as popularity. By default, they're ranked by agony, which helps you prioritize which performance issues
07:01
are worth spending your time fixing. It takes into account both how slow the endpoint is as well as how popular it is. So you won't waste your time fixing a slow endpoint that's only accessed by a handful of people every day. Those of you with good eyes might also notice the red heads up icon. This is next to the endpoint name. This indicates potential database issues.
07:22
But what kind of database issues? Yehuda will now tell you all about it. So to find out, let's click on the first row to see the detailed endpoint page.
07:42
So the first thing you're probably gonna notice here is the histogram on top. What this shows you is a distribution of the response times for this endpoint. This is a pretty good time to revisit Lee's earlier question about typical and problem response time versus averages. If there's one thing you remember from your statistics class, it's probably the normal distribution, AKA the bell curve. And you might expect your response time distribution
08:02
to look something like that, matching the bell curve you learned in school. The normal distribution has some pretty nice properties, which is why they teach it in school. For example, if your response actually did look like this, the average would be right in the middle, making it a pretty representative number. And if you did look like this, you would expect a lot of your visitors
08:20
to experience response times clustered around this number, and that would also be pretty nice. However, response times in the real world do not tend to follow this distribution. They tend to follow a distribution more like this, which is skewed to one side. A lot of the requests in this distribution will be fast, but there is also a long tail of slow requests. In this case, an average just doesn't tell you anything
08:42
about your fast requests. And even worse, it misleads you about your slow requests, which are usually around 10 times slower than the fast ones. Now if you ask me to tell you where the average of this particular distribution is, I actually have no idea where to point to, which is why instead of showing you the average and misleading you into thinking about the bell curve, we show you the typical response,
09:01
which covers about half of all of your requests, and the problem response time, which covers about 95% of your requests. Now it's important to remember that one in 20 requests falls outside of this range, so basically all of your users are going to experience slowness on the right side of that bar pretty regularly, and so we show you both numbers to remind you of what your users experience
09:21
in reality in the real world. So that's the first half of the page. The second half of the page shows you what things your app actually spend time doing during the request. For example, as you would expect, most of the time in the request is spent here processing the controller action, which is indicated by the blue bar.
09:41
However, a lot of time is spent doing other things within the action, such as database queries and rendering views, which is why parts of the blue bar are painted in light blue. The child tasks inside the light blue are also color-coded, green is for database-related stuff and purple is for view rendering. You can click on the event to view more details about it, such as the actual SQL query being executed,
10:02
and if you remember, the endpoint page we saw earlier had a heads-up icon referring to a potential database issue in this endpoint. It corresponds to this red icon over here, which refers to a pretty common kind of database issue in Rails apps called Mplus1 queries. And Peter is going to explain what is an Mplus1 query and how you can fix it.
10:22
Thanks, Yehuda. I imagine that probably many of you have heard of Mplus1 queries before, but if you haven't, don't worry because I'm about to explain them. So let's say you built a blog in Rails. You'll probably have some code like this in the controller. Obviously, you can see you have an index for selecting our posts here, and basically gonna send that off to the template.
10:42
Since this is a blog, you probably also have authors where each post belongs to an author. Finally, you might have something like this in your templates. You're gonna loop over each post and render the author's name and probably some other data about the post. This is works, and however,
11:02
this fairly benign-looking code is actually hiding some important work from you. When you access post.author, Rails doesn't yet have the author data. It only has loaded the post itself, so it has to do an additional query to fetch the author from the database. Since we have to do this for each post that we render, we'll end up issuing 10 separate author queries
11:22
to the database in addition to that additional one that we had to do to fetch the posts. This is why it's called the Mplus1 query problem. Because we know we are going to need the authorship information anyway, we can optimize this by fetching them ahead of time using just a single additional query.
11:41
Rails provides an API to do this for us automatically. You just change the controller code to add this includes here. Probably many of you have done this, but if you haven't, this is a really useful thing to do. And that now faces when you select from the post, we do this first one query, and then we're going to have a second one where it gets all the authors in one query.
12:02
And admittedly, this is only the simplest possible scenario. In the real world, you'll often find your Mplus1 queries are hidden deep in the code. They may follow the same pattern, but it's somewhere that's not quite as obvious and straightforward as this. Fortunately, Skylight does a pretty good job of detecting and highlighting these problems, pointing you in the right direction of where to investigate further.
12:22
Going back to Awesome, we were able to use the information from Skylight as a starting point. We found a couple of places where Mplus1 queries were an issue and submitted a PR to address them. As our time is pretty limited here, I won't get into any more of the nitty gritty details. However, if you're interested in learning more, feel free to stop by our booth at the exhibition hall. We'll be around later today, and we'll be around tomorrow.
12:42
And we'd be happy to show you and talk to you a bit more about how we found some of these problems when we needed to fix them. But for now, I'm gonna hand this over to Kristin and Zach to talk about the work they did on another open source app, the Oden Project.
13:03
The emoji. Thank you, Peter. We're gonna talk about the Oden Project. The Oden Project is an open source community and curriculum for learning web development, like a code school, basically. Students of the Oden Project build portfolio projects and complete lessons that are constantly curated and updated with the latest resources.
13:23
They offer free courses like Ruby, Rails, JavaScript, HTML, and CSS. Once you're done climbing the technical ladder, they even have a course on how to go about getting a job in the industry, walking you through things like searching, interviews, and much more. One challenge of running an app like the Oden Project comes from the size of the community.
13:41
As you can imagine, there's a lot of information to keep track of for each user, such as which courses they took and their progress in each course. With over 100,000 users, this adds up very quickly. For example, let's take a look at the work involved in rendering these beautiful badges.
14:01
As you might imagine, you need to know two things to be able to render a badge. The lessons belonging to the course and which of those lessons the student has completed. The first part is pretty straightforward, just a simple select query. The second part involves a join table
14:20
called lesson completions. It's pretty standard too. There is a lesson ID column and a student ID column. Since we are showing badges for every course on the page in order to avoid the MplusOne query problem, we are doing a single query to fetch all the completion records for the current student. However, according to Skylight, the second query takes up a noticeable amount of time
14:40
on each request. Not terribly slow, but definitely noticeable. So the question is, can we make it faster? Now, as you may know, databases rely on indexes to keep these kinds of queries fast. Essentially, you do a little more work when you're inserting the data in exchange for a significant performance boost when querying.
15:01
In a lot of cases, the answer is simply to add the missing index. But this case is a little more nuanced than that. Because a student shouldn't be able to complete the same lesson twice, there's already a compound index on lesson ID and student that guarantees its uniqueness.
15:20
Since we're only querying on the second field, is this compound index sufficient for our query? For a long time, and for most databases, the answer would have been no. However, starting in version 8.1, Postgres is sometimes able to take advantage of the compound index for queries
15:41
involving only the second column in that index. Unfortunately, even when that is possible, it's still not as efficient as having a dedicated index for that second column. We can confirm this by running an explain analyze on the queries. If you haven't used this before, it's a great tool in Postgres to understand what the database is doing under the hood in order to help you optimize your queries.
16:04
As you can see, if we have no indexes at all, Postgres will be forced to do a sequential scan, which could be fairly slow on a large table. Now let's add the compound index and try again. As you can see, Postgres is using the compound index here, but this is still not as fast as it should be.
16:21
Finally, we'll add the dedicated index for the student ID column. As you can see, Postgres is now able to fully take advantage of the index and improve the query performance noticeably. Note that none of this applies to the first column of a compound index. If we were to query on lesson ID alone, Postgres would be able to use the compound index
16:41
just as efficiently, so there is no need to add a dedicated index for that case. Here's a summary of what we have learned. First of all, Postgres is generally very smart about using multiple indexes for a single query. So you should always just start with individual indexes.
17:02
However, if you already need a compound index for other reasons, it gets a little more complicated. You would wanna make sure that you prioritize the field you want to query separately and put that first in the compound index. Alternatively, you can maintain a dedicated index for any additional columns by which you would like to query.
17:21
It says give time to read. We were able to apply what we learned here and submit a PR to the open, the ODIN project and improve this query's performance. While we were working on this, we noticed a similar case this time on the lessons table.
17:44
Here we have a compound index for slug and section ID and we wanted to query on section ID. So we thought we could add a dedicated index for section ID as well. However, when we ran explain analyze, we noticed that Postgres wasn't using the new index. As it turns out, the lessons table was fairly small,
18:01
less than 200 rows in total. When dealing with such a small table like this, Postgres will be able to do the sequential scan much faster than using the index. So if we had added the index, it would have just been more work at insertion time without any benefits.
18:21
The moral of the story here is don't optimize prematurely and always check your assumptions. It's often better to just do the straightforward thing and let Skylight tell you what's slow. Then spend time investigating and optimizing your code. Ultimately, we were able to figure out why this second query was slow and send a second PR to fix it.
18:42
However, that's a story for another time. If you're curious, come ask us about it at our booth after the talk. For now, let's hand it over to Vaidi and Godfrey to talk about the open source app that they helped improve. Are you ready?
19:02
Click when ready. Click when ready. Let's do it. Are you ready? I can't click until you're ready. Let's do it. It's ready enough. All right. Thank you, Kristen. So we decided to work on one of our favorite open source apps called Code Triage. What makes Code Triage special
19:20
is that it's an open source app that's built to help other open source projects. So as you might know, popular open source projects receive a lot of bug reports, feature requests, and pull requests every single day. And just reading through all of them could be a huge time sink for all of these projects' maintainers. And we're involved in a number
19:40
of popular open source projects ourselves, so we understand this problem pretty well. Code Triage lets you help out your favorite project maintainers by signing up to help triage GitHub issues and pull requests. Once you sign up, the app will send a random ticket to your inbox every day. That way, you can help split the workload
20:01
so that everyone only has to do a little bit of work every day. You can even sign up to help triage and improve documentation on the project, receiving one method in your inbox every day. As you can see, Code Triage is actually a pretty cool app, and there are already tens of thousands
20:21
of users helping out tens, helping out thousands of repository projects this way. So that's pretty cool. Wow, that's a lot of projects and users. That's right, buddy. And as we can see, running an app at this scale will create some pretty unique and interesting challenges.
20:41
Take a look at this home page, for example. This is what you see when you go to codetriage.com, right? And it lets you browse through some of the most popular project and shows you how many open issues that need to be triaged for each one. That's a lot of information to render, and so it could get a little bit slow at times. And at the same time, this is also by far one of the most popular pages in the app,
21:02
because this is what everyone sees when they go to the main website, right? So Skylight is understandably marking it as a high agony endpoint, meaning that we'll probably get some pretty good bang for our bucks optimizing this page. Notably, even though there's a lot of information on the home page, most of those things
21:22
actually don't change all that often, which actually makes this page a prime candidate for caching. But as it turns out, a lot of performance-minded people have actually already worked on Code Triage, and they've done extensive work in fine-tuning everything that needs to be done. So therefore, a lot of the things that we found
21:42
that should be cached already were. However, when we looked at this page on Skylight, we noticed that in order to populate the meta tag, we needed to run two queries to fetch the counts of the users and the projects on the site. Since we probably don't need these numbers
22:00
to be super up-to-date, we can get away with caching them. So we submitted a PR to cache this information for up to an hour, and as expected, when we deployed this change, we were actually able to shave off a little bit of time for the home page. So other than the header and the footer
22:22
and the meta tag, the bulk of the time is spent rendering these cards on this page, which is understandable. This might look pretty simple, but actually a lot of work goes into rendering these cards. But however, even for the most popular open source project, they only receive new issues a handful of times per day,
22:40
maybe up to a handful of times per hour. So they don't actually, most of these cards doesn't actually change that much. And you're probably thinking the solution is probably fragment catching, because we're using Rails, right? So, and as we mentioned before, a lot of the obvious performance things on this app has already been done, and of course, these cards are fragment-catched,
23:02
so that seems good. But in fact, they are better than fragment-catched. They are actually cached in a pretty smart way. So if you just learned about fragment-catching, you probably would do something like this, where you would loop through each of the cards, and then you will catch each of the cards inside the loop. This works, but it has a problem.
23:22
This would result in Rails going to your catch store, sending a network request to your catch store to get the fragment for each of these cards every time it renders it. So now we're looping through like 50 cards there, so that would end up making 50 separate requests to your catch store,
23:41
which could add up to a lot of overhead. This is kind of like the M plus one query problem for catching, right? So to solve this problem, the code triage maintainers took advantage of a Rails 5 feature called collection-catching. So instead of doing this, what you do is you do render partial, like you move the content into partial, and you do collection, give it the array,
24:02
and you say catch true. So by doing this, Rails will be able to fetch all the fragments it needs ahead of time in a single request to the catch, cutting down on a significant network overhead. And you can see this reflected on Scala as well, the catch read multi there. That's the single request that Rails is sending
24:22
to the catch store to get all of those fragments. So admittedly, this is pretty cool. But as we looked more closely at this, we realized that something was wrong. If you look at the logs, you will see something that looks like this.
24:40
And as you can see, for each request to the homepage, we're missing more than half of the 50 cards in the cache. If this is a one-time occurrence, it probably wouldn't be a really huge deal, Rails would simply generate the missing content and write them to the cache. Then any subsequent requests would be able to take advantage of that cached content.
25:03
However, the strange thing is that we were able to consistently reproduce those cache misses, even when processing multiple consecutive requests to the homepage. So perhaps the cache was way too small. However, the math doesn't really check out.
25:22
We had provisioned a 100 megabyte memcache store for this app. Sure, it would have been nice to have a bigger cache, but each of these cards are around a few hundred bytes. So even at this size, it would still take at least a couple hundred thousand of these cards to fill up this entire cache.
25:40
If this square is the 100 megabyte cache space, the 50 cards would probably occupy a fraction of a pixel in this one square. So this was definitely not our issue. It is possible that there's something very big in the cache that's taking up all this space.
26:01
However, our mental model is that memcached is an LRU, Least Recently Used Store, which means that new cache data will push out old cache data in a first in, first out manner. But since we're not caching anything big on the homepage itself, our mental model would suggest
26:22
that memcached will evict older items in the cache as needed in order to make room for our cards. At this point, it's pretty obvious that our mental model is wrong. And we need to get a better understanding of how memcached actually manages its data.
26:41
Now, before we get into this, I should probably mention that we're not actually running memcached ourselves. We're using a SaaS provider that implemented a proprietary memcached-compatible store. So some of these details are specific to the implementation and it's not applied to memcached. But that's more in the nitty-gritty.
27:01
Most of the things here apply to memcached in some ways. So the first thing we learned is that our cache store is not just 100 megabytes of free space that is available for everything. Memcached actually groups the data you're trying to store into different tiers based on the sizes and manages the space for each tier separately. So you can see this in action
27:22
by running the stats items command in memcached. The output looks something like this. And here you can see that there are a lot of stuff, but it's basically telling you a few things. You can see the sizes for different tiers. You can see the number of items currently stored in a particular tier, as well as how many items
27:42
have been evicted for that particular tier. Since all the card fragments that we're looking at on the homepage generally is around the same size, they will end up being cached in the same bucket, specifically bucket number four. We can see that the bucket number four is responsible for storing things that's up to a kilobyte.
28:04
And the tier before it is responsible for storing things up to 512 bytes. So this end up being the tier for anything that's between 512 and one kilobyte, which fits all of our cards basically. And you can see that we currently have around 5,000 items in this tier,
28:23
which works out to around five megabytes of storage, which is much smaller than the original 100 megabyte that we have in our head, that we assumed was available to us. So the implication here is that when talking about similarly sized items in memcache, they might be competing for a much smaller amount of space
28:41
than you expected. Okay, so sure, we only have five megabytes to actually work with, which is a lot smaller than we originally thought. But still, five megabytes is enough space to hold 5,000 cards. And if we look at our homepage,
29:02
we are only trying to cache 50 cards. So once again, the math doesn't really add up. So this isn't quite the end of our story. However, from the same stats command that Godfrey mentioned before, you can see that the eviction count is indeed very high on this bucket.
29:21
We've evicted almost two million items from this one bucket alone. So this supports our observation that our cache is overflowing. It's spilling out things as fast as we put things into them. At this point, the only explanation is that there must be something other than our cards in this bucket that's taking up all this space,
29:42
and somehow they're not getting evicted. It would be helpful, of course, if we could just open up this bucket and see what those things are. Unfortunately, the SaaS provider that we were working with didn't offer that ability. But based on our research, we had an educated guess.
30:02
There's a feature in memcached that automatically expires cached entries, which is also known as TTL, or time to live. Our theory was that when mixing TTL data and non-TTL data within the same bucket, this cache store would always choose to evict non-TTL data
30:23
before any of the TTL data was ever considered for eviction. And since our cards didn't have a TTL, and our cache was already pretty full, they would always be the first ones to be evicted. So what we ended up doing was flushing the entire cache manually,
30:43
and sure enough, it freed up space for our cards. And this allowed us to achieve 100% cache hit rate on the homepage, and that resulted in a pretty dramatic performance improvement. So another good thing that came of this investigation
31:02
was that we actually discovered some bugs in other open source projects, too. If you look at the Skylight trace before we flushed the cache, you will see that, yes, we are doing one cache multi, so we avoided the anchorage problem for caching, but we are immediately spending a lot more time in cache read and cache write, so what's going on?
31:23
Looking at the logs, we'll see that this is the read multi, so it's a very long line, so I didn't fit it entirely on the screen, but it's basically fetching all 50 fragments in one request, but then immediately after that, when we're rendering the missing rows, we end up doing another read to the cache store,
31:42
even though we already know there's nothing in the cache store, because the cache store didn't return anything in the original read multi, right? That's the whole point of that request. So in other words, the read time here, which, by the way, is longer than the original read multi combined, is completely necessary,
32:01
and we should be able to eliminate this bar entirely, which is taking up a noticeable amount of time in the page, even when we don't have 100% hit rate, which happens in the real world. And if you keep reading the log, you will also see a second problem, which is that, so on top is basically
32:22
we're reading, we do the unnecessary read to get the non-existing data from the cache, we generate it, and we write it back, seems good. But then, after the loop, we're immediately writing every, like all of the missing rows that we just wrote to the cache, again, for no apparent reason in the same request,
32:41
and which means that we should also be able to cut this time by roughly half. So finally, as we're working on this, we also discovered a bug in Rails 5.2 that is causing each cache item to take up twice the space than necessary, and we believe all of these are bugs in either Rails or the DALY gem, and before we took our flight to RailsConf,
33:04
we were working on patches to fix these issues, and hopefully we'll finish them when we get back to Portland. But notably, the particular bug about Rails 5.2 compression has already been fixed and merged on master, and you should expect to find it in a Rails 5.2.1 release very soon.
33:23
So we initially started by looking for ways to help our favorite open source apps using Skylight, and I think we did accomplish that, but we also ended up accomplishing something kind of bigger than that too. Through this journey, we ended up finding ways
33:40
to contribute back to upstream projects like Rails. And we hope that the Skylight for open source project and our, like this journey, and our OSSIFON will help inspire more contributors to do the same. So speaking of journey,
34:00
I'm actually gonna be giving a talk tomorrow about the Rails router's internals, and it's going to be a wonderful journey. Am I really, I'm trying to drive it home. I don't know if it's working. But I hope to see you all there. All right, so that's something to look forward to for tomorrow, but for now, since we promised to bring the entire company,
34:22
here is our CEO, Leah, with a baby. I usually bring him on stage with me, but he was so quiet and wonderful, and I don't wanna mess with it.
34:42
I chose this emoji, because that's what CEOs do, is show up at the end after you've done all the work and just take all the credit. So thank you all, that was fabulous. Yeah, this is our team. We are a small, I wanna say little family, but I feel like all companies call themselves family, and it has no weight anymore.
35:01
But we do also invite our actual families to work with us. For example, we have a Babies at Work program, and Jonas, which is my son over here, and Kristen's daughter, hang out with us at the office all the time, because why not? Turns out it's not actually impractical, and it makes everybody happy, because you walk down the hall, and there's babies babbling and smiling and stuff.
35:22
And I did write a blog post about it, about our Babies at Work program. So if you are trying to attract talent, parents, certainly women specifically, mention it to your HR person, who will look at you like you've lost your mind. But then point them at my blog post, because I talk about, I start out basically like,
35:41
here's all the reasons I thought it was a terrible idea. I was wrong. So we build Skylight, we work really hard on it, we like giving back to the community. We are very particularly focused on making performance a thing that everybody on your team can be involved in. A lot of people, especially towards the beginning of their programming careers,
36:01
have this conception that this is the sort of thing that happens in dark basements with the super ops people that they don't know, or could never be. And I guess in some cases, or using some tools it could be true, but one of our missions is to make that not the case, to make it so anybody on your team can have a spare cycle, 15, 20 minutes, drop into Skylight,
36:21
figure out what they can do, find some low-hanging fruit. We have a lot of that low-hanging fruit outlined in our documentation site. For example, if some of these things that you saw here are not things that you know how to do, many of them are in our doc site, just sort of pointing in the right direction to be able to fix it in your own app. So yeah, we have a booth. We're downstairs, we're hard to miss.
36:42
I was supposed to, I was supposed to make a joke about how professional we are. We're very professional. We make a lot of jokes. We make a lot of puns. If your pun game is not up to snuff, I don't know if we're a good place for you, but you can work on that.
37:01
This is a skill that can be built. And yeah, this is us. We're excited to meet you. We hope you use Skylight on your paid projects, on your free open-source projects, and we'd love to talk to you about it, so come on by. Thank you all.