We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Let's (D)Trace Postgres

00:00

Formal Metadata

Title
Let's (D)Trace Postgres
Subtitle
tracing the madness
Title of Series
Number of Parts
35
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
There are situations where an existing database mechanism doesn't provide sufficient methods to determine the exact work being performed. An example of such situation is GIN index rebuilds as a result of the fastupdate mechanism and slow computation of query plans. Due to wrongly selected systems timers. With the help of DTrace, we will take a look on the Postgres internals and we show how a system administrator can clearly identify the root cause of a hanging insert and the slow computation of the query plans. DTrace is a dynamic tracing framework created by Sun Microsystem and available in FreeBSD, NetBSD, and MacOS. Dtrace is a powerful tool which allows software developers and administrators measure performance and trace their applications using advanced scripting language - D. Before entering the beautiful world of Postgres during this talk we will discuses the basics of it. Gin in Posrgres stands for Generalized Inverted Index and should be considered as a genie, not a drink. Generalized means that the index does not know which operation it accelerates. Having to do multiple searches/insertions when the same key appears in multiple heaps becomes costly during bulk insertions hence a fast update feature has been developed that maintains a pending list of entries delaying the index rebuilds in time. Usually, the pending list is merged with the main btree by the vacuum process but it can also happen in other situations. Currently, the PostgreSQL database lacks a mechanism allowing a DBA to know when exactly this situation occurs and how long it took. This can lead to seemingly hanging INSERT and a DBA scratching his head not knowing what causes the stall. Using DTrace we can show exactly when this rebuild happens, how long it took and how a DBA can script this in D expanding his toolbox for the future - all without rebuilding the database.
19
Thumbnail
42:43
29
34
Thumbnail
52:38
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Transcript: English(auto-generated)
So hello, everyone, and thank you for coming to our talk. We will talk about detracing postgres and tracing the madness of GIN index pending list flushes that we found in production.
We have a case study, and we hope you will all like our findings and see how great detraces are actually diagnosing problems in production without altering production code or your deployment. So my name is Adam Volk. I am an open BSD developer and software developer
at Fudasecurity. I have a database background. I worked for seven years with Oracle databases, unfortunately. And after that, spent four years with postgres, and now I'm working at Fudasecurity, where I'm mostly doing back-end C stuff
and taking care of the database.
Unfortunately, we have only one mic, so sorry for that. But we will pass the mic three times, so I hope you will bear with us. So my name is Mariusz Zaborski. I'm also working at Fudasecurity. I'm a manager and software developer there. I'm also a free BSD committer.
Together with Adam, we also run a BSD user group, if you are interested into. And sometimes I'm blog posting some interesting stuff about free BSD and computer science in general. So if you are interested, you can look into that blog as well. So first, I will describe to you a little bit what is DTrace, what we can really do with that.
But does anyone ever use DTrace before? Like, OK, good. But most of you didn't, so hope the users will bear with me and most of the people can learn some new stuff about DTrace. So DTrace is a dynamic tracing framework.
We can trace our program without recompiling it, without adding any additional probes. We can just trace it. We can add probes in places whenever we want. We can see in a moment which things we can really trace.
And we program DTrace using DScript, which is very simple language. It's a subset of C language. Some people say that it's inspired by Oak as well. So the languages are quite similar. And the subset is designed that you cannot hurt yourself.
So for example, you will not find loops in DTrace. Because when you have a loop, you probably can loop your kernel and never end the loop. So things like loops are not allowed in kernel. When we have DTrace in our operating system, there is no performance penalty.
It's just there. It doesn't run. It doesn't hurt us. If we will enable DTrace, there will be some penalty performance for our application, not for whole system, and only for things that we're really tracing. Only in the places where the probes are set.
So what we can trace with DTrace? We can create our probes on which function we are calling, which arguments of function was passed to the function, how frequently the calls was done. We can also trace the return calls
or the function of the syscalls. And also, which is very useful, we can track whole function call stack. So we can see exactly what was the path, where we are right now, and much more. But we will focus on those few things. So a little bit about semantics.
Probe is something that we're tracing. This is something that we want to see. And probe is built from provider. Provider is some DTrace module. So we can, for example, have a syscall module, which will inform us about the syscall in the hours operating system. We have a PID module, which is a PID provider, which
inform us about our processes. We have a VFS, which is for file system, and so on and so on. Some of the provider must be built into the operating system. So you will see that not all providers are exactly the same on all operating systems. So some Solaris can have some older providers
than FreeBSD and Linux. So this is something that you need to be aware. So module is basically a software module. It can be a libc. It can be a Postgres. It can be just a kernel. So this is just a module that we're working on. Then this is a function, which basically is the function
that we want to trace. So we can trace, for example, syscall open or some other function in DTrace. Then we have some predicted. So we can say when the probe should be fired. So what conditions should be done
to be able to trigger the probe? And then we have action, which is basically some discrete what we want to do with our information that we are gathering. So here we have some example of DTrace probes.
So here we have a provider, which is syscall. Then we have module, which we don't care. So we just leave it. We don't need to provide it. Then we have a function write and on the entry. So this probe will be fired every time when the syscall write will be executed.
And here we have another example with the probe also syscall with the write function. With write syscall, the probe is set on return. So every time when we are returning from the probe, it will be fired. And we also have some predicted. So arc one is basically the return value from syscall.
So every time when you write more than 10 bytes, we will fire this probe. So in FreeBSD, I'm a FreeBSD guy, so you need to bear with me. I will talk a little bit about FreeBSD. With over 15,000 probes, we can very easily
list the probes using DTrace minus all. So if your operating system support DTrace, you can list all the probes with the providers, with the module, with the function, with the name of it. We can also filter a little bit of our search.
So for example, if you are looking only for a provider with syscall, we also can do that with DTrace. So here, of course, this is truncated. We have a list of all the syscalls that are enabled in FreeBSD with the provider syscall.
We can also be more precise. We can ask what the probe really is. So here is a DTrace with the flags that will provide us this description. This is also truncated, but we will have, of course, provider, module, function, and name, but also some small description what this probe is.
But what is also very important and what is very useful, we also will get the information what arguments are passed to the function. So if we, for example, have an entry, then argument zero will be in. That's true because the first argument of read is the file descriptor that we want to read from.
So this will be the file descriptor. So let's do some tracing on the very simple program with a few examples. So we have a very simple program with main, which basically is randomize some number. If the number is odd, then we execute full function.
If the number is even, we execute bar function. So what we can do with DTrace? We can basically set our DTrace probe for function entry that we're interested. So we just skip the module.
We skip the function. We will get all the function entries in our script. So every time when runs is run, we will get the information about that. Then fully is run again and so on and so on. And this also is truncated because, of course, we start with the main function and so on,
but this is not important. So we can also count how many times the function was called. So we can basically pass the probe function. So the name of the function that we are interested into to the aggregation function.
So we are counting how many times that the function was executed. So here we see that full was executed so many times, bar was executed so many times, and so on and so on. So like I mentioned, we can also track the user stack. So we can also aggregate how many times the path was taken.
So we pass to our aggregation function the use stack function, which prints our current stack, and we basically count how many times the stack was taken. Of course, I didn't mention that, but it's very useful to have a binary which isn't stripped. That means that we have some symbols in our binary
so we are able to see the names of the function. Here, for example, one of the functions was stripped, so we don't see the name of the function, we just see the address of the function. Of course, if the stack would be greater, then we would see whole stack of the function which was called.
So one of the most powerful things in tracing is generating flame graphs. Here we have an example of flame graph. This is basically a visualization of stack of the user stack or calendar stack. Here we have a visualization of the Postgres.
So we can basically see how long our process was spending time in which function. So for example, we have to type standard executor run, which is a Postgres function, which we see that all the time was spent in this function, but we also see the smaller functions like parse stacks
and how much our Postgres spent time in this function. So this allows us to profile our applications, see where we are lacking the performance. Flame graphs was maybe not designed, but was mostly populated by Brian Gregg.
This is a guy who is the guy of the tracing. So he prepared a very simple two scripts which you can run on your D-trace output and generate the flame graphs. To be fair, flame graphs are not dedicated to D-trace. You can also generate flame graphs
from perf or whatever tracing tool you are using, but it's very simple to do it with D-trace. Fair warning, you need to remember to adjust stack frames, which is variable for how many functions should be reported when
we call stack functions, so the stack for the kernel, or U-stack and J-stack if you are tracing the user programs. Because otherwise, you can get something like that, which looks like a flame graph, but it's a little bit broken flame graph
because that stack was truncated at some point. And you can see that this function call is basically the same like this. We have a postmaster main, and here we have postmaster main, close postmaster ports, close postmaster code. But the output of our stack was truncated,
so there wasn't enough information. So the script generating the flame graph was thinking that this is basically the separate call of the functions, the separate information. So please remember about adjusting those values. So what is also very interesting,
we can even cast our memory to structure, so we are able to read some variables from structural if we want to. So here we have some structure. We assign some function, some values to this structure,
and we call full function. So we create a probe on the full entry. So we copy the memory from our zero, which is our structure. We cast it to the test function, and we print the values like normal in C. What is the most interesting thing?
We basically can do the pre-processing, so we basically just include our header that we want to. So we don't need to do some fancy things and copy, paste, translate it to the GTrace struct or whatever. We can just include the header. So where can I use it? Of course, you can use it on FreeBSD.
You can use it on Mac OS. You can kind of use it on NetBSD. We heard that there are some problems, so maybe we don't recommend that, but you can. I didn't mention that, but DTrace was originally designed in stand microsystem. So of course, you can use it on Solaris. There are also some work on porting DTrace to Linux.
We didn't test it, so we cannot really recommend that. But what is the most interesting thing? Maybe you will be able to use DTrace on Windows. There are some work from Microsoft to port DTrace to Windows. So if this will be happened, then you can use basically DTrace on every modern operating system.
So what about the DTrace and Postgres? Between DTrace and Postgres, there is some kind of love, because you already have a lot of probes in DTrace. Unfortunately, to be able to use those probes, you need to compile the DTrace with the enableDTrace flag,
which nobody use. And in our scenario, we also didn't have this flag compiled in, so we wasn't able to use it. So this is something that I would like to mention, that you already have some probes, for example, for a transaction start, for transaction commit. But in this talk, we will not focus on those probes,
because we didn't have it. So you probably will also don't have it, because you will not build DTrace with this support. So here is some example how you can use it. Basically, we are counting how many transaction was aborted during the tracing of Postgres.
So now let's go a little bit into the madness and talk about GIN.
OK, and we are back. So to understand why we even started with DTrace, we have to know the use case and what trouble we had.
So we observed in production a hanging insert. It was like taking something that usually took a second, started taking 30 minutes, like really minutes. And there was nothing in the operating system indicating what could be the cause of that behavior.
So I started to take a look and made some guesses. And I thought that the most likely cause was the GIN pending list being merged with the main B-tree of the index. But I had no real way of confirming that.
And we will show you how you can do that with DTrace. But to start, we will go over a bit of what GIN indexes are and what the pending list is. I think most people here know what a GIN index is, but we will still try to rush through it. And I hope I won't bore you or say something stupid.
So what are indexes of type GIN? They are generalized inverted indexes. They are used mostly for full text search, but also for JSON and JSONB indexes. In our case, we use it for full text search. The index items are composite values that contain zero or more keys, so like integer arrays
where the keys would be integers or text where the keys would be lexemes. And they are optimized for cases where many keys have the same values and they appear in many different items. So I shamelessly stolen this slide from PGConf AU 2012
by Oleg Bartunov and Alexander Korotkov. Those are the, I think you all know, but those are the main GIN implementation authors. And I really love this slide because it shows very well how a GIN index works. It's like the indexing at the back of your book
that lists all the words and all the pages those words appear on. So for example, if you would like to find in a book what page talks about accelerator compensation, you would find, is there a pointer here? Yeah, so you would look at accelerometers,
look at all the pages, look at compensation, look at all the pages, find the common page 30. So it's most likely that page 30 talks about accelerometer compensation. And that's how you do also full text search.
And this slide is also great because it shows the actual structure of the index. It doesn't show the meta page and we will need to talk about the meta page because we will be accessing it. So the meta page contains control information, the index version and statistics and points to the entry tree.
And this is, and that part of the structure is included on the slides. The entry tree is a B-tree of entries potentially containing a pending list, a posting list, sorry. And that posting list since Postgres 9.3 can be compressed. And if there is no posting list
most likely it didn't fit along with the key. So instead there's a pointer to a posting list containing items. So the pending list itself is attached to the meta page and it's a in-linked list of pending key entries
that were not yet merged with the main B-tree. This is a performance optimization to allow for faster inserts. I will talk about that a bit in a moment. And it's only there if the fast update option
is enabled and it's enabled by default. So that's why it's important. And that was my assumption. So before we did this talk, I ran a very quick poll but that was back a few months ago. So you're not, you weren't able to take part but I'm very curious how you all think.
So the question is, there are like four parameters listed here, work mem, maintenance work mem, gin pending list limit, auto vacuum work mem. And the question is, which of those parameters has no direct impact on how much memory is used
while the index, why the pending list is merged back to the B-tree? So let's do a show, so again, I want to make sure the question is understood. So which one won't cause more memory usage if I erase it when the list is being merged back?
So who thinks that work mem has no impact? Okay, so like one person. Maintenance work mem, no hands. Gin pending list limit, okay, one. Auto vacuum work mem, okay, so everyone thinks
auto vacuum work mem has no impact. We will see in a moment. Okay, so why are we even holding a pending list? Because index rebuilds are costly and when you are inserting items into a pending list
due to its structure, for example, if you are indexing a sentence, it might touch a lot of leaves of the tree. So you are potentially locking and changing a large part across the whole index, so that takes time.
And in order to optimize that, you can hold a pending list, that's actually what it's doing. It's a linked list of items that you will keep along the meta page. Instead of directly adding them to the index, you just hold it there. And when someone tries to search inside the index,
you first do a ON complexity scan of the list before going into the B-tree. That's obviously okay, as long as the list doesn't grow too big. So when, another thing that's very important
is that there's some predicate locking happening on with a fast update on. Because we have a pending list, we actually don't know where in the index they will land. So we can't lock just a single part of the tree when an insert happens, we have to take a lock on the whole thing. So this impacts potentially every query
that makes a search will block the whole tree. So the cost of each query is increased because we are doing a ON scan of it. So we have to dump it before it grows too big. And where does this dumping happen?
It happens during vacuum. Whenever we vacuum, the pending list will be merged back with the B-tree, regardless of its size. It can have one item and it will be merged back. And it can have 30 items and they will be merged back. It will happen with auto vacuum.
It will, what I found surprising, but it is documented. It will happen during auto vacuum analyze. So you might suddenly see that there's actually lots of action going around in the B-tree because it's merging a pending list and you are just thinking that it's analyzing data
but it's doing much more, right? And it won't be done on a direct analyze. So we can't force it by doing analyze table or something like that. It's only with auto vacuum analyze. It also happens post insert. And that was my guess that this is our production issue.
And it happens post insert depending on your postgres version and what happened. So it can happen when it gets over work memory, work mem setting. That's before postgres 9.5.
It can happen if it goes over the GIN pending list limit that's post 9.5. And it can happen when it's triggered by you by a function call. You can call a function GIN clean pending list and this will cause the pending list being merged back
with the main B-tree. So answering our question, work mem. So merging will use, so the flushing will use at most. It will use at most working memory for work mem if the pending list flush.
So GIN insert clean up the function name from postgres internals happens post insert. So we are inserting just regular insert statements and we decided that it's time to flush. So the code will use at most work mem of memory
for flushing the whole list even if it's larger than the work memory we have. If we are triggered from auto vacuum and the auto vacuum work mem was set, we will use at most auto vacuum work mem memory. If we are triggered by a GIN clean pending list
or triggered by auto vacuum and auto vacuum work mem is not set, we will use at most maintenance work mem. So when you are choosing a GIN pending list limit, you have to know that it has absolutely no impact on how much memory will be used
during the actual flushing. It only decides if the list is big enough that it requires an immediate flushing with the B-tree right now during inserts. So yes, the GIN pending list limit has no impact on memory use.
So our test, because we wanted to reproduce the issue first before doing it on production and we wanted something easy that all of you can try, I stole it from an article linked at the end of the blog post and at the end of the slides and we will make the slides available
so we will find the source, it will be properly credited. So we create a test table with just an ID and a TS factor text column. We turn off auto vacuum, we turn it off because we want to exactly know where our flushes happen
and without it, Postgres might just dump the pending list without us noticing it. And we create a simple GIN index on the text column. We insert some data, that's not really data for doing proper text search, we are just filling up the index so we generate a lot of MD5 checksums
of some just plain text. And our configuration for these slides here because we were slightly derailed by checking fun stuff with DTrace and Postgres. We were tweaking workman, we were tweaking GIN pending list limit.
We did set max worker processes to one. We did that because we are lazy and didn't want to trace what processes Postgres spans. Well, we did that but it made the testing easier just not to have to look those up. But it's not a requirement for DTrace.
You can have full parallelism enabled in Postgres and trace everything and that's not a problem. We were just lazy. Okay, so let's look inside. What can we do in this specific case if we don't have DTrace, right?
What can we do without DTrace? If you're lucky enough to have not an ancient Postgres like we did in that use case or lucky enough that you are allowed to enable extensions on your production database, maybe you can't. Or if you are lucky enough that your feature
because you might have used a different feature has an extension written because it might not be there for a new feature, you can use for example, page inspect and pgstat tuple to diagnose this issue. So you could check this test index and see the pending pages and pending tuples
so how large actually the pending list is, right? And by calling that constantly, you could observe it growing and going down and you could make guesses or confirm your guesses
that yeah, it's likely rebuilding the pending list because that insert hanged for 30 minutes and after those 30 minutes, the pending list shrunk, okay? But in our case, we didn't have that option. And it also has some drawbacks. So the extensions have drawbacks.
There are a lot of code, right? They need to be loaded into the database. They don't answer the question of when exactly that pending list cleanup, for example, was triggered. We have to manually probe and sample the database
to check if that happened. And with dtrace, you can get like a callback immediately when that actual code gets executed. And those extensions might not be compatible with your version and the feature might be too new like I mentioned before. So going back to Mariusz.
Okay, sorry for that. So we wrote a very simple dscript. And so here we have a gene insert cleanup.
This is our Postgres function responsible for a flashing the pending list. So when the flash will occur, we just save our current timestamp. And when the gene insert cleanup will be finished on the return, we basically print the time, the difference between the time. We did the same thing for the executor run,
which is a function in dtrace too, which is responsible for executing our SQL statement. So this is how the output from dtrace was looking like. We are seeing when exactly the flashing is happening and what the time of the flash was. So we can see even that when we are inserting many data,
the time has some changes. And also we can see the whole time of the execution. All of those data are in nanoseconds. You can divide it or whatever in dtrace. So we did some testing. We poke a little with the pending list memory. So here is the setting when pending list memory
is set to four megabytes and working memory is set to four megabytes. All those colors is when the flashing is happening. So if we have many colors that mean that many flashing was happening and how long the flashing really takes.
So this is our graph. So we can see that for pending list with a small amount of memory and with small working memory, we are flashing quite often and the time is starting to increase all the time with every next flashing. So when we still had pending list set to four megabytes,
but we increased the workman, we can see that there was exactly the same amount of thing, but it was shorter. So the postgres did something that the real flashing to the tree was much shorter. And we also tried to poke a little with the pending list
with very big pending list, but small workman. We had only three flashes, but the last one was very, very long. And if we change the pending list to very big, to 32 megabytes and workman to 64, we have three short flashes.
So theoretically, you would like to set a large pending list and smart workman, but you need to remember that we was only inserting the data and we measure only inserts. So if you would like to also measure the selects, because when the pending list was built,
it wasn't flashed to the tree. So every time when you are trying to select something, the postgres need to go over the pending list and look the data manually, linearly. So it may be slow. So depending on your settings, you can poke a little bit with those settings.
So it was very interesting what happens here, basically why this pending case, when the pending case is very large, but workman is small, why the time here, the third flash was very, very large. So we decided to generate a flame graph.
So what we did is we just print the amount of user stacks and the amount of the kernel stacks. The profile 5000 is a probe that allow you to pick every few hertz.
So in this case, every 5000 hertz, we are picking our operating system, asking, hey, what is happening? And if it's working on our current process, he will just print us the user stack of our process. So this is the graph that we get. We are the most interested in the function,
in the standard executor run. So we just, sorry, in the gene insert cleanup. So we can truncate all of that. So this is a flame graph when the pending case is quite large, 42 megabytes, and with the small workman.
And then this is a pending case with our best case scenario. So when the pending case is very large and workman is really large. So we can see that the left side executes exactly the same functions. So what is happening? Why here, when the workman is smaller,
why it spends less time in this execution run? So the pending page function, which is a function used by insert BI entries, is responsible for preparing our data, preparing our pending case to be flashed to the tree.
Because we don't have a really workman, we cannot poke a lot of with the data. So we had only 600 samples in this execution, in these functions.
When we have a lot of workman and pending case is quite big, then we can poke a little bit with our data. We can prepare them to be inserted to the tree. So we can save time really inserting the data to the tree. So here, the right side of our flame graph is when the gene insert really happens.
And this is basically the function which is responsible for flashing all those data to the tree. So as we can see here, basically when we don't have workman, our Postgres cannot prepare our data to be flashed. It is basically need to do it in smaller chunks
more often. So our case scenario was that we had some inserts to our database. It sent out what is happening really in our Postgres. We don't know. We would like to know what is happening. So what we, so here is the insert that we are doing.
Adam is fan of tool. So what we really did is basically we select from pgstat activity, the actual background process that is executing our query. And we write a very simple, again,
Detroit script using already known for us profile, the probe, which was probing every 99 hertz. And this is very interesting. Why Detroit is more powerful than, for example, attaching GDB.
In this case, we didn't modify the Detroit, didn't even modify the memory of Postgres or whatever. He only set a probe that every few seconds, every few nanoseconds, you should fire this and see what is happening in the kernel itself. If you are executing my program, just tell me what the stack trace of the program is.
So if you would attach GDB to Postgres or try to use other technique, it's basically would somehow needed to stop the Postgres, look what is happening. Maybe you will crash your Postgres. In that case, we basically asking kernel, when you will execute this process,
tell me what is happening there. So this is the backtrace we do get. And here we can find the gene insert cleanup, which was the function that was running on which Postgres was working. This is not ideal, of course. You need to look then when you got such stack,
you need to look into the code, guess, read a little bit code, understand what is happening, but it's better than nothing, right? So what about the length of the pending list? So when we was preparing this presentation and we saw those extension was kinda, we want to do the same, you know, with dtrace and it should be easy, right?
To be fair, it was kinda interesting, a little bit hackish, but I hope you will bear with me how we was able to get the pending list length. So in the Postgres, we have a structure called get meta page data,
which is exactly the fields that we are interested in. So unpending pages and unpending heap tools are those two values that describe how big our pending list is. So we went to the gene insert cleanup and we wanted to get the information where the metadata is.
So we look into the code and we see that there is a variable metadata and we wanted to get into it. But unfortunately, metadata is calculated by getting page get meta, which is a macro, so we cannot attach that because compiler just did his thing, so we don't have access to it.
It's using buffer get page, which I think also is macro, so we also cannot detach to the return value of this code and is using read buffer, so it's pretty complicated to get the metadata of our function. So what do we did is we decompile the function and we see that here is the invalid block number,
which is basically a minus one. So we see that here is happening this comparison and error 14d came from this register error x. So that means that the structure is basically in this place. When we would hook here in the code,
we would have exactly the pointer that we want to. So we did that. We implemented, we write a small dscript, which is hooking to the ginsert clean up. Here, we don't hook up on the entry level. We hook up on the offset of the function
that we really want to. We read the ix register or rix. That doesn't matter anything. We copy the structure and we print the pending pages sizes. So it's doable. I don't think it's really changed anything, but you can see that dtrace offers you
a lot of more things. You can really poke around your process and see what is happening. In that case, unfortunately, we need to go even much deeper to get the pending pages sizes, but it was interesting journey. So here are some useful materials.
The presentation will be shared so you can see them later. And thank you very much. And maybe there are some question we would like to answer them.