We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

PostgreSQL Entangled in Locks: Attempts to free it

00:00

Formal Metadata

Title
PostgreSQL Entangled in Locks: Attempts to free it
Title of Series
Number of Parts
19
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Lock contention in PostgreSQL is one of the major woes for scalability concerns on many-core machines. We will discuss our attempts to solve those issues. The discussion will focus mainly on BufferMappingLock, WALWriteLock and CLogControlLock. First, we will describe our experiments wherein we split WALWriteLock in two separate locks — one for WAL write and another for WAL Flush. Next, a lockless implementation of the hash table for BufferMappingLock will be discussed. Finally, the presentation will conclude with a discussion on group clog update concept for reducing LockContention around ClogControlLock. Lock contention in PostgreSQL is one of the major woes for scalability concerns on many-core machines. In the past, several attempts were made to fix different lock contention issues. Some of the contentions were solved to an extent e.g WALInsertLock, ProcArrayLock and BufFreelistLock. Still, there are many lock bottlenecks that remain unsolved mysteries, e.g ClogCotrolLock, WALWriteLock, BufferMappingLock, etc. In this talk, we will present our experiments for wait event test to show the contentions on various locks. We also analysed the effects of these locks on various workloads. We will discuss the experiments done to reduce lock contention of various locks along with their performance results. The discussion will focus mainly on BufferMappingLock, WALWriteLock and CLogControlLock. First, we will describe our experiments wherein we split WALWriteLock in two separate locks — one for WAL write and another for WAL Flush. Next, a lockless implementation of the hash table for BufferMappingLock will be discussed. Finally, the presentation will conclude with a discussion on group clog update concept for reducing LockContention around ClogControlLock.
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animationDiagram
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animationDiagram
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Transcript: English(auto-generated)
Hello everyone, I am Amit Kapila, working for Enterprise DB, I work both on PostgreSQL server and advanced server which is EDB's proprietary product and today I am here along with my colleague Dilip Kumar to present on the PostgreSQL locking, basically the
talk is about what are the bottlenecks in various kind of workloads, some of those and what is the work done to resolve those and what are the possible other approaches for solving the remaining bottlenecks. So, I will allow Dilip to start it.
Hello all, my name is Dilip, so today we are going to talk about the PostgreSQL locking as Amit told. So, this is going to be our view of presentation, the first we will
talk about the briefly touch upon the work done in past for resolving those bottlenecks and then we will move to our main part of presentation where we will talk about what are the problem which is yet to be solved and what is the next bottleneck showing in the current version of PostgreSQL. So, basically we are divided broadly into two
parts in read only bottlenecks and read write bottlenecks. I am going to cover the read only part and Amit is going to cover read write part. So, set of locking locks are the major bottlenecks for the scalability in PostgreSQL and from past many versions
and the good news is every time it is a different lock. We are solving one lock and we are going to some level ahead, we are solving some problems, scaling is better and we are hitting the next lock and all the credits goes to the work done in past. There is a long list of these work and I could not get fit into all into one slide.
So, I have mentioned few of them like fast path relation lock that work is done in 9.2 by Robert Haas and locking regimen change around the buffer replacement, lock less strategy, get buffer in clock sweep I think done by Andres and proc array
lock contention removal by making the group clearing the XID work done by Amit and some work related to buffer had a spin lock removal I think Andres and Alexander did that work and another work related to hash had a lock contention where instead of maintaining
the common free list between the multiple backend for allocating the free element they are separated the free list for the partition wise and we have seen lot of contention reduced across this hash had a lock. So, now continuing the same motivation we are again benchmarking the latest version of the v10 and seeing what are the latest
bottleneck. We will also discuss about some of the patches or some of the work done in community and what is the benefit they have shown and what is the problem because of that they could not fly or could not get into the post grade scale so far. So, this is the first step where we are with the various scale
factor and different configuration we have taken the performance data especially we targeted two part. One thing is where the data does not fit into shared buffer and second where data fits into shared buffer. This commit is latest commit somewhere in the upper land for the v10 and when
data fit into shared buffer we have taken very high scale factor of 5000 and we can see its graph is falling after 64 clients. It cannot scale beyond 64 clients. Whereas if you see the other part where data fit into shared buffer we are tested with 300 scale factor and it can
scale up to 128 clients and also after that it is not immediately falling, it is flat. For finding the reason for the bottleneck we have done the weight event test and with the same scale factor 5000 we have done test duration
for 2 minutes and on the same machine we have tested with 128 clients so you can see the buffer mapping log on the top and after buffer mapping log we cannot see any other log closer to that. So it is proved that at least when the data does not fit into shared buffer buffer mapping log is the main contention. We have done with the 300 scale factor also
the same weight event test but I did not see any log it is showing as a bottleneck. So just identifying what is the other problems can be there when data fit into shared buffer we have taken the perf test and perf shows that get snapshot data is taking
a huge instruction or like at least 14 to 15 percent when data fits into shared buffer it does not show any log as a bottleneck. I think that is because maybe I will do a log some problem was resolved by Anderson when shared to shared contention was there in 9.5
So that is same summary of what we have seen that buffer mapping log is the major bottleneck when data doesn't fit into shared buffer and it fits when it gets snapshot data Some percentage is there. Buffer mapping log. This log is used to protect the shared buffer mapping hash table. We acquire this log into exclusive mode
when we need to get a new block into the hash table during the buffer replacement and we take that log into shared mode to find the existing buffer entry and there is a contention between write to exclusive mode also and exclusive versus shared mode also. This log is already partitioned for the concurrency but I think that's not good enough
Get snapshot data in read committed isolation mode. We need to take snapshot for every query and if it is prepared and execute mode which we generally take the benchmark we need to take the snapshot for the planning
as well as for the execution and because of calculating the snapshot there is extra power because we need to access the memory from each proc and get the current XID value and create the snapshot. In read only mode we did not see any contention across the proc array but we suspect that if the time taken
in the snapshot which is calculated under the proc array log it is going to contend with in the read write mode when we are taking the proc array log or proc array group log into the exclusive mode. I am going to discuss about the two approaches
or two patches that got submitted in the community and I am also going to discuss about the performance benefit shown by those work and some problems because of that they still could not fly. C hash a log free hash table though it is not completely log free which we can see in the upcoming
few blades but most of the part is log free and is done with the memory barriers and atomic operations. Lookups are done completely without log only memory barriers and insert and delete are done with the atomic operations. Delete operation will just mark the node as a deleted
with the marked bit set and it will not immediately move it to the free list because of course there is some scan will be in progress and they need to complete the scan so we cannot immediately put into reusable state and I told it is not complete log free and that log is because if any insert need to allocate entry from
the garbage list that time it need to wait for any scan which are in progress for those garbage list and that is using the spin lock. This is the performance test the same test what we shown the problem with the
when data does not fit into shared buffer we perform with the head as well as with the patch. So you can see on head it was anyway we saw it was falling after 64 clients whereas with the head if the patch it can go till 128 and after that also not immediately falling it is flat that is maybe because of hardware
is 128 core and it may scale further. We have also done the wait event test again with the patch and same configuration so we can see the buffer mapping log where head it is on top it is completely
gone with the patch. So now we don't see any major contention on the log on the top after this patch. So after CSCH or concurrent hash patch is scaling up to 128 client
and maximum gain at 256 clients or 128 client we are seeing up to 1 more than 150 percent. Wait event test also shows it is completely gone and the main problem because of that the patch could not get in is the regression at the 1 client 8 to 10 percent regression at the 1 client.
So the motivation is that either we can identify the problems and solve if it is possible or otherwise we can find some similar approach which can remove the log without inducing the problem at the 1 core and then we can solve. But this surely there that we have the buffer mapping log problem and we need to resolve it. Next part
where we talk about resolving the problem of getting the snapshot. Calculate the snapshot only once and reuse it across the backend till it is valid. So once the snapshot is calculated we are going to put into the
cache and next time any backend need a snapshot it will check the cache. If the valid snapshot is there in the cache we can reuse it otherwise we need to recalculate the snapshot. So if snapshot is invalid and multiple backends are trying to get the
snapshot only one backend will put into the cache or other backend will continue with the regular calculation of the snapshot and continue as it is doing now. And proc rand transaction we are clearing the XID we are going to invalidate the snapshot in the cache.
So if it is either read only workload or like 80 or 90% read is there I think we are going to see huge improvement in this case. So this is the experiment with the getSnapshot. As you have seen in this case there was no bottleneck on the log but there was a performance problem with the getSnapshot data.
We have seen that scaling is still up to 128 but we can see the huge gap in the performance compared to head with the patch. So after patch it can scale up to 128 that is the same as with the head but we have observed the 40% gain at the higher client count and with perf also
shows the getSnapshot data is completely gone. Now it is not in the top as we saw with the head. So now I will hand over to Amit for the
So now I will talk about the read write workload. Similar to read only workloads we have done some performance evaluation to see where we stand as of now. So we have done for read write workload we have
done the performance evaluation for two modes, synchronous commit mode on and off. For the synchronous commit mode on we see that the scaling is decent enough and there is no performance fall till 128 client count. Whereas with synchronous commit off we can see that
there is a sharp fall after 64 client count and there is not much performance increase after 32 client count. From these graphs also it is clear that the when the synchronous commit mode is on the performance does not scale as well
as for the synchronous commit mode off which is due to the wall writing done by the backends. So we have done some wait event test to see the bottlenecks for both the modes.
I think even though we have done quite some work for to reduce the contention in the proc array lock, we can still see some contention on proc array group update lock but after that for No, no this fits in shared buffers. So after
that for synchronous commit mode on the main contention is on the wall write lock and for synchronous commit mode off the main contention is on the c log control log. So this is what
for basically synchronous commit mode of wall write lock contention is not there. So for the proc array group update already quite some work is done. The further progress I think the idea for the CSN based snapshot could reduce further contention on it but or
maybe caching the snapshot what Dilip has covered could mitigate some more contention for proc array lock but for the rest part of the presentation we are going to focus on c log control lock and wall write lock. How and what we have done to
try mitigation for the mitigation strategies for those. So the wall write lock, this lock is acquired to write the write and flush the wall buffer data to the disk and this operation is being performed during the commit or
during writing the dirty buffers if the wall is already not flushed. Also the wall is written and flushed by the wall writer. So these are the three main places where we try to write the wall. Generally we observe the
contention on this lock during the read write workload. So we have done some basic experiments to see how much actually the performance overhead is for this kind of
lock. So we have actually tried by commenting the code for wall write and flush things to see how much performance it improves and we could see that the performance has increased from around 27000 TPS to 45000 TPS.
So and similarly we have tried with fsync off there also the performance improves drastically. So this is just to show that there is a very huge performance overhead for writing the wall. So some of
the experiments we have tried and shared in the community also are to reduce the wall write lock contention is we have tried to split the wall write lock such that the writes to
the OS buffers are done in one lock which is wall write lock and the flush part of it which is costly is done under the separate lock called wall flush lock. So what this kind of optimization attempts is, attempts at is that this should
allow simultaneous OS writes when fsync is in progress because we consider that fsync is a costly operation. So during that operation is happening, we try to accumulate more and more OS writes.
So and for the wall flush lock we have tried to use LW lock acquire or wait semantics so that we can combine the flush calls as well. So the basic approach is same as we have tried for the other grouping techniques.
The second that is the first approach we have tried to reduce the contention. Second is that group flush the wall which means that only one proc will be allowed to flush the wall as a representative for
the other writers. So basically each proc will advertise its, in this method each proc will advertise its write location and add itself to the pending flush list. The first back end that will see the list as empty will become the leader and we will try
scan the whole list and find the highest write position till what writes have been done and acquire the lock and try to flush the wall. So and after flush it will wake up all the other writer processes so that they can also
proceed. So actually after both these patches we have tried to take some performance data for 1000 scale factor with synchronous commit mode off on the Intel 2 socket machine. Here though we
don't see very high performance improvement but at higher client counts we could see up to 15 to 20 percent of the performance improvement. So actually
these tests have been done for the normal PG bench workload where the wall write basically each record of wall is very small, each commit record is very small. So maybe because this ideas could help only when there is lot of
OS write, combines for the OS writes. So some of the tests which are pending are could be like when the larger wall records are written then we might want to try some more test to see how the patches
impact the performance. So the next lock is C lock control lock. So as we have seen in the
this lock is very heavily contended. Basically we see the contention more at 64 or greater client counts. This lock is acquired in exclusive mode
to update the transaction status in the C lock and it is also acquired in exclusive mode to load the C lock page into the memory. In shared mode we acquire it in the shared mode to read the transaction status from the C
log. Sorry actually I have progressed quite fast. If you have any questions till previous slides please feel free to ask.
What's the default lock mechanism we have in Postgres and how it changed on different versions, who decides which lock to use. So if I understand your question is that how the locking changes from version to version in Postgres.
So actually all this work has been already proposed in the previous version but it didn't get committed to the Postgres because of some of the problems. For example for the C hash patch
Dilip has shown that we are seeing 8 to 10% of the regression. So basic point is that we can't even if we can see the performance or scalability improvement at higher client count, we don't want 10% regression at lower client counts. And similarly for this wall write lock
even though we could see that there is a lot of reduction in contention but the performance improvement is just around 20% and that too at higher client count. Maybe some more test could as I have mentioned in the last point, maybe some more test could reveal some higher performance
and that way community could be convinced that this locking could be changed. But yeah, the main criteria is that if we see some bigger improvements without any regression then only we decide to go ahead with the changing in locking strategies.
So if it improves the performance at all kind of work loads or at least mostly used work loads then we directly
try to change the whole mechanism, it is not by configurable parameter.
Actually this is by a somewhat script given by Robert.
Each weight is correct when it is released, so you can see what the average weight
time is and stuff like that. Oh, I see some directions for that. So moving further about the C-Lock control lock. So
basically the contention around this lock as pointed out by Takashi in yesterday's presentation that the multiple process, this lock gets contended in multiple ways like one type of contention is when multiple processes tries to update the C-Lock status at the same time.
Another is that when some of the processes tries to update the C-Lock status and some of the other processes tries to read the transaction status from the C-Lock. So for the read write workload, both of these contention together play a big role
and reduces and lead to a heavy contention for C-Lock control lock. So we have tried actually various approaches to mitigate this contention and we have seen that similar group array
technique has shown us the visible benefits without any regression. So the basic idea is like one back end will become the group leader and that process will be responsible for updating the transaction status for rest of the group members. This
reduces the contention on the lock significantly. So basically we can see that after this patch on the head, the performance improves
quite significantly. This workload is again for whatever we have shown the weight events like at the scale factor 300 and shared buffers 8GB with synchronous commit mode off. Basically the data fits in the shared buffers. So after this patch we
have again tried to see the weight event records and you can see that on the head we can see there is a huge C-Lock control lock contention for this workload. Whereas after the patch
you are so the problem with that is I think you probably haven't added a weight event for the new for the waiting for the C-Lock group update to happen because when I did that and I measured it. Yeah the C-Lock control lock went down but this new weight event for waiting for C-Lock group commit was very
very high and so it doesn't show up on your list at all. So I bet you didn't have it in that test. Yeah I think we didn't have that in the test but it can't be more than the existing C-Lock control lock because the performance. But it wasn't that much less either. Most of the C-Lock
control lock weights just shifted over to C-Lock or a group update. And the other thing that you had is the
Yeah yeah that's true but I think still even if that is in some considerable range, the readers who take the C-Lock control lock that should I mean it's obvious there is some improvement that you have made there because the contention has shifted
over up to proc array lock. You see proc array group update going way up. Something else has to be being contented less in order for that to be more. But I'm just saying there is a big piece of the picture that is missing from your graph. Yeah I think that part we have not added the weight event
for that. I think for that patch it should be So but apart from the weight events also for the raw performance we could see that there is approximately 50%
or so gain at 64 and higher client counts. And weight event also shows that there is some reduction in the C-Lock control lock contention. So that's it for both read only and read write workload.
Overall conclusion for this locking or the current bottlenecks we see is that for read only workloads with C hash patch we could see up to 150% or so performance improvement at higher client counts.
And the snapshot caching shows up to 40% of the gain. And for read write test we could see that there are three major contenders. One is proc array group update lock, write wall write lock and C-Lock control lock. So group flushing along with the separating out of
the wall flush locks into a separate lock shows up to 20% of the performance increase and C-Lock group update shows us approximately till 50% of gain for synchronous commit mode off.
So we should commit all these patches right now? Yeah, I think during the presentation we have shared that some of their problems are there and we should try to resolve those. No, XIDgen I think is
much lower. This transaction ID is I think that lock.
The transaction ID lock is actually not a problem. The transaction ID lock indicates the two PG bench sessions simultaneously updating at the same time. And if you want to make that go down you just increase the scale factor so that you're less likely to pick the same branch number twice at the same time. That and client read are really
problems right? It's the other things on that list. I think XIDgen is surely there in the list but here we have shown only four or five type of four or five top locks. I guess there that you know if they have proc array lock they have
150 or 90 or something like that. One last but one slide. This one?
Yeah, you said it shows more than 40% gain when data fits into shared buffers. How does that number change when the data fills out? Actually it will reduce because the connection goes somewhere else also.
So if you did C hash first then the snapshot caching would show more benefit on the workload that doesn't fit into shared buffers. Because you're sort of like once your throughput starts to be heavily limited
by a single lock right? Then anything that doesn't become a big issue until someplace higher you don't even notice. You remove the lower limit and then you pop up to wherever the next thing that you hit is. And I think one of the things that's really challenging about the right workloads is that for
a long time we've had proc array lock which is now partly proc array lock and proc array group update. And we've had wall right lock and we've had C log control lock. And so we've got those three locks and also wall insert lock. Those four locks around writing the contention becomes
severe very close to the same place with all of them. So like Eiki did the work on wall insert lock that reduced the contention on that to some degree. So then that one was at the higher point and the others were at the lower point. But then when you fix the others and you lift them up a little bit well then you run right back into wall insert lock.
So like when I did the initial work with the fast path locking stuff in nine point one the improvement turned out to be very large because there was only one lock maybe two. It was a significant problem on that workload. But on the right workloads you've got like four different locks that all run into trouble
right around the same place. So until you've addressed them all you don't see the throughput. If you test that represents more or less typical workload first is just run Pgbench. Second is
create lots of inherent tables or partition of table like this like 10,000 partitions and try to read from it. And my question is maybe you know some other tests, some other examples
that represent something users typically have problems with. I think the other benchmarks, actually if I understand correctly the question is like what are the other typical benchmarks which can show the performance bottlenecks or something like that. So
I have, we have not shown here but the other two benchmarks which we use to run is one is sysbench and hammerdb which are OLTP specific to show the contention on these kind of locks and for actually read only or data warehouse kind of
benchmark we use tpch benchmark. So these are the primary benchmarks we use for the performance work.
Okay. I think if no more questions. Thanks. Thanks. This back is with you only right?
For wall patch.
No, I think mostly it is with Pgbench we have done. Copy could be like main but copy has to be parallel copy you are telling. That's what I have told for one test is
we are missing where the wall data could be bigger.
Okay. Okay. Yeah, I have done tests like that a few times where you set up
a Pgbench that is with a custom script and the custom script is a copy or a create table as select or this kind of.
We also want to make the contention very heavy on wall right locks for these short transactions it was that everybody is quickly coming to wall right lock but if we just try with insert into select there also we have to be quite cautious that it will be 5 or 10 records not very high number of records so that.
You want high number of records. But if you have high number of records then the people reaching to wall right lock together will reduce because we want both things to come together. We want contention also that reaching faster and we want more details. Why do we want contention? Why don't we want good performance?
This patch won't show benefit. We are also not just theorizing that if you try to fetch more records or copy let's say 100 records then if all the sessions are trying to copy 100 records they have to do lot of other work that probably.