PostgreSQL at low level: stay curious! - TIB AV-Portal

PostgreSQL at low level: stay curious!

00:00

17

Related Material

Chaos Computer Club e.V.

Dolgov, Dmitrii

Formal Metadata

Title

PostgreSQL at low level: stay curious!

Title of Series

All Systems Go! 2019

Number of Parts

44

Author

Dolgov, Dmitrii

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/46128 (DOI)

Publisher

Chaos Computer Club e.V.

Release Date

Language

Producer

All Systems Go!

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Have you ever encountered a transient performance issue, that was hard to investigate only from the database point of view? On top of how many layers of abstraction your database is working? What is the difference between running your database on a bare metal, VM or inside a container? PostgreSQL does not work in the vacuum, it heavily relies on functionality provided by an underlying platform. And sometimes to answer these questions above one needs to step back and look at a problem not only from a database point of view. In this talk we will discuss how to achieve that, how to tame such tools as strace, perf or eBPF to troubleshoot intricate issues and stay curious. Have you ever encountered a transient performance issue, that was hard to investigate only from the database point of view? On top of how many layers of abstraction your database is working? What is the difference between running your database on a bare metal, VM or inside a container? PostgreSQL does not work in the vacuum, it heavily relies on functionality provided by an underlying platform. And sometimes to answer these questions above one needs to step back and look at a problem not only from a database point of view. In this talk we will discuss how to achieve that, how to tame such tools as strace, perf or eBPF to troubleshoot intricate issues and stay curious.

All Systems Go! 201913 / 44

1

24:23

The state of Thunderbolt on GNU/Linux

2

38:00

Stateful systems on immutable infrastructure

3

20:49

Squeezing Water from Stone - KornShell in 2019

4

23:43

Senpai - Automatic memory sizing for containers

5

24:06

systemd @ Facebook in 2019

6

37:53

Securing Bare Metal Micro Services: Service Mesh

7

35:50

Rootless, Reproducible & Hermetic: Secure Container Build Showdown

8

26:52

Revamping libcontainer's systemd driver

9

32:51

Resource control @ Facebook - 2019

10

44:26

Reinventing Home Directories

11

38:22

Purely Functional Package Management

12

25:35

Privacy-Respecting Linux Desktop Monitoring

13

37:42

PostgreSQL at low level: stay curious!

14

42:54

pidfds: Process file descriptors on Linux

15

21:30

oomd2 and beyond: a year of improvements

16

38:47

OCIv2: Container Images Considered Harmful

17

24:45

News from the coreboot land

18

25:51

Microcontroller Firmware from Scratch

19

33:04

Linux distro should be an upstream contributor too

20

27:02

iwd - State of the union

21

04:12

22

41:01

How Microsoft SQL Server Went Multi-Platform: SQLPAL

23

44:12

GNU poke, an extensible editor for structured binary data

24

26:55

Generating seccomp profiles for containers using podman and eBPF

25

34:47

Effective infrastructure monitoring with Grafana

26

37:01

eBPF support in the GNU Toolchain

27

23:26

Building applications at once for Flatpack, Snapd, Dockers

28

15:32

Development and testing with lrun

29

23:09

Custom cgroup-bpf programs in systemd

30

18:58

Container Live Migration

31

23:54

Coinboot - Cost effective, diskless GPU clusters for blockchain hashing and beyond

32

11:04

Closing of All Systems Go! 2019

33

29:20

Buildroot : Using embedded tools to build container images

34

25:39

Building {Portable Service, Container} Images with Buck

35

40:55

Boot Loader Specification + sd-boot

36

41:58

BMC management with bmc-toolbox

37

39:19

Atomic updates and configuration files in /etc

38

03:24

Alternatives to standard utilities

39

37:43

Yomi - an openSUSE installer based on SaltStack

40

04:00

Using RPMs for systemd development

41

37:12

Trust is good, control is better - A (short) story about Network Policies

42

42:34

2019 - Transactional Updates with Btrfs

43

37:30

Traceloop for systemd and Kubernetes + Inspektor Gadget

44

04:24

Time-limited login sessions

Automatic playback

Speech

Text

Image

00:00

System programmingLevel (video gaming)Multiplication signCore dumpFunctional (mathematics)BitData storage deviceComputer animationLecture/Conference

00:46

System programmingOperator (mathematics)Integrated development environmentDifferent (Kate Ryan album)Point cloudGoogolPresentation of a groupBasis <Mathematik>Data centerBitInteractive televisionMultiplication signCodeOpen sourceProjective planeOperator (mathematics)Lecture/ConferenceComputer animation

01:58

StatisticsInformationDatabaseDifferent (Kate Ryan album)Operating systemState of matterLevel (video gaming)BlogElectronic visual displayMereologyDecision theoryComplex (psychology)View (database)Physical systemUtility softwareInteractive televisionConnected spaceCASE <Informatik>Disk read-and-write headPoint (geometry)Group actionBefehlsprozessorVacuumVirtual machineBitCloud computingScaling (geometry)Figurate numberWater vaporRevision controlStatement (computer science)Process (computing)Digital photographyoutputPosition operatorWeb pageComputer animation

04:10

Physical systemError messageBitMultiplication signPoint (geometry)Server (computing)DatabasePlanningSlide ruleMeeting/Interview

04:57

Source codeInformationRead-only memoryError messageSpacetimeShared memoryElectronic program guideRevision controlMathematicsSemiconductor memorySystem administratorSheaf (mathematics)Transport Layer SecurityError messageBitSource codeVirtualizationStress (mechanics)CASE <Informatik>QuicksortPlanningMultiplication signProduct (business)Slide ruleFile systemInformationResultantKernel (computing)Different (Kate Ryan album)Menu (computing)Visualization (computer graphics)Scheduling (computing)Scripting languageCodeLogic gateComputer animation

07:15

Principal ideal domainNumberDressing (medical)System callKey (cryptography)Context awarenessCartesian coordinate systemRevision controlPhysical systemBoss CorporationFront and back endsSemiconductor memoryComputer animation

07:54

Principal ideal domainPlanningError messageParallel portLimit (category theory)NumberCache (computing)Shared memoryFlow separationCartesian coordinate systemQuery languageAnalytic setDefault (computer science)Computer animation

08:32

Principal ideal domainPhysical systemSpacetimeReal numberKernel (computing)Physical systemType theoryCASE <Informatik>Different (Kate Ryan album)Scheduling (computing)Presentation of a groupPoint (geometry)Semiconductor memoryMultiplication signRelational databaseVulnerability (computing)BefehlsprozessorObject (grammar)DatabaseInstance (computer science)Electric generatorContext awarenessBitFigurate numberStress (mechanics)Human migrationVisualization (computer graphics)Extension (kinesiology)Speech synthesisComputer animation

10:06

MiniDiscVulnerability (computing)IntelBefehlsprozessorVulnerability (computing)Semiconductor memoryFigurate numberPhysical systemComputer hardwareRelational databaseSimilarity (geometry)Scaling (geometry)Gradient descentOperating systemXMLComputer animation

10:53

Core dumpDisassemblerMiniDiscOverhead (computing)Symbol tableFluid staticsOvalRead-only memoryIntelArchitectureSoftwareBefehlsprozessorKernel (computing)Cartesian coordinate systemExistential quantificationFunctional (mathematics)CASE <Informatik>BefehlsprozessorPhysical systemVisualization (computer graphics)DiagramLoop (music)Computer hardwareReal numberMultiplication signSystem callLoginVulnerability (computing)Sampling (statistics)Pairwise comparisonDifferent (Kate Ryan album)Arc (geometry)Buffer solutionFormal verificationBit rateFerry CorstenFiber bundlePatch (Unix)ProgrammschleifeLetterpress printingFront and back endsSpacetimeSelectivity (electronic)Figurate numberWindowOrder (biology)Focus (optics)International Date LineComputer animation

14:02

AverageIntelOptical disc driveData storage deviceCache (computing)Multiplication signOperator (mathematics)Buffer solutionDatabaseArithmetic progressionMereologyConnectivity (graph theory)Different (Kate Ryan album)WorkloadGame controllerFlow separationConfiguration spaceSynchronizationDampingoutputWeb pageResultantVirtual machineCustomer relationship managementSemiconductor memoryReading (process)Normal (geometry)CASE <Informatik>Physical systemKey (cryptography)Data storage deviceBefehlsprozessorFerry CorstenBitKernel (computing)MeasurementNegative numberOperating systemPoint (geometry)Process (computing)MathematicsCache (computing)Sheaf (mathematics)Loop (music)AlgebraDatabase transactionPointer (computer programming)InformationIterationLocal ringWritingPurchasingRight angle2 (number)Program flowchart

16:55

TLB <Informatik>Web pageLink (knot theory)Web pageDatabaseInformationBitBuffer solutionDatabase transactionTheoryDisk read-and-write headCurveComputer animation

17:48

Event horizonSample (statistics)Principal ideal domainApproximationWeb pageDatabase2 (number)Web pagePhysical lawStructural loadData storage deviceRow (database)Group actionSet (mathematics)BenchmarkConnectivity (graph theory)Point (geometry)Client (computing)Different (Kate Ryan album)CASE <Informatik>Computer animation

18:51

SpacetimeStack (abstract data type)CodeEvent horizonMeasurementInformationDemo (music)BytecodeKernel (computing)Forcing (mathematics)Water vaporExtension (kinesiology)Cartesian coordinate systemFunctional (mathematics)Power (physics)WordLevel (video gaming)Stack (abstract data type)Meeting/InterviewComputer animation

19:59

System programmingPhysical systemLogarithmElectronic visual displayData typeDatabaseServer (computing)Address spaceNetwork socketQuery languagePrincipal ideal domainMultiplication signTracing (software)Query languageMeasurementInheritance (object-oriented programming)Computer programmingWindowScripting languageLine (geometry)Service (economics)CodeMathematicsCheat <Computerspiel>InformationThresholding (image processing)CompilerCondition numberDifferent (Kate Ryan album)Kernel (computing)MereologyPoint (geometry)WritingJust-in-Time-CompilerExtension (kinesiology)File formatShared memoryFeedback2 (number)Computer animation

21:56

Physical systemUser profileSystem callStatisticsStack (abstract data type)SynchronizationDeadlockQuery languagePrincipal ideal domainCache (computing)Maxima and minimaScripting languageCodeInformationLevel (video gaming)Computer programmingEvent horizonFront and back endsSource codeComputer animation

22:45

Principal ideal domainUser profileStatisticsSystem callStack (abstract data type)SynchronizationDeadlockQuery languageCache (computing)Interpreter (computing)Computer programmingCurveGroup actionElectronic mailing listOverhead (computing)Electric generatorPoint (geometry)Source codeComputer animation

23:28

Principal ideal domainUser profileSystem callStatisticsStack (abstract data type)SynchronizationDeadlockQuery languageCache (computing)Event horizonFlagComputer configurationTracing (software)File formatData bufferTotal S.A.Function (mathematics)Asynchronous Transfer ModeTimestampElectric currentRouter (computing)InformationSample (statistics)Core dumpQuery languageExpandierender GraphEvent horizonInformationSource codeCompilerFront and back endsComputer programmingSource codeComputer animationJSON

24:21

Event horizonFlagStack (abstract data type)Data bufferFile formatTotal S.A.Electric currentAsynchronous Transfer ModeFunction (mathematics)User profileComputer configurationQuery languageHash functionCore dumpSample (statistics)InformationData typeElectronic visual displayPhysical systemLogarithmServer (computing)Address spaceNetwork socketDatabaseComputer programmingLevel (video gaming)Endliche ModelltheorieInformationInformation overloadLoop (music)Element (mathematics)Server (computing)Semiconductor memorySource codeJSONComputer animation

25:39

Electric currentQuery languageRevision controlDifferent (Kate Ryan album)Computer programmingParameter (computer programming)Variety (linguistics)Physical systemBinary codeBitParsingRevision controlSheaf (mathematics)Kernel (computing)Hash functionVirtual machineLink (knot theory)Source code

26:36

SpacetimeStack (abstract data type)BefehlsprozessorQuery languagePrincipal ideal domainCache (computing)Electronic signaturePatch (Unix)Fiber bundlePoint (geometry)Presentation of a groupDemo (music)Revision controlKernel (computing)Latent heatLevel (video gaming)Cache (computing)Structural loadFront and back endsJSONComputer animation

27:20

BefehlsprozessorBit ratePrincipal ideal domainQuery languageCache (computing)Cache (computing)Pattern languageDifferent (Kate Ryan album)Client (computing)Set (mathematics)Level (video gaming)Shared memoryDatabaseMemory managementInheritance (object-oriented programming)Computer animation

27:53

WritingSystem programmingWorkloadRead-only memoryLimit (category theory)Right angleQuaternion groupMultiplication signoutputInheritance (object-oriented programming)LaceEvent horizonCartesian coordinate systemInformation2 (number)ResultantCASE <Informatik>Process (computing)Channel capacityData storage deviceKernel (computing)Radical (chemistry)Web pageSheaf (mathematics)Server (computing)Point (geometry)Correspondence (mathematics)Limit (category theory)MereologyGoodness of fitScripting languageProof theorySemiconductor memoryMemory managementNP-hardDatabaseGroup actionBuffer solutionQueue (abstract data type)InjektivitätWorkloadLink (knot theory)Insertion lossComputer animation

30:17

Read-only memoryLimit (category theory)PressureKernel (computing)Kernel (computing)Semiconductor memoryLimit (category theory)TheoryService (economics)PressureSocial classScheduling (computing)Cartesian coordinate systemDatabaseDefault (computer science)Sheaf (mathematics)Scripting languageArithmetic meanFrame problemMetropolitan area networkMeeting/InterviewComputer animation

32:09

Kernel (computing)EmailKernel (computing)DebuggerDistribution (mathematics)Parameter (computer programming)Software bugSymbol tableAsynchronous Transfer ModeService (economics)Physical systemMultiplication signRevision controlOverlay-NetzRun time (program lifecycle phase)Different (Kate Ryan album)Slide ruleSheaf (mathematics)MereologyData miningSound effectInformationCASE <Informatik>Dependent and independent variablesRight angleWeb pageProduct (business)CodeFlow separationSoftwareVideo gamePower (physics)Cartesian coordinate systemMultiplicationPoint (geometry)Sampling (statistics)Interrupt <Informatik>Computer fileCuboidComputer animation

35:55

System programmingPhysical systemBefehlsprozessorKernel (computing)DatabaseBenchmarkLevel (video gaming)Patch (Unix)Different (Kate Ryan album)ResultantRight angleSystem callSoftware bugRevision controlMathematicsComplex (psychology)Distribution (mathematics)PlotterHookingLecture/Conference

37:36

System programmingWebsiteLattice (order)Computer animation

Transcript: English(auto-generated)

00:05

So, thank you everyone. Thanks for coming and let's start right away because we don't have that much time unfortunately So yeah today, we're going to talk about quite interesting topic today. We're going to talk about why is it important to talk about Well, why is it important to be aware about to pay attention to some interesting low-level details when you're working with not necessarily

00:25

Postgres with pretty much every database in general First of all a little bit of background so yeah, I'm my name is Dimitri. I work for Zalando Well for a couple of years already. I'm longer with the Postgres community and from my contributions You probably use before JSONB functions or something from recently like pluggable storage support for PCQL and pitch dump and so on

00:49

Yeah, and in Zalando. We're doing something like that So unfortunately due to various reasons we have to run Postgres in a really really a lot of different environments So it's just a legacy kind of a legacy data centers, it's just pure cloud AWS

01:05

It's of course inside Docker inside Kubernetes a lot a lot of different environments. We're doing this. I must say successfully Thanks for two open source project that you can check out on GitHub They're really popular for example Patron is like what two and a half thousand stars in GitHub already and being used by a lot of different companies and

01:24

Postgres operator for example, it's our own kubernetes operator that was this year accepted and access Yes, as far as I remember successfully completed Google Summer of Code. So yeah, we're doing really interesting stuff And somehow this situation was actually the basis why I decided to make this presentation like this talk

01:43

This situation that we have to run Postgres in different environment because every time it's not exactly a challenge But quite frequently quite often we see some interesting problems or issues that happens on Interaction between Postgres at something else. So let me show and a little bit more details

02:01

So, I'm not sure how many of you are actually using Postgres to it Can you raise your hand if you're using Postgres cool and who are who you are just interested in? curious Okay, cool. So yeah Normally where you have database Postgres. Well pretty much any other database, too If you want to know what happens inside Of course database provides you a lot of different views and informations for Postgres

02:24

Usually they call the G start something which is that activity which is that statements? Whatever and usually like I don't know 99% of not 90% much smaller actually This information is enough to figure out what's going on. But this information has one drawback This information is basically what displays either the state of your data or the intention of your database

02:45

Obviously this information cannot tell you a little bit more Obviously this information cannot provide you something from the outside of Postgres itself And now we suddenly realize aha, but we run Postgres not in the vacuum We run Postgres on top of some operation system and then people start already thinking aha

03:02

But like we should monitor something here because here it basically kind of a connection Postgres interact with the Persian system So I think could go wrong. So we have to monitor it too. Okay, usually people are saying let's monitor something global Let's monitor like CPU utilization or IEA utilization of stuff But then suddenly people realize aha, but we're not only just inside some particular operation system

03:22

We're also inside some Docker container inside C group And already at this point people started to scratch their head at thinking aha, what should we monitor in this case because obviously it's another level of complexity and introduce another interesting tricks and Something other interesting situations and sorting loaded here what to do here They're starting to read some strange blog posts and outdated documents and of course sometimes having some wrong

03:45

decisions And then they suddenly realize that the things are even worse They're running this container on top of some virtual machine in cloud provider and they're like totally confused And then as a paramount of this stuff This virtual machine is a part of Kubernetes cluster just one of the nodes and those people are completely lost

04:02

They have no idea what to monitor and instead of just one nice Postgres can you have a lot a lot of different layers Usually at this point people are saying okay, we're serious business. We don't have time to think about this low-level stuff Let's just reboot our servers start our database and try and hope that everything goes away. Of course. Usually it's not happening like that and this

04:25

Food well in the nearest future most of the time this error this problem this issue appeared once again and I don't agree this approach first of all, yeah, because It's not really it's just a mitigation of a problem But just the symptoms this problem will be will appear more and more and more and more

04:42

But the second here losing people not you people who are doing this way They're losing really important knowledge about how their system work and without this knowledge Sometimes it would be hard to reason about performance and about your system in general So yeah a little bit of agenda. Unfortunately, there is if you wanted there is no plan whatsoever for this slide

05:05

it's basically a collection of different use cases that I found interesting or useful where Issue itself could not be troubleshooted with it the Postgres itself and you have to apply some different approach You have to step back a little bit start to think outside of the box and apply some different techniques

05:23

So I say that we have to apply some different like information sources So what are we talking about? And yeah, if you desperately need some plan here is approximately kind of a plan for this slides So first of all, and this is quite often well underestimated by people that's in this way is a source code of the product you're using and I really highly highly try I

05:45

Would try to encourage you to read the source code at least for PostgreSQL because it just amazing code base It's amazingly documented and sometimes you can get even more information from the PostgreSQL source code itself than from the documentation We did it a few times. For example, I get many personally when there was some

06:03

Script problems between different versions and I just tracked down this change from the source code back in the gate and then found out Only afterwards that there was some documentation for that for Linux, for example for Linux kernel It's a little bit of a different beast but still all the new stuff like even though Kyberia schedule or cgroup version 2 They're also decently documented and decently written. It's also nice to read them

06:24

The second section is about tools that usually administrators know about like s3 as well Maybe not that of a gdb but nevertheless and perf The first source is basically those visual file system that Linux kernel provides us so procfs Ccfs and some others and at the end we're going to talk a little bit about a little bit

06:44

Significant amount of time we're going to devote to BPF extended BF and BCC So yeah, the first example, let's imagine if you have a PostgreSQL Relatively recent version 11 12 something and then you run some analytical query and you've got suddenly without it instead of a result

07:02

You're getting this error blah blah blah could not recite short memory segment and so on and so forth Usually this error could lead to a panic Well for people who don't know what to do But in fact, we can recently relatively easy straight forward troubleshoot it So here we can place right for what s trace thingy So for those of you who I hope everyone knows but just to remind us trace is basically tool that allows you to show

07:24

All the system calls that your application is doing not everyone aware that there is a nice in the modern version of s3 There is a nice key minus K that allows you to also show the full stack trace from the application if available Of course for this particular system call So how to troubleshoot this problem?

07:41

We have a postgres we attach to the back end and we start see we start tracing and we see aha We open this short memory segment We and we try to allocate something and we failed and where came we where we came from this to this system call We came here exact init parallel plan So obviously postgres is trying to do something in parallel and then everything fits into the place

08:03

In a relatively modern verse of postgres finally parallel worker started to work properly But every parallel worker requires a separate shared number segment and then there is another cache that we see quite frequently Unfortunately docker by default limit deficit shen to 64 megabytes and of course for huge analytical queries. Sometimes it's not enough. Here we are

08:23

You may say that it's kind of a hack and cheat because obviously you cannot really analyze errors From within the application itself when the error is happening So here is another example where see s trace also could be pretty pretty useful and this is more Performance-related. So here there is an interesting feature called sent

08:42

Interesting stuff called visual data stamp objects It's basically a feature that allow us to perform some particular system calls like well most of the time. It's time related Get time of day or clock time without switching to kernel space Which is nice because then we don't get this hit over switching And yeah, of course, it's give us some performance boost

09:02

But then the problem is not everyone aware about this that not all hypervisor support this feature Unfortunately, for example, most notoriously sent hypervisor and if you're working in database infrastructure, you know that they have different generations and Whether if m5 generation is even KPM where this feature supported and for for example, which is still in use

09:22

Is not supporting because it uses exam So let's imagine you have the situation you have postgres and you have a different instances type You want to figure out how much performance hit you get by this situation? for example you have two different nodes and you see that one database perform on one out a little bit slower than the other and The only difference is like for example instance type. So again super straight forward we

09:43

Attach in our s trace and whenever we see the system real system calls we're doing with that means that we are really switching To a kernel space and we are having this performance hit Okay, normally I have to admit I'm giving this presentation probably too often

10:00

So normally at this point I'm saying about scheduling CPU migration, but this example is pretty artificial Especially in a in a case that when we have something more interesting and real talk about it's this vulnerability called memory data Sampling if I remember correctly It happened to be this year. Not that not that well like few month already or something. It's similar

10:22

It's like this one of this CPU related hardware vulnerability similar to specter from the last year And yeah, of course there is a mitigation for pretty much every descent operation system and obviously this mitigation involves some performance hit So now let's imagine you have the situation you have your postgres scale running on top of some hypervisor

10:41

And then this hypervisor was patched and you want to figure out how much performance degradation you will get Normally, it's pretty hard to answer it from the postgres itself. And that's why we deploy something more powerful We're going to deploy perf And then when we start to just do simple CPU sampling we see that suddenly one system call

11:01

Call one function from the kernel called new system call 64 suddenly takes too much time in comparison with previous snapshots If we zoom in we found out something really interesting found on this instruction verify W and everything fits into place after we reading what exactly this mitigation from the Linux kernel was about and This mitigation this instruction was actually kind of overridden and before this instruction was doing almost nothing

11:25

But now this instruction also flushes all the CPU buffers and that's we see that here we see like almost 30% were spending on this function. That's basically where all this heat arc is coming from all performance hit But then there is an interesting stuff that sometimes you can even notice that in a different way

11:42

For example, we were running Some Ubuntu on top of patched hypervisor But the bundle itself was not patched itself and we saw the very same situation via Relatively high rates of usage of time sampling spent on native safe hole Which usually just say that your system is idle, but it wasn't the case that was super strange We spent few days trying to figure out what's wrong. And then it turns out that here indeed

12:02

This mitigation was also inserted to this Safe hold function and here we are that basically the reason how we indirectly figured out exactly in the day when this Vulnerability was disclosed that something's happening Another really nice example why perf could be really useful outside of post goes and in fact for not only post yourself for many other

12:23

Applications is something that called log holder preemption problem. This problem is so severe that even CPU vendors provide some one of that or another solution for this problem most of well For example, if we're talking about Intel it's called pause loop exiting. Let me show the nice diagram to explain

12:41

How does it work? So let's imagine we have a hypervisor and then let's imagine this hypervisor have four Visual CPUs and two of them are active and two of them are preempted And now let's imagine that we're running positive and up of it And now let's imagine that something happen For example visual CPU once and back and there is doing some work doing some select or whatever

13:01

I don't know how they and then it happened that C2 is waiting for C1 Well on the lock or something Normally, of course It's not a problem because usually for example when they're talking about locks, especially spin locks They should be taken for a really short amount of time But then we're talking about like real hardware like bare metal here. What could happen is supervisor can say Okay, C1 got enough time

13:21

Let's preempt it and give some time to C4 which run in a different background on different back and from Bosco school Which is doing something different and now we have this interesting situation What's supposed to be really short amount of time for waiting for C2 now is like unexpected amount of time Whatever hypervisor is saying and what's even more

13:40

Well, yeah in this situation this post loop exiting basically, what does it do? It tries to prevent such an idle loops But it does tries to prevent this by sending and basically an exit from the guest to the hypervisor Which is also kind of a switch from the user to kernel space, which is also performance here Now we have this in mind and we want to figure out how much that affect our performance

14:03

Here's a two examples here are two examples that I've performed in fact It was performed on my own machine, but nevertheless when you can see the results, they are different So what I'm doing here is the very same set up the very same database It was wiped out from the scratch With the only difference this database was running inside KVM virtual machine and this KVM machine in the second case

14:25

PLE was disabled completely and then we were running just PG bench workload normal read write workload against database And then we can see suddenly that with PLE enabled We've got average latency even higher than without which means that in this particular situation

14:41

Our CPU was so much separated this poll that pulls loop exiting was Interrupting a real waiting in our case and basically we're doing bad for our performance So it was a negative impact But of course, it could be not always like that and just showing that this feature could be bad Well, there is a pros and cons for this feature. So of you have to measure it for yourself

15:02

Yeah here we have to make a little bit of a detour and for the next sections I have to explain some basics about how postgres works and not only postgres in fact, pretty much any storage based Database so normally we have some processes that are running like well backends that are doing some job and then some background

15:23

Writers and for example checkpointers then we have some memory in of course in pages in the middle Then we have a transactional local writer headlock from the right side and then we have operation system cache and storage The point is that postgres basically does all the rights all the IO is buffered in postgres Which means we rely a lot on the Linux kernel itself. So what happens when we are working with this database?

15:45

So let's imagine we decided to update something. So let's imagine our data was like, well, you know, our cache was warm We decided to update few pages. Of course. First of all, we have to write write the headlock That's how all the stuff is working Then what happens is that we record this information and now we do not sing this immediately

16:04

Now kick a background writer kicks in background writer tries with some particular configuration Synchronized from time to time those dirty buffers with operation system cache. So not the storage itself just with the purchase in cash Then from time to time kicks in the component

16:21

Well, basically no external which is not exactly a part of our postgres, which is not exactly under our control So what does it do it from time to time tries to also synchronize operation system cache with the real storage Depends also on some configuration dirty background right here to radio and so on and so forth And eventually when we're doing checkpoint, we synchronized everything to the storage eventually to preserve the data, of course

16:43

So what does it mean? It means that even in this schema we rely significantly on several parts For example on kernel memory management on buffered IO and in general a lot on the internal So and one more I think the last it nice it but really nice example for perf how to do the perf for postgres

17:01

Well is how to check how much performance you can get from huge pages somehow it happened I'm not sure if it's true in this audience But sometimes somehow it happened that for example for databases people are not always aware What are the huge pages for how do they work and somehow they are like surrounded by some mysterious? Mysterious around them. Yeah here we're talking a lot about transparent only about classic huge pages

17:26

So here to figure out this we can use the very same scheme as before first of all read the documentation Link documentation says that huge pages are good because they are doing to be transaction look aside buffers for misses faster And there are a little bit less happening less frequently

17:41

So we have this information It's a kind of a theory we have to prove or disprove for our for our purposes so here again We do just an experiment everything is basically about experimenting with your own setup, which because it's basically important No one can say something for you in advance So here is again just a simple example simple database on bare metal with only one difference

18:03

The first one is using huge pages the second is not and then we're smart now. We're recording with perf We're recording till the lots and stores misses And then yeah, we see that in the set in the first case when we have this huge pages we have 19% well Yeah, almost 20% less a lot misses and almost 30% less lot misses, which is quite nice

18:25

which is it's nice not only because We kind of a checked something it's nice because we checked one exactly component normally when people are doing some benchmark Then true they're trying to benchmark the whole you know pipeline the whole Set of factions from the ground to from the client to the storage and of course there could be some other influences here

18:45

We checked one particular company, and we know that it's there and now from this from this point. We can derive some latencies okay, so what we were doing before can be described as Stateless Measurement, so we were just having some events we were attaching to those events something happened

19:04

We get some information, and we just forgot about that afterwards but then BPF was well extended BPF and reduced so originally we had BPF Berkeley back at filtering that was with us like since Nineties or something and eventually just well originally just a byte code that we could execute within the scope of our kernel

19:24

Normally for TCP protesting or something like also pretty stateless But then extended BPF. Thanks to Alexis there were the force introduced and now we have totally amazing powers now we have Stateful measurement so now we can respond to some events in the kernel or in the application itself so here

19:43

Yeah, we can attach to any function within the kernel or with an application which is important which allows Which just opens a lot a lot of possibilities how to use do we can use register stacks maps and everything and? To not make it just the word words. I will I prepare some demo. Let's see if

20:02

Something go wrong this time, so what happens here. I hope you can see I hope it's big enough So we have several pens here several windows and the first we have just postgres quid running Postgres quid and then PC could attach to this postgres nothing particular Now here in this window. We have all the BCC tools available, so

20:22

First of all I have to tell you that of course normally when we are working with extended BPF It's pretty complicated. We have to write pretty complicated to write JIT code by itself That's why there are some different tools most important most famous of them is BCC BPF compiler collection that allows us just to write Some lines of Python code to generate this BPF program

20:44

So here in this window we exactly have this problem this this program, and what can we do well simple? Simplest thing you we can do is for example. We can change trace Postgres exact simple query So what we're doing here right now, we're tracing some query, and then when for example execute something like select one

21:06

We see this query happen So we have some feedback and now we're doing the second time we're doing second we see the second time so it's could be For example super useful to measure latencies between queries of course postgres provide this information for you But there are different pros and cons for example this information provided as part of the log information with some particular thresholds

21:24

And so on so forth so sometimes it could be really nice to get exactly in this format plus well I explain it later, but the point is that this information we can also do we can also process within the kernel itself which means that we're much more performant and We can for example filter this information based on a lot of different a lot of different conditions

21:43

But what happens in the background when we're doing this? Yeah, yeah, it's just it's one of those Python script. I can show you it's basically User share business see tools here. They are it's just a Python script that does all this stuff And that's what what I'm going to explain right now, so when we run this trace what happens?

22:06

tools Proclists so BPF it's just a Python script. Oh, I'm sorry business. It's a priest the Python script it Uses it generates some C code That was then on the fly compiled by LLVM backend into BPF jitter code

22:21

Then BCC itself again via perf API created a performance event well user probe then BP BCC attached to this BPF program to this user probe and plus also create a map to store some information So here we can see for example this user this BPF program that was created and here we can see map that was created by this program

22:46

But as we figured out the problem is well not exactly a problem But this involved a lot of stuff it involves generation on the fly it involves some Python interpreter And all of this sees we have seen some situation when this Let's say performance overhead sometimes can be too much for us because for example

23:03

Port and Kubernetes is so much overloaded. We just cannot even run the Python interpreter by itself But the point is that it's not even necessary Eventually we have to basically to get BPF program by itself, so What can we do for that? First of all we have to before we basically can produce the very same the very same

23:22

List of actions as BCC is doing first of all we can create user probe, so let's say then perf probe expand postgres and then exact simple query Yep, an exact simple query true, so now if we will check we have so here

23:43

We are going we were checking the trace trace of us and here. We have different information We can check that we have those events created And then I already prepared it. I think here yep something called PG latency

24:00

Something yeah, so here. We have basically some already compiled BPF program from the C source code by LLVM backend with only I will explain the sketch later But we already have kind of a compiler so what we can do is we can literally just run it PGA latency Now it starts now. There is we can check again

24:21

There is a BPF program. There is a BPF map So everything is pretty much the same well except some small details like name and so on and this program right now is waiting for some event and If we go for example in doing this Yeah, we will get this so this program right now is basically doing this It's showing this one single element in the loop, but what's really nice about this stuff. What's really curious is that?

24:46

Before as I showed you before we see secret at the BPF map we can see this BPF map for example CFS BPF Maps so maps it was created by me, but here is latency So this is exactly a matter to create it and normally it's not visible

25:01

But with BPF we can pin this map And it's really convenient because then we can for example read this map separately Which means that for example we can keep one value or like last recently used values in memory and updated quite frequently But then do not use it and just discard them when we don't need it Which means that we are not going to be for example overload this information We're not going to overload our servers or something and then we can for example just execute PG

25:24

Latencies reader and then just get this value and that's pretty much it which is nice Which is mean that you this via this pinion of maps we can separate these concerns, and it's pretty much nice for monitoring purposes But there is one catch I think I have this yet So the problem is that not exactly a problem, but interesting situation is that to be able to run this BPF program you have to

25:46

Execute system called BPF which is require one argument kernel version exact kernel version. You're going to run into Well with this program and of course if you for example have different setups different infrastructure have different variety of different links kernel and

26:02

Compiling on one machine. It's could be pretty come problematic because if it's got to specify literally one here But then eventually I was trying to use originally just this elf parser to change this version There was a problem because I've parser eventually was generating a little bit different binary little bit different and hash and so on So eventually I just ended up replacing this manually with xxd

26:24

Which works nice just you replacing this section in your program, and it works everywhere. Yeah, it's it's it's it could be silly But if it works, it's not silly So Yeah, that's that was a small trick and yeah The only another thing that I want to mention is that be careful when you're running this in Ubuntu because starting from some point

26:43

I have no idea when they're showing wrong Linux kernel version with you name they're showing the last patch version always zero and you have to I was not understanding why it's not Running why it's complaining and turns out that you have to check the correct version from proc Proc version signature just for you to know, okay That was the short demo short example and now let's return to our presentation

27:03

So yeah, I just said before we have BCC which really nice really popular but kind of a generic So for our purposes for my own purposes I've created something more postgres call specific so you can check it out if you're interested and here are a few examples for example We can check last level cache misses and loads but per back-end per query, which is really mind-blowing

27:25

Mind-blowing because basically normally we're doing this kind of in a global set but of course, it's not that convenient for example when you have to share your Database with some other clients with different patterns with different data access patterns and this could be really nice For example when you're trying to deploy when you're trying to figure out cache less level cache access patterns to figure out how to

27:44

Use for example these technologies about cache sharing like Intel resource director for example Yep, I mentioned before that and we already seen that memory management super important and another company that's super important this buffer tayo and

28:01

Was one company into it is of course right back. So now let's imagine just let's forget about BPF for for a second Let's imagine we're smart We know that there is event event to a performance event from Linux kernel or write back written now We know that it's important We have to monitor to start the monitor with Perth and then we see aha from time to time Linux kicks in and try to synchronize everything with the storage which is not exactly nice for database

28:24

which means that it's basically saturating our IO and For example for write and write ahead look it's already a problem for database for modern devices like NVMe like SSD when they have Several queues. It's not exactly that much of a problem because we still have capacity some capacity to write

28:40

But then things could be worse The point is that sometimes Linux kernel can inject delays timeouts to your process in case when right back is not keeping up with amount of dirty pages that your Applications providing your applications create and it's really nasty because if you think about this, it's really kernel inject delays in your business critical application when you want to return result within like

29:03

milliseconds or even less than seconds or something And yeah to monitor this information Unfortunately, there is no event from the perf at least by now That's why I've created also some small script timeouts and this is just an example We have just PG bench insert workload with some relatively big amount of memory and in this particular case

29:22

We get just four situations when links kernel injected these delays But of course you can imagine on a huge service with like 120 gigabytes memory or more. It could be much worse So probably the last section It's about Kubernetes So it's still questionable whether it's good or not to run database in Kubernetes

29:42

But at least part of the answer is that sometimes it's convenient So that's why we're doing this and if you're aware in Kubernetes, you can manage resources via manifest when you specify resources requests and resources limit And as I said before for us many memory management is super important and let's imagine we run our database in Kubernetes inside

30:01

But and we're talking about memory right now So my first reaction when I was going through the stuff was that aha probably request memory request Corresponds to C group soft limit invites and limit corresponds to hard limit invites. That was my first reaction but of course with Kubernetes, you should be prepared that some obvious answers are not that obvious usually and

30:22

The first memory request it has nothing to do with soft limit at all In Kubernetes soft memory limit basically used only for the purpose of internal scheduling You have to figure out for example classes of service and to calculate or I'm adjusting something like that okay, I figured this out and

30:42

From this I had another theory. I thought okay, then it's cool It means that we don't have this soft limit which means we don't have for example memory claims because we're not going to over soft memory limit and then kernel does not is not Trying to reclaim our memory and then of course I was wrong because of the thing that got memory pressure, especially for containers

31:02

Well containers are designed to be Allocated in a way that they already have memory relatively close to what your application is needing which mean that we're already close to the Highest memory limit we have just by default and which means that we by default have memory pressure We need quite high and just do this memory pressure

31:20

We have quite frequently this memory claim nevertheless and we already seen situations when it has super overloaded ports with Databases and we are doing those memory claims quite quite often, especially where when we're on the edge of for example Out of memory om kill Yeah, and here's a nice example that all this stuff well at least most of those scripts

31:40

You can actually use even with some particular docker container ID Nowadays, I think starting from 4.18. You can also use cgroup ID, but it's not exactly correlated For example with docker container ID, so I still have this possibility Yep The last section is about so I

32:01

Basically I said everything about that's a super powerful that the problem is that in Every different infrastructure. It's kind of painful to run it still so for example if you want to try to own your locker Only or just laptop. It's pretty straightforward You just have to check that your debugger first mounted And you have to check that those parameters for Linux kernel are there, but normally they are there and the more modern distributions There is no

32:21

No magic here a Little bit of magic happens when you want to run this on docker First of all of course there is going to be this machinery with the bug and symbols you have to Do not forget to copy them from there or to use them from there because it's kind of separated but then of course you have to run a docker container with privileges extra elevated privileges because all this

32:41

BPF and tracing stuff requires privileges But then as a nice perk you can attach you can create a separate container separate monitoring container And you can attach something to somewhere to your application without you know pull it in this original container another kind of a trick here is that Until well look your uses overlay fast of course and until well overlay fast or other

33:02

Overlays file systems, but there they all have one problem until 4.17 I guess they were not supporting new props unfortunately, so you have to also be aware about that And then probably the most of the time I spent trying to run all this stuff BPF on Kubernetes of course we require some elevated privileges we require some elevated bridges from account service account

33:25

But then Stuff that I spent probably most of the most of the time It's exactly figuring out how to deal with different Linux kernel versions Fortunately for BPF and BCC we can there is a variable called BCC Linux version code Which you can override just on runtime and when only when you know that this particular version is close enough to what you're supposed

33:45

To run in otherwise there could be some side effects and of course. It's not nice Yep, and the last section really quickly about how to break because all this stuff It's really powerful and with all the power comes of course great responsibility this particular example I was trying it it was in fact. It was quite outdated perf

34:02

So it doesn't really matter, but there was really interesting situation this outdated perf somehow could not handle Situation when I was trying to extract some arguments for example in this case I was trying to extract some information from trigger when This information was now and it was just question And I was just trying to do some sampling and I really really crashed production back in which is of course not that nice

34:24

Another example is also that well unfortunately software reason is not without backs It's also somehow data per version, and I was trying to figure out how much how much right a full page rights Are we doing from postgres create and then when I executed this I tried to create this probe And then when I executed this perf stuck while trying to create this user pop a probe in a non interruptible

34:45

Sleep in a kernel mode which means that nothing could stop it Nothing could kill it and not only that it basically means that all the docker stuck all the machinery stuck And all pretty much everything is stuck and the only thing that we could do it We had to restart the whole day well in this case. It was fortunate a replica, so not a big deal, but still

35:02

And the last part it's already outdated slides But still nevertheless it shows how powerful and how scary this stuff could be it's from 2018 and it was like Linux kernel version 4.4 quiet tension toward at this point, but never does it's nice It shows that with some relatively you know We do not get used to the situation that with Python we could get kernel panic. Yeah, right, so it's kind of a scary

35:24

It's of course people are working on this box But still if you want to use this in production if you have to be aware and you have to be careful And you have to check stuff multiple times, so yeah, that's pretty much it. I hope you have a lot a lot of questions

35:45

Yep, any questions Come on people No, I don't believe that there should be some questions You just shy enough to ask, but please be aware that there are no stupid questions. There are stupid questions So I'm here under the danger

36:05

Because you showed mostly at the very beginning a lot of benchmarks and stuff like this regarding different changes between kernel versions different Patches for CPU problems at all do you have something? What continuously runs in your system and the benchmarks this stuff as the kernel updates are released as the new?

36:24

Bag fixes are released to get kind of overview. What is good for you? What is not good? So maybe you want to wait with deploying some kernels some bug fix and this kind of stuff Or do you just do it when you get a call from someone that? This database is now not working as it used to and then you

36:43

Retrospectively try to benchmark what may have changed well. We're not doing this unfortunately continuously We're doing this on that hook so sometimes we just goes into our own because we want to know for example We know that there's a new version and stuff or for example our colleagues produced a new New well distribution and sometimes we were doing this of course that when people are complaining that something is wrong

37:02

But unfortunately it's totally different level of complexity continuously performance benchmarking So yeah We're kind of working on this right now because I have right now after like I spent about a few months already trying to Not try and I successfully prepared Kubernetes set up for benchmarking it's kind of a continuous It's right all the results with all the plot things into estry and so on and so forth

37:22

But it's like at the very beginning right now unfortunately Yep, any other questions Okay, then. Thank you. Thank you