Inside PostgreSQL Shared Memory

Video in TIB AV-Portal: Inside PostgreSQL Shared Memory

Formal Metadata

Inside PostgreSQL Shared Memory
Title of Series
Number of Parts
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date
Production Place
Ottawa, Canada

Content Metadata

Subject Area
This talk is for people who want to understand how PostgreSQL shares information among processes using shared memory. The talk covers the internal data page format, usage of the shared buffers, locking methods, and various other shared memory data structures.
Presentation of a group Enterprise architecture Read-only memory Shared memory Database Data structure
Group action Presentation of a group Multiplication sign Decision theory Execution unit File format Database Function (mathematics) Mereology Disk read-and-write head Bit rate Semiconductor memory Buffer solution Shared memory Core dump Data storage device Endliche Modelltheorie Social class Presentation of a group Enterprise architecture Computer font File format Software developer Computer file Computer simulation Bit Process (computing) Telecommunication Order (biology) Website Pattern language Quicksort Data structure Cue sports Point (geometry) Slide rule Server (computing) Enterprise architecture Service (economics) Power (physics) Wave packet Number Blackboard system Computer architecture Read-only memory Data structure Mathematical optimization Dialect Interactive television System call Complex system Marginal distribution
Building Multiplication sign System administrator Source code Open set Replication (computing) Mereology Pointer (computer programming) File system Physical system Block (periodic table) Software developer Database transaction Flow separation Laser Data storage device Drill commands Configuration space Right angle Quicksort Spacetime Point (geometry) Web page Trail Backup Server (computing) Computer file Link (knot theory) Virtual machine Gene cluster Student's t-test Raw image format Power (physics) Number Revision control Mixture model Computer hardware Installable File System Multiplication Focus (optics) Database Directory service Subject indexing Kernel (computing) Integrated development environment Commitment scheme Personal digital assistant Universe (mathematics) Table (information)
Presentation of a group Presentation of a group Enterprise architecture Database Database Directory service Number Integrated development environment Read-only memory Shared memory Right angle Data structure Installable File System Physical system Probability density function
Trail Slide rule Computer file Real number System administrator Menu (computing) Field (computer science) Product (business) Number Gastropod shell Extension (kinesiology) Installable File System Physical system Social class Scripting language Mapping Structural load Database Directory service Library catalog Subject indexing Personal digital assistant Data storage device Order (biology) Website Right angle Quicksort Table (information)
Web page Computer file Block (periodic table) Multiplication sign Web page Electronic mailing list Pointer (computer programming) File system Hardware-in-the-loop simulation Game theory Lie group Quicksort Table (information) Installable File System
Web page Tuple Greatest element Computer file Web page Database Price index Directory service Laser Subject indexing Pointer (computer programming) Block (periodic table) Table (information) International Date Line Tuple Installable File System
Point (geometry) Tuple Trail Functional (mathematics) Code Multiplication sign Water vapor Function (mathematics) Disk read-and-write head Field (computer science) Automatic differentiation Number Medical imaging Pointer (computer programming) Coefficient of determination Semiconductor memory String (computer science) Representation (politics) Installable File System Physical system Email Theory of relativity Information Key (cryptography) Block (periodic table) Forcing (mathematics) Mathematical analysis Variance Database transaction Denial-of-service attack System call Laser Type theory Pointer (computer programming) Personal digital assistant Query language Data storage device Order (biology) output Object (grammar) Block (periodic table) Table (information) Resultant Row (database)
Tuple Email Email Information State of matter Strut Bit Database transaction Host Identity Protocol Word Memory management Hill differential equation Quicksort Data structure Data structure Metropolitan area network Installable File System
Tuple Email Thread (computing) Multiplication sign System administrator Strut Execution unit Process modeling Function (mathematics) Mereology Stack (abstract data type) Database transaction Computer programming Subset Pointer (computer programming) Mathematics Mechanism design Semiconductor memory Single-precision floating-point format Endliche Modelltheorie Physical system Area Touchscreen Texture mapping Software developer Fitness function Shared memory Coordinate system Bit Sequence Flow separation Connected space Category of being Process (computing) Drill commands Quicksort Metric system Data structure Reading (process) Row (database) Spacetime Web page Point (geometry) Vacuum Overhead (computing) Computer file Coprocessor Wave packet Workload Read-only memory Operating system Software testing Address space Dependent and independent variables Database Line (geometry) Word Uniform resource locator Pointer (computer programming) Loop (music) Integrated development environment Query language Personal digital assistant Data center Window
Web page Point (geometry) Trail Computer file Multiplication sign Water vapor Proper map Mathematics Synchronization Natural number Semiconductor memory Buffer solution Shared memory File system Data structure Physical system Area Multiplication Inheritance (object-oriented programming) Block (periodic table) Structural load Shared memory Database Directory service Multilateration Measurement Flow separation Process (computing) Query language Buffer solution Hill differential equation Right angle Figurate number Spacetime
Web page Point (geometry) Tuple Functional (mathematics) Length Code Field (computer science) Front and back ends Number Pointer (computer programming) Memory management Core dump Energy level Form (programming) Continuum hypothesis Infinity Continuous function Variable (mathematics) Cache (computing) Arithmetic mean Loop (music) Buffer solution Hill differential equation Right angle Quicksort Table (information) Tuple
Tuple Continuum hypothesis Data management Uniform resource locator Pointer (computer programming) Mapping Atomic number State of matter Semiconductor memory Integer Field (computer science)
Tuple File format Translation (relic) Database Number Medical imaging Pointer (computer programming) Estimator Befehlsprozessor Semiconductor memory String (computer science) Memory management Right angle Physical system
Server (computing) Game controller Matching (graph theory) File format Shared memory Water vapor Bit Database Login Food energy Human migration Type theory Word Befehlsprozessor Process (computing) Data storage device Personal digital assistant Semiconductor memory Energy level Right angle Quicksort Figurate number Conformal map Error message
Point (geometry) Operations research Implementation Building Assembly language Multiplication sign Set (mathematics) Bit Zugriffskontrolle Type theory Latent heat Befehlsprozessor Kernel (computing) Shared memory Computing platform Energy level Software testing Data structure Data structure
Web page Slide rule Divisor Weight Physical law Semaphore line Database Directory service Control flow Amalgam (chemistry) Field (computer science) Pointer (computer programming) Mathematics Process (computing) Read-only memory Computer configuration Data storage device Semiconductor memory Hierarchy Shared memory Buffer solution Software testing Quicksort Data structure
Web page Module (mathematics) Point (geometry) Scripting language Area Default (computer science) Game controller Group action Block (periodic table) Code Multiplication sign Set (mathematics) Mass Line (geometry) Event horizon Cache (computing) Mathematics Process (computing) Buffer solution Buffer solution Software testing Right angle Endliche Modelltheorie
my so in my mind and chastened on he can lower and therefore we have died my I I I I I I I I I I I little I a my the
story of various regions of the rate of this site which somebody can tell me go and because it was me anyway all my name is Bruce mongin super excited to be here it's been a great carbon so for how many of you attended unconference appears to today's the i was that positively like the conference yeah I just wow so but it seems like we've got a great sort of pattern of all sort of exchange working on style often kind of you know brain power discussion and and that's something that sort of missing from the lecture format i tends to have a lot more interaction my format but again it's it's a personal preference on but I guess I just found it really valuable so on looking forward to that but and again numerous model for all on a posters this core team members been working with it since 1996 all and I work for the prize the billiard with them for 9 years and obviously you know we've been around for a long time units to adding enterprise of tooling and and support and capabilities to pose crescent work with parts customer so are on very happy it's work really well on well in the Conference on how many of you were just over at the compile lesions to toward 11 were yeah is the funny thing is I'm over there thank you very much and pulled up this slide and the slide like hey that looks kind of like 1 of my slides for this presentation and those you were over there you actually see some stuff that that actually is is related to what he was talking about so what kind of a bonus for you or I never I did not expect that so what is this presentation about is about all basically how PostgreSQL all uses shared memory how PostgreSQL processes communicate outputs processions communicate on I've been working obviously with this for a 19 years now also on intimately aware of how it works and I will also say that all how it works and how it does what it does is 1 of the things that fascinates me about PostScript so if you detect person who was interested in sort understanding what happens inside and how old PostgreSQL because what it does I think this will be a really interesting session for you you CDs all the sessions of again of the focuses mostly on how posters does what it does some of the sort of overarching architectural decisions the posters has all regarded as somebody internals and servicing and again a great a great presentation from PG call were a lot of people and tends to be more server development internal people on and I think I find this French again I I like questions as I speak so all feel free to to answer question and we can talk about that also makes the class were interesting over the trading margin of an action Monday development here cause I was a schoolteacher for 5 years and I am I proud professor at Drexel University so sort of like a blackboard slash lecture hall is is mapping natural of willful scary natural but again it's wonderful you came today
so what are we gonna talk about all of the talk about 6 major things on 1st order talk about how close restores its data which is a little bit of a little on the simplistic solid for people who use that much more complex systems mostly why it's dies that way that's how we do it for a while we do it that way the model but shared memory not for those who were familiar with it I will explain what that does all process creation also something that we talk about and come to give you an overarching of all the things there are a lot of times when explaining the architecture stressing in training classes I have to go to the slides because these slides really I think give you all sort of foundations of what is happening inside PostgreSQL you understand the foundation you're much more able to sort of see what was that process doing how is it doing it what is it mean when there's lock intention was at me when I create a new session all these things that are kind hidden for ordinary users you're going to have a world into a number and told about how we actually look at Rose how Rosa look up all that was actually the the point that was talked about in the optimizer session before this would below about locking in other structures and then finish off also many questions for start of a melanophores body and went back near me over the head with the questions on that so all that
kind of 1st of all surprised all I would say that the people who come to pose from other databases is the PostgreSQL uses the file system solely ought to store its data so unlike systems that made use raw devices or special hardware or work out some sort of mixture of those for everything is is stored in the file system are in fact everything is stored in the file system under a directory that typically is called data which you know very unimaginative all but but that's that's usually where it is as an environment variable called PG data that usually points at this all and this directory has everything it's got the configuration files it's got the system tables it's got the tracking of which transactions committed it's got the right had log in most cases it's got all your data files and all your indexes ok so everything is in there to the extent that if you show both present there and back up the data directory you have backup the database OK so everything is in that makes it very easy to administer makes it very easy to into LSO and you can see all your open file descriptors there's nothing hidden OK if you want more then again there is no sort of back-to-work kernel get needed this special region of the drive it's all in that directory OK if you install multiple versions of PostgreSQL multiple clusters of posters on machine machine you can have multiple data directories so typically you know if you create really clusters and promote and port numbers and a server you have 3 data directories in 3 different parts of the system they may be on a separate file systems they may be on different areas but of the somewhere in that false that's basically the way all pose resources that don't directory is created initially by the EU DB command which probably most you you've heard of all its internal commander you would you would used to initialize post-procedure using install a lot of these spores 1 in dB for you but if you had to run it yourself that's what it does it's creating that raw data directory all for you to to use so again very simplistic doesn't use anything fancy very good for administration very good for virtual machines but you when the hands came around and really what we needed this special not much really nothing at all but now you're not restricted to the data directory you can store stuff another tablespaces there's that way symbolically because at low tables this is not locked into 1 directory for all your data but again this is going to be your main source of data is going to be where you can think falls live in those cases and then you can spread out from here but this is where things start OK any questions so far about the OK to power 1st thing is that the file system is kind of don't go to place proposed rests on a lot of the database is usual for all systems and sort of fancy storage basically that was designed in the old days when fall systems were very primitive think like System 5 3 fragmented Harold sort of sort of problem with throughput or if you look at modern false assistance today versus role device there's not a whole lot of advantage there's 2 per cent if you're lucky by going over system but it's not worth the headache it's not worth the administration is not worth the rigidity that you have of using what about suppose was never implemented it I doubt we're going to all we've been very happy with the the development that has occurred within false system designed to make that easier just to know that it works you can put your money and people want PostgreSQL 1 like PS 3 is another crazy devices and whatever all so we just don't feel that 0 false as a sort of raw device were postprocessed white false system you know is really a good use of our time and we do think that it it would make it quite difficult all and if you look at what vendors recommending now originally again they originally recommended false system because the speed but now in most cases the single assuming that 2 % just below there because there's there's a lot of headache in that you know in there and having worked with Informix and unlawful will fall systems and compute how big I had to make my stance and stuff I'm glad to see that go on and in fact when you start to look at how we do replication and backups as a with a lot of that is tied to the whole fall system so you there is this synergy in how posters those things that may come across in this fall but all in that because we choose a very basic solution we can put things together all using practical refer replication geographical point power covering all of the value of the web page on top of problem graphical there's a whole bunch of sort of elegant so that we can add this review makes sense because the basic building blocks invisible 1 falls a questions so so what what is in that data directory on the probably the
most important directory that were told about 4 this is actually the base directory which is really weird for all of know I think of acids and bases when I see that we want to call it a directory base will because it's a data base right so when you put the 2 data directory in the base directory makes database on remember this was originally designed by university students and I thought they probably thought that was a cute idea all so it's called OK all you also other directories which I'm just going to talk that robbery fully the PG ex-lovers were we keep the transaction log the PGC loggers were we can we keep the US status of committed transactions this PG tablespace here is where we keep the symbolic links to the table spaces so I said that you can use tablespace impose president of the false systems and that's where they would they would exist that will have a global directory for global tables like the the user name and all the base directors were focus on here with a drill down into it but
inside the base directory you have a number of subdirectories makes sense which is the full system for that effectively you're going to have a subdirectory for every database so in this environment we have 1 2 3 4 5 directories under the base directory so therefore we have 5 databases defined in this particular cluster of all let me say
1 more thing this presentation is at that URL right there so you can download it right now all its PDF is probably 30 PDFs there so I always forget to mention that further
on so here we've got 4 directly is number 1 is actually template 1 which is sort of a special database and the other 1 to test the null marketing whatever whatever look at the production database in this particular case inside
the production database which he happens to be 1 6
3 8 5 we have a number of files OK and the number of on so in this case 2 4 6 9 2 is customer all of you know 2 7 2 1 4 is order and so forth but by normally urine and many more files in this directory and just this album to show you a couple of them there will be some system tables in there but effectively inside of here is every table and every index of those are the 2 things that get story here plus some things like the free space map assistant tables in the in the and the visibility map also going to have to be in this directory there special extensions underscore VMware reference them to track everything that's going on that particular database yes I'm getting up to and from work as was the Luminex sentence having worked in informal and having sort of to look a directory and restore 1 you can figure out what it was you try a store all the is actually in the system catalog say it particular field in the PG class table called real file load and real file move will map to that particular on example I do have a lot of administration slide that on a website that shows you how usual farmer to look things up but also to hold all idea name could run from the command line which will do a query and look up that name that goes with the numbers you can create shell scripts that you should the biggest 5 tables from the command line right and then used to work on the menu for each chill 5 you look up the name based on the number we would love to use the names originally in the nineties we 1st for you to use the names but then the problem was people always renaming indexes in naming tables and is required that atomically while because obviously you have to do some some unique allocatively for referencing coreference in following so now we allow the name to be changed because effectively for reference by the the rel file load which does not change if you bring in a table off on that is the same so if we had to do it that way and then there's other problems with the long table names and UTF-8 the table names and it just really wanna go there and so I would love to use just the names we we can other questions look at so let's take a look at 1 of these files so here
is actually let's see here this is a 2 4 6 9 2 which have be a
customer table and basically the tables are all of going to be made of K blocks can change that by recompiling posters about everything's going to be a list of an appendage of moral K blocks if the blast over sort of game we will create another file which is again the the key what we have what we call segments and you can you create 1 on 1 didn't time will create segments to make sure we all over for your file system you limits on them effectively we got these 8 pages and everything is restored that lie inside the KK page it looks
like this OK so that's 1 of the 18 pages that particularly appear in the table are at the beginning is people had at the bottom some special thing at the beginning you have what employers and at the end you have what are called Rosa tuples and the item importer's grow from the beginning to the middle the tuples grew from the end of the middle when the 2 meet tables full and we created a new page or are there idle worship middle important because of what we move this thing around a lot is risk get added deleted but the item pointers are pointed to by the indexes so the index is active pointers to these so that allows to move rose around on the page without having to touch the index and that's really the 1 of the major reasons are is allows us to move things around what keeping the reference to external reference that well that fixed offset with the page that and again there's a literary character
so let's go inside the tuple so we remember we've gone from the data to the base to the directory that matches the database to a particular file to indicate paper page now we're going to go inside or out so this is sort of like we're going smaller and smaller and smaller all we got key pages carry of particular tuple here well what is that look like it
actually looks like this for this is that while it actually has a head and then a bunch of values that are appended to that of the header has a wanted tracking information all related to when the transaction was committed in fact is a virtual there will have all of these fields and their on but we have a we have the 0 we have a number of road number of columns of the at fields in the well on all set for the nulls which is backed down here all and then after that you have a bunch of values OK this is where we start to come get specific on particular values of effectively the way it because PostgreSQL was designed as an object relational data stored mean you can add the attached to it functions as an index of for every time you access of value in the wall of the field and the road you actually have to call a special input or output function to get the value in and out because both this doesn't know what a major is doesn't know what a text field is it doesn't know wanna see ideologies on with the code that's looking at these woes doesn't know exactly what's stored there all it knows is to get the not the value out I have to call an output function and you get the value in Idaho on input functions so for example this 1st value in the room field is actually images and it will call a function call it for working with the strength and that function will return of 4 by 4 bytes of memory and the system will store 4 bytes then this is that's what the input function told me is the binary representation of the restraint in the flip side when you when get a value out of assumed that the force field here is a text field and recall what call text that text will look at the also or bytes at a particular order entry of offset within that Rollins character return of an actual string in this case is Martin happens to be stored there but again you can see that we have not hard coded information about the types all that information is kind of stored in external functions we do a couple short circuit things looking for it for a couple hours were we conversational I notes and force from the use of special you're not call function but almost every other case of this is why I oppose this is easy to extend it also why the system tables are so important why you can add dog detects PostgreSQL because you can very easily justify the input function defined the output function also post-Christian deal with that this paper but the paper did not know about before the so you're talking about the in the unimportant see
ideas effectively that the kids and so those are the that's actually by employers that sit at the flood and a point to specific roles within the within the block all of the the question was happy to use attendees to find out whether they're identical within a table there duplicates is that right yeah OK yes so as the questions can you see candidates that effectively do stuff with table we would probably discourage that because the CKD is really tied to the storage of the RelE and doesn't really track things like the world getting up the water running the query because the Sudanese government somewhere else on I don't remember if we actually follow the CKD point here if you access it so this is that the we really would not normally have people looking at those we would probably recommend some kind of primary key although again if you're saying that the rows all or identical then maybe some of you know ads and that's a good question really Robert yes so twitching and I was hoping you had and answer for this gentleman all the all on the other side of the in the in the in the analysis of the results of the of the question some of the of the story of the British we call this the variance of the point for of the of the thank you that was great which at that other questions that so what I can't show you you is the idea of drilling down in looking at an entry on the
idea of sort of doing now
let's take a look at some of the details this is actually the way out for word the header in C OK so you're actually looking at the info man all the bits for Knowles all these all are these sort of what is here this is the of the x minutes max Siemens seen that sort of a union of values also virtually all this stuff is not bad such a defined in state and we have a which Makris axis Sloman and make you know through access and when you add a well have a look and create structures that match this particular way out
on so any questions again I've covered drill down all the way I want going to go into the next topic of 10 adjusted 1 of so when would that limit the rows get show the sequence can get shuffled around in update situation particularly about what happens when you query you'll get that kind of a page lock as you're looking at the physical world but then what should be that went to the next page you will typically maintain a lot on that so this seed tijah particularly evolvable how because an update would cause perhaps that seek to be on the on visible or invisible to you and you see fit for the same role as somebody that they would be somewhere else so if there's clean up on the same page like backing model that center changes seated because the CT is a pointer into that page you can move things around the pages he does not change it's really going be something like an update or perhaps a vacuum that removes the wrong and therefore there will be a point the season up with nowhere other be another case where where Roberts idea of doing select property do have a lot of because you're 1 within the same query another problem is really not disappear by looking at but but when you 1 1 green and you go to the next 1 unit that your own mustering serializable motor recommitted were repeatable read mode-related new snapshot and subset doesn't know anything really is not maintained any because they about what that role now and it could therefore which right let's get to the
next thing I want to cover and that is how close PostgreSQL is effectively laid out the shared we talked about sort how the data stored let's look at the word shared memory portion of so called post this process based this is all often another surprising thing to people all because they're used to having databases which or basically 1 huge process which has multiple threads running inside of location of a big address space you got 1 executable all and you have mobile stacks effectively with which is effectively were thread is the stack all within that and you're using your operating system is is scheduling multiple threads within that all within that particular executable multiple threads at the same time of metrics to open up posters is not that all what we actually have a process-based model Anderson major advantages to that 1st from administrative standpoint you're not looking at sort of 1 big that you have to sort of pick apart on member would forwards went from 4 cross the threat model will happen that system overcomes also became very difficult of practical mantra developers is it's a lot easier to do development or process model all because the threads are really hard and it has a shared stack shared this space so everything you allocate all this senescence of Pollock around it all we've been very happy with through the process based model fortunately almost operating systems are very good at printing processes process creation of recreational Linux is about the same overhead so again it's it's a really not it's really cheap way of us getting a very stable environment you're much less likely for a random session to corrupt things in this process based model then you would for a base model size of the easier programs the only places were friends to actually less over the processes significantly would be on Windows and on Solaris I think those 2 have very heavy processes are and very lightweight threads on those environments ideally what about non threads but also for almost every other environment every other workload were good with this if process creation is a problem for a high-volume system we also recommend some kind connection were so the threads the processor used all those people environments so you know if you come look at the way here effectively what happens with POS spots on it starts something called post which is your to listener or you're you're connection all your initial connection all process which originally talk to the person wants to make connections assuming that connection is considered valid and we will we will use an operating system command called for work which will make a copy of the problem posed postmaster so you can see how the outcome of elegant this is you're not really creating a new holding binary you're just duplicating the binary you already have an issue with Rory listening and it inherits the file descriptor all the socket OK and then continues and the POS Mr. than just goes on to do what it was doing before and that is listed for new connections so again very elegant design all of the basic do for what happens in the 4 is the year program texture actual program is identical because again using the same pages so you're creating your to process 0 to both of the same pages they both actually share the same stack and the same data segment although the latter sharing that goes on between post Mr. depict recession is fairly small which indicate which is indicated here by various skinny lines all sort of came sharing between those 2 occur all so effectively you're creating a session here it's starting it's starting it's starting process that the loop to understand SQL it gets the query forces it you know it optimizes and execute that there is a huge amount of work obviously that goes on in a single SQL queries on training is really good if you have a lot of small tests you wanna do like you know you want your ability and you want that our lives to keep spinning wall you're updating some all part of the screen that's a very lightweight case did is a pretty heavy there's quite a bit going on here also so the idea of creating of spending time between processes is pretty logical crap but the problem so that the process model has is that is that these processes private even though you fought the child every time you change the data in your in your local data you know other processes that are missing that OK it's got a copyright mechanism all effectively preventing anybody from seeing changes you get a copy of everything is you modify get private copy so how do you get the system to work and all this very complicated orchestration and all coordination and so forth if everything is in every introducing a can be shared the entity that is shared memory so this pink area down here is effectively all of the area that you want to share of long your sessions OK and as base the outputs responses it in a typical thread model you have some shared data in your data stack some that's private to the threat posed as effectively we can set and want a share it can go or a shared memory segment and it's going back in the same address a similar addressed causal processes all the same size but changed by 1 part shared-memory people see it right away and anyone private you do in your data center okay to you can see is a very clean design because effectively you taken everything share and you put it in a separate part because that's kind the magic of what love post-Christian knowing for a lot of stuff there you know the
shared memory area has probably 15 or 20 years 25 different structures all of which is shared and again we've taken the combustor separate out everything a shared and put it in this 1 place so the biggest area that you wouldn't wanna while you're going to be using of you shared memory for is assured buffers so this is going to be all pretty much all the I O you do is the happen in the shared buffers because effectively while page somebody else wants to look at that page we 1 nature we show them the same page right like if I read some data nutrients and data and we see in both modified with going out separately 1 of them over at the other right so you don't want do that so you have a shared buffer area which is your common area for every time you read some pages every chemical modifies the pages and do not your local this space but in measure buffer space we also have a way of tracking poll published scriptors there's a proper way I'll talk about which actually has a ready for every other session which running all we have a whole bunch of tracking stuff related to things like to fisheries Committee multi exact synchronize stand all again of all of this stuff in there all that every processes able access so let's put the 2
together we talked about all the data directory Nikkei pages and everything all we talk about shared memory as a way of allowing these processes tall communicate let's put together so effectively the share buffer area which is the area I said we do all viral all that means going to the shore buffer area all the modifications happened shared buffer area of all the rights from post all sessions happened in the shared buffer area mingling get written down the file system at some later point because it's like the big staging area effectively it's broken up into 18 pp super surprise right we already said that this is a world governing k pages so we've got affecting their shared my memory area maybe this book about a pages so if you have 8 megs of shared memory you have a thousand 8 kph unlikely you have a k or leadership buffers if you have a decent job Robert let's say this was yeah to do existía buffers right all but then you have a thousand years you have what see it would be a million above is really so you have like 250 thousand shared buffers into gigs of shared memory which is a typical figure for a larger OK on and effectively what you're doing in this area is if you if a database if a particular session needs a page or a block from the false system it will go to that file system at a specified time it will seek out to the offset of where it once the load that article page Ilsa give me a k from that and from that file from that directory path at this also give me a car and write it right that share buffer area OK and then once that happens I can read it I can modify I can write it and other people can read it and modify and then effectively right I may not even write it I probably will just read the page and then just get just move 1 and some some other process like the checkpoint or the background writer will actually be the 1 that you throw that away because nobody wants to read any it has been modified but it has been modified and somebody's going to have to write that down and that's usually the background writer would attract water so I don't have to worry about writing the changes I make partially because this is a shared area so I I can like set up and down with it and just leave it there and I let the cleanup process come along in some other session that is waiting for a query that is currently excluded query it wakes up every so often earlier we have a lot of pages here these have been modified to throw away the 7 modify this through these write these down then you from 1 to have an accident what for that cleanup is happening in other sessions but my point is that wholly OK the idea that there is a big debate in the file system effectively is big right up and to the job offers and we have all the above restricted area which is doing the tracking of who's reading these pages whose of perhaps writing them and therefore we should be modified this gets back to that the whole thing with rights on so
again the AK page actually looks like this on the tuple itself looks like that and if you are reading and writing of those where fields again has the same text in its were out whatever the sort of thing and then they're all coming from all that the director was kind of interesting is that when you reading of predicative field in the shared buffer which again is a copy of what's on this problem there's not a lot of sophistication effectively just a point here all
so if this is actually some C code that's in the back end which is a which is effectively walking through all all all the particular table so we have a loop here which of is basically like as we see here so this is our desire for a right here OK and it's as the called no caching get added where we have all winter functions like this on the taken 1 of revenge what it does is it says OK I'm going to walk I would like sort of an infinite FOR loop here and I'm gonna look at the 1st field and the 1st field isn't all that i'm just gonna skip over because effectively it's not we lost 1 also just advance and then move to the next number is far more effective because field is not all that if it's a if it's a variable length field then we effectively all hopped over all of the length of that field we have to get from that field to the next 1 is variable length so we have to look at the core the length and we just say OK 520 bytes with some point form along on if it is available in field we can actually look at the of alignment and the value that you you just without reading the data we just know it's always a points that but if it's the world were looking for work it's with a mean field of work we break out of the loop if not we just kind of advance and align ourselves and forward to the next field and if want to kind that the field registered we effectively just call a French for particular by and we have an offset the for we've been accumulating also said here all which can be getting bigger and bigger because again we're no good at all set and that's what I'm going to use some kind of walking if I have a feel of hieroglyphics and you know 10 fields and I want the 5th 1 and I walk through 1 I don't walk through to you and welcome 3 and through work when I get to 5 that returned to the to the upper levels of the all this is even more complicated member
again I said we have a fetch 18 all effectively
what the state at does all is it's going to say OK is this field I'm looking at that byte integer Soviets In manager that I'm just going to actually take the field and what it might take the the memory for location of the map it as an integer pointer atoms returned to the person if it's against 16 minutes means that it until I'm going to return that as the user is not at on when actually all and 1 actually kind of map as a character pointer and returning to the user and then it not only is it true that comes from to return a pointer to that article so what I'm trying to do here is I'm trying to map that particular by of of of the data so the interesting
thing I think here is that all a lot of people think that when PostgreSQL was its data stories there some like magical format like some format that all the approach was designed from early on images look this way and strings look this way and you know this is not really true it like when you're actually stored in forward or in a imposed as were just taking whatever the CPU thinks of as and when they were just going into a memory little to memory offset so there's no quite translation of a lot of these negative values it's just that's exactly what the way that that CPU wants to store in so therefore we should take that where and when not accessing that Anderson his involved would have put the the number 5 in image is good right at that offset in the shared buffers between
white that particular value and then eventually excluded stored estimate comes back and come back exactly the same way and system to read that it work if you influence in my research so what you move database from beginning system to linear system it will not work of in fact 1 of the downside of this is very fast and very
simple right so it works really well but the downside is that it will not work all it's not only it's tied to that CPU all if you go from 32 64 that it will not work not only in that not only the energy and water is an issue by going from referred to bit or less restrictive were all on the same physical CPU also things are slightly different alignment and therefore it will not work so all you get narrower drive started at a low doesn't work all but but it's and it's kind of an interesting approach some databases have sort of a neutral format that the use of allows that kind of migration which has never had that and and again I think for performance and simplicity reasons on probabilistic before we have all you get there and start up yeah 1 of the conformity of the PG control you have a PDA control error in your server logs and said what would say agent is that yeah ahead that it's so obvious that no 1 ever asks media like what is a ceremony because they must do UCB types a lot but I whenever there is people like OK I got it you know that we'll get we never get I don't ever remember seeing question and maybe every 6 months to 9 months the a question about moving it didn't work but it's pretty clear yes the PD pretty clear when it doesn't match something on 1 final thing
I wanna look at is how we do locking all my soul locking is kind of again very low level for PostgreSQL so you forget that concept of simplicity here so this is a good illustration of what we call test-set instruction where all your actually have to processes 1 over here what every here they're both trying to put a wall and into a particular by of that in shared memory have been shared memory right would make use of the world on so if you put a 1 in and you get a 0 out then you have succeeded you have a lot if you put a 1 in and you get a 1 now out then you you you will actually so somebody already has got a walking this allows for concurrent or locking in a very efficient manner of that because successors put evaluating get out the values to be there so when you get a 0 out you know that there was no there was no 1 there before word Figure 1 of you know there was a 1 year before and if we did not have a lot on
it is so low level that we have CPU assembly language specific instructions for every CPU type that we support to implement test sets of this particular examples from Intel are actually 86 I think this is a 32 bit example all but effectively you're doing the exchange by bytes which happens to be the of the of the of the assembly language structure for test set on that particular CPU all but again very low level you're calling assembly language here to do locks on
sometimes you don't want like what if you locked fails what you do we have another type local spinlock which means that if I don't get that if I get success some fine if I don't get success on going to sleep for a wind and increased amount of time and then go back and try again also again through building on the test set you of the spinlock all this can be good this cannot be good because again the spending is not going to be always good a certain point need to
sleep on and we have something called lightweight locks which effectively allow you to sleep on a semaphore and get woken up when that law becomes available so you have sort of a hierarchy of walking that you can do all depending on how long you think you need a lot only think you have to wait forward you have you have those options
there is a mention before lot of a proper way out which tracks all active sessions again also in shared memory so that we can know what all the other sessions are doing in new coordination on
that for the properly looks at
all so that's the last slide kind of reminding us of all what we're storing ensure buffers again what we talked about is how store the data in the data directory house losing K pages have we store a well tuple all have we get database in an ad that field and so on and then we started talk about process creation and forward and shared memory and how well that works and then we sort of about shared buffers how the pages are copied initiative buffers and how we do some little walking with the spinlock in the test set this sir this is a problem on the left is a woman so have shared buffers get allocated individual processes are the factors they don't all a should buffer it is shared buffers themselves or just amalgamation of all bytes OK so what happens is if you want to if a special session wants to make a major change in a particular page it
will actually use a buffer descriptor and above a descriptor you put new were carried 1 that published represent I and in the process of making changes to that page and comes along once the modified page even see it may be required to wait until you release the white block 1 that particular page the beauty of this is because a lot of this is happening assembly and a lot of the test this year hold that buffer opinion we call very briefly in action taken back above opinions that nobody else can modify or a candidate for a buffer right lock would be don't get in there while on moving things around all and that would be where you would do it in a buffer script area any any process can find it at any page and do whatever they want with us and there is some code there event things a Catholic from large Kansas of like that so what we need do try and sort of and enumerate the major it disruptions recession but in general by default they have pretty much control the set of all of all of the all the all the all the all the so could you write a tool that actually looks Sherman without looking oppose PostgreSQL all you can do is they can train and this is a great point there is a contributor model called PG buffer cache and allows you look inside of PostgreSQL sure buffers and see what's there you could very easily Walter right a process that just attaches to shared memory that shouldn't offset of the of the evolutionary is in the post Mr. that PID follows 1 of lines to get ideas that our attention should remain and you can do pretty much whatever you want the reason most people do it from PG buffer cash-short control module is there's a lot of infrastructure inside the back and that knows how that is laid out and normally they would create a show of a plug-in to do that because then they can they can harness all of the infrastructure that are I believe I am out of time I believe it also is lunchtime and so on but of thank you very much I will be here if you have any questions