We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Pluggable Table Storage in PostgreSQL

00:00

Formal Metadata

Title
Pluggable Table Storage in PostgreSQL
Title of Series
Number of Parts
35
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
For PostgreSQL 12 we have worked hard at making table storage pluggable, i.e. allowing to store table data in different forms than PostgreSQL's current heap. This talk will explain: Why we think it's important to make postgres' storage pluggable how the new API works what types of table stores can be implemented, and what types are not yet possible how the limitations of pluggable storage can be addressed in future versions of PostgreSQL
19
Thumbnail
42:43
29
34
Thumbnail
52:38
Computer animation
Computer animationProgram flowchart
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Transcript: English(auto-generated)
My name is Andres Freund, I work for EnterpriseDB and hack on Postgres way too much. And I wanted to talk today about plug-all table access method in Postgres. So what is a plug-all table access method in Postgres?
It's basically the ability to specify how a table is going to be stored, which method it is stored with at creation time. So we can say create table, table name, otherwise the specification of the table, and then say using heap or using another method. And I worked with allowing different table access methods
for a good bit of the last year. And this was in collaboration with Hari Barukami, Alvaro Herrera, Ashutosh Bapat, Aleksandr Korotkov, Amit Kandikar, and Dimitri Dolgov, and a lot of other people. So this was a lot of work,
but it's definitely not just mine. So what do I mean by that you can switch around the table storage? It's basically the way that a table can store data, and table here is not just table, it's also materialized views, can differ. You can have different trade-off.
You might have heard of something like columnar storage. You might have heard in another talk about, in the transaction talk earlier, that the way Postgres is heap stores data is actually not particularly optimal for a lot of workloads. So there's a lot of different ways you can store data, and we want to allow to store it in different ways.
So that would be what we want to be able to change. And when you say table storage, we don't mean that the index would work differently depending on which table AM you're using. They will mostly work the same. They will possibly some minor details,
but you'll have the same B3, GIN, GIST, and so on type of indexes that we have currently. And it's not just that we change the IO layer, as in it's exactly the same as the current heap, but just in memory or something. It's much more can change than just the lowest level
how we issue IO or something. And what do I mean by pluggable? Basically, the goal is to be able to create extension magic storage from it, and then afterwards, be able to create tables with that storage from it. And you can also just say from now on,
the default for this table for the current connection or for the entire server going forward is going to be this other storage method. You can set the configuration either inside the session or in the configuration file, and then from then on, all the tables that are created where the storage is not explicitly specified
are going to use the new access method. So it's runtime extensible to different storage methods. That's why it's pluggable. You can plug an extension in and it works. Why do we want to do this? The concrete reason I am working on it most directly is that you want to allow to develop zheap,
which is a form of a row store that doesn't have some of the weaknesses that Postgres's current table format has. Namely, it doesn't have as much problems with floating under heavy activity, which means that vacuuming is much, much less of a problem and it's also much more denser storage-wise.
That is, more data fits into a table per, or more rows fit into a table for a given size, and that's mostly achieved by making the per row overhead much, much smaller. But also, we want to enable, being able to having columnar storage methods because for some workloads, it's pretty clear
that columnar storage is much better than a row store, in particular, analytics workloads or archival workloads where compression is very important and you want to be able to allow that rather than say you just can't do that with Postgres. And there's a lot of crazy experiments that are just easier if you don't have to rip all the storage method layer out of Postgres
to just experiment with how could you improve this. I think there's also good reasons why we wouldn't want to work on making table storage pluggable and I think the major reason is that we might end up with a lot of half-baked table access methods.
There's other databases that have had exchangeable table storage and they have a lot of different variants and most of them have very odd, different oddities so you don't really have, this is your go-to because one doesn't work for this use case, one doesn't work for this use case and one has these kind of bugs, one doesn't have these kind of bugs,
that would not be a great world to be at. So I think that is a danger and I think the second danger and that's also in a problem in said database is that there's a lot of commercial storage engines that are the best thing since sliced bread but are super expensive and you end up needing them
even though the product is normally open source and I think I'm a bit hesitant that that would be a good direction for the Postgres product to go to so I think allowing that accessibility has the danger of that more people will have to pay for random expensive access methods because it's better for their specific workload
and it also has a large architectural impact. We had to change a lot of code to make this pluggable, like a lot and that will not just be felt by the people that had worked in Postgres 12, it will, and I'll go into a small bit of detail later, also mean that a lot of extension authors have to do some changes to make their extensions compile
which is not nice. So what exactly do we want to do? We want to be able to have multiple table access methods active in the same database. That is, we want to be able to create, for example, join a table that is defined with an existing heap to a Z heap table to a column in a table and it shouldn't break
and the user shouldn't necessarily notice that when querying. It's possible that somebody else will develop like table access methods that only support a very small subset of features. For example, you could very well imagine an append only access method that is optimized for that and it would be faster to insert
but you couldn't do anything else so the user there might then notice that you can't update because it would get an error. But outside of these kind of things, the user shouldn't necessarily notice when querying a table what type of access method it is. And we want to, as I said, them to be pluggable. That is, we want to be able to edit new access methods at runtime.
So they don't have to be built into Postgres and we don't have to maintain every possible AM out there. And indexes should work across different table AMs and similar, the planner should also work for most table access methods. And I'm going to talk a bit about the limitations around that later.
One thing that's very important is that for now we are not promising that the API for table access methods will not change because we already know that there are limitations in the API and it will change in the next few releases every time because it's just a large enough change that we couldn't get through the perfect state in Postgres 12 so there will be iterative changes and that will probably mean
that it will have a bit of pain every time you upgrade to a new major version because there will be changes. And it's also not tackling a bunch of the problem problems that we could tackle as part of this work, namely we haven't made some other parts of Postgres extensible even though it would be very nice to have that. It just seemed to be a large enough project
that we didn't want to tackle that at the same time. And I think the largest one that people might be bothered by is that you can't actually create the catalog tables on any of the new access methods that's hard coded to the current heap. Some of you might think, but we have foreign data wrappers, how is this different?
And I think you can make some of the things that you can do with the table access method using a foreign data wrapper, but it's pretty awkward. Foreign data wrappers were, as the name says, built to access foreign data. They don't really support all the things we want to do with a normal table. For example, you can't actually do like proper DDL on it.
You can't create indexes. And all that kind of thing doesn't work because that's not what they're made for. And you don't have proper transactional integration. You can do some things, but it's pretty hacky. You can't have a proper commit that spans multiple foreign data wrappers and the local table and expect that it actually is actually correct
because you just get something that's roughly right, which is obviously not something that you want to do for your main transactional store. You cannot have stuff like foreign keys. You cannot have additional constraints. All that doesn't work. So in my opinion, it's pretty clear that the foreign data wrapper is very nice for what it does, but it's not built to do primary data storage.
So what did we have to do? This is a diagram that roughly represents how Postgres works. We have when it, which subsystems interact when a query is being processed. The client sends a query to the parser. The parser builds something out of the query.
For that, it has to sometimes query the catalog, the planner. If it's a DDL statements, then it goes to some DDL subsystem does that. If it's a normal user query, the planner executes it and builds a plan, then hands that plan to the executor. That executor then sometimes accesses the heap. Sometimes it also does index accesses
and then accesses the heap and so on. And then the heap internally sometimes accesses the buffer manager. The buffer manager then has to go to the storage minute that actually might read data from disk or it might have the data in shared buffers. And lastly, if it's not in shared buffers, we then have to go to the operating system,
have to read the data. The operating system might already have read the data and if not, then it has to go to disk itself. As you might notice that the abstraction layers here are not particularly perfect. All the red ones are kind of abstraction layer violations. Even though we have an abstraction for heap in Postgres before 12,
we bypass that abstraction in a number of places. Namely, when you execute DDL, that sometimes has to interact directly with the buffer manager. It sometimes interacts directly with the storage manager. The executor sometimes in parts doesn't, it just bypasses the heap and just knows that, for example, for bitmap scans,
it knows how tuples look in the heap and it just directly looks into the heap and all those kind of things. And if you want to make this extensible, obviously we can't have that because all those layering violations wouldn't work as soon as we don't have heap as a storage. So we have one large part of making table, parallel tables storage possible
was to rearchitect that so all these layering violations are gone. There's a few bits and pieces left somewhere but basically that was the biggest part of this work. So after this, it basically looks like this. All the layering violations are basically now routed through the table access method and if you have, for example, heap, then that heap does all the buffer manager access
to all the storage manager accesses even if it's for DDL because DDL now is also routed through the table access method if necessary. It's just some other access method where we don't use the other Postgres subsystems. You can write an access method
that doesn't use the buffer manager. You can write one that doesn't use the storage manager. It's a, yes. As you noted that it uses the black hole storage method. I mean, you could write something that just doesn't use any of the buffer manager
but uses the storage manager directly accesses it there. You could have one that is purely in memory and then doesn't use the buffer manager because that overhead, if you already know that it's all in memory because you could have a much more optimized layout with internal pointers and all that. So really, there shouldn't be any dependencies
from above the table access method layer that assumes anything below here. There are some issues where it's not perfect but that's basically the goal and I'm going to talk about the problems in a bit. So how does, do the access methods basically look like?
Just, we have this preexisting PGA M table and that already is used to access methods but until Postgres 12 that contained index access methods which were, because index access were already extensible. You could create a new type of index in extension and it would work. In fact, Postgres since 1096 or something
has the bloom index access method that is an extension and you can just create it at runtime and it works. What you basically wanted to do is to make that extensible and that's why there's now an AM type that's called T for table and before there was index for index. Now it's both. You can, for, if you create that table
you can see that currently there is only one defined in core Postgres but that's the existing storage method and the way that the access method interface works is that the AM handler has to return a struct that defines the behavior of the access method and that's, that function here as you can see is,
has the result type, the table AM handler which is basically just a pointer to the struct that I'm going to show in a second and as an argument type it has internal and the reason it has internal is that we don't actually want this function to be callable from SQL and since you can't create data types like any internal datum on the SQL level
you can never call this on the SQL level. How does an access method handler look like? So that's the function that we just saw which is basically just says, okay I'm returning a pointer to a struct that's defined and that struct in this case is just,
it has to be a server lifetime and then that defines a bunch of properties about the access method. In this case it's just two examples. One is it has to indicate that it's actually a table access method using Postgres's somewhat odd version of virtual objects or something,
inheritance that isn't quite inherited, whatever and then here just one example callback that it has specified and what are the types, sorry, any questions so far? So the way the API looks like is
I'm just not going to go over all of the callbacks because I think there's like 42 callbacks and that's just going to bore everybody to death so I'm going to go over the most important ones. We have for example callbacks that are about inserting tuples because that's obviously something pretty crucial which is basically the function that
if you ever looked at the Postgres code before 12, there was this function called heap insert that a lot of code called by some wrapper usually that just inserts a tuple. This is basically the same parameters except that we don't pass in the tuple
as a heap tuple which was what we did before. Instead we have these tuple table slots and tuple table slots were a pre-existing concept in Postgres that's basically just it can hold a tuple but the format of the tuple is not necessarily specified. Before Postgres 12 that it was used to have things like
virtual tuples which was, there was no heap tuples we just had the datums for individual tuples. We also could hold minimal tuples which is some condensed form of a tuple that we use for sorting and hash joins and that kind of thing but didn't have transactional information. And now we can have additional types of tuple table slots that are used for the different AMs.
For example, there's one that contains tuples for the heap access method and in the branch for the zheap support there is a zheap access tuple table slot and the user of those tuple table slots doesn't have to know about them.
That's why we can make the generic code that inserts tuples in a table that just has to deal with the tuple table slot, it doesn't know how it looks like, it just gets that from either from the user input or if you have something like an insert into select, insert from select then it just is the tuple table slot that the select returns.
And then there's a few operations for DDL. For example, if you have something like truncate what you do is we create a new relation file. We don't like truncate the old relation file in place because we want to be able to roll back. So it has to create a new relation file and we don't want to restrict the table AM interface
to be very specific to the way heap does that. So you just get called and you get a bunch of information what's the new rel file that's identified and then is the table that we're creating an unlocked table, a temporary table or a normal table
that is the one that is properly well locked and not temporary and a bunch of information that you can return because for example, I don't know who you knows what the Ralph transaction horizon and the multi exact transaction horizon for a table is but not every access method would want to reuse that. Actually probably none will want to reuse that
except for heap but so we have to be able to overwrite these for each access method because that triggers how often vacuum is run. And then there's support for other DDL operations. For example, when vacuuming, not every access method is going to need vacuum and or it might do vacuum
in a different way so we can't have one general code that does the vacuuming. We have to instead just call a callback into the AM and then that does the code that previous like the other logic that we need to do for heap or for Z heap or for the black hole table access method.
And then there's also similar callbacks for building indexes because for example, an index scan is actually a lot more complicated than one might think. One might think that we could just use a sequential scan of the table and insert everything but that doesn't actually work because we have to do stuff like create index entries for tuples
that aren't visible anymore because another session might later then use that index even though it has a snapshot from before the index was created. So it has to do stuff like index tuples that are actually already deleted if there's any other session that might still see those already deleted tuples. So that's pretty specific how that works for each AM
because some access methods might need to just work like heap which basically just say okay, we have all the deleted tuples in line in the table. Others might need to go into undo and look up all the data there and so we can't really make that into general code so it's a callback.
For various forms of scanning the table, we have a different set of callbacks. One is scan begin, why did I copy this twice? I wanted to hear a different callback. Okay, I don't know how that happened. There's one callback that just starts a general scan
and then you can just say give me the next tuple, give me the next tuple, give me the next tuple and if you wanted to do a parallel scan, then you just have to pass in the parallel scan information and that just generally should work independent of how your access method internally works and this one was supposed to be the scan
get next tuple slot or something which basically just says give me the next tuple, give me the next tuple, give me the next tuple every time you call until it returns I don't have any further tuples but since I copied the wrong struct, we can see that. And then there's a couple very specific type of scans.
For example, a bitmap index scan, the way we implement that is that first we scan the index and then we have to do a bitmap heap scan to fetch all the tuples from the heap because if you build a bitmap over the tuples that we want to read, we don't have all the columns so we have to fetch them from the actual table
and return them and that's a pretty specific operation. We can't just implement in a general way. It used to be that both these operations were basically implemented as part of the bitmap scan node but directly poking into the files basically or into the buffer of the files but we can't do that anymore
so we have to have a different type of scan that accesses the tuples and there's some problem there because it actually doesn't necessarily make sense for all types of scans to support bitmap index scans because the way bitmap index scans or bitmap heap scans are efficient is that you assume that the tuple identifier
has some correlation like spatial locality with the on-disk location otherwise you're not going to gain any efficiency if it's just some random key then it doesn't make any sense to use bitmap index scans. That's why these two callbacks are actually optional. If they are not specified,
the planner will just never generate bitmap index scans. And there's always another way in Postgres, Postgres always knows how to execute a query without creating a bitmap index scan. It will just potentially be less efficient so it's fine to not implement that. There's another type of node that also cannot really be implemented in a generic manner
and that is the sample scans. We have the, for the table sample SQL class, we also have a specific node and that used to also poke directly into tables into the table storage without going through heap and we couldn't do that anymore so there's another set of callbacks
that's basically just scan sample next blocks were something similar and that one is actually at the moment mandatory because at the moment there's no other way to implement a sample scan than to use the sample scan because we can't, you can't just do a normal index scan or something to replace a sample scan. We might want to throw an error at a different place
but at the moment the access method would have to throw the error itself if it were to not be able to support sample scans and there's some, the external, for sample scans Postgres actually has an extensible API to create new ways of doing sampling and you can already create that
with like in extensions further ones. The problem that is that that is somewhat block based so there's some APIs that cannot use the table sample API because it assumes it's block based but I couldn't, we couldn't fix that at the same time so it's probably that the sample scan API that we are going to have to change things in the future
I should have gotten some water. So what did we have to do to even be able to implement table access methods? One was that we had to remove or we didn't have to but we chose to remove the support of declaring tables with OIDs because otherwise every different access method
would have to reinvent the magic way that heap stores OIDs in some way and since nobody really wants to use with OIDs anymore and has been deprecated for I think 17 years I thought it might be better to just remove the support instead of having duplicating all the weirdness
all over and over again. And that was a lot of like road work but it's not that bad but it will break people that want to upgrade from 11 or from early pre 12 version to 12 if they have tables with OIDs because neither pgupgrade nor pgdump will actually work anymore. You have to remove the with OIDs column
before you upgrade. You can still create manually create a table that has a column with OID just type OID named OID and that would work but all the magic around the column that you couldn't, that you have a column that you don't immediately see when you select from a table you have to explicitly name it. All that is gone.
Probably the largest change and probably the one that will break most extension is that the tuple table slot API had to change to make it general so it can hold tuples of different types. Before it could only hold three types of tuples. It could hold heap tuples, it could hold minimal tuples
or it could hold virtual tuples which was basically a tuple that only exists as part of each column internally but not as a batch together row. And we had that, we made that abstract so you can at runtime add new types of tuple slots
and the table AM API can return which kind of slot you're supposed to use for the table AM so we don't have dependencies in the core code of what extra slot is to be used. The type of slot is to be used and that will just break extension that use tuple slots internally. Unfortunately, that is a number of them
because then they either have to inquire the type of tuple slot to create from the AM they're interacting with or they're going to have to specify it statically if it's, for example, just a virtual tuple table slot which is probably the most common thing. And we had to make a lot of code actually use tuple slots
because a lot of the code just stored heap tuples in various places which doesn't work anymore if you want to support other AMs because otherwise they have to constantly convert from one type of tuple to heap tuple then hand it to the other subsystem and then convert back and they might, that conversion might not actually store all the necessary metadata to continue
to be able to interact with that. So we had to just change a lot of the subsystem to work with tuple table slots which was pretty painful because some of those are like some of the very old code especially like the evil plan call code. It's like, there's like not very many people that understand it and it's hardly, but barely tested.
It's like, that was the least fun part of the whole project. Then we had to change, as I previously mentioned, that we had to fix a lot of the layering violations. We had to make sure that bitmap scans and sample scans didn't poke directly into the tuples and similarly previously, Analyze also directly poked into the tuples.
So we had to change those APIs so they didn't do that anymore. And then as I said before, we had to route, analyze, cluster and all that through table AM. There's also a few boring changes like page inspect, the extension obviously cannot work for inspecting tables, it types, it doesn't know about
so all of that stuff had to get grow error checks. Okay, now I'm going to go from where we are to what we did wrong and then from what the limitations are that we might want to fix in the near future. Any questions so far?
Yes, so that's probably, oh, I should have added
that as one of the problems because I did, but I didn't. At the moment, obviously that doesn't play a role for inserts, right? Because for an insert, you always need the index entries but for updates, at the moment, the update callback just your assurance of whether it needs new index entries.
Some people want to change, push all the index manipulation into the AM layer. I'm very hesitant of that because there's actually a lot of weird complexity around that that we also need to fix and if you have the five copies of that code, it will get harder, but I'm pretty sure that we're going to have to evolve that interface over time.
Might be that we just delegate the index integration to a callback that's then be called from inside the AM but without having all the duplicate copy of the code that does all the necessary stuff like forming the index tuples and so on, but I don't know yet and nobody has actually written an AM where that is problematic, so.
Yeah. There's other limitations around types that I'm going to speak about in a few slides. Yes, it will, your callback can just decide
not to do anything. Well, it has to be, in the sense you have to have the callback but it doesn't have to do anything. Some of the, I think we are probably going to have to have a separate callback that says, do I want to be vacuumed or not?
It already does it for some things, namely if you don't set the frozen XID, a rel frozen XID or a rel minix XID in the callback that I showed a few slides back, then it never does anti-wrap around vacuums because those are really only necessary if you have these kinds of horizons. I think that's something where we're definitely
going to have to improve, but the problem is that L to vacuum is such an odd area of the code worthy of improvement that I wasn't quite sure how to structure this properly without making it even worse and exposing more complexity there because it has all these things of prioritization.
It doesn't quite fully work, but it has that and it didn't, integrating something half baked in there for some AM that we don't even know how it exists was just something that I didn't feel like was a sensible thing to do at this point. Yes and no.
The creation of a toast table is controlled by a callback that we just added a week ago or something. You can, there's a callback that says relation wants toast table or something, and if it does, then we create the proper catalog entries outside of the AM and it exists,
but the responsibility for chunking up any real datums and sending it to the toast table, all that is specific to the each AM and they have to do that internally. I mean, D toast already can, you can basically have D toast call callbacks if necessary, but it would go to the, through, yeah.
Yeah, somewhat. It's I think the best description of the problem. I think it's definitely something that we're, all toast tables are automatically at the moment created with the X method of the main table
that might or might not be the right thing. Robert Haas has recently posted a few patches that re-architects the whole toast API to be, so you don't have, at the moment, you would have to copy a lot of the toasting code because it's very, there's a bunch of weird heap-specific assumptions in there. Like hey, I assume that your tuple has a overhead of this
and then it does calculations of how many tuples fit on the page and stuff like that. All of that doesn't really make sense if you're not heap. So you would have to copy some of that code and then do some minor adjustments. Robert re-factored that to be more generic, but it's not N12.
We don't want to duplicate the planar code.
I'm come, I have a slide on that. We don't know everything yet. Wait a few slides, and if I didn't answer anything about it, then ask me again in a few slides. I tried to. Some of them added a bit, like overhead.
Some of them removed a bit of overhead. It's, I haven't measured anything that is meaningful. Initially, there were definitely slowdowns, but in the latest round, I haven't measured anything that's above the changes you get from code layout anyway. So it's like, for me, it was in a wash. There's one case where we sometimes are a lot faster
and sometimes are a tiny bit slower. That is, if you copy into a table and you can use the multi-insert overhead thing and you have exactly as many rows as a batch size, so you never reuse the batching, because the creation of the state for a batch
is a bit more expensive, but we also made other parts of that cheaper. So it's like, that's I think the only one where I could measure a very small slow down. I don't know where they, we could probably improve that, but at the moment, it didn't seem worth doing much more. Okay, come to that in a few.
So what are things that are bad and that cannot be really excused with saying, we are just going to do it later? I think one of the things that we just ran out of time and or energy is that there's a lot of confusion
about function names and that's not really just now, but there's a lot of function calls that start with heap underscore something and it hasn't actually to do anything with heap. Even if you create an index, we call heap create with catalog and all that kind of weirdness. So there's a lot of functions that we just should rename
and a lot of files that we should rename that aren't actually created like related to heaps, but all the names have heap in it and it's just a very historical mess that's basically been there forever and it would have been good to clear that up now because now it's really weird that heap create with catalog then goes into table create callback and then goes back potentially into heap an.c
to do that work. That layering is not exactly obvious, but I think it's mostly going to be that we have to rename all those functions to be instead of heap create with catalog, create catalog entry for relation or something like that and there's a lot of that kind of boring work. There's still a few unnecessary conversions to and from heap tuples
and I'm not sure whether that should have been just on the list for future things because it's defensible. The problem is that the trigger interface actually expects heap tuples to be passed in and we can change that to accept the slot. The slot is passed in as part of the parameters now
but at the moment, all the PLs expect there to be a heap tuple and all the other external C triggers also expect there to be a heap tuple and I didn't want to break all of them. It might be necessary that we have to somehow indicate whether a trigger is using the old
or new trigger function call interface. I don't quite know what the best way to do there is or perhaps we are just going to have to say we'll just break it in v13. I don't know how many people still write C triggers themselves because if you just wrote triggers in PLPG SQL, we would fix that once inside the core code
and you would never notice except that things would get a bit faster. I don't know what the best answer there is and one of the problems is that index and only a bitmap scans currently have visibility map accesses directly in them which really isn't great but we didn't get around to abstracting
or figuring out how the callbacks for that exactly would look like and it turns out you can just return false. You cannot just not create the visibility map and the behavior will be correct. It will just be a bit slower. So I don't like this but we have to do more work
and I think if you had tried to do this in 12, you would probably have end up with an interface that sucks in different ways. So yeah, one thing that's been long-standing but it's now really weird is that pdrelation size doesn't actually go through any of the abstractions we already had. It just goes to the file system, does a stat on each individual segment
and sums it up itself which is pretty broken before this but it's even more broken after this when we don't actually know that the table will have a relation fork that looks exactly like the current one. You could easily write heap two that doesn't use segments but has one big file and that would work but if it did chunking
in smaller chunks, then it wouldn't work because it checks for one gear by chunks to check whether it looks at the next segment. And we have the ability in the table m to actually get the table size but we just have to change these functions. I asked whether anybody knows why we have written them in such a weird way and nobody has replied.
If anybody knows, please do. Yeah, I think those are the ugliest words and now we come to limitations of the API. I think one big problem is that at the moment wall logging in Postgres is not extensible. There is a generic wall format that you can use and that will work as long as you use a table storage
that goes through the buffer manager and so on but that is actually pretty voluminous and you cannot do all the things that you might want to do in doing wall replay. For example, you couldn't develop an undo based system, at least not as far as I can tell because you can't recreate the undo during recovery.
So that is definitely not enough. So at the moment, if you want to do write ahead logging, you either have to somehow make do with a generic XLOG thing or you have to create, modify Postgres to get a different type of wall registered. We have discussed making wall extensible, I don't know how many times
but it turns out to be pretty hard because you have to have a registry that you can access before the cluster is actually started because during wall replay, we didn't have a consistent catalog so it cannot be a catalog table and has to work across replicas. You cannot have something that works differently on the primary than on a standby. I kind of was wondering whether the right answer is to just say we have a registering core
where you say even extension asks, hey, please add a new AM type for me, then we just create that and you have to fill in those callbacks at startup time but it's a well-defined RMGR ID, I don't know. It's not nice but I can't really come up
with something better. I think the next biggest limitation and I think that's a pretty large one is that the format of how tuple identifiers look like is very, well, it's pretty restrictive. That's the definition that we had historically. We have a block ID which is basically a reference and offset into the table,
like which block into the table you want to have a table access and then we have an offset number inside that page. You can store nearly arbitrary stuff in those so you can have an AM that doesn't have the concept of blocks and that doesn't have the concept of positions in there. You have to avoid a few values because we turned out, at the moment,
rely on IP pos ID, for example, to not be zero and the IP block ID cannot be all bit set because that is the invalid block number and that all code outside of the AM checks that but I think the bigger limit is that it's actually only six bytes and that only is a problem for Postgres
because we currently cannot have tables that are larger than two to the power of 32 times the block size and that's exactly this. You could have bigger tables except for this. It is also something you cannot really design, for example, an index organized table by this
because usually the key in an index organized table will be much wider than four bytes or six bytes so you cannot really do that unless you restrict the index organized table to be six byte integers which isn't really very nice. The problem for that is not really the table access method side of things.
That could be relatively easily changed. The problem is that indexes currently store a hard-coded six bytes for each index entry to point to the actual heap tables and we cannot just break binary compatibility by saying, oh yeah, the index entries look different now. So we're going to probably have to make
the index access method support variable with tuple identifiers and that probably cannot change whether it's variable width or not variable width after the creation of the index which I think is not a problem but that requires a fair amount of work to make that work without breaking or without regressing performance.
Which turns out that we also want that feature for other reasons, namely global indexes, that is indexes that go over multiple partitions. You also need that to have a partition identifier otherwise it's very slow to drop data ever and you don't even know which table the index entry point to so you would have to scan all partitions or something. So global indexes really need that.
Also indirect indexes, that is indexes that point to other indexes also need that and that's kind of nice to reduce the right amplification of updates. So we probably want that at some point anyway but yeah, it's a fair bit of work.
I think that's the hardest limitation. There's a lot of planner executor type restrictions at the moment. And in particular for columnar stores we have a number of limitations. I think the biggest one is at the moment that you would normally I think do the columnar splitting into columns or column groups inside the AM
but right now we don't pass the information which exact columns we are going to need properly down into the scans. So that's something that we very clearly have to improve that would not only beneficial for columnar stores where it's very, very clearly needed but also beneficial for heap because it turns out
that we at the moment often deformed columns that we never, never need and we could actually improve performance there quite a bit. And I think a lot of the otherwise largest issue there is that we don't have the proper costing support. The planner gets called to provide an estimate
of how many pages you're going to access for a certain type of scan but we don't have the option to say, for example, if you use columnar scans the where clause can be evaluated much more efficiently and things like that. I think we're going to have to improve there. I don't really, there's a lot more you can do
to make columnar execution much, much faster by doing more later in the executor and there's no support for anything of that. You can do a lot of that by creating the planner hooks and just rearranging the planner plan tree or the executor tree by using the custom scan notes.
Then you can do basically everything you want to do because you can just have additional callbacks in your AM and use them. But that obviously is not the nicest way to do things. We are going to have to work on that. And I hope that Heike and Cole who work on that store will provide a fair bit of feedback around that.
We currently rely on relation, like Postgres heap has relied on relation forks which is basically that there can be four, exactly four different underlying logical files for each relation for a long time and you can only have those and you cannot have more
and you cannot have less really. We assume those exist in some way and that's not good because you might want to actually associate many, many different types of files. You can kind of make that work by just prefixing the files you have inside your AM with like subset of the,
like an array or like just an increasing number inside the file appended to the file name that's prefixed by the relation number. But that's pretty crummy. I think we're going to have to invent a system that you can register additional files because otherwise things like pgewind and so on
will not properly work with that. Yeah, yeah, I think I'm out of time and the rest of the limitations aren't that interesting. Any further questions? KO.