PostgreSQL partitioning. Work In Progress.
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 637 | |
Author | ||
Contributors | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/53279 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2021268 / 637
1
2
7
8
10
11
12
17
29
33
35
38
40
44
48
50
54
59
63
65
85
87
91
95
97
105
108
114
115
119
120
122
126
127
129
130
133
137
140
142
143
147
149
151
156
159
160
161
168
169
170
175
176
177
178
179
182
183
184
187
189
191
193
197
198
204
206
209
212
220
222
224
227
230
233
235
238
242
243
245
247
252
253
255
258
260
261
262
263
264
265
272
273
278
281
282
285
286
287
288
289
294
295
296
302
304
305
308
310
316
320
323
324
328
330
332
335
338
342
343
347
348
349
350
351
360
361
365
368
370
372
374
377
378
380
381
382
383
386
390
392
395
398
402
405
407
408
409
414
419
420
422
425
427
430
439
451
452
453
458
460
461
464
468
470
471
472
473
475
478
485
486
487
491
492
493
495
496
498
509
510
511
512
516
532
534
538
543
548
550
551
554
556
557
559
563
568
570
572
574
575
577
583
585
588
591
593
595
597
601
602
603
604
605
606
607
610
611
617
627
633
634
00:00
Software developerPerspective (visual)Incidence algebraScaling (geometry)Presentation of a groupDatabasePartition (number theory)Open sourceCore dumpState of matterOcean currentProjective planeError messageChord (peer-to-peer)Computer animation
00:40
Partition (number theory)Hacker (term)InformationDeclarative programmingData managementComputer configurationVolumeDisintegrationSoftware maintenanceUsabilityReplication (computing)Table (information)WaveQuery languageStrategy gameBound stateRange (statistics)Electronic mailing listHash functionRegulärer Ausdruck <Textverarbeitung>Data storage devicePrice indexSubsetDefault (computer science)Series (mathematics)Library catalogPhysical systemAttribute grammarTupleType theoryUniqueness quantificationConstraint (mathematics)UsabilityQuery languageMereologyDeclarative programmingField (computer science)Key (cryptography)Computer configurationChemical equationSet (mathematics)Partition (number theory)Web pageINTEGRALReplication (computing)Social classHierarchySpecial functionsLevel (video gaming)1 (number)Network topologyElectronic mailing listInformationCASE <Informatik>Theory of relativityHacker (term)Type theoryCodeTerm (mathematics)Revision controlData managementMathematical optimizationSinc functionOperator (mathematics)Limit (category theory)ConcentricVideoconferencingAreaoutputBlogPresentation of a groupBound stateTable (information)Default (computer science)Library catalogPhysical systemStrategy gameSubsetExpressionSubject indexingConstraint (mathematics)Analytic continuationInheritance (object-oriented programming)Exclusive orFlow separationImplementationDifferent (Kate Ryan album)Core dumpQuicksortPerspective (visual)MetreData storage deviceShape (magazine)Group actionMultiplicationGraph coloringExistenceComputer architectureAxiom of choiceRadical (chemistry)Position operatorUniformer RaumError messageSource codeMedical imagingCondition numberProjective planeRegular graphSanitary sewerComputer animation
07:40
Latent heatStrategy gameHash functionPartition (number theory)Default (computer science)Electronic mailing listMaxima and minimaRange (statistics)Inheritance (object-oriented programming)Table (information)Price indexUsabilityBound stateEvent horizonMathematicsSubject indexingSerial portRevision controlExtension (kinesiology)Declarative programmingData managementMechanism designPlanningRun time (program lifecycle phase)Function (mathematics)Extension (kinesiology)Function (mathematics)Projective planeInformationRevision controlPartition (number theory)Serial portHash functionRange (statistics)Table (information)Constraint (mathematics)Declarative programmingUsabilitySubject indexingSoftware maintenanceBound stateEvent horizonArithmetic progressionElectronic mailing listDefault (computer science)Sinc functionMultiplication signNumbering schemeSet (mathematics)Software bugSoftware developerParallel portMultiplicationOperator (mathematics)Heegaard splittingOcean currentSlide ruleOrder (biology)Concurrency (computer science)Limit (category theory)Physical systemWorkloadKey (cryptography)Complete metric spaceStrategy gameInsertion lossStatement (computer science)System callExterior algebra1 (number)Position operatorTouch typingImplementationMathematicsLatent heatFlock (web browser)Maxwell's equationsCore dumpRotationTelecommunicationOnline helpData managementGraph coloringGroup actionAddress spaceDifferent (Kate Ryan album)WordTask (computing)Computer animation
14:41
Data managementPartition (number theory)Range (statistics)Event horizonRun time (program lifecycle phase)Mechanism designPlanningFunction (mathematics)Asynchronous Transfer ModeStapeldateiConcurrency (computer science)Inheritance (object-oriented programming)Table (information)Software maintenanceExtension (kinesiology)Point cloudPrice indexElectronic mailing listHash functionBound stateStatisticsVacuumDrop (liquid)Object (grammar)Limit (category theory)Link (knot theory)WaveWeb pageWikiDeclarative programmingLengthStapeldateiFunction (mathematics)Ocean currentExtension (kinesiology)Arithmetic progressionMathematical optimizationSoftware developerSoftware maintenanceOperator (mathematics)Limit (category theory)Slide ruleTable (information)Patch (Unix)Partition (number theory)Group actionBitInheritance (object-oriented programming)outputConcurrency (computer science)Physical systemElectronic mailing listLink (knot theory)Projective planeElectric generatorEvent horizonObject (grammar)Computer architecturePlanningCore dumpRun time (program lifecycle phase)FeedbackWebsiteUsabilitySubject indexingDatabaseMathematicsSerial portVacuumStatisticsCoroutineMultiplication signData managementRange (statistics)Particle systemMoment (mathematics)Shared memorySystem callHeegaard splittingPoint cloudProcess (computing)QuicksortDataflowFlow separationLocal ringFiber bundleRevision controlMechanism designBuildingCycle (graph theory)TesselationRational numberComputer animation
21:42
Element (mathematics)Partition (number theory)Discrete groupFlow separationExclusive orBitMessage passingTransportation theory (mathematics)Expert systemBit rateMereologyPredicate (grammar)Operator (mathematics)NumberOcean current2 (number)Key (cryptography)Streaming mediaDeadlockShooting methodCausalityPerspective (visual)Grand Unified TheoryOracleRight angleRevision controlLatent heatMechanism designTable (information)InformationConstraint (mathematics)Hacker (term)Point (geometry)Event horizonServer (computing)Instance (computer science)Bound stateNormal (geometry)Concurrency (computer science)Insertion lossRegular graphComputer animationMeeting/Interview
28:54
Element (mathematics)Computer animation
Transcript: English(auto-generated)
00:07
Hello, my name is Edna Steselu-Bendikova and today I will talk about partitioning. In this talk I want to observe the current state of PostgresScale partitioning development and its perspectives. So the presentation is intended for developers and advanced users of the PostgresScale.
00:26
I am a core database developer working for Postgres Professional for about six years now. I actively contribute to PostgresScale open source project and also develop our company solutions.
00:41
There are several reasons why I am doing this presentation. I have recently started working on partitioning features and I needed to explore this area in detail. For me the best way to learn something is to tell others about it. So here I am. As I started I noticed that even though we have a lot of articles about partitioning
01:02
for PostgresScale users, we actually lack resources for hackers. So I have made an effort to collect them in this presentation and a complementary article in my blog. I am also planning to write to read me about the partitioning architecture. Today I'll concentrate on the features that we already have and our to-do list.
01:24
And to do so I'll compare PostgresScale native partitioning with existing third-party solutions. I hope that it will be a valuable input for our roadmap discussion. And finally I'll provide an overview of the features in this area that are currently
01:41
being developed. And hopefully it will encourage some people to review them. Important disclaimer about this presentation, this video was pre-recorded, so the information is relevant as of January 15th.
02:00
Well, declarative partitioning in PostgresScale is a set of features for handling large data volumes. It includes dedicated syntax for partitioning, convenient data management options, and internal performance optimizations. So we have not just one feature that enables partitioning, but rather we deal with a set
02:20
of multiple smaller features. For convenience of discussion, I divided them into three groups, usability features, the ones that are visible to the user and needed for partition maintenance, performance features such as partition pruning or partition-wise joins and aggregates, and features ensuring
02:41
the integration with other subsystems of PostgresScale. All of them are equally important to make partitioning a valuable tool. It's actually hard to fit the information on all of them into such short conference presentations, so today I'll concentrate mostly on usability features.
03:01
On this slide, you can see a quick recap of the history of declarative partitioning in the past PostgresScale releases. Limited version of this feature was introduced in PostgresScale 10. The new syntax was added, but many operations still had to be done manually. In PostgresScale 11, many usability features were implemented.
03:24
The next release significantly improved the performance of queries over partitioned data, and since version 12 declarative partitioning has become really good for many use cases. And finally, the latest release of PostgresScale 13 has improved performance even further,
03:42
and it improved integration with other PostgresScale features such as triggers or logical replication. If you want to learn more about it, check the great talk by Amit Langata. Here are some terms and definitions that we use when we talk about partitions.
04:01
I hope that you are already familiar with them, but here is just a brief reminder to ensure that we are on the same page. Well, first of all, we have a partition table, which is divided into partitions. Sometimes it's also called parent and child tables. From the PostgresScale code perspective, parent table or partition table is a special type
04:24
of relation that doesn't have any physical storage, so all data that is inserted into it will be routed to the respective child partitions. In case of sub-partitioning, we can have a tree of partitioned tables, and only the ones at the lowest level of this hierarchy will actually contain the data.
04:45
They are called leaf tables. A partition table is defined by its partitioning strategy, which can be hash, list or range, and partition key, which is a set of columns or expressions by which we divide the table.
05:03
A subset of data in a partition is defined by its partition balance. And finally, there is an option to create a default partition that stores the rows, which do not belong to any other partition of the table. The information about partition tables is stored in the system catalog.
05:23
Partition strategy and key are stored in special system catalog table, PGA partition table. And the information about partition bounds is stored in PG class system catalog, in fields rel is partition and rel part bound.
05:42
The bounds are stored as a dump of an internal structure, and they are not intended to be readable. To get human-friendly partition bound definition, one can use special functions that generate the table's DDL. On this slide, I listed some important facts about partition management and table routing,
06:02
which are for now scattered across different chapters in documentation. Some of them are just design ideas or facts, and other are limitations of a current implementation. Well, firstly, the set of partition columns and their names and types must exactly match
06:23
the parental ones. The existing tables can be attached to the target table or detached from it as partition, but currently, it is not possible to turn a regular table into a partitioned one, or vice versa.
06:42
Overlapping partition bounds are not allowed. One value always belong only to one partition, and default values can be specified for each partition separately.
07:00
Constraints such as check and not-null constraints are always inherited by all partitions of partitioned table, but currently, there is no way to enforce cross-partition restrictions, which means that unique constraints must all include all partition key columns, because
07:21
every partition has its own unique index to enforce this constraint, and also, there is no way to create a global exclusion constraint, which spans over the whole partition table. Here is the continuation of the previous slides, but with the facts that are specific
07:43
to certain partition strategies. Hash partitioning currently doesn't allow setting the default partition. It is a design choice, because we can always define a complete set of hash partitions. List partition currently doesn't allow a multi-colon partition key.
08:03
I think it's more of a limitation. And range partition bounds may take special values, min value and max value, to define open ranges, and currently, only default partition can take null keys.
08:25
This is information about locks used for partitioning. Actually, locks required for partition maintenance not only impact performance, but also they impact usability, because operations that hold exclusive locks for a long time often
08:42
can be performed in systems with heavy workloads. And usually, those are exactly the systems that need partitioning. So, most of the items listed here are the limitations of the current implementation,
09:00
and improvements for some of them have been proposed on the ongoing commit tests. I'll list them later. Well, dropping and detaching partitions now require an exclusive lock on a parent table. This is too strict a limitation, which can be simplified.
09:26
Next is the fact that concurrent index builds on partition tables are currently not supported, but there is a work in progress to fix this. And finally, attach partition currently requires an exclusive lock on the default partition,
09:41
because it must be scanned to verify that it doesn't have any information that actually belongs to the new partition. Well, I'll list the features that we already have and some design facts, and here
10:01
is the list of usability features that we don't have yet. Some of them are relatively simple to implement, or even perform them manually, others require a lot of work, so the order on the slide is pretty random. Firstly, it would be great to be able to create partitions automatically.
10:23
Now, if user wants, for example, to create a table with a hundred of hash partitions, they have to write hundreds of create statements. Of course, it can be scripted, but still it's a DBS duty. Next thing needed for partition maintenance is a callback or some trigger,
10:43
which can be defined for partition creation event. It is handy for retention of old data or partition renaming. Current trigger-based partitioning is basically built on top of this idea of triggers, but current event triggers are not detailed enough for this task.
11:05
You can define a trigger on table creation event, but it lacks specific information, which is needed to work with partition table. Next features are about changing partition bounds.
11:23
It can be done now by detaching a partition and attaching it back with new bounds, but it's kind of not so convenient. Split and merge of partitions is basically changing bounds, but for tables that already have some data in them.
11:41
And one more frequently requested feature is partitioning of existing table. It can be done now, but it takes a lot of manual operations. And finally, users want to have global constraints or indexes which span across multiple partitions.
12:02
Some of these usability features are already implemented in third-party solutions that exist in the PostgresQL ecosystem. They were first introduced around version 9.5 and they were developed in parallel with native partitioning. They contain different sets of features, but most importantly, they implement functions that simplify partitioning administrations.
12:23
I think that we can use their experience to design new features for PostgresQL native declarative partitioning. Today I will talk about two extensions, PgPathman and PgPathman. PgPathman is an extension developed and maintained by Postgres Professional Company.
12:45
This project is currently maintained, but all active development has stopped in favor of improving native built-in partitioning. PgPathman is developed and maintained by Crunchy Data Company.
13:00
I am less familiar with this project, so I can only speak about its status judging by its documentation and GitHub activity. And this project seemed to be maintained. The latest bug fix release was at the end of December, yet I found no information regarding support of PostgresQL 13.
13:23
Both of those extensions provide a set of functions that aim to operate with partitions. PgPathman implements its own partitioning scheme to work with hash and range partitions. PgPathman is optimized for time or serial ID partitioning.
13:42
It is built on top of trigger-based partitioning for older versions and on top of native declarative partitioning for versions since PostgresQL 10. PgPathman provides a lot of functions for partition management, such as attaching, detaching partitions.
14:03
You can add or drop partitions, you can split them or merge, or you can replace one partition with another, which is something like attach and detach together. You can automatically create partitions
14:21
and automatically create partitions for range partition table when your values are inserted. One of my favorite features is user-defined callbacks for partition creation event, where you can use some...
14:40
One user can create a function, which complies with an API, and then he can use it as a callback for retention or renaming of partitions that were automatically created.
15:00
Also, there are some non-blocking concurrent operations on table, such as partitioning or split, or some others of them can be done concurrently. And finally, PgPathman provides optimized planning mechanism,
15:21
such as runtime pruning, partition-wise joins and aggregates. Now they are... Now most of those features are available in core of PostgresQL, so they are not in such need.
15:40
PgPathman is a bit simpler, I would say, but still it has a lot of functions that are useful for DBAs, for managing of partitions. This extension concentrates on managing time or serial ID data. It also allows automated creation of partitions.
16:02
It allows partitioning of existing table, and implements batch mode for better concurrency. With this extension you can move data from child tables to parent table and back. And also it has background worker
16:22
to run partition maintenance operations, such as autocreation or retention, or some other routines. There are certain pros and cons in developing an extension. As I already said, we duplicated PgPathman in favor of native partitioning,
16:43
and I think that after this slide you will understand why. Let's start with upsides. First of all, the site project is easier at the start, as it needs less compromises. It's really easy to release MVP quickly and answer the user's needs.
17:04
Also, the site project doesn't have to align with upstream release schedule, and thus we can release more often and react to user feedback much faster. And also quality requirements for third-party tools
17:21
are not so strict, which also makes development much faster. But there are also certain difficulties. Firstly, not everything can be done from an extension. Some optimizations require core patches or even complex architecture redesign. Also, extensions are not so easy to maintain.
17:43
With every new release, extensions, third-party tools, need to keep up with upstream changes. Sometimes they can be really inconvenient API changes. And final drawback of having an extension is that third-party extensions are not available in clouds,
18:01
which can be very important for people who host their database systems there. And finally, here is an overview of partitioning-related patches that are currently in progress in the OpenComet Fest. It is about the middle of release cycle,
18:22
so we can still see some of them in release PostgreSQL 14. So, it isn't easy to discuss them in details in such a short talk, so please check the links and join the review.
18:40
If you are not a developer, but you want to see these features implemented, you can contribute as well by sharing an opinion from your experience. It's very valuable input. First is automatic hashing list partition creation. Actually, bound generation is pretty simple. Range partitions need a bit more work,
19:03
but it's also doable. But first of all, we need to agree on a new syntax, which needs to be extendable for future improvements. Next feature is cluster on partition tables, which will allow to perform cluster automatically
19:21
from the topmost parent to any partitions that have a suitable index. And one more feature is auto vacuum on partition tables, which will fix auto vacuum analyze, so that the statistics of parent tables were updated correctly.
19:43
And here is another group of patches that improve visibility. I already mentioned them on slide about locking. All three of them are intended to impose fewer locking requirements on partition maintenance operations. The first one is about detaching partitions concurrently.
20:04
Next two patches are intended to improve locking for index maintenance operations. Here I listed some obstacles we are currently facing on our way of partitioning development. For many usability features, we need to agree on a new syntax.
20:23
It is really important stuff that impacts everything, and it impacts future development. So again, I ask you to join the discussion. The next thing is about partition creation trigger.
20:45
I already mentioned it. Current event triggers are not detailed enough to handle this, so we can maybe improve this subsystem, or we can create some callback, which will allow user to define their own function.
21:05
And final small obstacle is object naming length limit of 63 characters, which makes partition name generation quite difficult.
21:21
Finally, here is the list of some useful links. Please follow them and learn more about partitions and join the discussions and review. And that's it. Thank you for your attention. I'm ready to answer any questions that you have.
22:03
Okay, we should be going live now in a few seconds. I hope. I still don't see in the live stream, but I guess we are about live now.
22:21
So this was a very, very nice talk. I definitely learned a lot about partitioning. Thank you very much for that. Thank you. I can see that this talk has already received quite some discussion in the chats, and there are some very, very interesting questions. I will start with the top one.
22:44
I'm really missing two things when migrating from Oracle, reference partitioning and partitioning by a key that is not part of primary key. Any chance for those to be added? Well, let's start with reference partitioning.
23:01
I think it's doable, definitely, and probably it's not that difficult when you just play a little bit with partition bounds and how they are implemented. Like, I speak from hackers' perspective, of course. But I haven't seen any active works on this feature.
23:22
Speaking of partitioning by a key that is not part of primary key, I think it's way more difficult because here we need to have some global primary key or global constraints at first, and there were some beginnings.
23:41
I think there were actually several takes on this feature, but I don't see anything really active for now, active, like, open at Canvas Fest and actively discussed, so I think we'll maybe need to start this discussion again.
24:05
Yeah, but this is also doable, of course, but probably not for the version 14. Right. I understand that it's not actively worked upon, as far as you know, right?
24:21
Yeah. Right. That sounds reasonable. So the next question is, what's a current way to avoid deadlocking on partition operations? And I will also add a small personal question. What is the actual danger for deadlocking on partition operations?
24:42
Well, I a bit struggle to understand the question. Is it deadlocking of, like, regular operations, like regular deadlock of updates, but over partition table, or it has something to do with DDL on partition tables?
25:03
So maybe I can simplify. Right. That's why I asked a little bit that question because it seems to come, as far as I can see from the chat, from someone claiming that was able to perform,
25:21
to add in a deadlock for some reason, but I cannot really understand how, so maybe we can move to... I think that the way to deal with deadlocks, like with regular DML operations, such as inserts and updates, is just the same way as with non-partition tables.
25:45
You just need to somehow serialize your operations, maybe based on the key or some, I don't know, other information. So I don't think that it really differs.
26:00
But speaking about deadlocks on detail, like when you, for example, attach and detach partition, I think it's more complicated, so I'd love to see more details, and maybe we can think about it and introduce something like concurrent operations, or just improve our current lock-in.
26:25
Right. So yeah, if I understand correctly, DML is not more specific for partition tables versus normal tables, right? Yeah, I think so. Right. Great. So there was a follow-up to the first question,
26:42
which is what exactly is reference partitioning, which, let's go through a little bit. Let's vote. Sorry about that. Can you explain reference partitioning a bit faster? Well, I'm not an Oracle expert, naturally, so I think that idea of reference partitioning
27:01
is that we partition our table based on the other table's partitioning schema. So we have table users, which is partitioned by some user ID, and then we have table, let's say, messages, which references this user ID,
27:21
and we want to partition our message table using this reference. Yeah, I think that is the idea. Right. Thank you very much for that. That probably helped people to follow a bit more. So the next question I see, it's asking if everybody agrees that horizontal partitioning
27:44
is the same as sharding, or implies sharding, that sub-tables are on separate nodes, et cetera. Yeah, I saw some discussion in the chat, and I would agree with many people who answered this question,
28:02
that sharding actually implies that we have shards, and usually people mean that those shards are separate instances or separate event servers. So I think that partitioning is a part of sharding mechanism,
28:21
so we need to partition our data to split it between shards, but it's not the same. Right, that's actually quite clear. And from what I can see in the discussion, this is more or less most of the questions that we have had up to this point,
28:47
but I'm really certain that partitioning is a very, very big topic, so there will be some more...