We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

When PostgreSQL Can't, You Can

00:00

Formal Metadata

Title
When PostgreSQL Can't, You Can
Title of Series
Number of Parts
31
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production PlaceOttawa, Canada

Content Metadata

Subject Area
Genre
Abstract
After using PostgreSQL for a while, you realize that there are missing features that would make it significantly easier to use in large production environments. Thankfully, it's extremely easy to make add-ons to enable some of those features right now without even knowing a single line of C! After using PostgreSQL for a while, you realize that there are missing features that would make it significantly easier to use in large production environments. Thankfully, it's extremely easy to make add-ons to enable some of those features right now. And you don't even have to know a single line of C code! Over the past view years I've been developing tools that have made PostgreSQL administration for our clients much easier. Table partitioning is one of the best methods for providing query performance improvements on large tables in PostgreSQL. While the documentation gives instructions on how this can be set up, it's still a very manual process. For time/serial based partitioning, the maintenance of creating new partitions is huge part of that process. PG Partition Manager is an a extension that aims to provide easy setup & maintenance for the most common partitioning types. It also provides advanced features for taking advantage of constraint exclusion and retention to drop unneeded tables. Another area where we ran into limitations was when one of our clients needed an easy method to keep the database schema checked into version control. The pg_extractor tool grew from this and has become a popular method of finely tuning the extraction of database objects from PostgreSQL into individual files. Autonomous transactions is another one of the more popular features in other RDMS systems that is missing from PostgreSQL. Begin able to reliably log a function's run status requires that function failure not roll back the entire transaction and erase entries made to the log table. PG Job Monitor takes advantage the dblink contrib module to give the behavior of an autonomous transaction, providing logging & monitoring for mission critical functions. PostgreSQL's built in replication is all or nothing, but sometimes you just need to replicate specific pieces of your database. There are several 3rd-party systems capable of accomplishing this, but their setup can been daunting to most new users and overkill if you just need to grab a few tables. Mimeo is a logical replication tool that requires minimal permissions and setup and provides several specialized means of replicating from one or many databases to a single destination. This talk will discuss these project and hopefully bring insight on how easy it is to contribute to the PostgreSQL community.
Coma BerenicesWebsiteFreewareType theoryView (database)Digital filterTable (information)Social classCore dumpDatabaseMatching (graph theory)Level (video gaming)Right angleGame controllerElectronic mailing listCartesian coordinate systemComputer fileWebsiteDefault (computer science)Computer hardwareLatent heatProjective planeTable (information)Function (mathematics)Integrated development environmentCodeRegulärer Ausdruck <Textverarbeitung>Filter <Stochastik>Demo (music)Partition (number theory)Revision controlScripting languageVolume (thermodynamics)Object (grammar)AreaExpressionMedical imagingProduct (business)Wave packetIdeal (ethics)Network topologyMultiplication signPattern languageInternet service provider1 (number)Software developerSoftware testingCore dumpFitness functionDatabaseProcess (computing)Computer animation
Server (computing)Newton's law of universal gravitationThermodynamischer ProzessFunction (mathematics)Meta elementLogarithmMaxima and minima3 (number)Computer fileTable (information)Sanitary sewerDefault (computer science)Object (grammar)Rule of inferenceMetreDatabaseCore dumpDemo (music)Table (information)Level (video gaming)Default (computer science)Key (cryptography)Function (mathematics)Computer fileSource code
Software testingDemosceneTable (information)Computer-assisted translationReal numberEstimationCASE <Informatik>Replication (computing)SatelliteFunction (mathematics)View (database)Mathematical singularityExecution unitCore dumpDefault (computer science)EmulationInheritance (object-oriented programming)LoginSet (mathematics)Large eddy simulationBinary fileTable (information)Function (mathematics)File formatComputer fileComputer configurationPasswordRevision controlSource codeComputer animation
Software testingSource codeMathematical singularityView (database)Function (mathematics)Rule of inferenceTable (information)Execution unit3 (number)EstimationVarianceCAN busSummierbarkeitCore dumpSoftware engineeringSatelliteRobotCombinational logicComputer programmingPhysical systemRevision controlComputer fileGame controllerElectronic mailing listLibrary catalogFunction (mathematics)Functional programmingDatabaseObject (grammar)Source codeComputer animation
Revision controlTable (information)Computer fileRule of inferenceSingle-precision floating-point formatDefault (computer science)Computer configurationFlow separationSequenceConstraint (mathematics)Subject indexingFunction (mathematics)Operator overloadingType theoryStatement (computer science)CASE <Informatik>Electronic mailing listArmObject (grammar)Inclusion mapExt functorNumbering schemeLine (geometry)Content (media)Hand fanComa BerenicesInterior (topology)View (database)Library (computing)Directory serviceDrop (liquid)PasswordDigital filterLevel (video gaming)Uniform resource nameDatabaseHill differential equationConditional-access moduleCAN busComputer configurationMedical imagingSpherical capSimilarity (geometry)PasswordObject (grammar)Default (computer science)Table (information)Function (mathematics)Computer fileSequenceRule of inferenceCore dumpNumbering schemeLatent heatRight angleReduction of orderConnected spaceSource codeComputer animation
Digital filterType theoryView (database)Table (information)Social classCore dumpDatabaseObject (grammar)Matching (graph theory)Ext functorRevision controlField extensionFunction (mathematics)Complete metric spaceSpecial unitary groupSupremumProcess (computing)Exception handlingSummierbarkeitGrand Unified TheoryLibrary (computing)MassInformationExecution unitChi-squared distribution12 (number)Mathematical singularityLogarithmService (economics)EmulationOffice suiteVirtual machineIRIS-TMoving averageSineReplication (computing)Mountain passCondition numberTrailSingle-precision floating-point formatSequenceDatabase transactionExclusive orSource codeValue-added networkSerial portKey (cryptography)Daylight saving timeTimestampControl flowQueue (abstract data type)MathematicsEntire functionInstallable File SystemData warehouseDrop (liquid)Personal digital assistantPartition (number theory)Customer relationship managementElectric currentRule of inferenceAerodynamicsFluid staticsInheritance (object-oriented programming)LengthLimit (category theory)Relational databaseSubject indexingScripting languageStapeldateiInterrupt <Informatik>Block (periodic table)Row (database)Gamma functionProgramming languageGame theoryPersonal area networkMetropolitan area networkLocal area networkComputer configurationDefault (computer science)AutomationMaxima and minimaSoftware testingQuery languageExtension (kinesiology)MaizeComa BerenicesLink (knot theory)Computer fileLimit (category theory)Electronic mailing listComputer configurationRow (database)Table (information)Source codeDatabaseTrailObject (grammar)Inheritance (object-oriented programming)Queue (abstract data type)CodeLogicScripting languageBitStapeldateiSet (mathematics)Multiplication signType theoryField extensionReplication (computing)Subject indexingProduct (business)SpacetimeVacuumRevision controlCore dumpCASE <Informatik>Link (knot theory)Extension (kinesiology)Functional programmingPartition (number theory)Serial portLoginInsertion lossContent (media)Operator (mathematics)LengthSoftware developerKey (cryptography)SequenceClient (computing)Service (economics)Default (computer science)Ocean currentCondition numberMaxima and minimaDrop (liquid)Database transactionEntire functionGame controllerMereologyExclusive orQuery languageRelational databaseSoftware testingDirection (geometry)Network topologyCustomer relationship managementPatch (Unix)Function (mathematics)Statement (computer science)ResultantConfiguration spacePlanningGroup actionProcess (computing)MathematicsOverhead (computing)Different (Kate Ryan album)NumberLine (geometry)Right angle1 (number)MultiplicationCountingDaylight saving timeData warehouseWritingBoundary value problemArithmetic meanError messageWindowGastropod shellComputing platformUnit testingExterior algebraProjective planeBlogDependent and independent variablesAlphabet (computer science)Physical system2 (number)FlagAuthorizationFrequencyWeb 2.0View (database)Loop (music)Thermodynamischer ProzessPort scannerINTEGRALSystem callOrder (biology)Cartesian coordinate systemForm (programming)Single-precision floating-point formatStreaming mediaCategory of beingDynamical systemSoftware maintenanceMoving averageSelectivity (electronic)Fluid staticsSynchronizationBlock (periodic table)Interrupt <Informatik>Cycle (graph theory)Incidence algebraMeasurementAutonomic computingInterior (topology)Data loggerFigurate numberTime seriesRoundness (object)Cost curveWordAreaConnected spacePurchasingState of matterPoint (geometry)Game theoryStability theorySubsetDisk read-and-write headExistenceWater vaporSocial classExistential quantificationSound effectGraph (mathematics)PropagatorInternetworkingEvoluteWeb pageComputer fileFamilyVarianceStudent's t-testTraffic reportingInstance (computer science)Pressure dropVideo gameMathematical optimizationFreewareProgramming languageData modelEndliche ModelltheorieIdentifiabilityVideoconferencingPattern recognitionElectronic signatureExecution unitSheaf (mathematics)TelecommunicationTimestampTunisReading (process)PredictabilityOpen sourceRollback (data management)Self-organizationSoftware bugComputer animation
Transcript: English(auto-generated)
My name's Keith Fiske, I'm talking about some projects I've been working on with Postgres lately. I'm a DBA for OmniTI, we're a full stack support for high traffic websites and applications, basically we'll provide support from down to the hardware level all the way up to application development,
mostly for companies that have very high volume of traffic and have very large amounts of data. Some examples of companies we've worked with that you may have heard of are Gilt, Etsy, Oratv, Freelatto, and a few other companies. Every year we also put on a conference called Surge. The prevailing theme of disaster porn is how things go horribly wrong
and how you recover and fix things from that in very large environments. It's annually in September, you can check OmniTI's website for the exact date right now, and of course like many other companies here, we are also hiring. I'm here, Robert Treat is the CEO,
he did the previous talk about 9.4 features, and Dinesh Patel is another person here, so looking for a job, you can speak to any of us. This is the first project I worked on called PgExtractor. PgDump is pretty limited in what you can actually filter out as far as trying to get the objects out of the database,
those are pretty much the filters you have there, table schema, and the P is for functions and only works on PgRestore, it actually does not work on PgDump. It only dumps out the access control list and privileges for tables and comments, it only does that for tables,
so it's pretty limiting if you just want to get a function out or you're just trying to get a specific piece of stuff out, it's actually rather challenging. I originally wrote this in Perl, recently did a version 2.0, I rewrote it in Python, actually Python 3.
You can run the 3 to 2 script and it does work perfectly fine in Python 2, but the major thing is, you can see all the things I wrote up there that it can filter by, it actually puts each object into its own individual file, it's not in one giant SQL file, it organizes them into folders and files like that.
You can also do regular expression matching, so if you need to dump out partition tables and you know the pattern of your tables, you can give it a pattern like that. You can also give it a list of tables in a file and give it that file as a filter,
you only want to filter out those specific pieces of code. I had some other features in, you can actually extract out the default lists and the default privileges and stuff like that now. I figure I can just go and do a demo of this, it explains it, shows it a lot easier than me talking about it.
You just want to give it a database name, you don't have to give it a database name, if you don't give it a database name though, it dumps everything out into the folder you specify at the top level or the default folder you're running it from. I usually always give it a database name, so it will put it into a database folder.
And then you just want to get everything, and if you want to keep the dump file it makes, and also just to see it made the Keith folder, that's the database.
So there's aggregates, functions, roles, schema, tables, views, all in their own folders,
run the tables, there's every table in its own file. It all outputs in plain text, the plain text format, if you want binary format you can give the dash FC option, just like you can, it has most of the same options as, so then you can go to roles,
there's all of the roles it has, and I actually do have an option to filter out whether, you can filter out passwords from the role file, so if you're checking this into version control,
you can filter out the password hashes, so it doesn't include them as well. Sure, it basically uses, the only thing the PgExtractor uses is pgdump and pgrestore.
When you originally wrote, when the original program this was, actually like would log into the database and use, and like rebuild things from the system catalogs, that was for 8.3 and 8.4, then 9.0 came out, system catalogs changed, broke the whole program. So, to avoid having to rewrite this and keep it future compatible, it only uses pgdump,
and a combination of pgdump and the pgrestore-l object list, it parses through that to get everything out. If you have overloaded functions, it will put all of the versions of that function into the same file, it puts the access control list for the functions in the file as well, so it tries to keep everything all together.
Tons of filtering options. I gave a get all, but you can do each image with one get table schema, types, roles, there's the get default privileges, you can actually get sequences out,
if you give it, by default it doesn't output data, if you want to get data out, you can give it a get dash dash get data option, it'll get the data as well, and it'll actually give you the sequence values as well, if you get data out. You can get rules, I tried to give things similar options as they are in pgdump as well,
like n is schema, cap lens, exclude schema, same with table as well, and you can actually give it a file name, with each object on its own list, use functions, you can filter by owner, which a lot of people have found,
they only want to dump objects owned by a specific user, you can do that here. It does have a, you can do things, run things in parallel, that'll help it go a little bit faster. There's the remove passwords option from the role file.
It also is actually a full python class, so you can actually import this, and use the public functions to get the ordered, like the structured list of the pgdump object list out,
and use it however else you feel you may need to use it, so try to make it useful in other situations as well. Any questions about this tool? I have links at the end with where I can get, omni-ti has our own labs, git repository, and for my git repository,
I have links to all of the things, I'll have them at the end. Any other questions? All of my other tools I'll be talking about from here on out are all extensions,
the extension system was introduced in 9.1, that basically describes what extensions are, it's logically grouping things together in a way that's easier to control. All of the contrib modules that are on Postgres are actually extensions now as of 9.1, so they're all version 1.0 as far as I can tell, and hasn't been updated,
but you can have actually version controlled groups of objects in the database now, allows more controlled updates of your code, and allows things to be more, if you have the same code installed in multiple databases like these probably would,
you can make sure they're all running the same versions of the same code, and it allows predictable updates, and if the author of the extension provides it, you can actually downgrade as well in a predictable way, so it allows control of the code in your database, mostly functions, you can put tables in extensions, but the extension authors didn't really intend tables for anything other than configuration data,
because when you do a pgdump, you don't actually dump out any of the extension code, all it dumps out, all it creates is a create extension whatever in your dump file, so if you need that data, you can set a flag on your extension table to be included in pgdumps,
but then that also includes your table and schema dumps with the data, so if you have very, very large tables as part of your extension, you'll be dumping out all of that data as part of your schema, so I'm hoping they make the extension system a little bit more flexible, but so far it's worked great, you can write extensions in any language,
just like Postgres functions, C, Perl, Python, whatever you need to do, this is the first extension I worked on, Autonomous Functions is a feature that's missing from Postgres, what that basically means is you can't do multiple begins and commits in a single function,
so you can't loop over things and keep things going, and that means everything that runs in a function, if that function fails, everything it did rolls back, so if you try to have monitoring on a function, you want to log the steps of a function and see where it fails, and if that function failed, you would roll back everything you logged,
so you would have no idea where it failed and why it failed, so this extension was written, it was actually code we actually had written back in 8.2 and 8.3, but when 9.1 came out I kind of organized it a little bit better so we can keep code in sync across our different clients, so this allows you to log the individual steps of a function by using a DBLink trick,
DBLink is another extension, all DBLink does is allow you to connect to another Postgres database and do whatever you want, so you use DBLink to connect back to the same database, and you basically make individual sessions that are individual transactions that can be committed, so it does add a little bit of overhead to your functions,
but usually the functions that we're running, the overhead is negligible, and the monitoring of the function is much more important than the overhead it adds on to it, so it also allows you to integrate monitoring tools so you can keep track of when something fails it will alert you,
this is basically how it works, you add a job, call the add job function, give it a name, you add a step to the function, what that step will do, and then you update the step, what it did, how it did it, this actually makes it easy to keep track if you have really really long running functions
and you want to see how they're progressing along, you can just call updates in a loop and then go query the jobmon table, and whatever you're writing to that job it will keep track of things, how many rows it's doing, whatever it's doing, if you just want to add another step you just go and add another step and it will keep going,
and it also accounts for failure, you can put a failed job call in the exception, and then you'll get whatever the error logs were, the error message was in your log, this is basically what the output looks like, there's two tables, there's a job log table and a job detail table,
I made these functions just as convenience functions, but you can just query the job log tables directly, and this is like these are individual jobs here, and then this is like if you want to get into the individual steps of a job, so that was job ID 9 here, these are the individual steps of the job,
and what it did, I used this one a lot because I can never remember exactly what the jobs names are, so I made a convenience like function, at the end by default it will give you back the most recent 10, if you need to get more than that back, you just put a number in there and it will give you the most recent 20, 50 jobs,
whatever you need to do, and it also provides monitoring, with this one function pretty much does all of this, it returns a record so you can query it however you need to query it,
we've used this in Nagios, and we also on the TI also has a monitoring service called Serconis, that we use, how to do this, I have on my blogs how to add to it, but if you've used Nagios before, these should look familiar, these are what basic Nagios replies usually look like to the built-in, a lot of the tools that people already made,
so this is like when everything's working okay, this is when stuff's gone bad, so say you have a function that must run every day,
so you can actually, there's some configuration options in here, you can tell it that, you give it the job name, and you tell it this function must run every 24 hours, if it hasn't run in the last 24 hours, when you run check job status that will come back and say, like this says some other process is missing,
and when it last ran, so Nagios knows what critical means, it's just Nagios knows what that means, when it gets back a response like that, so you'd make the command and service checks in Nagios, and then Nagios would tell you oh this function didn't run,
and by default I made this if a function ever fails three times, that by default sets off a critical alarm, so if a function fails, if something runs every five minutes, and you want it to, it may occasionally fail once, and it'll recover, it'll fix itself, so this will alert you if it's failed three times in a row,
you can change that however you want, it also has escalation, so you can see this is actually just a warning, so if something goes into warning three times, it'll escalate it to critical, and let you know that it failed, so any other questions about this?
This is the other extension I've been working on, it's a logical or poor table replication extension, I was looking for, just went to thesaurus.com looking for copy synonyms,
and that definition came up, the thing that caught my mind on it, for a mimeograph was low cost and small quantity, this is a very very simplified logical replication solution, compared to things like Picardo and Sloanie, which are, I will admit, much more powerful than this replication tool is, if you need something that's,
you're going to be replicating a lot of tables, and need to be more easily managed, those would probably be better off, if you're looking for something to just replicate a few tables, or need to do some of the specialized replication that this can do, this may work better for you, the big thing also is, I can start here,
so built-in to Postgres as most of you know, is streaming and log shipping, but that's pretty much all or nothing, you can't pick up, except if you go to the next talk about how they're getting logical replication starting to built-in to 9.4, hopefully this may make this extension obsolete, but you never know,
there's three basic types of replication, I'll go over each of them individually, the biggest thing is this requires no superuser, other than installation requires no superuser to run, you just grant the permissions to the users that are running it, and it will just work, and this is installed and run from the destination database, so this is a pull thing,
this is a pull replication, it's not push, so you set this up on your destination, where you want the data to go, and it pulls the data from however many sources you define, you can define as many sources as you want, all pulling to the same database, has some other features, you can tell it you only want to replicate certain columns, and you can actually throw a where condition
to only replicate certain rows as well, and it also uses the pg jobmon extension to provide monitoring and logging of what it does. Snapshot, easiest way, copies the whole table every time, so if this is like a very small table,
or a table that's not, doesn't change very often, you can see the last option there, if the source data hasn't changed, it actually just does nothing, but it does it in a specialized way to minimize locking, it actually it's a view on top of two tables, and when replication runs, it populates the other table, and swaps the view for a brief lock,
so while replication is running, it doesn't actually have your whole table locked, it's just doing all that in the background, and then swaps it for a brief second. This is much faster, if the majority of the table changes all of the time, this is much much faster than replaying, the DML on the replication system,
so you can just re-pull the whole table, select star, grab it, and truncate, and re-populate the table. I actually checked some of the system logs, you actually look in the pg stats, all tables, something like that, I can't remember the view right now, there's actually columns in there that have an incremental count of the number of inserts,
updates, and deletes done on every table, so I just store that number, and watch for it to change. And since it's redoing the whole table, I actually made it so it'll replicate column changes, so if you add new columns, drop columns, or change types, it'll catch all that stuff too.
One of our clients had an issue with this though, because they were actually using this on one of their development databases, and a view is not the same thing as a real table, it has no sequences, all that kind of stuff, so I made this basic table replication option, locks the whole table every time you do replication,
but it actually will let you add the sequence numbers and foreign keys and all that kind of stuff working, and it'll manage resetting the sequences and foreign keys and all that kind of stuff, so it's good for a development database, if you just want to copy data from production to a development database,
that's arbitrary to you, could be a gigabyte, could be 100k, this is kind of up to you. This is the other specialized type of replication, I call it incremental, this is based on there being a column on the source
that changes with every single insert or update, and that's a key part of it, it has to change with every single insert or update and update it to a new value, so this is actually very ideal for things like a web hits tracking table, it's just constantly getting hammered with new data, and you want to replicate that over,
most of the logical replication relies on triggers, which populate another queue table, so that makes your high traffic tables having to write to two tables every time when you want to replicate them, this allows you to avoid that, so both snapshot table and this only require read access on the source database, requires nothing else,
so very non-invasive form of replication. You do run into issues, with serial you don't really run into too many issues, but timestamp, you run into the daylight savings problems, the easiest way is to not use, is to use GMT or UTC time for your database time,
that will eliminate all of your problems, but not everybody can do that, and the client we developed this for runs their database in eastern time, so the easiest way to avoid that is to just not run replication during DST, that two hour period, and that's the only way to really account for it,
so if different places have different DST time, so that's configurable when you want it to not run, but that's how I solve that. Yes? That can be a problem and there can be data missed, that is one of the other issues with this type of replication.
Yeah, by default it's, for time by default it's 10 minutes, so your destination will essentially always be 10 minutes behind, so you set that boundary variable to when you know the source is done doing what it's doing, like if it takes 15 minutes for all your inserts to do,
you set the boundary variable to 15 minutes, it'll account for that not happening. The same thing with serial, usually with serial it's a constantly incrementing number, so it's not a big deal, but if you're inserting the same number over a long transaction period of time you can have the same issue, but you can set the same kind of boundaries. Boundaries is actually another very big problem, you'd think this would be an easy replication type, it is not,
there's a lot of edge cases with this, boundaries being one of the big issues with it, with handling things when replication runs, you grab a value, but that value is still being inserted on the source, then you're using that boundary as your next group of data,
you'll miss it completely, so there are some configuration options to allow you to handle that. I also added some recent functions in this recently, for both this one and this type that will monitor if the source columns change, it'll let you know,
it can't replicate them automatically, but at least have some monitoring to let you know if the source changes, you can go in there and fix that. This is the type of logical replication that almost everybody does, I call it DML, basically just replays, inserts, and updates with a trigger and a queue table to keep track of everything. Like I said, it doesn't actually replay everything,
it replays all inserts and deletes, but if say you have 100,000 updates in between that time period, it will only replay the last one, because it's just going by the primary key values. This option does require a primary or unique key, so if you don't have that, the inserter or snapshot replication methods can work for you.
I did handle it so the trigger and the queue table they can put on the source automatically manages giving it the right grants and stuff that it needs, and you can have multiple destinations. This does mean putting multiple triggers and multiple queue tables on your source,
so for every destination, you're writing to that many tables on your source, so I have a hard limit of 100, but if you're writing to that many tables, I think you're already running into other issues. I also did this other option that we have data warehousing uses.
You don't want to audit every update or anything, but you want to audit when a row is deleted. You want to keep that row around. This will allow you to keep the deleted rows on your destination. It just adds an extra column on the destination of when the row is deleted and keeps it around.
Some uses for this where we do it. You have an audit table in your source database, and you want to keep track of the entire audit log on another offsite data warehouse. You don't want to keep the entire audit log on your production database.
That works really well with the incremental replication, so you can do partitioning on your source table and drop all the old data and keep all the audit log tables on your other offsite database. We've actually also used this to do upgrades across major versions.
We did an upgrade from 8.1 to 9.2 with this. Basically, what you do is you take the largest tables, which is usually not the majority of the tables. It's usually like maybe the top 20 or 50, so this does require you to set each individual table up for replication,
but setting up 20, 30 tables isn't that big of a deal. You set those tables up with replication from the old one to the new one, and that gets you the majority of the data across to your new database, and then you just do a dump on the smaller tables and significantly reduces the downtime for the upgrade. What could have taken five, six hours is now down to maybe five minutes
because you've got the majority of the data already replicating over. We actually had somebody else that used this. They had a sharded Postgres system that used a UUID, so they had 512 shards, a sharded Postgres database
across 512 databases, but they needed to do reporting on it, so they set 512 replication jobs to pull from all 512 tables to a single database and then did reporting on that. Use cases I would have never even thought of that somebody else that used this did with it.
The snapshot one can handle getting the new column changes over. The other application methods cannot, so it won't automatically do that.
I have some monitoring functions to monitor for if these things change. With the DML, that's actually tricky. You have to do the column change in a specific order. First you add the new column to the destination, then you add the new column to the trigger, then you add the new column to your actual source table, because if you do it out of order,
then you start getting errors on the replication. It is a tricky issue. I have in the documentation how to handle column changes with DML. I can't handle it except in the snapshot sense.
Any other questions about this extension? This has been my biggest project. I've forgotten the most feedback on it. Postgres has no automated partitioning built into it. It has a very, very extensive documentation
on how to write partitioning and use table inheritance in the core documentation. That's pretty much what I based this extension on. If you look at the source code of it, you'll see this looks exactly like the way the core developers intend partitioning to work.
If you want to write partitioning yourself, you have to handle keeping all the tables up to date, keeping the trigger up to date, keeping the functions up to date. If you're just doing basic alphabetical partitioning, that's kind of a once-and-done setup and you're done. But thanks for time and serial ID partitioning, you need to have ongoing maintenance.
This is what this one is designed to do. It only handles time and serial-based partitioning. I started out with including some basic pre-made time series, yearly, quarterly, monthly, hourly. Recently, I actually got custom time.
You can actually set any time interval you want. It adds a little bit more overhead to it. I'll get into what static and dynamic triggers are a little bit later as far as partitioning. You can basically partition by three and a quarter hours if you wanted. It allows you to do that.
The other thing with partitioning is if you wait to create the partition when the partition is needed, like you're doing serial and write when you need the new table, if you wait to do it until then, you run into race conditions. This avoids that, the race conditions, by pre-creating the partitions.
By default, it stays four ahead. You can set that to whatever you want it to be. I also have it managing most of the properties of the child tables listed there, indexes, constraints, defaults, privileges, ownership. I actually had somebody that had OID set on their parent table
and I had to do a patch recently to actually account for OIDs being inherited as well. You set all that on the parent table and it takes care of putting all that on the child tables. Just two days ago, I actually got foreign key inheritance working. If you have foreign keys on the partition set to another table,
they will inherit from the parent to the child. Foreign keys in the other direction in Postgres, like to a partition set, do not work. Actually, there is a way to do it. The reason it doesn't work is because there's a lot of... It doesn't work because when you do a foreign key on a partition set,
it only looks at the parent table. It doesn't look at any of the child tables in the inheritance tree. So you can actually make a trigger to do that for you. But if you have a really, really large partition set, every single insert to that source table will take a very, very long time. This takes care of updating the triggers as they're needed.
The object name length is one of those things that you don't usually think about when you're setting up partitioning, especially in serial partitioning. That can be a big problem where you usually do underscore P and then a date or a number for the partition name.
For serial, that continues to get bigger and bigger and bigger. Eventually, you run into that 63-character limit. Usually, your partition suffix is how you're identifying what the data is in there. Postgres just truncates it. It doesn't stop you from doing it. It allows you to create your table. It just truncates it off. So if you had a very long table name
that was partitioned by day and it cut it off right in the middle of the year, you just kind of lost the whole semantic naming of how your partitioning works. So this handles it by just truncating the actual name of the table and then adding the suffix on it to try to get around that. And also, in a recent version,
constraint exclusion is one of the reasons people do partitioning. So if you're doing a select and you have ten years of data, but you only want the recent month, and you do a query with a where condition on the partitioning column, it'll only look at the tables with the most recent month.
It won't even consider the previous ones. As soon as you do a where condition on one of the other columns, when you don't include the control column, there's a sequential scan or index scans across the entire partition set, which is a downfall of it. So what I did was, I used this pre-create number.
Like I said, the default was four. That actually controls the trigger of what tables it's handling for the data. So with four, it handles four tables in the future and four tables in the past, and we'll continue moving that. So what I did was, if you give it other column names,
any tables older than those four, it will look at the current values that are in the table and add a constraint on the table based on those values, so you can actually have exclusive constraint on the other columns in the partition set. It only works if the data's static. If your old data's changing, obviously this won't work. But most of the time, where people need that,
the old data's static. So does that make sense? Anybody have any questions about that? Actually, somebody was just talking to me yesterday about that at the Royal Oak. It does not support sub-partitioning yet, automatically. I mean, if you can somehow, it would be kind of hard to manage,
but you could manually manage doing sub-partitioning with this, but it doesn't have it automatically. I'm not sure if I can get that built in. I have it in my notes to see if I can. It's basically functions.
I have a create. First you create the parent table. The parent table already exists, and then there's a function called create parent, and you give it the parent table, the column, and whatever other things you do. So you can either do that with a brand new table or an existing table. If you have an existing table
and you need to partition data on, this will do the partitioning for you. So there's a Python script to do it automatically. It does a commit after each partition's created by default, but you can actually make it and do it in smaller batches like the example I have now. This is actually like I told you about the hits table.
We actually had a hits table. This is a hit, like a web hits table that we partitioned live while traffic was still coming in. The only lock this ever requires is a brief two or three second lock on the parent. That's actually long because this is a very, very busy, busy table. It's the only reason it took two or three seconds to get the lock on the table.
Usually it's pretty instant. And then you can see we partition it by day, but there's options to the partition, to the partitioning function to do the commits in smaller blocks so there's less contention. So we did it in hourly blocks with a five second wait to not overwhelm the wall stream.
And it took about, probably took about 10 hours, but there was no interruption to production and partitioning was done. I actually have,
that's actually the opposite direction I was thinking to do, but I was actually, I'm actually working on a consolidation script. So like say you have, you're like partitioning hourly or daily, but like after a year, you want to just change to yearly partitioning. I'm working on a script to change to do that. I will see if I can do in the opposite direction as well.
I don't see why that. That might actually be a little more tricky, but there's actually, I'll get to that in a little bit. You can actually undo partitioning in this. It just undoes it completely, but it does undo partitioning. This is getting into what the difference between static partitioning
and dynamic partitioning was. One thing that drives me nuts when people, they actually have like their functions writing functions or functions writing SQL. They just have it write one long line and then it's unreadable. This is actually a live function, trigger function that this writes. So I tried to make them readable, but basically what a static means
is you actually explicitly name each individual partition for each condition in the if condition. This allows it to cache the query plan and makes it a lot faster. So in this instance, and then I actually also have it like alternating. This is with the pre-make value set to two.
So you can see like this is the current table. This is one in the past, one in the future, two in the past, two in the future. So it kind of alternates to try to keep the if condition be the most recent value so it doesn't have to go down through the entire list. But this is the most efficient version of partitioning.
It's what people mostly do for their partitioning sets. The issue with this is if you want to keep being able to insert to a lot of tables, like hundreds and hundreds of them, then you have a list of if conditions that is hundreds and hundreds of lines long and impacts the performance of the partitioning.
So if you run into that, that's when it's usually time to go into dynamic. I can't explain that part too much right now, but basically what it's doing is it's doing an execute statement and the partition name is variable. So at the time of insertion, it determines what the partition it goes in
and then inserts that. The problem with this is execute statements aren't, the plans aren't cacheable. So this has to reevaluate the query plan every single time the insert is done. So depending on your partitioning method, that's why I included both of these options because depending on your data model,
you need one or the other. So I kind of tried to include both of these. The custom time partitioning option actually uses this and it also does a lookup table. So that's actually the least performant of the partitioning methods, but a lookup table is basically the only way
to do custom time partitioning that I've been able to figure out. So it's still pretty fast, but obviously if you start running into really high insertion rates, you can run into those kind of issues. The other thing for time-based partitioning,
the creation of new partitions is based on running a cron job with a function that this includes. By default, serial replication will go by when the current table reaches 50% of whatever its max constraint is, it will go on to making the next partition in the future.
Some people have reported for very, very, very high traffic tables. That causes a lot of contention because one insert will come along and say, I need to make the new table, then the next insert comes along right away before the new table is even created and has to wait. So that method is one of the issues. I actually made it so you can...
I did it that way because IDE-based partitioning is usually not predictable of how often you have to run the cron job to keep it up to date, but I made it optional now. If you know how often you need to run it, you run it that often and it doesn't do the 50% creation anymore so it avoids the contention there.
Any questions about static versus dynamic partitioning? Automated creation, you can also have automated destruction. This is the other big reason people do partitioning is if you have very, very large tables and you have to delete 100 million, 200 million rows,
that's a very expensive delete operation and massively bloats your table. And the only way to clean that up is either a vacuum full, which will lock your table, or using something like pgRepack but this allows you to just drop the table, which is a very fast option and this allows you to automate that.
So you just set for time, you can set it to about anything older than three months, it will automatically drop the table. By default, it just un-inherits it. It doesn't actually drop it, so I try to make it safe. So if you actually want to drop the tables, you have to tell it you want to drop the tables. And you can actually have it only drop the table or just drop the index if you want to keep the table and save the space.
So you just want to keep it around. And I also do that for serial. Serial is a little bit more complex. I try to explain it as best I could. You give it a boundary value and it will go whatever the current max minus that value is, it will drop those little tables. Somebody actually, at another conference,
this is another reason I talk about this at a conference is because I get ideas for stuff like this, is they actually needed to dump out the tables. They don't even want the data in the database. They want it compressed and archived outside of the database. But the way he did it, he used the PLSH language, which I started to try to do, but realized it was very, very limiting.
You would essentially require the shell being able to be available to whatever platform you're on, which makes it pretty much useless on Windows. So I actually did it. It will actually move the table out of the schema, and then there's a script that will dump out the tables
from that schema, and will only drop the table if the pgdump command comes back successfully. So I try to keep it as safe as I can. And like I said, I do have partitioning undo. So it won't do like you're saying, to automatically change it from yearly to monthly or something like that, but you can undo your current partitioning and redo the partitioning again.
The undo portion will actually... Partitioning works with the table inheritance feature of Postgres. So the undo will actually work for any inheritance table,
not just ones that partition that pgapartment manages. So if you have another table inheritance table you have set up and you want to undo that, this tool will actually work with any of them. This tool, I don't think David Wheeler's in this talk, but this tool has saved my sanity on developing these extensions.
pgtap is unit testing for Postgres. Basically you can tell it, I run this query, I want this result back. Does this table exist? Does this primary key exist? Does this trigger exist? All of those conditions, you don't want to check if you run something. This allows you to do unit testing for.
This has been the only way I've been able to release things, even semi-bug-free. Mimeo has about 300 pgtap tests. The partition manager has close to 4,000 to try to manage all of the different partition types for time and all that kind of stuff. I'd never be able to make sure all of that stuff works
without something like this. So if you use this, go find David Wheeler, he can talk to you about it. But I love this tool. I would make this say this is essential if you're doing any extension development at all. It may be useful in your own functions, but if you start looking into extensions that you want to release to the public,
I would highly recommend using this. That's it. These are the links to all my tools. Like I said, I knew 0c. All of these extensions are written in plpgsql. They're not c. So they're a little bit easier to dig into the code
of how they work. I knew 0c before I started working on the API, and I still don't know very much yet. My only attempt at writing a c function is in the segfaulted database, so I don't know what I'm doing in that sense. But I've still been able to write extremely, extremely useful tools for Postgres. And just thank the community
that plan out Postgres and the announcement lists and the conferences and all that kind of stuff and pugs that have allowed this stuff to exist and be popular so people can know about it. Thank you.
Any other questions about anything? No. Postgres doesn't have partitioning built in. No, you can't. There's no partitioning built in.
There's table inheritance. But that's not partitioning. I mean, you can have it's like the I can take all this off. Yeah.