Around the world with Postgres extensions
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 561 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/44146 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Extension (kinesiology)Extension (kinesiology)Key (cryptography)Field extensionCore dumpSequelData typeMultiplication signQuery languageQuicksortData storage deviceDatabaseFormal languageContent (media)Sensitivity analysisScaling (geometry)WebsiteRevision controlCASE <Informatik>BitSampling (statistics)Computing platform1 (number)ScalabilityRelational databaseCodeVideo gameHuman migrationPartition (number theory)Electronic mailing listCovering spaceSubject indexingBlock (periodic table)Web pageType theoryStatement (computer science)Right angleMobile appRow (database)Computer animationLecture/Conference
06:29
Limit (category theory)Order (biology)MiniDiscQuery languageStatement (computer science)Point (geometry)PolygonSource codeStochasticHarmonic analysisSet (mathematics)Local GroupThresholding (image processing)Error messageCellular automatonAddress spaceBlock (periodic table)Data typeData structureStatisticsPoint (geometry)Total S.A.Open setThumbnailMappingMultiplication signCartesian coordinate systemQuicksortShift operatorSubject indexingBitDatabaseAverageExtension (kinesiology)Customer relationship managementField extensionStatement (computer science)State observerLevel (video gaming)Query languageParallel portHigh-level programming languageInsertion lossUniqueness quantificationInformationComplete metric spaceSampling (statistics)Set (mathematics)Functional (mathematics)Hash functionRow (database)Data storage deviceComputer clusterConfiguration spaceCuboidSparse matrixRaw image formatCombinational logicWebsitePersonal digital assistant10 (number)Data compressionCountingError messageTable (information)Moving averageView (database)Type theoryWeb pageDistanceApproximationSelectivity (electronic)Operator (mathematics)EmailNumberDifferent (Kate Ryan album)2 (number)SpacetimeRule of inferenceSource codeStochasticPolygonRight angleHigher-order logicKey (cryptography)Repository (publishing)SoftwareCASE <Informatik>IntegerSingle-precision floating-point formatWordCodeRemote procedure callEstimatorCoordinate systemHarmonic analysisGoogolArray data structureMaxima and minimaPiPattern languageStability theoryMathematicsLecture/Conference
16:07
Sample (statistics)FrequencyHill differential equationFloating pointOrder (biology)Local GroupMusical ensemblePartition (number theory)Range (statistics)Computer wormSoftware repositorySet (mathematics)Table (information)ScherbeanspruchungSingle-precision floating-point formatData typeFunctional (mathematics)BitFunction (mathematics)DatabaseSet (mathematics)Configuration spacePartition (number theory)Table (information)1 (number)Operator (mathematics)Single-precision floating-point formatMathematicsPhysical systemError messageGoodness of fitInsertion lossExtension (kinesiology)Event horizonOpen sourceField extensionTime seriesUtility softwareBit rateQuicksortStandard deviationNumberData loggerInheritance (object-oriented programming)CASE <Informatik>Right angleProcess (computing)Covering spaceSoftware maintenanceOverhead (computing)Multiplication signHeegaard splittingQuery languagePoint (geometry)Limit (category theory)Data storage deviceMetropolitan area networkMereologySpacetimeRaw image formatApproximationScaling (geometry)Flip-flop (electronics)Cartesian coordinate systemLevel (video gaming)TimestampFitness functionAverageSource codeLecture/Conference
24:04
Term (mathematics)Range (statistics)Event horizonPort scannerReal numberCountingTask (computing)Clique-widthRouter (computing)WindowDatabaseCovering spaceField extensionExtension (kinesiology)Wrapper (data mining)Functional (mathematics)Table (information)Query languageEvent horizonPhysical systemPlanningMiniDisc1 (number)BitTraffic reportingExtreme programmingSocial classCoordinate systemNumberSpacetimeCountingEntire functionProduct (business)Line (geometry)SequelComputing platformStandard deviationRoutingResultantCASE <Informatik>Distribution (mathematics)Data storage deviceCuboidHash functionWebsiteTwitterMusical ensemblePartition (number theory)Heegaard splittingScaling (geometry)Multiplication signEmailRange (statistics)LogicSet (mathematics)PhysicalismDivisorKey (cryptography)InternetworkingIntegerSelectivity (electronic)QuicksortFigurate numberParsingServer (computing)Lecture/Conference
31:45
PasswordTexture mappingServer (computing)Table (information)Computer configurationProgrammable read-only memoryPurchasingProduct (business)Finite element methodSubject indexingData storage deviceExtension (kinesiology)Multiplication signProduct (business)Casting (performing arts)Basis <Mathematik>Process (computing)DatabaseData typeBlogBackupStandard deviationField extensionEmailMultiplication tableMoving averageUltraviolet photoelectron spectroscopyRaw image formatMixed realityBitReading (process)Statement (computer science)WebsiteWrapper (data mining)Cartesian coordinate systemOrder (biology)Server (computing)Ocean current1 (number)Predicate (grammar)Directory serviceRevision controlQuicksortAdventure gameConnected spaceTable (information)Cheat <Computerspiel>Software developerQuery languageCore dumpVapor barrierNumberPurchasingElasticity (physics)FrequencyCache (computing)CASE <Informatik>Virtual machineMapping40 (number)Descriptive statisticsCoordinate systemInheritance (object-oriented programming)Boundary value problemEntire functionMultiplicationLevel (video gaming)Lecture/Conference
39:27
Computer animation
Transcript: English(auto-generated)
00:05
So, our first speaker is Craig. He works for Citus and Microsoft, and then he's going to talk about Postgres extensions. Thank you. Good morning, everyone. Thanks for being here.
00:21
I know the first talk of the day is always the hardest one to wake up and get to. So, I'm going to spend some time, kind of a tour of extensions. I don't know if this resonates over here in Europe like it does in the US. Has anyone seen Carmen Sandiego or familiar with that? A few hands, okay. It was this computer game that, like, you would learn geography growing up in elementary school,
00:46
like there would be clues and mysteries and all that sort of thing. I'm actually really excited because now it's on Netflix, so my daughter, I can introduce it to her, similar to DuckTales. Is anyone excited that's back? Okay, just me. So, Postgres has been around for over 20 years now, and it keeps getting better and
01:04
better and better. But I'm of the strong opinion that I think some of the most exciting things for Postgres in the future are not in the core of Postgres. The core of Postgres moves as a really solid, stable foundation. It used to do one thing right, which was not your lose your data. That was kind of the goal of Postgres.
01:23
In the last five years or so, it's become, I would say, a sexier database, if you can call a database sexy. We got things like rich geospatial support. We have JSONB. We have full text search. Now we have native partitioning. So it's less and less of just a stodgy relational database.
01:40
But extensions are a really interesting approach to extend Postgres without having to change the core code base, really taking it to be more of a data platform than just a relational database. Before I get into it too much, a little bit more about me. If you see me online, that's what I look like. I work at Citus Data. We turn Postgres into a horizontally distributed scale out database.
02:06
We were actually, I guess I used to work for Citus. We were acquired about two weeks ago by Microsoft. So happy to chat and answer any questions about that afterwards as well. Previously, I ran Heroku Postgres. So there we ran around 1.5 million Postgres databases for customers, so a pretty reasonable
02:23
scale, and curate Postgres weekly. So if you're not subscribed, I encourage you to if you care about Postgres at all. It's focused less on like a DBA content and more on like an app dev rundown of here's what happened this week and here's interesting articles and content.
02:41
All right, so what are extensions? So per the Postgres manual, extensions are basically low-level hooks that allow you to change or extend the behavior of Postgres. You can do things like they can be written in C or other languages. So you can write them in pure SQL. You can write them in Python in cases.
03:01
That sort of thing, but allow you to basically create and change the behavior of Postgres. You can have new data types. You can have new index types, all sorts of things directly now within your database without changing the core underlying code. Postgres itself, you've probably used some extensions without realizing it.
03:23
It ships with some native extensions already known as contrib ones. Others have to be built and installed. If you're running on somewhere like Amazon RDS, they already have a whitelist of extensions, some that are contrib, some that are outside of contrib that you can install. So a few of the examples of the contrib ones, and we'll dive deeper into a few of these,
03:44
not all. If you've ever used the UUID data type, you've probably used the extension for it. CIText, if you're migrating from MySQL, which doesn't preserve case sensitivity, to Postgres, which thinks that's important, but you maybe want to preserve the value,
04:01
you can have a case-insensitive text data type. HSTOR was one of the first NoSQL data types within Postgres. It's a key value store directly in Postgres. I say one of the first because I think the first technically was XML, which came about 15 years ago.
04:22
Now, the non-contrib ones get really interesting, and I think we'll see less in contrib over time. For a while, that was the primary way you would find extensions. Now there's more and more in an ecosystem starting to develop on its own. These include things like PostGIS, which if you're doing anything geospatial, PostGIS is well-regarded as one of the most advanced geospatial databases that exist.
04:44
Things like Citus, which turned it into a sharded distributed database. HyperLogLog, which I think is just fun to say. And there's a list of probably north of 100 extensions. I'll get into it more where you can go and explore, look and find new ones.
05:02
So today, what we're going to do is take a tour of just a quick drive-by of a bunch of these. To give you a sampling of the various things you can do. I don't expect this to be exhaustive and learn everything you need to know, but hopefully you get a taste of, hey, this one caught my attention. Now I want to go and use it or learn more.
05:21
So the first one I would consider is the most critical extension for anyone running a database. pgStatStatements has existed for a while and then got an update in version 9.2 where it is now immensely useful. So what's it do? It records all the queries that were run against your database
05:40
and basically normalizes these or parameterizes these so that you can see things like how long it ran, how many times it ran, things like how many IO blocks were read, how many pages were dirtied, all these sort of things under the covers. So what's it actually look like? I don't know if you can see that perfectly.
06:01
Can we kill the lights anymore? Devroom? All right, we'll see if we can dim it a little bit more.
06:24
But basically, if you can squint and see that, oh, that's much better. Thanks. Oh, success. Well, you can't see me now. That's fine. All right, so you've got all these things, right?
06:42
Like the query of select star from users where email is like my email address at situsdata.com. You've got all these different shared blocks that were dirty and red. A lot of this you actually don't need to know. This is like if you query the pgSAT statements table, what you're going to get. What you can do from this, though, is write this really, really simple query
07:03
that will aggregate the total amount of time a query has been consuming resources from my database and how long it takes on average. So, I can get this really, really nice pie-in-the-sky view of like this query has run for a total of 295 seconds
07:21
against my database and on average takes 10 milliseconds. Similarly, I've got another one that's consumed 219 seconds. On average, it takes 80 milliseconds. Now, just by a rule of thumb of what I know on performance, I can probably get most queries down to about a millisecond or so. So, if I wanted to go and optimize things without having to look at any of my application code,
07:41
I can just hop in here and say, we're just slow, let me go add an index on this second query and give a lot more performance back to my database. Citus, we actually took an extend of this as well. So, we have something, Citus stat statements within Citus, which extends on pg stat statement. If you're doing something that's very multi-tenant based,
08:03
think like a B2B application, if your Salesforce CRM and one customer's data is their own, but you want to know which one of your customers is using the most resources from your database, what we do is preserve that tenant ID. We preserve that who is your kind of shard key,
08:21
that sort of thing. So, you can see who is noisiest, who's consuming the most resources, maybe who's underutilized, that sort of thing. All right, PostGIS. PostGIS has a wealth of information about it. It is probably the largest extension
08:42
that exists for PostGIS, and it has its own complete kind of parallel ecosystem that runs parallel to the PostGIS ecosystem. It is well regarded as the most advanced geospatial database. When you install PostGIS, what you get are all these new geospatial data types.
09:00
So, you have things like points and polygons, so you can start to draw maps. You can put those directly in the database and then find where points that overlap within these polygons, or what is the short assistance from one point to another based on this map that exists. Now, when you install PostGIS, you get a bunch of new built-in indexing and operators
09:23
you'll want to start to use. So, Postgres itself has a number of different index types. You've got the standard B-tree index, you've got Gist, Gen, SPGist, and I'm missing one, Brin.
09:41
Brin and SPGist, I know, are used for really, really large data sets that can naturally cluster together. Gen, the easiest way to describe it is when you have multiple values within the same column. So, if you think about text sentences or arrays, or JSON is an obvious one, right?
10:01
Like, you've got a large document in a single column. Gist is most commonly used on things like full-text search where you've got values that can overlap between rows, and then very, very heavily on the geospatial side. So, here you can have a Gist index, and you can say, hey, select the distance from these two points directly within the database,
10:21
and Postgres is gonna do all the heavy lifting for you. I'm just gonna fly by PostGIS because otherwise I would spend 45 minutes on it. There's a number of other extensions that you tend to use with it in coordination as well. PG routing, which is useful for finding a mapping, right?
10:41
Like, how do I get from point A to point B based on these roads and otherwise? There's a number of others around connecting to remote geospatial data sources. That's really common in the geospatial space is, hey, I've got this other data source that's over here, open maps and that sort of thing,
11:01
and you can connect directly from within Postgres to those externally, you don't have to pull all those in. All right, gonna shift a bit to HyperLogLog. This is one of my favorite extensions just because I think it's really fun to say, and I think I sound way more intelligent when I start to talk about it as well.
11:21
So, if you read the paper on HyperLogLog, it's a paper out of Google. It has all of these things in there, things like k-minimal value, bit observable patterns, stochastic and harmonic averaging. Do all these things make perfect sense to everyone in here?
11:42
Cool, I'm not alone. I read the paper and there's all sorts of map in it that makes a lot of sense for all sorts of reasons that I don't understand at all. To simplify it, it's basically probabilistic uniques with a small footprint, or I like to think of it as close enough uniques.
12:01
So what happens is it's doing some sampling of data as it comes in, basically how many zeros are in the front of the value, sampling that down, combining it with other sets so they can do really interesting things. So, taking a look at it, we've got the extension HLL we're gonna create,
12:21
and then we've got this new data type. So we can see that I'm gonna call it set and HLL. Now when I wanna insert into it, I'm not just gonna insert into it, I'm gonna use a function that says, hey, hash this value, and so we can hash anything pretty much. We can hash an integer, we can hash text, it's gonna take a hash of this and store this directly in this data type now.
12:43
And what I'm gonna do is create this table. So in this case, I'm taking all of the raw site visits I have every single day. I'm just gonna record an impression to my website, save that, and then I'm gonna roll this up into a uniques table at the end of the day. So I'm gonna do insert daily uniques and I'm gonna select all of my values
13:01
and I'm gonna hash them together. Now, when I query this table, it's gonna get, I'm gonna get this. So it's super intelligible, right? What I can do with this though, is I've got this daily uniques table and I can query and get a record back that doesn't make any sense.
13:20
But what I can do is basically say how many uniques are there and extract it from this. I can also start to combine this. So I've got like, I saw 100 unique people on Monday. I saw 100 unique people on Tuesday, right? But how many did I see on the combination of Monday and Tuesday? And HLL is really interesting in that it can combine those two. So I can see how many people did I see both days?
13:40
How many people did I see on just one of the days? Those sort of things. So a few best practices for it. It uses update. So you're not gonna insert directly into it. You're gonna take data from somewhere else, update into that data type. And you do wanna tweak the config. So I said it's close enough to unique. It's usually quite accurate, even right out of the box.
14:03
But you can tune a lot of things like how big is the data structure itself? How accurate is it? How sparse is your data? So you can actually come in here and tune a lot of these config settings for it. So is it better than just having the raw data? With it, you can store estimated count of around tens of billions in a little over 1,000 bytes.
14:25
So I'd say that's some pretty good compression with a few percent of error. And for most cases, if you're like an ad network, if you're in advertising, that can work really well versus storing all the raw data.
14:40
All right, so top end. Top end is another approximation extension as well. So for HLL, like if you have data that's too big to have uniques or it's too costly, it's a great one. Top end, also known as top K often, is the top set of people that have done X or Y. So if you want to see what are your top 10 pages
15:02
on your website by visit, it's a great one. So top end, we're going to create the extension and then we're going to have a top end. But instead of storing the top end as a data type, we're actually going to store it as a JSON-B.
15:20
Now when we insert into it, we're going to do this top end at AG. So we're going to insert in a similar fashion to how we did with HLL, but our data types actually appear JSON data type, which is pretty nice. So instead of that, when we query the HLL table before, and we got that unintelligible set of bytes,
15:41
here what we can see is a JSON-B data type that's pretty understandable. So this is my top thousand users. This could be my top pages on my website. Now to query this, we've also got a very, very similar aggregate to what we had with HLL. So to parse this out of, hey, if I actually want to know
16:02
what are my top pages for a set of days or top GitHub repositories, here I'm going to use the top end union aggregate and feed that in to the top end function itself, which expects this data type. And I get a pretty intelligible, nice output here.
16:23
Cool. So shifting a bit, there's a lot of interesting ones like hyperlog log, top end that are approximation, allow us to do kind of some new operation. There's also extensions that kind of change fundamentally what Postgres can do.
16:41
So Timescale is a company that actually runs and has a time series extension called Timescale itself. So if you're looking at Timescale, a few kind of requirements generally, like you want to have data records that always have a timestamp. Here you're looking at probably sensor data is the most common,
17:01
and you're looking at append-only data. So sensor data, again, a good fit, like it's append-only, I'm not going back, the sensor itself isn't getting updated, edited, saving values, it's just saying here I read this, send it off to the database. And this is really key, I think, for a lot of time series databases where they often get overused.
17:21
Recency is really, really important and key. If you're always recording all of the data and always querying on all of it, a time series database isn't going to help you as much because you're having to scan the data anyways. So you're really looking for no updates to your data and a recency bias when you're querying it.
17:43
So taking a look at Timescale, we're going to go ahead and create a table that has taxi rides. So it has the pickup, the from, the to, things like rate and all that sort of data. Now this is just a standard Postgres table. To start to put Timescale in place,
18:02
what we're going to do is take this table and run a function on it that's called create hyper table. Under the covers, this is going to create all sorts of automatic partitions. So it's going to create like maybe a one minute partition for every set of new time series data that comes in. At this point, I don't really have to think about it or worry about it,
18:20
it's just doing it for me under the covers. And then when I start to query it, we're not going to do anything special really again. So here we have a pretty standard query that's finding the average fair amounts grouped by day. Under the covers, what this is going to do
18:42
is start to split it up and say, hey, five minutes of this data is stored here, aggregate it together, five minutes of this data is stored here. So it actually knows how to go to those underlying partitions without you having to think about it. And you can go to basically really, really granular buckets as well. So you can do really broad,
19:01
more granular than exists from their partitions, it knows how to span partitions appropriately. The other nice thing is you can go in here and tune it in the config to start to roll off the old ones. This is generally pretty key for the time series databases where, hey, you have a lot of data, but if you're using recent data, you want to get rid of the old stuff. So you can move it either to colder storage,
19:22
do pre-aggregates, save them, and then delete the raw data. That's generally a pretty common workflow. Within the time series space, there's another one, pgpartman as well. So Postgres itself got native time partitioning a couple of releases ago.
19:41
At the time, it had some rough edges. It's gotten better. I'd still say generally don't use the native Postgres partitioning without some extra utility. pgpartman is really kind of nice to smooth out the rough edges. I think by 12, knock on wood, it'll be where we don't need external extensions.
20:00
It's gotten to the point where it's super solid and robust on its own. But for now, you generally want to rely on something like pgpartman or Timescale. So pgpartman, Timescale kind of does its own thing in regards to partitioning. pgpartman builds on Postgres itself. So it takes all of those,
20:21
the native Postgres functionality, and gives you basically some helper utilities and maintenance cements so that you don't have to kind of do some of the manual overhead that you wouldn't. So for pgpartman, we're going to create our table, and we're going to use the native partition here and say partition by created at. So a little different from Timescale
20:41
where we created the table first, then came in and created this hypertable. Here up front, we're going to specify that this table is going to be partitioned by something. Then I'm going to have pgpartman come in here and create my parent.
21:01
And I'm going to go ahead and update this config right away. So this is the thing I personally like to do. You don't have to. It's up to you on your partition setting. But basically, I'm going to say, hey, keep going creating partitions forever. As long as there's new data coming in, don't just stop and run out. This is a common thing I think I see with a lot of people
21:20
that manually set up time partitioning. They say, hey, I'm going to create five years worth of tables. That's more than I will ever need. And five years later, you're not around and no one that knows how that system is around. And you're getting errors in the log that you can't insert to a partition that doesn't exist. So please set up automation
21:41
around most of your partitioning items. Now under the covers, what this is going to do is create all these events tables. So we're going to create events 2018 for nine o'clock, 905, 910, et cetera. So pgpartman has a pretty robust config.
22:01
You can come in here and tune a lot of things. Now, the config itself is just within the Postgres table. So things like versioning of it, you maybe want to be a little careful and that sort of thing. But there's a number of things you want to come in here and probably set, like how many partitions you want to pre-make. When you see data from like right now, I want to create four partitions into the future for this
22:21
so that when your data comes in, I'm not on the fly creating that. How often do you want your jobs to run? How long do you want to retain things? Do you want to just retain things for 30 days, for one day? And pgpartman will then go in and automatically take care of dropping these old partitions,
22:41
creating new ones, that sort of thing. So pgpartman basically builds on that native partitioning. I would say if you really want to use Postgres native partitioning, which you should, it's there, it's going to keep getting better. You should also consider pgpartman alongside it.
23:02
If for some reason you don't want the native partitioning, take a look at Timescale. All right, so Citus. So I mentioned I work at Citus, but Citus is also a open source extension as well. So you can take it like all these others,
23:21
download it, run it, install it. What Citus does is turn Postgres into a horizontally scalable distributed database. So to your application, it still looks like a single node Postgres database. It looks like Postgres because it is. Under the covers, what we've done is spread that out across multiple physical nodes. So basically it turns it into a sharded database
23:40
without you having to worry about it. This generally is the case when you're outgrowing the limits of a single node. I've seen that happen often in the terabytes. I've seen it happen as early as 100 gigs. If you're not ever going to approach that level, you don't need a sharding under the covers. So really quickly, for those that may not be familiar,
24:00
what is sharding? It's basically the process of splitting your database into a bunch of smaller bits. So here if I've got one database right there and I've got a bunch of smaller tables similar to how we had with partitioning, maybe that was 16 partitions.
24:20
Here what I'm going to do is actually take this and in this case I've got 32. I'm going to spread them out across two physical nodes. So a couple of important things to note here. I think a lot of people get a little mixed up when they first start on sharding and say, hey, I'm going to create two shards because I have two physical nodes. And what that means is that's great. You split it up across two nodes,
24:41
but if you want to go split that up across four nodes now, now how are you going to split up those tables that were on there? So it's really important to create a larger number of logical shards than you have of physical nodes up front. You don't have to use factors of two, but usually it works pretty nicely. So here we've got two physical nodes, but 32 actual shards under the covers.
25:05
What you're going to do up front usually is hash based on some ID. So a hash on your primary key, it can be an integer, it can be text, whatever. There's a lot of actually talks on the internet of hey, what's the right appropriate hashing function?
25:20
It's not going to determine whether you're successful or fail with sharding. The Postgres internal ones work great. Postgres has its own internal hashing functions. You can just use those. And again, you're going to define a large number of shards up front. Two is generally bad. You also don't want to go overkill. Two million is probably on the extreme.
25:42
A lot of people in production I see with like 128, 256, and they'll be fine for the next 10, 15, 20 years probably. So generally you don't want to take and just route values. This is the other mistake I see is that people say hey, I want to set up sharding early on, I'm going to put my first 10 customers
26:01
on this node, my next 10 customers on this node, my next 10 customers on the other node. The problem with that is you're going to have some big hotspots. And so my first 10 customers are my oldest. There's the ones that have been in my platform the longest, and then there are ones with most data. So now I've got all those together
26:21
competing for resources, and my new node with my last 10 customers is a completely unoccupied box. So what you actually want to do is take a range of hashes right up front. So you're going to say okay, if I'm going to have 32 shards, look at the resulting hash range and split that up. So if I've got a hash of one, that's 46,000.
26:43
I've got a hash of two, that's 27,000. And if I divide up the entire space, like shard 13 out of 32 shards would have the hash values from 26,000 to 28,000. So I can see that user number two would go into bucket number 13 right away.
27:01
This is going to create a nice distribution of my data without me having to worry about those hotspots. It's not perfect, but it covers 90% of cases pretty well. And then what I'm going to do is have something like an events table. So here's how's it working. We've got something like GitHub events and the users.
27:23
We're going to shard both of these by the same column here, I believe, user ID. Yep. And so with Citus, we've gone ahead and already created those tables, just like we did with Timescale. Just a standard Postgres table, nothing special about it.
27:43
Then I'm going to have this other function called create distributed table, which under the covers is going to go create all those shards. So under the covers, I didn't have to go and create all those different ones, figure out how to route them. I just said create distributed table and now I've got GitHub events underscore 001, 002, 003, et cetera.
28:01
And then when I want to insert data, I just insert it into GitHub events like I normally would. And it'll rewrite and route that. Then for querying, we just execute normal SQL. We do a select count star. And what this is actually going to do is rewrite the execution under the covers.
28:22
So because I'm doing a count star from GitHub events, all that data is in 32 different tables. And now I've got to get a count from events underscore one, events underscore two, three, et cetera. And we can see right here that we've got different executors under the covers. This is going to take this and rewrite it and say,
28:41
hey, this is the explain plan one of 32. And I'm actually executing 32 of these different explains or these queries, pulling it together back on the coordinator and giving you the value back. So rewriting the query in the executor under the covers and then distributing multiple lines, getting the results back and returning. To you, this just looked like a standard SQL query.
29:05
If you're routing to a single node, we don't want to hit all 32 nodes, right? So if you're saying, hey, I just want the events for user ID two, well, I know that lives on shard number 13, I want to rewrite that query under the covers and basically say, okay, now send this from GitHub events
29:22
down to GitHub events underscore 13. Just that one node, get the data and bring it back and not wait for everything else. And then I have to put this in here because this is like the one graphic I have. We're database people, we don't do a lot of UI, but this reminds me of like the Windows 1995 disk defrag.
29:45
Everyone remember that? I never knew if it was actually doing anything, but I have always felt like it did. And so for us, basically when you're rebalancing shards, we're gonna take and move these in an online fashion. So you've got 32 shards on one node, 32 on another. If you wanted to scale that to four nodes,
30:01
we're gonna take 16 from one, move them over, 16 from the other. All right, FTWs. So FTWs I put up here is they are extensions where they're almost like their own special class of extension. They allow you to connect from within Postgres
30:21
to another system and query it directly. So if you have a bunch of disparate data that you wanna pull in for like ETL or one off reporting, that sort of thing, they're really, really useful. Now I say they're like their own class because writing a foreign data wrappers
30:41
is like writing an extension, but a little bit different as well. You've got certain different things that you can do with them. And there's an entire ecosystem and collection of foreign data wrappers. So while we've got like 100 extensions or so, we've also got, I would estimate 30, 40, 50 foreign data wrappers. I could be off on that number, but it's probably roughly in that ballpark.
31:04
And there are some really crazy foreign data wrappers as well. So like these are some obvious ones like Redis, Mongo, C-Store, which is a column restore. There are ones like the dev-null foreign data wrapper that just like lets you write data to nowhere.
31:21
There is a Twitter foreign data wrapper. So you can query Twitter from directly within Postgres. There's an email one. So there's a lot of foreign data wrappers that allow you to connect to anything, including like a CSV one. So if you've got a much less CSV data and you wanna parse that, you can do it directly within Postgres.
31:40
So when you use a foreign data wrapper, you're gonna install it like the extension first, and then you're gonna create a foreign server. And you're basically gonna give a connection to this other database. Postgres itself already comes with a Postgres foreign data wrapper, which is really useful. So you can query from one Postgres database into another. Really, really handy in a lot of situations.
32:02
Here we're gonna actually connect to Redis. I'm gonna create my foreign server to say, hey, it's over here. Then I'm gonna create my foreign table. So for Redis, it's just key value. So we're not gonna have multiple tables, but something like a Postgres FDW, you may wanna map all of your tables from some other database directly into your local one.
32:22
And then you're gonna create a mapping of, hey, who can see this? How do I connect to it? Now when I describe this, my database, I can see I've got products, purchases, users, and I've got this Redis database. And this Redis database in this case is a cache
32:40
of who's visited my site and how many times. So I've got like, hey, this user showed up five times in the past three days. If I wanna query something like, hey, give me really basic user 40, how many times have they been here just show me how many times have I seen this user.
33:02
Great, but I can also then go in and join this. So with Redis, it's in a text value, so I have to do some casting. But I can say, show me my top users that have been here more than 40 times. And then I can maybe look at things like,
33:20
okay, who's been here more than 40 times, but not bought something? Or who had something in their checkout and it was here yesterday, but I haven't seen since. So I can do some really interesting things here. I would say use caution when putting some of the foreign data wrappers into production. They don't always push down things well, so you may bring back your entire Redis database
33:43
as you're querying this, which may be fast. But if you have a 10 gig Postgres database, you probably don't wanna pull back that full table. Now, fortunately, the Postgres FTW is getting better at pushing down predicates, which is really exciting. And I think more and more and more in a user facing production website,
34:01
we'll be able to use a Postgres one. We'll see in times with others. All right, so a bit of a wrap up. Postgres is definitely more than just Postgres. I should have asked at the start, how many people here are using any of these extensions already?
34:22
Awesome, a few hands. How many here have used all of these extensions? Someone's kind of cheating back there. So there's a whole other world of extensions out there beyond just these five, six. There are new ones created every week,
34:41
and I think they're really pushing the boundary of what you can do with Postgres. And now we don't have to wait on the core community, which has a higher barrier to what gets committed, right? It has to be well maintained for an X number of period of years. Here we can go and experiment and have fun and create a lot of new things, but also add a lot of value without having to wait a year, a year and a half
35:02
to see something directly in core. A few honorable mentions. Madlib is a really awesome one. I don't know how well it's maintained these days. It was originally, I believe, out of UC Berkeley, but it's like machine learning, data science directly in Postgres, so a pretty exciting one.
35:22
Zombo is interesting. You can connect directly from within Postgres and to Elasticsearch, automatically send your data there, and then when you query it, it'll use Elasticsearch indexes. So you can have Elasticsearch indexes basically backing your Postgres data. Cstore, I mentioned briefly earlier,
35:41
that's a columnar store directly in Postgres. Pgchron is actually really, really handy. If you ever need to go and delete things on a regular basis, why set up a cron job on a server somewhere else or on your database that could fail when you could run it directly within your database? So if Postgres is up, your cron job is up and running.
36:03
Pgchron, also really useful for rollups. So if you have a bunch of raw data you ingest and you want to do rollups every five minutes every day, really, really useful to run all that directly in the database. So a little bit of further reading. Here's a blog post we wrote on what it means to be an extension.
36:22
pgxn.org is kind of the current Postgres extension community. So new versions of extensions are posted there. New extensions are there. You can get a description of what they are. They're tagged in some way so you can kind of easily browse them if you're looking for data types or for data wrappers, that sort of thing.
36:41
And then if you're feeling adventurous, take a shot at writing your own extension. You could do a lot of fun things. If there's an extension that you want that's not there, give it a shot. There's a few I'd love to see, like an email data type directly in Postgres. I don't think it's going to land directly in core super soon, but maybe if we start as an extension, we can put a little pressure and see if it gets there.
37:03
And that's it. And I think I've got a few minutes for questions. Questions, anyone? Have a question regarding Citus extension.
37:20
In which way are you handling backups? Do you have many small backups for each node or somehow? Which way are we handling what? Backups. Backups. Yeah, so it really kind of depends. What we generally do is use just the standard Postgres tools. So every node within Citus is just a Postgres database. So if you have a coordinator and two data nodes,
37:42
you would just have three backup processes running. So we have customers that, ourselves, we use Wally. We have customers that use Barman, Backrest, a mix of things. But it's kind of whatever you would normally do for Postgres, because it is just Postgres, follow that process. You just get to do more of it, because you have more nodes. Other than contrib, where do you find these extensions?
38:04
So PGXN is a great place. I would also say GitHub, and then Postgres Weekly. So some obscure ones will show up on GitHub and never hit PGXN. Postgres Weekly try to highlight most ones that come up that are of a certain quality.
38:21
Not every extension that exists. But then PGXN, I would say, is kind of the de facto directory. PGXN.org. Oh, yes.
38:43
What's your favorite extension? You weren't here. Oh, so this shows where you were late. So, yeah, so I think in order, I would rank my top three. Pgstat statements. It is the most useful extension for application developers, hands down.
39:02
I'm biased, but Citus is a pretty cool one that turns Postgres into a sharded database without having to worry about all that. And then I really do just love saying hyperlog log and sound, I think, way more intelligent when I talk about all the things it does. So it probably pulls out the number three spot.
39:20
All right, thank you. Thank you.