We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Let's Build a Complex, Real-Time Data Management Application

00:00

Formal Metadata

Title
Let's Build a Complex, Real-Time Data Management Application
Subtitle
...before the training session ends!
Title of Series
Number of Parts
37
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Congratulations: you've been selected to build an application that will manage whether or not the rooms for PGCon are being occupied by a session! On the surface, this sounds simple, but we will be managing the rooms of PGCon, so we know that a lot of people will be accessing the system. Therefore, we need to ensure that the system can handle all of the eager users that will be flooding the PGCon website checking to see what availability each of the PGCon rooms has. To do this, we will explore the following PostgreSQL features: Data types and their functionality, such as: Data/Time types Ranges Arrays Indexes such as: GiST SP-Gist Common Table Expressions and Recursion Set generating functions and LATERAL queries Functions and the PL/PGSQL Triggers Logical decoding and streaming We will be writing our application primary with SQL, though we will sneak in a little bit of Python to demonstrate the power of logical decoding. This will be an interactive sessions and we will provide some containers to ensure you have everything you need to be up and running when the bell rings. At the end of the session, we will have a working application, and you will be happy knowing that you provided a wonderful user experience for all PGCon attendees made possible by the innovation of PostgreSQL!
4
18
22
23
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Lecture/Conference
Transcript: English(auto-generated)
So good morning. Welcome to PGCon, for those of you who just got here.
If not, welcome back. So my name's Jonathan Katz, and today we're going to be doing an exploration and application building. And just maybe, well, I'll give the high-level picture of the talk in a couple of slides. But I'd say the genesis of this was I actually did this as a much shorter talk. And the first time I gave this much shorter talk,
I was like, wow, there's a lot of material in here. And even trying to do it in 40 minutes ended up going over into an hour. I was like, well, maybe I can expand this into a tutorial and just go into all these really interesting concepts about Postgres. Then, of course, I tried writing the tutorial, which you were about to see today. And I was like, wow, it's actually a little bit harder to expand everything. It's somewhere in between a three-hour talk and an hour
and a half talk. But that said, my background is in application development, and particularly using Postgres. I find every single excuse to take advantage of all the many features. And the idea is that a lot of these concepts in the talk are going to be things that have been in Postgres for 20 years, if not 30 years.
And it's really about taking all these pieces together to see what we've been able to do with Postgres for so long, and taking a brand new feature added in 9.4 that's been expanded upon a lot since then, and saying, all right, well, here's a completely different way to do it. What's funny is the punchline for all that is maybe 10 to 15 minutes long at the very end.
This talk is going to build up on a lot of some fundamental database concept to some advanced database concepts to get there. And it's going to be funny, because we're going to spend a lot of time on one particular aspect and then say, oh, by the way, here's a completely different way to do it. And that's really the theme of it. So I want to give you the chance to escape in the beginning in case this is not
what you're interested in. And that's pretty much the summary. So perhaps to kick it off a little bit, just a little bit about myself. So the other thing I should also say, the reason I'm not doing the full presenter mode is that there's going to be a lot of going back and forth. We're going to run a lot of example code.
I mean, in some cases, we're just going to go over every single line of code to see what it means, particularly when you get to some of the advanced queries. So I will be doing a lot of alt-tabbing like this. I'm also going to make it so you can try to follow along and put it into your code base if you want to play along as we go along. Let's see how that goes.
But I'm going to do my best to do so. So without further ado, OK. So far, so good. There we go. There we go. All right. So first, really briefly about my company, which because it's PGCon, I don't
want to get very corporate about it. But CrunchyData enables me to talk about really fun Postgres stuff. I love it there. And we're hiring, like every other company. Next. About me. So actually, the way I got active in the Postgres community was around event organizing, because I saw a lot of brilliant developers, particularly at PGCon.
I was like, wow, I cannot write C that well. But it seems like there's an opportunity to organize events, bring people together, and do fun things like that. And I found myself doing that. I've also been using Postgres a long time. I usually say I've been using it about 15 years, maybe well for seven. And for me, it's always something that I always gravitate towards, even
a lot of my earlier web projects. I just wanted to use Postgres. Building web stuff, and I enjoyed it. But for me, it was really all about the data and the database. For me, it's just so excited to see how the Postgres community, well, one, it's grown itself. But also, five years ago, I talked about Postgres with people. And the first question after that would be, what's Postgres?
And we don't have that problem anymore. And to me, it's still mind blowing. And I think it's cool. And now we're at the point where, in my humble opinion, it's now educating more how to use Postgres to its fullest potential, that people now realize, oh, this is a really cool piece of technology. But look, like I said, I've been using it for 15 years.
I still find out about features every single day. Well, maybe not every single day, but quite often. I'm like, oh, I didn't know it could do this. And it really takes a community to teach it everything that Postgres can do. And also just throwing it in practice and seeing different things of doing it. The one shameless plug I will give here
is please follow at PostgresQL on Twitter. Launched in January. It is a lot of fun. And yeah, please follow. No, let's go through it this way. All right, so I kind of gave the overview of this already, because I got too excited. And the coffee's starting to kick in. But really, we're going to be exploring building
an application together. I mean, that's the theme of it today. First, we'll do a little bit about setting up the requirements in case you want to try to follow along. Again, your mileage may vary. I think it will probably vary a lot. A little bit of overview of the problem that we're going to try to solve, because this is a real world problem. I actually ripped this from something that I did set up,
and basically how I evolved a system. This is more of the simplified version of it, which maybe you'll see some of the SQL, and you'll be like, this is the simple version. But yeah, no, this is something that's very real. And like good developers, we're going to code it. We're going to build. We're going to test. We're going to iterate until we get it to the way that we want to. And of course, I don't want this to just be me yammering for three hours.
If questions do come up, I'm more happy to field them. If you want to say, you're doing this stupidly, I'm also more than willing to hear that too. If I say, please wait till the end, it's probably because I do think we're hitting the talk becoming a four and a half hour talk, and I want to make sure we have enough time for lunch.
All right, what are our requirements for this talk? First, we're going to be using Postgres 10. Why? Because it's the latest version, and I actually do use some things that are specific to Postgres 10, and I will point those out. Waltj JSON. Waltj JSON is an extension for the punch line at the end using the logical decoder. Basically, it allows you to stream all the changes that
are coming from your database in a highly digestible JSON format. Thank you so much. Oh. And Python 3. And why Python 3? Because Python 2 is being deprecated a year and a half. It is coming. It is real now. This is not 2014 where they originally planned to deprecate it.
And actually, since Python 3.5, when they introduced the new async IO framework, it's really fast. They did a really good job with it, and that's a whole separate discussion. Specifically, we'll be using the psychopg2 library, which is one of the drivers for interfacing with Postgres and Python. The reason specifically we're using this one,
other than it being around for a very long time, is that it supports the logical decoding. And basically, reading in changes from the logical decoding stream, which is very important for this talk. So again, for those following along, trying to build it as we go, I will assume that you know how to install Postgres.
Actually, real quick, before I go on, who's new to Postgres? Who's your first Postgres conference? Cool, welcome. And if you're new to Postgres, have you used other databases before? Cool. So are you coming from an Oracle background, MongoDB background, SQL Server? Yeah, cool. So you'll see actually there's some certain feature
parallels between all of them, except maybe MongoDB. And yeah, that's actually good to note in order to frame the talk. So one thing that's great about Postgres is that it's very extensible. And there's a lot of extensions that are built into it, particularly in the, we call them contrib modules. But there are also some that exist outside of it.
You can install some of those using PGXN, which is the Postgres Extension Network. And there's a Python wrapper that lets you do that. And then there are some that live off on their own that are incredibly useful. And you have to do a little bit more work to get them in. And wall2.json is one of those. So essentially, you do a git clone, go into the directory. You have to set this flag, use PGXS equals 1, and hit Make.
And then you have to sudo make install using that same flag. So currently, it does not work with Postgres 11 beta 1 right out of the box. I did hack that to get that to work. And I should probably submit that upstream. Again, ugly C, so I don't know.
But yeah, but with 10 and below, this definitely works very well. On the other things, after you install it, you do need to restart Postgres. Yes, Mr. Federer? What broke? What? What broke? So it was a function definition changed for,
I can't remember exactly what. But it was an API function definition that changed. And you just need to put in compiler flags, and it works. It changes the new interface. I can show you the code during the break. Oh yeah, and key thing, restart Postgres after you do this. So it loads it. All right, setting up Python 3 and psycho PG2.
So if you don't have Python 3 on your machine, you can download it from python.org. They have an installer pretty much for every platform. Or you can get Python, yum install, Python 3, whatever floats your boat. I like to use virtual environments with Python. They basically allow you to install your requirements locally
to a particular environment. I love Python. Python's probably my favorite programming language. I do think some other languages do the packaging a little bit better than Python. But Python's getting there. It's improved a lot. You activate that virtual environment, and then you can install psycho PG2. This installs, by default, I think it's like 278 now, maybe 275, but basically the 271.
All right, enough boilerplate. Let's start building an application. So what's the problem we're trying to solve? So imagine we're managing this room. We're in DMS 1150. And a lot of events happen here. Classes happen. PGCon happens.
Wherever it may be, we need to manage this space. We know that there's a set of operating hours in which we can book this space. Let's say it's probably 8 AM to 10 PM. College runs a little bit later. I don't think the University of Ottawa wants people using this in the middle of the night, but I don't know. But yeah, typically you have an opening and a closing hour.
And to keep it simple, only one booking can occur in the room at a given time. And that makes sense for a lot of things. If you're booking a meeting in here, we probably don't want to have two meetings going on at the same time. Now in the real world, if you're at, let's say, a restaurant, you can have multiple events going on at the same time. And that's where it starts getting complicated.
But for us, let's just keep it simple. If the room is booked, it's booked. If it's not booked, well, it might be open. For business, it might be closed. Do we understand that? Good with that? If we don't understand this, the rest of the talk will not make sense. So here's the example.
Right now, we've booked DMS 1150 for three hours. And there's some rooms next to it that might have some other events going on. I mean, this is your typical booking schedule. So if you're planning to go to the unconference and you want to bail out on the talk early, you now know exactly when to leave.
So what does this actually mean on the database level? Because I always like to think of an analogy between math and computer science. You do math. You can write everything out. You see how it all works. You can see the proof. And then you have to code it. And there's certain assumptions, or I guess maybe not assumptions, but there's certain things that you can kind of hand wave when you're doing the math that you can't when you're actually
programming it. So the same thing. What do we need to know to actually set up? What data do we need to have in our database to make this application work? Well, first off, we need to know all the rooms that are available to book, which makes sense because we need to know the rooms that we need to book. We need to know when they are available to be booked. So they're operating hours. So we need to store what their given operating
hours are for a given day, which can change. Maybe this is not open on Sunday, so there's zero operating hours. We also need to know when the rooms have been booked, so that way we don't book things overlapping them. That's also important. Double booking is annoying. So the other thing is, of course, we're
trying to build up an application. So we need to do our standard CRUD operations. We need to create, read, update, delete. That makes sense. And let's think about it in two different aspects. There's the create, update, delete part, essentially the management part. And then there's the reading part. And the goal is that we want both of those systems to be efficient.
We, of course, want to be able to do lookups quickly. We also want to be able to manage it quickly. So believe it or not, there's a lot going on here. Because if you think about the system, it sounds fairly simple, but there's actually a lot going on behind the scenes. Or we're going to have to put a lot going on behind the scenes to make this work.
So first, this is where we introduce availability and how to measure availability, find availability, and look at it. And we're basically going to start with some of the building blocks of Postgres. The first and probably most important and probably the feature has been most useful in my career in Postgres are dates and times.
Who's worked with dates and times in Postgres? Yeah. They're really, really powerful. So these are the fundamental dates and times types within Postgres. We have time stamps, with or without time zones. You can definitely get into a very long debate about whether or not you should store your time zone within your data type.
I choose the yes side. The one thing I startlingly advocate is keep your database in UTC, because that transfers very well. And you will avoid weird time zone issues. That's, again, I think I have a talk where we spend just like 30 minutes talking about time zones. So we're hand waving this.
Dates. Dates are one of my favorite types. You'll see that dates appear throughout here. But basically, you can just store the date, no time stamp, no time. They're very easy to work with. Then there's times. There's time with time zone and time without time zone. So the time is just like 9 AM, 12 PM, no concept of date. I typically use time without time zone, which you just
specify it by doing time. And I've had very few issues with that. And intervals. Intervals are one of my favorite. An interval is something like, you could say, three days, five hours, 10 seconds. And essentially, it's your delta. And usually, I use the delta when I'm doing calculations and operations.
But you can store the delta, which is particularly useful if you're generating rules or need an incremental offset as you go along. So how do these work? Yeah, and in fact, so this exercise, we're going to just look at what's a little bit underneath the covers a little bit
and what happens when you do math. Because there is some magic in these statements. So the first one is, let's see what our current date is and what returns when we do that. Sure enough, lo and behold, it's correct. My system clock is correct. And what's the type of it?
It is a date. Nothing too magical. What about current time? So current time returns the time using the time with time zone format. That's good to know. And actually, if you want the current time stamp, you just do select current time stamp and voila.
Of course, the type of current time. I probably should have done that to say what the data type was. Now, what happens if we do current date plus current time? Lo and behold, we get a time stamp. And we've actually kind of magically transformed into that type.
The point of this is that we're going to look a little bit about the operators and functions throughout date and time. But they're very robust. And it's definitely worthwhile reading the documentation around them because it gets very deep very quickly. And here's another one of my favorites. If I do select current date plus three, voila, it's three days from now.
I don't need to use an interval for that, which is also really important. Because when you do the interval, you do have to do the cast. And while they're very useful, there is a little bit of a performance penalty because you saw that's a 12 byte object. And by the way, just to show, it's still a date type.
Cool stuff. So some other useful functions, particularly that will be in this exercise. There's date trunk. Date trunk is short for date truncation. Basically, you can specify a period and take your time stamp or your date
or whatever it is and truncate it to that period. So let's go through a few of these. Maybe we won't do it one at a time. So the first one we say, hey, take the current date and truncate it to the beginning of the week. Lo and behold, it truncates it to 528, which is today's Wednesday, right?
So that's Monday. So it truncates it to the ISO start of the week. Because you see some calendars, they start on Sunday. And we see some calendars, they start on Monday. By the fall, this is going to Monday. Well, if we truncate the first day of the month, lo and behold, it does get us to May 1. How about the quarter? I mean, how many times do you get asked, hey, run this quarterly report
to see what our cash flow was? Well, sure enough, you can do that very easily in Postgres, where you can at least truncate the date. And last but not least, the year. Yep, we can get back to the beginning of the year. One of the reasons I show this one off is that this is very useful in reporting. I've used this many, many, many times to generate all those fun business reports.
And what's nice is that, particularly in aggregate queries, this is where I find the date trunk very useful. I actually had a reporting sub-application where you could specify how do you want to view this port, year, month, week, day, whatever it may be. And in an SQL injection save way, it basically
would map to the appropriate time period, pull everything together, do the J truncation, do the aggregate, and spit out all the information. Just so cool, because it was so simple to do. So another thing, which is part of the SQL standard, is the extract function, which is basically
you can extract a date part from a date or a time stamp. So it does exactly what I think it does. You can extract the year, you can extract the month, extract the day. And there's also this thing, ISO-DOW, which is the ISO day of the week. So Postgres has two ways to extract the day of the week,
where the day of the week is an integer corresponding to Monday, Tuesday, Wednesday, Thursday, Friday. The ISO standard way of doing it is assigning one to Monday, up until seven, which is Sunday. The Postgres, I don't want to say it's the default one, because it's not, but the regular DOW function in Postgres, it does, I believe, zero for, I don't want to mess this up.
And that's the other thing. This is why you've got to use the ISO one, because every language tries to do it their own way. I want to say, basically in Postgres, one starts at zero and starts at six. I can't remember if it's Monday or Sunday, and that's kind of the point. Use the ISO standard one, where Monday is one, Sunday is seven, and you just increment through there.
So this is getting us towards trying to build our scheduling application. So one of the operators you have in Postgres is something called between and and, where basically you
can look to see if something is in a particular range. You often see this with integer lookups. You can do this with time stamps or dates. So in the first one, we say, hey, what week is our current date in? This is actually between Monday and Sunday of the week, where I got lazy.
I didn't put in 528 and 63. I basically said, use this date truncation function. And using an interval two to show it off. And survey says, yes. This is actually in the week that we believe it is. If we want to expand on that function a little bit, or just to see what's going on, let's see how we truncated the date.
Oh, no. Clearly I'm a Vim pro.
Yeah, sure enough, it goes from Monday to, that's actually, that's the next Monday, right? All right, but that's fine. And for the purposes of this exercise, we can see that this 1230 falls within that week. The other important function that we have is overlaps.
So basically, overlap says, can you see if two ranges of dates overlap with each other, or two ranges of timestamps overlap with each other? So in this case, I took, originally I actually thought this was from 9 to 1230, so I built the examples around that, only to find out, oh, no, we actually have 30 less minutes together. So basically, I want to see, does the time we have this tutorial
overlap with 11 AM and 1 PM? This is also part of the SQL standard. And yes, that makes sense. They do overlap. This is the proof that it works. But here's an interesting thing with overlaps.
An important thing, particularly in scheduling, is to see if things line up next to each other. So does my end time of 1230, from 9 AM to 1230, overlap with starting from 1230 and going to 1 PM? Because that's important, because if we say the two 1230 endpoints overlap, then we
can't actually book that second event. But thinking about it as humans, well, we can, because they're both right next to each other. So does it work? Let's take a poll. Do you think this will return true or false? True basically says they both overlap. False says they don't. You have false. False is the consensus.
It is false. Good, it's sane. It works the way we think it does. But that's actually a very important point, because that could lead to, if it didn't work the way we thought it did, it could lead to double bookings, which is bad. So good, this is sane. Now, of course, right now this is still academic.
We're basically doing one-off lookups. We need to actually see how this works over an entire data set, because that's going to determine how we structure our application. So now the fun part. Let's start benchmarking, or doing hand-waving benchmarking, because benchmarking is a very deep topic. If you ever want to do something fun,
look up Greg Smith and benchmarking. Those two terms. And just see his talks and or buy his book about Postgres performance. That is essentially the tome on not only Postgres performance, but database performance. And Greg is not paying me to say that. So we're going to create a table. It's going to be called bookings.
Now, here's a Postgres 10 specific feature that's very important, that's part of the SQL standard. Identity columns. So who's had to create primary keys? Let me put it this way. Who's created circuit keys using Postgres with the serial data type? Yeah. Who likes using the serial data type, because it's obviously SQL compliant?
I mean, I do like using it, but here's the awesome thing about Postgres 10. We've actually introduced something that's part of the SQL standard for generating key increments, similar to MySQL auto increment. And this is called identity keys. I don't think there could be a full talk on them themselves. I think there could be a lightning talk on it,
hinting to anyone who wants to talk about it. But basically, what it's saying there, that line says ID integer, generated by default as identity. That's essentially saying the same thing as serial. Hey, create a sequence that's incrementing, in this case, by the default, which is 1. And this is essentially enabling an identity-like function.
It supports, I believe, small int, int, and big int. I don't think anyone uses small int anymore. If you do, please talk to me, because I'm curious what you're using it for. Either you use integer or big integer. And please only use big integer if you truly need big integer, because there is a performance hint,
because it's 8 bytes instead of 4 bytes. And if this is your primary key, you're probably looking up that key quite often. And you need speed. It is. I believe so, yeah. I think it's in 2011, but I could be wrong.
I can now just be spouting things. I think that's the reason why we did it, because in true Postgres mentality, serial worked. There's nothing wrong with it. I'm sorry? Well, you can just do a sequence and then it works. You can do sequences as well. Ultimately, that's what it's interfacing with.
Now it's a little bit more behind the scenes. Yeah, exactly. And I think that's actually one. If you've seen one pattern in the current Postgres release, it's about making life easier. And this could lead to gigantic tangent on partitioning, which we will save for later. Cool. So let's actually create.
So before we get way too off topic, so here's our bookings table. And again, very basic. We have our primary key. We have a start time and an end time, which is in the time stamp format. And essentially what we're going to do is that we're going to insert a bunch of bookings into this table.
Now, I'm trying to think. We're going to parse this function a little bit later, but generate series. Often you see this function called generate series, which is a set returning function that can return a series of information or a series of data. In this case, I'm using it for integers. You can actually also use it to return dates and times and basically generate a whole slew of data. Often you see this as a function used for generating random test garbage
data for the purposes of presenting. But it does more than that. And you can actually use it in real applications, which we're going to see later. But I did want to introduce this concept now. So basically what I'm doing is I said, hey, from one to 400,000 times,
starting from April 1, 2003, insert this range where we increment by one days, two days, three days, four days. So basically 400,000 days into the future. As you can see, I spent a lot of time figuring out
how to know what the test data set should be. So actually, let's try on the timing mode. So they all inserted fairly quickly, about 1.6 seconds on my machine. But the point of this is that we're not benchmarking it just yet.
Now, of course, if we tried to start looking up over 1.2 million rows, Postgres is fairly fast. But if we're doing a specific operational lookup and we're doing a sequential scan, basically we're scanning all 1.2 million rows to get our information, that could take a little time. In fact, let's do that real quick.
Let's see if we have any bookings during the time of our tutorial. So I'm going to use explain analyze, which explain generates a query plan. And analyze actually executes that query plan and gives you your actual results for executing each step of the plan.
Oh, I see. Yeah. It's going horribly wrong. So as you can see, we do a sequential scan. The plan is able to work very quickly, but the execution time
is 83 milliseconds, which actually I could say is not that bad. It's fairly quick. Now, this is an OLTP application. And there's, let's say, even 1,000 of us hitting all of this at the same time. Your performance is going to get very shitty very quickly. Pardon my language. I'm a New Yorker. So we need to do a little bit better.
And if you look at most databases, the index that you're offered is the B-tree, which is short for B-tree. But essentially, it's a balanced indexing tree. And it's not like, you know, it's not, well, oh yeah, hang on. I forgot. We're using a basic type. So yeah, we will be using a B-tree index for this.
So the reason I chose a multi-column index will become clearer in a second. So actually, first off, are you familiar with multi-column indexes? Cool. All right. So in a single-column index, it does, basically, you're doing an index lookup on one value.
In a multi-column index, you're able to use the index lookup for two values. And this is particularly important if you find that you're frequently looking up the same two values in an index, you might want a multi-column index, particularly if you have a lot of columns in the table. Or if you have a lot of, you have additional information associated with that single column. So this is actually a good example.
If I'm looking over a table where there's dates and a lot of starting times, and the same starting time is associated with similar end times, the multi-column index graph should be very useful. Ordering does matter in Postgres multi-column indexes. Because if you're doing a lookup only on the first column,
particularly start time, then it basically looks at the first column first and the second column second. That's how it orders it. So if you're doing a lot of individual lookups on the end time, you might actually need two separate indexes here. And that would get better. The advantage of the multi-column index is that they will be, it is a little bit smaller in size compared
to having two separate single-column indexes. But of course, buyer beware. You need to make sure that it makes sense for your data type. In this case, the reason I chose a multi-column index is that I know, well, one, I want to show it off. But two, I know that I'm more likely to start looking at my start time versus my end time here.
It might not actually be all that useful in this case, but we can create it. And actually, what would have been more useful is comparing it to the other size. But as you can see, it's actually not that large. Sorry, if you don't mind, let's play around real quick.
Of course, that's legible. So we'll create an index for start time. Do one for end time.
And then if we look up the size, I'll show you that I'm not crazy.
So as you see, the size of the two separate indexes is 17 megabytes each for a grand total of 34 megabytes. But our multi-column index is 24 megabytes. So if you can use them and it makes sense for your data set, try to use a multi-column index. But again, you really need to understand your lookups and your data sets.
So let's actually drop those indexes because I don't want to mess up our query plans. Cool. So what happened? So now let's go back. Yeah, sorry.
Sorry, which index scan? Yeah. No. So there's this feature coming out in 11 called covering indexes
where that could be the case. But 11 is in beta right now. That might be actually worthwhile testing, maybe at a break. But yeah. Right now, no, it would not use it. You'd have to have the separate index for it. All right. So remember those two queries we did on the other slide where we first were looking to see, well, one was the overlaps.
We want to see, are there any bookings in the room right now when we're doing the tutorial? And the second one is, well, let's see if there's any bookings in our room right now. And let's see if it uses the index.
So we run it. And it says, no, I'm not going to use the index. I'm going to do a sequential scan on this, which really sucks because that's a very important query for us to run. So the lesson is, even though we have the ability to do the SQL standard overlapping lookups in Postgres, it's basically not programmed to use the index.
Yeah. Yes. Let's try that.
Good point. Actually, full disclosure, I did do this earlier. It is not. Yes. I meant to run Analyze Target, I'm sorry. But what about, well, also because there's a lot of rows there.
So it really should be picking up the index. That's why I wanted to make sure we had a fairly large data set. All right, so what about this one, which basically says, so the next way is, can we find any bookings that are essentially today, like between 5.30 and 5.31?
Does this use the index? It does, and it returns much more quickly than the other one. It's sub-millisecond time. And the reason is that it's using some standard B-tree operations
that it can recognize, which are greater than or equal to and less than or equal to, which is a very well understood function for B-trees. So why would this function be important, though? If we go back to the beginning, this is saying, in this case, return an unordered list of all the events occurring
on 5.30 within our given space, which is something that we will be seeing as we get late into the calendaring app. Long story short, as you can see in the notes at the bottom, the overlap did not hit the index. But if we're able to basically say, hey, let me look up everything
within a particular range, I can basically get to the equivalent of that overlap function. But that's not convenient, because then we need to understand some range math. And we don't want to understand math. We just want to be able to program. So the good news is that we actually have something called a range data type.
So how many people are familiar with range types in Postgres? Cool. So range types basically allow you to store ranges and work with ranges and perform operations on ranges. And ranges can be all sorts of things. Like we're going to be working with dates and times and timestamps. But there's also integer ranges and numeric ranges that are built in.
And Postgres itself being very extensible, you can write your own range types. It's actually fairly easy to extend an INET range type or an internet range type within Postgres. You don't need any additional hooks or functions into it. But you can come up with your own. I'm not very imaginative, so I'm not sure what else ones you want. I've been able to use all the standard built-in ones.
But they're a lot of fun. And they actually changed my life significantly. Actually, the author of the range types is at this conference, Jeff Davis. If you do find this useful, I suggest you go up to him and say thank you, because he did it as a complete labor of love for, I think it was like three or four years, to try to get it in.
So how does it work? So we'll create a similar table to what we did before. We'll call it bookings two, so we don't mix it up with the other one. We're going to have identity key. We're also going to have something called a new com called
booking time, which is a TSTZ range. TSTZ range stands for a time stamp with time zone range. Essentially, instead of storing start time and end time as separate columns, we can store the range all within one. There's a lot of nuances with range types. Actually, let's go over some of the nuances real quick with TSTZ ranges.
Let's do time stamp. So it's actually kind of challenging to turn your head and type.
So we get this column returned. And we see that at the beginning, there's a square bracket. At the end, there's a parentheses. What does that mean? Who remembers precalculus dealing with ranges and range notation?
Yeah. Yeah. This has come back, and it's come back in a real way. So basically, if you see something like this, what this really means is that 1 is less than or equal to x is less than or equal to 2. If you see something like this, it means
that 1 is less than or equal to x, which is less than 2. And then any combination in between. Now, the reason that's important is this is how the range type data is returned in Postgres. And you actually might see some odd things, too, based on some particular defaults. And there's one I'd like to show you.
Let's say we want to create an integer range, which you have to use in for range to do that. And I just try to do the default. I say I want the range to be between 1 and 10. Well, shit, that didn't do what I thought it would do.
OK, never mind. There used to be a case where, I'm trying to figure out which type it is. But basically, you'd put in something you think you'd be getting the range that you wanted. Oh, here we go. I remembered the case. So let's say I want to get all integers between 1 and 10 inclusive. So 1 less than or equal to x less than or equal to 10. I hit Enter. And it returns 1 to 11.
And you're looking at that. You're like, what? Why would it do that? Well, the idea is that they want the output to resemble something where you could link ranges together. So basically, if you want to check for contiguous ranges next to each other, you could link them all together. It's actually something Mr. David Federer pointed out to me many years ago. So even though you're saying, I want this inclusive range,
it's going to return 1 inclusive, 11 exclusive to put it in just for that. The first time I saw it, I was like, I don't understand why you would do that. Because I saw it on the discrete integer range. But when dealing with the date ranges, that became incredibly useful, which we'll see implicitly as we go along here.
So TLDR, ranges are very powerful. You should play around with them to understand all the nuances. And there's actually a lot of operators and functions with ranges that we will be covering as we go through this talk. But I highly suggest, like most things, to read the manual on. So with that said, let's start creating our data. And basically, we have a population function similar to what we saw before.
We're populating all the other things. I'm actually using this as an opportunity to also introduce lateral. Who are familiar with lateral queries? Cool. All right. Lateral was introduced in Postgres 9.4, I believe.
And basically, what it says, I would say, the way I would define it in my simple brain is that it essentially allows you to do a join that operates on the from query and basically allows you to say, hey, applies a certain function to data coming out from the from. So that way, we can have it all together and push it up or push it down to whatever part of the query.
That is like my hand-wavy, simple brain definition. So in this case, I'm saying, hey, we have four of these basic values, these start time and end time from these particular time stamps.
Let's actually turn this into a table here, so there's a lot going on. I'm like, let's turn this into a table. And we'll call the table z. And we'll say, this part of the tuple is a start time. This part of the tuple is an end time. Select all of it. Then we do the lateral join to the generate series function
and say, all right, well, for 400,000 times, 1 to 400,000, join this 400,000 times, bring it up to the select part of the query, and then say, OK, we'll take the x value that is returned from this function and add the start time to it. Then same thing, add the end time to it using this x days interval,
put into a range, and insert into bookings. Easy, right? Are we good? Maybe? All right, so here's the semi-good news for this. This is just me trying to be fancy in terms of inserting test data that will never be used again. But the point is, I just want to show off
some of the functionality of Postgres, because believe it or not, some of these are going to reappear later in the talk. So I just wanted us to get used to it in kind of a contrived example. I'm also happy to make the slides available to everyone at the end of the talk, as well as the additional coding examples we will see. So let's run it.
So it inserted 1.2 million rows in about five seconds using some fancy SQL. And just to prove it, I don't want to do that.
OK, it's all there. Actually, let's actually look at the data, too. So as you see, we have an incrementing primary key, and we have our ranges.
And sure enough, we're actually seeing it in order, magically. And we see that they're incrementing by day, and they're incrementing by time, and all is well with the world. It's the data that we expect. But this is still boring, because we just inserted data. We want to look up the data.
So here's the thing. So the way B-trees work are that they work on the standard equality and inequality operators. So equals, less than, greater than, greater than, equals, et cetera. With ranges, we're going to get a whole new set of operators, because we're going to need to be able to look inside the range, or see if ranges overlap, or see if ranges are next to each other. And we use all these funny little symbols
that we put together in order to do it. These funny little symbols cannot be indexed by B-trees. So we need a new index type to do that. Now, some other databases might give up on that, say, all right, I'm done, I'm going to do whatever I want. Postgres said, no, we're going to create a specialized index. We're dealing with this kind of data. There's actually a few indexes that can do this.
We're first going to start with the GIST index. So GIST indexes have been around for a while, and I believe they were first introduced to deal with full text search queries, but Postgres historian can correct me on that. It stands for generalized search tree, because basically it allows you to use a more generalized set of operators to work with it.
It's basically a balanced tree, so kind of similar to the B-tree. And the key is that it allows these arbitrary indexing schemes. So basically, it can do things similar to B-trees. It can also act more like R-trees. And the key thing is it allows us to do this indexing on custom data types.
It supports a lot more different operators, as I said, all funny looking things. But for instance, so this one's the distance operator. You definitely see this in PostGIS, which is the geospatial extension to Postgres. This is overlapping. These are, do I have a value included in the range? Well, in the case of range types. Or do I have a range included within a range?
These are all related to, is my range all the way to the left of another range? Or is it all the way to the left and overlapping a range? Which actually, I use quite a bit in my career. And I do not remember what these are off the top of my head. But what?
Oh, intersection? Oh, adjacent. I thought adjacent was two minus signs in this one, a pipe. Yeah, and for different things, they could mean different things, too. But actually, if you use PostGIS, we definitely see this one and this one a lot.
Distance and overlapping are very important. In full text search, you see, well, for full text search, you actually use some functions, which I think interface with those operators. Let's see, GIST also supports, I think it supports, there's some support for arrays with it.
I should have wrote it out, that was my mistake. And the cool thing is you can actually implement your own indexing scheme. So basically, actually, a lot of the data type authors that are in core Postgres that need to use these operators interface with the GIST indexing and basically write their own index to it. And that's the really cool functionality of Postgres.
So if you decide that you want to implement your own data type, which actually you can do without recompiling the database, you can then implement your own indexing scheme to make sure you can officially take advantage of it and take advantage of all of the GIST mechanics. So we're all going to do that right when we're done with this. Because we all love C. Actually, I do like C.
Oh, I did write it all out, of course. All right. I do like C, but I'm not very good at it. So GIST works with full text search. I will say, though, you want to use the GIN index with full text search, which is outside of the scope of this talk. But essentially, the way I describe GIN index is that it's like a hash table. And it's like so, so, so fast.
It's basically used for things where you need to introspect your data, such as like a JSON object or a full text search document. And basically, it does constant time lookups. It is really fast. Just also works with arrays. PostGIS. Actually, how many people use PostGIS?
Cool. It's awesome. Actually, I was just at Phosphor G. It was such an awesome conference. Or I should say Phosphor GNA. I had such a good time. So basically, you can work with geometric types. Trigrams. Trigrams are basically, you use them if you're trying to write like an autocorrector or a spell checker. It works well with those.
And of course, ranges. Actually, one quick note on geometric types, particularly points. So there's a feature in the GIST index. I think it was added in either 9.2 or 9.3. Well, let's say 9.3, because 9.2 is deprecated. I think it was added 9.2, though. Basically, you can do k-nearest neighbor lookups, which are useful for saying like, hey, I'm at this point here.
What are the five nearest Starbucks to the University of Ottawa? So as long as you have those points and you have the GIST index there, you can basically, again, it's almost like a constant look up, say like, oh, here's all the five closest Starbucks. Awesome. Very, very useful. I've used that in the real world. Those indexes get to be a little bit bigger.
But what you pay for the index size, you gain in lookup performance. And that's really what you need for it. So Postgres ranges. So basically, we're going to use the GIST index to do our range lookups.
And let's do that. Yep. There it is, the GIST index. Oh, but the overlapped operator in this case is the AND AND that we're using with the range type. Yeah, yeah, yeah.
All right. So we're creating our GIST index. So because we're indexing more operators, it's going to be a little bit larger, especially with this indexing scheme. So it's going to take a little bit more time. Usually not this much time. There we go.
And basically, if you look at the size of the index, this is 93 megabytes, which is a little bit larger than our B-tree. But again, our data type, our range type, is a little bit larger. I believe these are 16 bytes for the timestamps. And there are more operators involved. So we need to do a little bit more index path searching.
And actually, I actually have a thread on the hackers list about this particular data set with indexing schemes, which is outside the scope of this talk. But we actually hit a little bit of an edge case with the data set that we generated here.
That all said, we created the index. So let's do some look ups. And now the most important look up is, well, can we see if this room is booked at this given time? Let's see what happens. We can. And it's fast. It's like total planning time execution time is,
it took more time to plan than execute, which is pretty cool. Which means if we have the statement already prepared, it's even faster. But yeah, we use that index. And actually, if we want to see what was returned, cool.
It turns out, yes, the room is booked from 9 AM to 12.30 PM today for this talk. So it does return that data, and it returns it lightning fast. That's really cool. This is a really cool data type. Like I said, it literally changed my life. So what if we want to pull all the events for a given day?
We can use that with, this is the am I included in operator. And I should say, this operator here is the overlaps operator. So let's see, can we get all the events in a given day in this room?
Survey says, yes, we can. And we can use the index. And it's pretty fast. It's like, in this case, there's two milliseconds. Let's see if we get, yeah. So basically, part of it to, I probably should run it multiple times, is let the data warm up the cache. And we can basically do the index lookups in memory at that point, because everything's been loaded into the cache. This is also a static system.
So in a real system, you have things dropping in and out of your shared buffer cache anyway. So again, your mileage may vary with performance. But the long story short is, yes, basically all those queries you're trying to do before, we can do them with the range types and use the index and get that performance lookup. So we're doing better now in terms of looking up things
for the eventual calendar we're going to build. And actually, just to prove what data is returned, we have our randomly ordered list, which happens to be in order of all the events for this given day.
It works. Cool. So let's start architecting our application, because the most important thing when we're writing software is to plan. I had a professor in college who said minutes of planning
saves hours of coding. I've taken that to me to this very day. So how should we think about managing availability? Well, as I mentioned, if we think about a space, there's basically three statuses that can occur. It can be closed. Nothing can book at all. It can be available, or it could be booked or unavailable.
And the reason I separated it into three different things is that you could say, well, isn't closed technically unavailable? Well, one could say, yes, it is. But sometimes I like to open up my space additional hours, even when it's closed. So I want to be able to keep track of the difference between being closed, available, and booked.
So yeah, that's actually one of the real world complications I did throw into this, because I found it useful. Also, it's going to allow us to use prettier pictures, which you'll see in a moment. So ultimately, what's our calendar tuple? What are the things that we want to look up? We ultimately need to look up what room we're in, what the status is, closed, available, booked,
and what range this all applies to, which is essentially, is DMS 1150 available from 1230 PM to 8 PM?
So let's start doing some pictures to visualize it, because I find a picture is worth, indeed, 1,000 words. So let's take today. Let's start with nothing. We have no rules in our system. Basically, if we have no rules in our system, or maybe I've inserted one rule, basically I'm saying we're closed. We're closed from midnight to midnight on this given day.
So our range would be TSTZ 2018 530 00 to 2018 531 00. We good? Basic range? The color came out awful. So that's supposed to be like a lightish green on it.
I don't know why it looks that bad. Basically, let's say, all right, we want to say that we can book anything in DMS 1150 from 8 AM to 8 PM, or 20, 200 hours. So that would basically give us a range of that time.
In that ugly green color, that's really hard to see. Now let's say we know that there's two tutorials booked in this room today. There's one from 9 to 1230, and there's one from 430 to 630, which is actually not true, because I just made that up as an example. But yeah, so basically, we have two unavailability ranges,
one from that first time, one from that second time. And if we put it all together, ultimately, this is what our calendar should look like if we're representing it in range format, that we're closed from midnight to 8. Then we have some availability from 8 to 9.
Then from 9 to 1230 PM, we have a tutorial. From 1230 to, what is it, 4 PM, we're available. Then from 4 to 630, we're unavailable. And then 630 to 8, we're available. 8 to midnight, we're closed. So ultimately, we're going to be able to create 1, 2, 3, 4, 5, 6, 7 different ranges in order to get that full gamut of the day.
We're good with that. Cool. But the way I described it is that we only have these four range tuples. Because as we went through that process, all I said was, well, we know we're closed from here. We know we're available from here.
And we know we're booked at these different times. And we could definitely do some math every single time to be able to generate what is our calendar. But I can tell you from experience, that is a very inefficient lookup. We ultimately want to be able to turn these four ranges into these seven ranges. Because from a performance perspective, particularly for the end users looking
at the calendar, we've seen that we can look up ranges really quickly. But we're going to see that doing math around getting to this type of formula is going to take a little bit of work. So basically, on paper, hand-waving it, this is easy. But here's other things we need to consider.
If we're trying to build a calendar, we need to divide up those ranges. And out of the box, Postgres does not deal with non-contiguous ranges. Operative word is yet, because I believe people have been trying to do some work on improving that. But we need to deal with what is Postgres 10, what is the stable release right now, not like some theoretical patch that
doesn't exist. We also just looked at availability for one day. If you think about it, we want to look at the entire calendar for DMS. Well, that could be like one giant range starting from midnight of today going on to infinity, because we can do infinite ranges. And you could probably see from both a data perspective and a math
perspective, it might be hard to look at an infinite chunk of time. In the real world, I like to chunk things up by days for a lot of different reasons, particularly because I like to say, hey, show me all my calendar for a given day. Also, what happens to data in the past? We don't care about booking a room on 529,
unless you have a time turner. But if you don't have a time turner, you really don't care about that. So we probably want to be able to archive all that data or do something with that. That's a real world concept. We're not going to deal with that here. And also, data in the future. How do I keep generating more account? If I'm not dealing with infinite ranges,
I'm only dealing with chunks day by day by day, I'm going to need to be able to generate new chunks as we move into the future. How far in the future do I want to generate that data that has real world performance ramifications? And also, I need to have something to keep generating that data in the future as time increases. So there's some real management things to consider.
Also, with bookings, we want to make sure that we don't double book. But in the real world, as I said, there are places where you do double book. I have a private room in a restaurant that seats 50 people. Well, I can have a party of 20 over here and a party of 25 over here. And maybe I can allow a five top to come into it. Also, this is just for one space.
As you see, DMS 1150 is one of many rooms within the University of Ottawa, let alone the Demarche building. So there's a lot of things to consider in the real world. And this is basically to say, oh, I thought I had a little icon there, like the face emoji.
This is basically to say there's a lot to consider here. Even though we are going to look at a smaller scope of the problem, there's still a lot that we need to take into consideration. So I have been yammering for an hour. And this is originally where I thought I would do a short five-minute break if we needed to stretch. Are we feeling like we need to get into this? Because after this, we're going to be hitting it even harder.
So let's take five minutes, because I know there's a lot of content coming up. And at 10.07, we will begin again. So I realize if we're doing the theoretical one minute per slide, I do have 90 slides left.
So let's hit this hard. So let's start designing our application. So we're going to start building out our schema, because it's good to build out your schema. I am all for building out schemas. I'm not taking a dig at any schema-less databases at all. So first, we're going to start with a room.
So a room is very simple. And also, I realized I made annotations that the data types were going to use. And I did not include the identity columns, because this was from a previous talk. I'm sorry. But anyway, all the IDs where it says serial, let's switch it with identity columns. So room, we're going to use circuit keys. I don't want to do a circuit key natural key debate.
We're just using circuit keys. And actually, the original application, just speaking of, was actually written on top. I had Django as my web interface to the database. But a lot of this functionality was captured within the database. And I just made it work with Django. But we'll get there. So the room has a name. That's all. Then we're going to create something called
the availability rule. Now, the availability rule is essentially what's going to allow us. This is the managing whether or not we're open into the future. Basically, I'm going to take enough parameters that I would essentially either be able to have a script, or a daemon, or a program that's able to say, all right, we'll keep generating these rules 52 weeks in the future, or 80 weeks in the future.
Or remember, I had a customer who was like, we need to have, we do a lot of major life events. We need to have five years of availability into the future. So basically, we have a foreign key to the room, because it needs to be associated with a room. We're going to have an integer array of the days of the week. Now, that's going to store, remember our ISO day of the week
where Monday is 1 and Sunday is 7? That's what that's going to store. And that can store a collection of it. Because instead of having an availability rule for every single day of the week, which is wasteful of space, we can take advantage of our integer array type and say, all right, well, this rule can apply from Monday through Friday. Because for instance, maybe my operating hours Monday through Friday
are always like 8 AM to 10 PM. And Saturday, we're going to stay open to 4 AM, because we're going to get crunked. So that's probably an American term. So then start time, end time. And notice we're using the time type here. Why? Because this is a generic rule.
I'm saying that, all right, because on any given Monday, or Wednesday, or whatever it is, from 8 AM to 8 PM, I'm going to be open. We don't need the time stamp there. In fact, we don't want the time stamp there, because we're going to be generating some generic information. And our time type, I believe, is a little bit smaller. And then how many weeks into the future do we want to generate it?
Cool. Like I said, this is kind of a real world application. You can build a calendaring system off this. I want it to be practical. So here is the SQL that does it. Why am I showing all of this SQL? Because we're going to run it, that's why.
Also, just to point out some things here, I'm using a specific foreign key definition here. That's the references, the table name, and the column that I want to use as my foreign key. On delete cascade, what does that mean? By default in Postgres, if you have a foreign key, and I delete the parent, and I say, all right, I delete the room,
it's going to bark and say, no, you still have data referencing it. Like, why are you deleting it? Don't do that. That's one of the points of a foreign key, it's a safety check. If I say on delete cascade, I basically say, well, if I delete the room, then also delete all the availability rules associated with the room. The reason I'm doing that is actually, this is primarily
for exercise purposes, because we're going to be deleting things a lot, and I don't want to get that error. In the real world, what I actually would do is I actually would not delete things. I'd have one additional column on it. I'd call it, like, is active or something. And basically, if the row was active, I would set that to true.
And if I wanted to delete it, I would take that flag and set it to false. And I'd make sure my queries would not query it. I would say Postgres 10 and really the upcoming Postgres 11, what I can do is I can create an archive table and basically have this as a partition table, where all my active rows are in one partition
and all my inactive rows are in the other. The other reason I try not to delete things is one of it's for auditing purposes, to see what actually is happening in the system. The other thing is, let's say the application code doesn't do what you think it's going to do, and you start deleting things that you really don't want to delete.
Because, of course, you're taking backups, and you have a point in time recovery available, you can go back to the point in time before you deleted it and restore everything. So I mean, you're kind of guarded against that. But then you also lose all the other good transactions that occurred in the same time. So it can be a whole mess. If you can afford it, you probably don't want to delete things. But again, it's up to you.
So long story short, let's create our tables. And they're created. And they're created validly. The other thing to note, too, is that when you do create a foreign key, you do get an index by default, which is pretty cool. And that's actually important to know, because a common mistake
is you add a foreign key, and then you create an index to look up things specifically on your foreign key. So now you have two indexes doing the same thing. And that, we call, is really bad. So don't do that. Also, and actually the way you can verify this, here's a fun command line trick.
What do we call the table availability rule? If you do backslash d, this is very intuitive. You basically get a printout of everything that's in the table, and indexes, and other things that are associated directly with that table. And sure enough, here's your foreign key, which shows your foreign key constraint, which is basically using an index.
I mean, I think if you don't know what the natural key would be,
I think it would be a combination of these four columns. That's what the natural key would be. Because you need those four things to uniquely identify it.
And if I do something where I feel like, if you don't get where I'm explaining this, this was my life for six years. So if I'm making assumptions and you're confused, please stop me. I'm more than happy to go into it.
All right. So these are the generic rules. But as we see, if we are going to just use generic rules to say when we're open, we're going to have to do a lot of math to figure out when we're actually open. And for the purposes of our user who are looking up our calendar, math is bad because math is slow. We want to be able to just say, oh, the math is already pre-computated, look things up.
So that's what this availability table is. Basically, the availability table are all the rows that will be generated from our availability rule. And as we say, by default, we will be generating, well, 52 rows times the number of days of the week that we have in this given rule. So if we basically say, we're creating a rule for being open 8 AM to 8 PM Monday through Friday,
well, 5 times 52 gives us 260 of these rules that we'll be generating. And basically, these rules get specific. So now we're using our range type, the available range. So this would be specifically, let's say, we generate a rule for today. It would be 5.30 from 8 AM to 5.30 at 8 PM.
We're also storing the date here. And the reason I'm storing the date is just for the convenience of lookup. Let's just say I'm going to look up all the operating hours in the system for May 30. I may want to do that.
It's more for convenience than necessity. You could do everything with ranges. But the application developer in me sometimes likes to just add a little bit of extra data to make it easier. So, yep. Did you consider some kind of a mistake?
I did. And maybe the missing piece of this talk was what I did before the punchline, which is we actually had a materialized view for the final calendar. Let's get through. I'll tell you my experience with the materialized view. We did not have a good experience with it.
I think it had to do with our data set. It probably could have been just my poor programming. But I'll explain why this actually led us to build out the final 10 minutes of the talk, that materialized view. So before we create this table, there was actually, I had a digression about the SPGIS index, which I mentioned before.
So SPGIS stands for Space Partition Journalized Search Tree. And actually, it was designed for dealing with unbalanced data, which is something you can often see with ranges. So actually, when I've originally given this talk, and actually in my real world application, I use the SPGIS index. While preparing for this talk, I
basically hit a performance edge case, which I reported to the hackers list, because I couldn't actually build the SPGIS index on this example. And it's probably because when I was generating the data for the example, those 1.2 million rows, I was actually generating it in a balanced way, because it was
constantly increasing. In a real scheduling application, while time does sort of constantly increase, because we keep moving into the future, we don't have time turners, you tend to insert the data kind of randomly, though, because maybe picking on David, he generates some data, he inserts a booking to May 31, but then I insert a booking for May 30,
and then Michael inserts a booking for 6.2. But even though it is sort of increasing over time, it's still being inserted in an unbalanced way, because we're inserting the information randomly. And that's why I never really hit that issue in production. So normally, I was going to do a demo about how to use SPGIS indexes, but I couldn't,
because I hit this edge case with it. And of course, I'm writing the tutorial at the last minute, so I didn't get to write the correct one. But mostly, I want to throw out that there's this thing called SPGIS, and it's definitely worth looking at at your own time. And we had really good performance using the index in the real world. So I just want to make sure to say this actually
is a really useful tool. I'm sorry, I can't actually demo it. So the other lesson for that is always test your indexes before you put it into production to see what makes the most sense for your data set. There's a little of an art to index selection.
So anyway, here's our table for the availability rules. And we're going to use a GIST index on our available ranges, because as you recall, we're going to need to be able to look up our ranges fairly quickly, and we saw that we are able to get good performance with our data using the GIST index. We're looking it up on ranges.
So everything creates. Now, of course, maybe arguably the most important thing, we need to be able to book our room. And that's what this unavailability table is for.
It's actually very simple. It says, I have an unavailable date, I have an unavailable range, and I have a room ID. And the range, of course, is our range. We're booked right now from 9 AM to 12.30 PM. That's our range, and it's the very specific range of being on May 30. And again, the data is just there for convenience.
For instance, I want to be able to look up all the bookings within a given day. We'll also put a GIST index on the unavailable range, because again, we need to be able to do those fast range lookups, particularly, as you see, when we're generating our calendar. Everything's created.
Last but not least, I lied. This table is the most important table, because this is the table that's going to let us basically build a calendar where we can look up things very, very, very, very quickly. So we call it calendar. It's associated with room ID. Basically, it keeps track of our status.
In this case, I call the statuses available, unavailable, and closed. And it has a date, because here's where the date is actually very useful, because while I keep showing these really fast range lookups, it's still faster to do an equals lookup on a given date, because it's a smaller data type.
It's a discrete value. And we can find it in the index really quickly. And then, of course, we have the range. So recall that we showed what our availability tuple was earlier, where we need to know what room we're in, what the status of that room is, and how long that status applies to in that room. This is that tuple. We're going to be storing that information in this table.
And this is going to be able to let us to do these really fast lookups. No, and here's why. Because this is not, for the purposes of this exercise, we're not going to need that. In the real world, let's talk about that after,
just because I know your application. In the real world, it's a much more complicated table. Anyway, TLDR, so basically for this table,
I actually use the multi-column index, because often when I'm looking, this is really just to say, I want to look up the calendar. I just want to see all the bookings within, or bookings open and closed within this given calendar day. And we can use the multi-column index here, because we have a room. And we know that a room is going to have a lot of different calendars.
So I can actually take advantage of the multi-column index there, that on the date. We actually could probably create a just index on this too. But this works.
The date represents, that would be correct. Because we're basically going to be logically chunking our calendar by a given date. So we know that nothing is ever, so even if I'm booking a room, let's say from 11 PM to 1 AM, I'm going to cut off the calendar at midnight. So that way, that booking from, basically you'd
see a booking within two chunks. You'd see it from 11 PM to midnight, and then the next day, midnight to 1 AM. That's me as an application architect saying, I want to do it this way. And you'd have two rows for that example? Correct. I purposely made sure we did not hit that example in this talk, but the code does account for that example.
Yes? It's OK. I could. I could. And actually, it's funny. So there's a way to do something called exclusion constraints in Postgres,
where I can say, make sure these two rows, it's a little bit more complex way of doing a unique constraint. It's basically saying, these two rows don't overlap. In the real world, I had something like that, but it was too good. And there are cases where you do want to double book, so I did not remove that. And also, I want to make sure there's enough time to get to the punchline of the talk.
So yeah, could I put a guardrail in there? Absolutely. So I could do a check constraint, for instance. Maybe in the next version of this talk, I will do that. So anyway, we've created all of our models. So great, we're done, right? No. Because this is just the schema.
There's a lot of business logic here that we need to create. For instance, we need to generate the initial calendar. When I create a room, I want to make sure I can create enough. I basically need to be able to say, all right, I have a room. Well, I don't have any rules around it, so I'm always closed. So I need to create rules to say, I'm closed for our default, which is 52 weeks.
We also need to generate some availability rules. And when we create those rules, we'll be able to generate all of the different individual availability chunks off of those rules. And of course, we want to insert some bookings as well. So this is the fun part that's built an application.
So to do this, though, we're actually going to be able to build everything within Postgres to build and manage and keep our information in sync. But we're going to cover a few topics in order to get there. And this was actually, when I was first giving this talk, I was like, well, this is really a lot where the tutorial is. Because in order to build this, there's
a lot of advanced features of Postgres that one needs to take advantage of. So we'll be looking a little bit more of that generate series function that we saw. We're going to be writing some recursive queries. Did you know that you can do recursion within Postgres? It's a way, within SQL itself, you can loop through things, which is pretty cool. There's also, we're also going to cover SQL functions, including
set returning functions. And then if you don't like recursion, there's actually a program, and you can actually use iterative programming languages within Postgres, including the built in PLPG SQL, and then triggers. Because we need to keep everything in sync, and triggers do that. And we can get into a whole trigger debate
later, because that's actually part of the, we actually will get into that, because that's the point of the talk. So this was originally why I planned for our first break. But I realized that didn't make sense, moving on. Generate series. So as I joked before, most often when you see generate series, you see it for generating test data.
But actually, I'd say a really nice thing about generate series is that it actually helps generate missing data. So let's take reporting, for instance. Now let's say you need to report all the sales for a given month. And let's say you don't have any sales on May 29th. It was just a slow day. Well, if you were just to generate the report point blank, just using the data from the database,
you'd have a gap in the report. You would never see May 29th show up. Because of course, you cannot account for the absence of data in a database. But with generate series, what you could do is I could say, hey, select from generate series my interval of May 1st to May 31st increment by one day, left outer join sales on the given dates,
and then aggregate it up. And sure enough, you'd see May 29th there, and a sum total of zero. So it's a really useful function. It's more than just generating a lot of BS data that causes you to send bug reports to the hackers list. It actually does useful applications.
And here, for our calendar application, we're going to use it to help basically bootstrap our calendar. We'll be able to quickly generate our closed blocks, or our availability rules. The key thing about this too is, and I think I handled this before, is that it's a set returning function.
It doesn't return one value, it returns a set of values, which allows you to perform the various database operations on it, such as joins. So just for fun, because I guess we haven't done a date generating function yet, so generating a series of dates.
Using that function, I probably should show it off for one more second. Select no from generate series to date ranges on an interval of one day. I generated all the dates for the year. So now I could break down my sales report by every single day of the year. One thing to, actually just to prove it, as you see, everything generates.
Recursion. Since Postgres 8.4, you've been able to do recursive queries. Who has actually used recursion within Postgres? Cool.
So the reason Mr. Federer is very excited is that he was a big proponent for getting this feature in. Actually, I think Postgres was the first database to have recursive queries. And then definitely the SQL server. And then we did writable. Didn't we do writable? Yeah, we did writable queries first, writable CTUs.
Anyway, so basically with the syntax with recursive, which you put at the very top of your query, the recursive flag lets you turn on the ability to do recursive queries. You need to have defined a base case and a recursive case. That's straight out of math. Important thing to note is that we'll see some of these queries come in.
The way you separate your base case from your recursive case is with union. If you use union all, you basically say, I want the full results of the query to come out. Like essentially, even if there's duplicate rows, it will return those duplicate rows. Union will eliminate the duplicates.
And actually, I used this to speed up queries that I was running in the real world, where I need to eliminate duplicates, and I was running a distinct after. I was paying that cost penalty for running a distinct at the very end. When I was unioning as I went along the iterations, it greatly sped up those queries. I think it was one case, like 100x performance boost.
So yeah, that's what I wanted to point that out. Like all good recursive queries, you can hit infinite loops. In this case, you can actually hit infinite loops. I did have safeguards in our production code, where I said, well, if we're looping more than, depending on what it was,
let's say I'm looping more than 50 times, just exit out of the recursive case, because something is wrong. Of course, the better fix could be not writing code that can recursively loop, but there could be real world cases. Well, basically, the real world case I would have is that I'd have a hierarchical table, and someone would say, all right,
we'll point this row at this row, but then point this row at this row. And then we'd do a recursive descent on it, and it would be like doing this, this, this, this. So it's good to break out for infinite cycles. The code examples you're going to see are going to assume that we're perfect, and we'll always be able to break out of our recursive cases.
So let's build a recursive function. So I chose to use a factorial. Hey, guess what? Now you can do factorials and postgres. And I actually do this with window functions, too, but I decided to do it this way. So what does it say?
So we have, starting up here, we're going to store for a factorial variable called n, and we're going to basically keep track of how deep we are using a variable called i. So this is our base case, because we're going to start with 1. And then we say, all right, well, here's our recursive case.
We're going to select from our factorial table. This is our recursion right here. And as long as i plus 1 is less than or equal to 100, we're going to calculate the factorial. And we're also going to increment i. Pretty simple. So basically, this is hard-coded to say calculate 100 factorial.
And basically, to prove that we're running 100 factorial, instead of just selecting n, I also select our iterator. So let's see if it works. Who thinks 100 factorial is going to blow up my machine? Actually, I should have picked something to blow up the machine specifically.
Sure enough, it works. And actually, in this case, I didn't say return 100 factorial. I said return every single step along the way, so we can see how quickly everything grows. And by the way, that ran pretty fast, too, which is awesome. Postgres is awesome. I mean, obviously I'm biased, but cool.
Next step. In this case, is this one going to blow up my machine? Let's find out. Yes, because I never canceled out of, if you look at it,
if you look over here, I never canceled out of the recursive case. I never said run 100 factorial or anything. I just said keep iterating, iterating, iterating. And sure enough, in this case, we overflowed. So yeah, exactly.
So actually, the good news is we overflowed because we had a numeric overflow. Basically, our numeric got so big. And that blows my mind, because the numeric data type is gigantic in Postgres. But yeah, we hit the numeric overflow. If I did not hit that numeric overflow, we would still be iterating.
Yeah, it would be very careful. Again, this is one of those things with great power comes great responsibility. Next slide. Cool. So in this case, I said instead of having this set returning
table of all factorials, find the actual 100 factorial. In this case, I was lazy and said pick the max, because I know in this case the max will be 100 factorial. Sure enough, it's going to do it. We get the one row, and it's pretty fast.
Cool. Now here's the one thing, though. If we're thinking about a factorial function, sorry, yeah, I think there's a setting for max recursion depth.
But I've actually not, in the real world, I've not hit it. I've just had infinite loops. And basically getting alerts saying, hey, your database is using all the CPU. What's going on? That's why I always say, you know, Postgres is a great system, but I also trust making sure I have my guards in place
to not loop the system. So yeah, again, I say this as a self-deprecating application developer, but sometimes I trust the application developer more, because at least I know the code is there. All right, so the cool thing is that we're doing these factorials,
and we're using recursion in Postgres. The bad thing is that every single time I'm hard coding which factorial that I want. And I'd like to be able to do this in a more scalable way, because as an application developer, I do like being able to make things generic, and then handing it off and not needing to worry about it. So we can do this with functions.
Who's used functions in Postgres before? Cool. Functions are, this could actually be a very long story. So I'm going to try and do a long story short. Functions are a great way, in any programming language, to capture repeated behavior, because writing the same thing, copying and pasting over and over is terrible, and I hope nobody does that in here.
Postgres 11 introduces something called stored procedures, which one of the big advantages with those is that you can actually do transactions inside the function. Up until Postgres 11, you can have a function existing inside of a transaction, but you can't do anything transactional. Well, you can't run separate transactions inside the function.
This is all changing in Postgres 11. I can't wait to see the kind of code, the kind of systems people are able to write with that. My mind's already racing for things that I want to build because of it. What can SQL functions do? Well, they have a lot of properties. If you go to the documentation page with the create function syntax, there's a lot you can see there. And I want to summarize a few things.
First, functions can have input and output, duh. That's like math. Volatility. This is actually very important. There's three volatility types, immutable, stable, and volatile. And all affect the ability, basically how the function is used in the query planner, which
can basically affect performance. Volatile is the default type. So if you don't set the volatility type, it will be volatile. And volatile will perform the worst. Now, volatile you need if you're doing a write operation inside the function. But if you're doing only reads, let alone only calculations, and you don't set your volatility type,
your code will execute more slowly. And you're going to be like, oh my god, SQL functions suck in Postgres. Why am I using this? The way I summarize stable is if you're using a select query where you're reading data from the table, you can use stable. Basically, what the stable type says is that within the context of my transaction,
my data won't change. So if I'm calling this function repeatedly, know that the underlying table won't change for this transaction. Immutable basically says, immutable is immutable. My data will never change. So something like a factorial could be an immutable function. Because basically, based upon my input,
my output is never going to change based upon that input. Parallel safety, this was added in 9.6. So since 9.6 Postgres has been supporting parallel query. So the default for this is a function is parallel unsafe. But if you're pretty sure a function is parallel safe, like a factorial, you can mark it
as parallel safe, which is what I do in these examples. I don't believe my immutable factorial function that we will write, or I will write, will be able to take advantage of parallelism. But I'm marking it as parallel safe because I think it's just good for example purposes to get that concept out there. There's also leak proof and security definer.
If you're working with row level security, which is a topic, well, row level security was introduced in Postgres 9.5 and is a topic way out of scope for this talk. If you know the function's not leak proof, basically it's not going to let you steal information or put information in a place where it shouldn't be. You can mark it as the function is leak proof. Security definer basically says,
you can execute this function as a particular user within the system. These are two things that are very important if you're writing applications that need to have security in it, which should be everything that you write in production. So you should be very familiar with them. You might not need row level security for production. That is like a whole different level of security, but it's there.
Execution cost. If you really want to play with the query planner and try to optimize things, and you think you're smarter than the defaults, which maybe you are, you can actually set what you think the execution cost of the function is, which will help with query planning purposes, which is pretty cool. I've actually never done that. Maybe I should have. And you can also say what language the function's going to be in, which we're
going to talk about in a little bit, because it doesn't have to be in SQL. So let's create a function. So as I complained, that recursive factorial function we had was not generic. It was hard coded to 100. In this case, we can say, hey, run any factorial.
So the basic thing is I basically have a function definition. I take an input, integer of n. Basically, I say that calculate whatever factorial. It's the same function as before, but I basically have this variable substitution here.
The default way to substitute variables in Postgres functions is the dollar sign and the positional argument. So in this case, this is dollar sign 1. And then, as before, as we saw in that final example, we take the max value from that calculation, and that's what we're going to return. So let's create this function.
Cool. And it worked. I was curious. And the case would be is if you have,
you could have an immutable function that's actually writing to temporary tables as part of its calculation. And as a result of that, it's not parallel safe, because it's actually performing its write. So you do need the composition of possibly being parallel unsafe. And you can do little aside, but I found it.
Cool. This was a fun fact for Michael Glazman. All right. So as you see, we're running our factorial functions. I still can't believe the 10,000 factorial executes that quickly. And let's see if we can even see a number. It's like the matrix. So like half a second, like holy shit.
That's pretty cool. So just for fun, here's the, let's look at a query plan. And again, we're not seeing much, because a lot of, I mean, one, we're essentially doing an order one operation, because we're doing a calculation. Well, really, it's not.
It's big O of calculating a factorial. But also, the magic is hidden within the function. So the query planner is like, all right, well, you're calling a function. And all right, good luck. Let's see what else we do.
All right. So I'm just going to hand wave this one. So remember the original call we had was, we basically returned the set of every factorial from 1 to n, where n was 100. We can actually turn that into a function as well. So basically, the big difference is that when we give our return type,
we give a return set of numeric. And I'll just write it real quick. It's like TLDR. It works. It's the same thing as before. Now it's functionized.
Now, there's another thing where we can actually return tables. Basically, we can define specifically what values are being returned in it. And this is a very basic example. But the reason I point this out is that the Postgres function is a data type you can return called record, which
is essentially a generic record or a generic row that you can return. So if I need to return tuples of information, which is actually very useful in some real world applications, including this one, you can do that. The problem is that set of record is not necessarily optimized, because it's basically saying every single time,
oh, I'm getting some generic information. I might have to basically build out this tuple every single time. Table allows you to specify what that tuple will look like. And the reason I wanted to point this out is, in some of the examples coming up with building our actual application, we are going to see that specifically using table.
So while this one performs very similarly to the set of function, I just want to point out that we can get specific about set returning, basically, table returning functions. And you can specify the attributes that are returned with it. So we're going to trust that this all works.
And of course, I've lost my mouse. All right. Procedural languages. Sorry, I won't check the slider on, and I can't find my mouse. There we go. OK, procedural languages. OK, Postgres, the extensible database, can use languages beyond SQL.
Basically, it has this thing called the procedural language loader. It's been around since version God knows when. I think I was still in middle school when that came out. I'm a little bit older than I look, too. And basically, you can execute iterative languages or imperative languages or essentially any language that's been interfaced with the procedural language loader.
Fun fact. So Postgres has something called PGSQL, which is similar to PL-SQL in Oracle, similar but different. But that was not the reason why it was written. It was basically written because Jan Wijk, who built the procedural language loader, wanted to run TCL inside Postgres. So TCL was the first language supported,
built in by PL, then PGSQL, then Perl and Python. There's also others you can add on. JavaScript is very popular. That's actually, you have to look up PL V8. It actually executes scary fast. R, R is also very popular as well. And there's a bunch of others out there. There's actually a couple of new ones, JVM and Container.
LOLcode was very popular April 1st one year. But you can actually run LOLcode within your Postgres, like hide their i's in your loop and all those fun things, and a bunch of others. This is awesome. In production, I have used PGSQL, so PLPGSQL and PL Python. The examples in this talk will be using PLPGSQL
because it's easy and let's all learn a new language anyway. But yeah, it's cool. Actually, how many people have used procedural languages in Postgres? Cool. Nice. So here's an example of PLPGSQL.
The first thing we have to do is create the, basically, load it into our database if we've not done so already, which we can use to create extension. So in this function, we're going to use PLPGSQL to implement our factorial function. So that way, we don't have to use recursion.
In fact, probably all the different language constructs in PLPGSQL are a tutorial in themself. Just going through it real quick, if you're using variables, you need to declare them in a declare block. Assignment is the dot equal syntax because equals is used to test for equality.
And you can do loops in it. You can do ifs. You can do a lot of things. You can return. But in this case, we've reduced our recursive query from that giant with select block to 4i and 1 up to n. Do the factorial and return it, just
to prove that it all works pretty fast.
Let's do the fun one pretty fast. And actually, if you remember, our recursive one ran in about 400 milliseconds. This one ran in 100 milliseconds. In this case, we don't need to warm up the cache because it's all compute.
So there are some cases where, yeah, PLPGSQL will outperform SQL. Vice versa, you need to test all your different examples. Cool stuff. And I think triggers.
All righty. Who has used triggers just in any database, not just Postgres? Cool. So we are familiar with what they are. But just for the purposes of definition, basically triggers are basically triggers execute functions that can be called before after instead of an operation or an event.
And Postgres has basically two sets of triggers. There are data change triggers, which is probably what most people are familiar with, which are insert, update, delete. But there's also event-based triggers for things that can happen on the structure of your database. So creating tables, for instance, making a DDL change, or granting permissions
to a user in the system, a DCL change. And that's actually pretty cool because it's almost meta. Basically, I can track changes that are going on within my database and do something with them. A common example for that is auditing. I want to audit that Mr. Federer is creating a bunch of tables in my database,
and then he teamed up with Michael, who is using immutable functions to put stuff into temporary tables. So I got all that audited now, and they're being sneaky. Triggers are atomic, which is really important. So if I'm running my code within a transaction, which by default Postgres, even if you don't explicitly say begin,
every single line is a transaction within Postgres. In order for that transaction to commit, the trigger must either execute cleanly or say, I don't want to execute this trigger. If there's an error in your trigger, which is probably the pain of most triggers,
the transaction will fail. Syntax-wise, when you're defining a trigger function, it must return either a type of trigger or event trigger. You can return null if you want to skip the operation. And a common trigger gotcha in Postgres
is dealing with old and new, which we'll see in a second. So old is basically what your record is before the change. Old is the state of the data before, particularly in an update. And new is the state of the data after the change.
A common gotcha is that people try to do a catch-all, and they return either new for every single thing or return old for every single thing. Well, old is not defined in insert because if you're inserting new data, your old data is null because it never existed. So if you return old, you cancel the trigger. The bigger gotcha is usually in delete
because in delete, there is no new. The old changes your old data before you delete. The new is null because you're deleting it from the database. Typically, you might get lazy and just return new regardless of the change. You can't do that with delete because your row might not actually delete if you're intending for it to delete.
But anyway, we're never going to delete our data because we know that we're just going to set it to inactive and put it into a partition. So they can actually trigger functions and execute once per modified row or once per SQL statement. Clearly, if you're able to get away with it, there's advantages to being able to execute it once per SQL statement. Unfortunately, in my line of work,
I was never able to take advantage of that. So I would have to execute one trigger per row. What you're going to see is going to come and bite us later. If you have multiple triggers defined on the same table for something like an insert, they execute in alphabetical order, which if you need changes to occur in a particular order,
well, one, you probably should have them all in one trigger function anyway. But if you don't for whatever reason, know that gotcha because that could end up biting you hard. The cool thing is that you can use any PL language that has interface with the trigger interface. So I know PGP SQL does support it.
That's what we'll be using. So I know Perl and Python definitely support it as well. I believe R does too. There's probably others. But that's nice. So if you're much more comfortable writing Python versus PLPG SQL, write Python. Some common things to know.
So we discussed new and old. TG op. So first of all, these are variables that are available within your trigger functions when they're executing. So TG op basically tells you what operation you are using. TG name, name of the trigger fired. You can get your table name, schema name.
Also, when it's being fired, is it before the changes, after the changes? That could affect your trigger behavior. And also, is it a row level trigger or a statement level trigger? Pretty cool. Cool. In most of my practitioner days, I used new, old, and TG op. I never had to use TG when. Again, part of this is your architecture design
for how you do it. TG op I found very useful because I was able to keep, particularly if I wrote a trigger for changes on a table, I was able to keep everything in one function, which is much nicer, particularly when debugging. So let's write some triggers. Let's do an example trigger. Sure.
So basically what I did is I created two tables. So basically, this is a completely contrived example. But I just wanted to show how triggers could work. So I have two tables. One where I store a name and a secret, which I believe is randomly generated.
And then a table B that whenever I make a certain kind of changes in table A, then I have these records in table B that also just have secrets in them. Like I said, it was a really stupid example because the fun examples are going to come shortly.
Let's create that. When I do post the slides, I probably should add a note that's a really stupid example. But anyway, so here I write a trigger function. And basically what I want to do here
is I'm basically tracking changes that occur in table A. First thing I want to do is I want to be able to generate a secret every single time that I'm inserting a row in that table. And the secret is going to be a combination of the primary key concatenated to a last name that's inserted,
concatenated to the current timestamp, and then do a SHA-512 hash and encode it in hexadecimal. Why? I don't know. But the idea is that I don't want the application interfacing with this to be generating the secret. I just want to show that I can do it in PGSQL, like I said, contrived example.
The one thing in the real world where I've done this is actually generating timestamp ranges. Basically, I've used triggers to say, well, I know I have a date. I know I have a time. I know I have a start time and an end time. Combining them all together into a timestamp, I insert it into the database.
The reason that I've done that is that sometimes it's difficult to create timestamp ranges based upon the driver you're using for your application code. So I break this up into two separate parts in the trigger. The first is if it's an insert, well, insert all these rows into table B that basically make a secret based upon I,
which I'm starting in IV. I kind of borrowed it. I don't know. Like I said, it's stupid. And then whenever we update, again, we regenerate the secrets for B. So all right. Like I said, it's completely stupid,
but let's just create it. Now the thing is all we've done now is defined the trigger function. We've not defined any triggers yet. It's a two-step process. So we're going to create two separate triggers.
The first is a before insert trigger. And basically, it says before insert for each row, so it's a row-level trigger, execute the procedure async trigger. Easy enough. No conditions. Now the reason I separated it out is that on updates, I basically said, all right, in my application, I only need to update the secrets
if the last name changes. So I'm able to put a condition on the trigger, which says create trigger before update on the table for each row when the old name is different from the new last name, execute the trigger. Very interesting function here, or sorry, operator here, is distinct from.
So nulls. Nulls are very interesting in Postgres because a null is nothing. If you do a comparison, comparisons with nulls can break your application in weird ways. Is distinct from is a way of basically this is a fancy way of saying does not equal. But it accounts for dealing with equality comparisons with null.
Because as I said, long story short, it's null can do some really weird things. But if this is null and this is a, this will be able to say, oh, null is distinct from a. Something has changed. Go ahead and execute this.
And that's actually really important in triggers because we often deal with nulls there. And we don't want to have to say, basically this simplifies saying, well, if old.lastname is not null and new.lastname is not null and old.lastname does not equal new.lastname, or old.lastname is null
and new.lastname is not null or old.lastname is not null and new.lastname is null. That basically simplifies all that and is distinct from comparison. Sorry, that was a big digression.
So we created some triggers. So first off, we have nothing in table A, nothing in table B. We're going to insert my last name into table A.
So we see that the trigger fired. And it generated this really long secret. And we also see that the trigger inserted data into table B with a bunch of secrets with a bunch of the i values. Cool.
Now let's see if that check constraint actually works. Let's say I'm writing an update and I say, update my last name. Does it update the timestamps? The answer is no. It's still the 11 AM and 31 seconds timestamp. So it did not fire the trigger.
Or sorry, it did fire the trigger. But actually it did not fire the trigger because our check constraint said only if the last name changes and the last name did not change. But let's update the timestamp. So we explicitly update the timestamp.
And well, it did change in table A. But because we didn't hit the check constraint, we didn't pass that one constraint. So it didn't update the timestamps in table B, which is part of what we did in our update clause in the trigger function.
But if we do update the last name, sure enough, all right, well, it did update. Timestamp did update. In table B, it might be hard to tell because it's all a bunch of mishmash.
But all the secrets updated. And we can see that all the timestamps did update. So our trigger does work. Yay. Yeah, exactly. Cool. So let's build something real.
So recall our data structure. We all remember this, right? But just to give a brief recap, so we have our rooms. We have stuff to generate when our rooms are open. And we have stuff to generate when our rooms are booked. And then we ultimately aggregate that in a table called calendar so our users can quickly look up the availability of a given room.
So the goal is that we got to keep all of these things in sync. So first things first, we need to have a trigger for, so I should also say we're going to build this up with all the tools we've learned over the past hour.
So first we need to have our room, whenever we insert a room, we need to be able to insert all of the information about its default operating hours, which is that they're closed. And we're going to do that over the course of 52 weeks because that's what we decided as our default. So ultimately, if we scan this code real quick,
that's exactly what it does. And maybe one of the key trigger specific things is that we have this new.id here. So the new.id, that's basically the primary key that has now been generated for our room record being inserted. And we can basically associate that with the room ID foreign key
in our calendar table. And the rest of it is basically around our defaults. You can see that we have our generate series function. So we can generate a year's worth of availability for that given day. We increment by one day. So we're actually inserting 365 rows.
And anything else special? This is that default availability tuple. So let's create that function.
And then we're going to create that trigger. And again, we're only going to do this on insert. We're not going to do it on update because really, as you saw in our room table, we only have two things. We have an ID, and we have our room name. And none of our availability is affected by the room name. Of course, if we changed the ID, it would be affected by it.
But we're not modifying our primary keys. And probably if we really want to be thorough, we'd have something to account for that. But we're not because we're not. And we don't need to do anything on delete because on our foreign keys, we did the on delete cascade. So if we deleted the room, automatically everything is going to delete that's associated with that room.
It's going to have this cascading effect. Very quick way to delete everything in your database, by the way. So now that trigger is installed. We're not going to test it out just yet because first, we're going to start. There's a few more things we need to insert into the table
before we test it. Otherwise, we're going to get a weird state of data. Now, availability rules. So recall the availability rule table is going to help us generate, when we're open, x days into the future. We give it a set of days of the week, like Monday, Tuesday, Wednesday, and a set
of operating hours, like 8 AM to 8 PM. We also say generate it 52 weeks into the future. That's our default. So what do we do here? So the most important thing, well, we're going to do two things. One, we're going to need to define a function for inserting and a function for updating.
And again, we're not going to need to define a function for deleting because it's being handled by the cascading firing key. So in our insert function, basically we say, OK, well, start from our current date, or actually
starting from the beginning of the week for our current date, up until how many weeks into the future, which we're going to be getting from the parameter we're passing in. So actually, let me back up a second. So this function is a helper function. This is not actually the trigger function itself because you can see I'm returning void, not the trigger type.
The reason I'm doing this as a helper function is me as the application architect deciding to do it this way. So sometimes I might just want to be able to call this function and generate arbitrary availability. So again, sorry, getting back to down here. So we have generate series. Generate up until how many days into the future.
So the parameter that we're passing in is an actual row from availability rule. So you say, OK, sorry, so there's
two parameters, the availability rule and the day of the week. And the day of the week is defined as the ISO day of the week. So from here, we say, starting from the beginning of all these weeks, take our room ID, take the ID of the availability rule.
And so the available date, we're basically able to calculate the offset of it by saying, all right, well, what's today's date? Oh, I think I might have a bug in here. OK, so basically, this is supposed to be, we're able to calculate the offset of the date. Actually, no, there's no bug in here. So basically, we say, take the beginning of the week,
add the ISO day of the week and subtract 1, because if we're at the beginning of the week, and we need that first day of the beginning of the week, we don't want to increment it by one too many. This is preventing an off by one error.
We then create the range here. And basically, here's that one real world check I talked about in terms of time wrap around. And they basically say, well, if we see the end time is greater than the start time, then we need to be able to increment our timestamp range by one day. Otherwise, we do it as normal to generate that block.
So I think the easiest way to finish explaining this function is to see it execute. But first, we'll need some availability rules around that.
All right, so now we're going to actually build out the trigger function for this. So we're building out the trigger function for generating the availability into the future. So the way we do it is that, recall that we have the array of days of the week on the availability rule function. So that could be anywhere from one through seven. So let's assume for now we're just doing Monday through Friday.
So we could have an array that says integers 1, 2, 3, 4, 5. Basically, we loop through each day of the week, and then we perform this availability rule bulk insert function that we see, which is we're able to insert basically all of our operating hours for being closed for the next year.
And we need to be able to handle the update case, too. What happens if I update an individual availability rule record? Well, that takes a little bit more work. The insert is nice because the insert says, OK, well, I changed the availability rule.
Sorry, I inserted an availability rule to just generate all the opening times over the course of that period. Update, we've got to do a little bit more work. And actually, as you see in the comments, it gets a little bit tricky if you change the days of the week. I decided to be very naive about it and say, well,
if we've detected that we've changed the days of the week, let's just delete all the availability that we generated. Because otherwise, we've got to essentially do a merge type thing, and that would make a lot more code. And just regenerate all the rules from scratch.
So clearly, from an architectural standpoint, this is a little bit less performant, but it's easier to understand. Otherwise, if we updated anything else, we can actually use some fairly basic math to generate the ongoing availability or the ongoing openings.
And pretty much the two things that can change are your start time and your end time. So essentially, it allows you to reset your ranges. And we have that guard if, for whatever reason, our rule is, we want to be open from 10 PM to 1 AM.
Now, the one thing about this code example is to create this function in the database. I actually need to go to here. I need to go to here.
And we have basically the trigger that fires on inserts or updates,
because we don't need to do deletes. I noticed that. All right, so now we can generate our default close times.
Now we can generate all of our opening times. But we need to start being able to generate the calendar, because that's the key of all of this. So I actually created two helper functions that are able to generate all of the calendar information. This is the first of one of them.
Obviously, it's very legible, because there's a lot going on here. So we're going to break it down a little bit. So for this experiment, just to help keep things in context, we're going to have two availability rules. We're going to be open every day from 8 AM to 8 PM. And we're going to be also open every day from 9 PM to 10, 30 PM
and try to keep it simple. So what we're going to do, since we've already created some of our data, is we're going to essentially insert those rules. First, we create a room. And then we're going to create two availability rooms for that room. And I'm also, in this example, I'm assuming
that the ID of the room that's going to be generated will be one. So as you see, those two rules have been created. Actually, just to have a little bit of fun,
we see all those specific availability chunks have been created as well. I had it show up in totally random order. So so far, everything's working. We can't generate the calendar yet, but that's OK.
So let's look into these two helper functions that will actually help us generate our calendar. And actually, that's the thing I'm thinking of.
I'm going to clear out the table. One second.
We're going to come back and insert those rules later. OK, so the helper functions. So first, so I created two helper functions. Again, this is more for a matter of convenience than anything specific.
So the first helper function is basically going to generate what the calendar view would look like just with my closed blocks and my available blocks. Then I'm going to use the output of that function to generate what the calendar view would look like if I have my booked blocks and to combine them all together. Again, architectural decision.
So in this function, basically it takes two inputs. The room ID we want to generate this for and the range. And the range could be any block of time. It could be for a given day. It could be for a given month, whatever it is. And basically, we're going to use the table returning function of the status, which is in this first function,
the status will either be closed or available, and the range for which that status is valid. We could return the room ID as well, but we already know the room ID because that's one of the inputs of the function. So the first part of the function is a recursive common table expression.
And this actually looks incredibly complicated, but it's not. Well, it is, but we're going to break it down. So the first thing that we're going to do is we're going to generate a bunch of dates, individual dates, from the range that we pass in. So let's pretend that the range we're going to pass in is just today, May 30th.
So it would be May 30th to May 31st. All we'll do is we'll generate a single date here in the set returning function and say, OK, from May 31st, from May 30th, let's see if we have any availability rules, or sorry, any availability or any opening times that were set on May 30th, from May 30th to May 31st in DMS 1150.
Great. So if we have any availability rules, what will happen is we have this gigantic three-headed beast here. And basically it says, well, it says we actually
have some pictures that will explain this. But basically it says that, see if I have any overlaps. If I have any availability overlapping with my close times, we're going to need to be able to split up these ranges. Remember those pictures we saw before where we have cut our ranges into various pieces? This is what's letting us do that. And that's what all of this code says. And I think the best way to be able to look at this
is to be able to explore it visually, which I have a few slides ahead for this, because going through this line by line right here is going to be very overwhelming, and I think I'm going to lose people. So ultimately what this select list says is, all right, recall in our example that we
know that our operating hours are from 8 AM to 8 PM, which means that we need to split our default closed values are from midnight to midnight. So we have two sets of operating hours, 8 AM to 8 PM,
and 9 PM to 12.30 PM. And our original closed hours are this giant chunk. So basically what all those selecting case statements do is that it splits up the two chunks. So we'll have one row that says, all right, well, I'm closed from midnight to 8 AM, and I'm closed from 8 PM to midnight.
And then I'm open from 8 AM to 8 PM. Then the second one says, OK, well, I'm closed from midnight to 9 PM. Then I'm open from 9 PM to 10.30 PM, and then 10.30 PM to midnight. So those are our two rows. And that's what those gigantic case statements do.
Now we have the recursive case, which we're actually able to use union on this, because there will be some duplicate rows that are returned here. So the key thing I want to point out in the recursive case is the froms.
So the first is, we're saying from on, I guess, basically our recursive table, which we call the availability. We're selecting the data from that. Oh, sorry. Then we're joining our availability table to see if we have any overlaps with any openings. So the first thing we check is one DeMarche 1150, whatever room
we're in, that date range that we set up, which is our availability on today, May 30, from midnight to midnight. The key part is we want to make sure that we're comparing our center range. So our center range, recall, is when we're open.
We want to see if there's any other overlaps between that and our recursive set, because if there are, we might need to further chunk our data. Ultimately, our goal, if you come back to this slide, our goal is we want to be able to see in not a single range, but in our view, we want to be able to say, all right, here's
my minimum closed area. Here's my minimum available area. Here's a closed area. Here's an available area. Here's a closed area. The way I like to think of it is we just need to smush all these ranges together and be able to get that result.
So this part of the query is helping us to get to our smush function, which we're not quite there yet. But what this part does, so that's the part I wanted to highlight. I'm sorry I didn't go to that one. What this does is this basically finds all the overlaps we've not seen yet based upon what we've recursed
versus what's existing in our database. And for instance, one of those overlaps is we have that ADM to APM availability chunk, but we have that new closed chunk that was created by the other opening chunk. So there's an overlap here. So we need to be able to split this up.
And that's essentially what this part does. It says, OK, well, this part of the query has found this new overlap that we didn't realize existed before because we had no idea, because we've just started iterating on it and trying to divide up these chunks. And this part actually does the chunk division.
So as you see from here, the difference between what we had over here is now that we've chunked it up and said, OK, from midnight to 8 AM, I'm closed. From 8 AM to 8 PM, I'm open. And from 8 PM to 9 PM, I'm closed.
In the context of this row, because we do know our closing time is different overall, but we have to do it row wise in order to chunk it up. So looking at the full picture, I probably should have made it clear. Now we have four rows that we're dealing with.
We have those two original rows from the base case of the query that are standard overlaps. And now we've created two more chunks based upon what we originally were chunking together and the recursive query that we've been doing now.
So the last part of the query is we're done with the recursion. Eventually what's going to happen is, assuming there's no infinite cycles with, actually, there will not be any infinite cycles here. Well, in fairness, in the first version of my query, I did have infinite cycles.
So you are seeing the pain that I've gone through. The last part is, I call it, this is the smush part of it. Where basically we say, all right, we have all these things here. We have, in our left sides and our right sides, we have the times where we are closed. In our center, we have the times that we were opened.
And we want to find, essentially, what chunk. So the final part of this aggregate basically says, find me the chunks where they cannot be contained by any other chunks. So basically, if I have a closed block, find me
a closed block that's basically contained by all other chunks. Because then I know that's my smallest unit of a range that I can build. Basically, the TLDR, what this part of the query does,
is it basically says, OK, well, this closed block here can contain other closed blocks, so I'm going to eliminate it. But this closed block here can't contain any other closed blocks. Well, I mean, these two are equivalent, but I'm going to eliminate that with my aggregate. Same thing here, this available block, these two available blocks
are equivalent, so I can eliminate those. This closed block here can contain other closed blocks, so I'm going to eliminate it. This closed block cannot contain any other closed blocks. And same with this one. And since we can keep those, then we can eliminate the ones that look just like it.
That's more or less what this part of the query does. And so that's why I'm going to make the slides available. And sure enough, with the aggregate, it all smushes together. And we have our close times and our opening times for the given week. Cool, how are we feeling?
I'll tell you, like I said, I think the TLDR for all of this is one, of course, you can do this all with an SQL function. But two, even though that SQL looks very scary, and I can tell you the first time I wrote it, I was very scared of it. When you start breaking down pictorially,
you find it's actually not that intimidating. Of course, once you've written it, it's written, and it should hopefully work forever. And that really was a lot of the purpose of it, was just to demonstrate what this function is. So what about unavailability? Because we just did closed and open. We didn't do booked.
So like I said, there was two helper functions. This is the second helper function. Here's the good news. It's actually very, very similar to that function. So if you're able to understand that, you're actually in pretty good shape. So we're going to hand wave it, also because I see we've been sitting for a while.
The good news is, what function is this? All right, so now we can basically take all this work and use this to manage the information on the calendar. So basically, that second helper function is called Calendar Generate
Calendar. It combines the output from that first one and takes all the times that are booked and does one more giant smush. And by the way, here's another really cool kicker too. Even though that looks like a lot of code and there's a lot of expressions and operations going on, does execute relatively quickly. The operative word being relatively quickly.
And we're going to see a few examples in a second. But that's the key. Like it's not going to take, depending on, so I had a fairly large data set size deal with it and it's still operating very efficiently. It did not bottleneck the system in the new method that we'll see. But yeah, it's actually nice.
I'll put it that way. So this is just, again, one last helper function to manage the information going into the calendar. Then we have our triggers that we're going to put on the availability table and the unavailability table, which will basically
keep the calendar in sync whenever we change our availability and whenever we book something. And the reason we have that helper function is that both calendar management functions are the same regardless if you're dealing with availability or unavailability. We're basically saying, if we change something on a given date for our opening times and our book times,
just regenerate the calendar for that given date because we'll be able to do that quickly. And of course, create the actual trigger definitions. Cool. And we're done. Actually, we're not done because we actually need to create all of it. And the easiest way to do this is, anyway, nothing up my sleeve.
That's the example.
So we should probably actually test it to see if it really works. So actually, do we, so we have about 30 minutes left and I have, let's see how many more slides.
Actually, I have about 30 slides left. So do we want to test it now and then take a quick stretch, or do we want to keep going for it? Keep going? Let's do it. Yeah, I love it. OK. No, ah. All right. So I've done this live demo a few times now.
I'm not scared. That's what I tell myself before I do a live demo. So all right. Yes. So we turned on timing.
And again, we're going to hand wave timing a little bit. This is just to give us general guidelines. So first, we're going to create a room. So we run this function. We're going to insert what we call the room DMS 1150, and it inserts. You probably noticed that that's actually technically invalid SQL syntax, and that's correct. It's because I basically echoed the SQL, and I probably
should have echoed it properly. So let's look at our calendar for today. So we run the function. Select star from calendar, where calendar date equals today. Order by lower calendar range. Lower calendar range is this part.
And basically, we're ordering by the time stamp. Because now we're in a real application, we've got to start using the order by function again. So basically, we see that our default data is there. We've inserted the calendar information for today, and we're just closed. That is that big, white, closed chunk.
And also, things look like they execute very quickly. Like that insert took about 10 milliseconds. Even though it inserted an additional 365 rows, it was still pretty quick, which also makes sense, too, because it's an empty database. It should be fast. All right, so this room, DMS 1150, only allows bookings from 8 AM
to 1 PM and 4 PM to 10 PM on Monday through Friday. Are we good with that? So we're open from 8 AM to 1 PM and 4 PM to 10 PM. And in between, we have a three-hour siesta. So the function we use to generate this information is the insert.
We get the room ID from DMS 1150. We say, again, I want to show off lateral. So I created a table containing these two tuples, expanded them out.
And we have Monday, Tuesday, Wednesday, Thursday, Friday in our days of the week, and getting our primary key ID from DMS 1150. So this took about a second to insert. Let's keep that in mind, because if we do the math,
so if we look at our calendar now, so a few things happened. If we look at our calendar now, we now see that's chunked everything up. We have a closed block, an open block, a closed block, an open block, a closed block, which is what we expect. We basically ran the smush function and were able to divide everything up.
But notice the way we ran it is we ran it using triggers. And the triggers, basically everything has to commit in order for the statement to be successful. And when we did this, when we inserted the availability rule too, keep in mind, we had five days of the week times 52 weeks,
times two sets of times. So 10 times 52, we inserted 520 rows. And keep in mind, as you saw from those other triggers, every single time we inserted a data into the availability table, we execute a trigger to refresh the calendar on that given day. So then we had an additional 520 calls around that.
So keep in mind, roughly, we saw that when we inserted the closed blocks for 52 rooms, that was, sorry, for the one room, the 365 closed blocks, that was roughly 11 milliseconds. If we extrapolate a little bit, we can see that even though that calendar manage function is relatively quick,
if we're doing it 520 times, it is going to start adding up over time. So let's keep this in mind when we're thinking about how we manage, how we continue to manage the calendar. So let's see what happens if we extend our, so remember we have that closed block, sorry, that open block
from 4 PM to 10 PM. Let's extend it to 11 PM and see what happens. So we've gone from, I was being a little bit lazy with some of the times, but we've gone from 10 PM to 11 PM. And if we look, so notice it took about 600 milliseconds
because we executed, in addition to all the other triggers, we executed the calendar manage function every single time. But we do notice on our given date, the time did update correct. It is showing, sorry, correctly, it's showing 11 PM. And the closed block has changed from being from 10 PM to midnight
to 11 PM to midnight. So our triggers are working. This is more just to show, hey, it's working. And also, it takes a little bit of time. If we remove all the rules, well, this is the delete cascade foreign key that's going on. Remove all the rules, we're back to our calendar just saying that we're closed all day.
And if we delete the room, we're back to our calendar just having no information in it. So we're keeping everything in sync. This is cool. We did build this real time application. So I'm going to restore us back to our original data. Let's say we book an event from 9 AM to 12 PM.
So here, I was able to remember how long the tutorial was actually supposed to be. So I got the correct time in there. And scrolling up. So we insert the booking. And actually, this is really fast, because even though we need to call that calendar refresh function, we're not cascading as much
information together around it. And if we see, we look at our calendar for today, closed, available, unavailable from 9 AM to 12 PM, and available from 12 PM to 1 PM, it's still working. It's in sync. We have this whole calendar.
It's spit out. And look how fast the look up was. This is sub-millisecond. This is cool. It works. So let's say we book a second event. So let's say we booked a second event from 6 PM to 8 PM.
Yeah, here we go, 6 PM to 8 PM. I'm sorry. Yeah, it's still fast. And here's the cool. So remember, when we're doing this look up, we're actually taking advantage of the multi-column index.
That was the room ID, and then followed by the date. And what's nice is that, particularly if you're dealing with a much higher volume system, you can quickly generate your full availability calendar for a given month. And that's really powerful.
So let's say the second event needs to start one hour earlier. Lo and behold, now it's starting at 5 PM. Everything updates correctly. Cool. How long did the update take? Seven milliseconds. You know, not bad. Let's say the first event cancels.
We perform a delete. That's pretty fast. Everything updates. Cool. So the application is working. Exactly. Famous last words, right? What did we learn? Again, from a little bit of a hand-wavy test. So first, always test your live demos.
The availability rule inserts took some time. And we see it's because for a given day time block, we're going to be doing 52 inserts, which also means we then do 52 calendar refreshes from a non-trivial function.
So this is going to start adding up really quickly. The updates on individual availability chunks, like just the one-off, one-day availability rule, those are pretty fast. As are the inserting of booking. Those are pretty fast too. But again, keep in mind, they still call that non-trivial calendar refresh function.
But if you're doing it just once, it probably won't hurt too bad until your data set starts growing. The nice thing is no matter what, the lookups are fast, thanks to our indexing scheme. And ultimately, the people who are the end users of our system that we're managing are going to be very happy from that. But you can see what I'm getting at is that for us who are managing Demarche itself,
this might start slowing down very, very quickly. So let's try to go to web scale. So let's see if we have enough time to do web scale. So I always like to do web scale as a tongue-in-cheek thing for MongoDB. So let me show you what's going on.
So some of this is going to give the answers while we're running. And part of the point is, so the first thing is we generate 100 rooms within Demarche, which could actually conceivably have 100 rooms. This building is gigantic.
And I'm from New York, I'm saying this building is gigantic. OK, it's not gigantic. So the first thing that happens is we select count from calendar. And we see from our default, if we insert our default closed rules, there's 36,500. So this generates a set of availability rules for every single room.
And even though there's only, we basically said, OK, there's 100 rooms. We're just going to do it from 8 AM to 8 PM Monday through Friday. We only insert 100 of these rules. But when I did this run, it took about 60 seconds. Because, keep in mind, as you see, we've ballooned from 36,500 entries
in the calendar to 90,000. Just keep in mind, every time we insert an availability rule, this could be 52 times days of the week availability blocks are going to generate, which means we have to do that number of calendar refreshes.
And keep in mind, because triggers are atomic, everything has to execute before we can commit the transaction. We can't necessarily distribute this load. Now everything will be correct, assuming that we coded correctly, which we did. But we're starting to hit this performance penalty. And if we generate a bunch of unavailability,
let's say we're inserting, in this case, 36,000 rows, the time on that is trying to creep up too. Because as we get more and more data, those refreshes get more and more costly to run. And at the end of the day, as I hear my computer starting to heat up, the lookups are still pretty fast.
Let's see if we're done generating it. Yeah, my computer is still working on that initial availability rule generation function. Here's the good news. The lookups are still fast. But what's not fast is managing the calendar.
And that sucks, because if you're designing that internally for your company, those are your teammates who are dealing with some painfully slow updates. And this comes from a real world example. We basically, at my old company, we had a calendar where you could look things up very quickly.
But updating any of the availability rules, it was taking like 40 seconds, 45 seconds, just to make a single update on a single rule. And the system became unusable. We essentially had to, from our customers who were interfacing with that system, we kind of had to remove that feature. And we would only manage it internally. And that just sucks. That sucks for everyone.
And of course, engineering that as a point of pride, I was miserable. I don't like people having a bad time. And still right. It is all my fault. But it's like the phoenix out of the ashes, we can do better. And that's what we're going to try to do.
So I'm giving up on this, because it's just taking too long. All right, let's do it in 20 minutes. And here's the long story short. So even with only 100 more rooms and a few set of rules, the rule of generation time just ballooned.
It was unusable. Lookups are still fast. So we're going to do better. And this is where we're going to introduce logical decoding. How many people are familiar with logical decoding? Well, how many people are familiar with logical replication? Cool. So logical decoding is essentially the basis
of logical replication in Postgres. What it does is it replays every single change that's going on in your database. So every single write. So if you insert something, it streams a change. You update something, it streams a change. Delete something, it streams a change. So this is really cool. Cool.
And like I said, this has been available since Postgres 9.4. Logical replication itself was added in version 10. But we've had this feature for a while. And it was actually good, because when I started implementing this at my old company, we were on Postgres 9.4.
There's a few things to note when using logical decoding. First off, to use it, you need a logical decoder. The one that's built into Postgres is something called the test decoder. The test decoder uses a format that is unique to Postgres, because it was used to test how the logical decoder worked. There are actually people who have written tools that parse the test decoder and run it in production.
I'm going to reserve judgment on that. But fortunately, there's that. Recall several hours ago, we talked about wall to JSON. Basically, that's a decoder that takes this change it's streaming in and converts it to JSON and makes it usable by pretty much every programming
language under the sun. And that's what we're going to use. The way you set it up is you create something called a logical replication slot in your database. So replication slots, in general, were added in Postgres 9.4 and basically say, all right, somebody wants to connect to my database and stream all the changes from it, be it physical replication that's
been available in Postgres 9.0 or, ultimately, logical replication. And it basically says, I'm going to keep track of where I am in the database with my changes. So I will always make sure that I'm storing all the write-ahead logs or all the changes up until that time.
Once whoever's connecting to my slot have acknowledged those changes, I'm going to start getting rid of them and basically make sure my disk doesn't get too big. The caveat in there is that if you never acknowledge those changes, Postgres will keep holding all those logs and all those changes until it runs out of disk, which is a bad time.
But basically, the interfaces to logical decoder allow you to say, OK, I've acknowledged these changes. Basically, you keep streaming those in. So only one receiver can connect to a slot at a given time. And this allows you to maintain the time
density of the changes for a logical slot. If you do need to have multiple people listening to changes, you need to create multiple slots. But this is not a way to parallelize the changes that are streaming in. There's other ways to do that, which we'll touch on very briefly at the end. But it also makes sense, too. You need to know the order in which the changes are occurring,
so you don't misapply them with whatever you're using. One important thing that bit me very early on when I basically thought this feature didn't work and was stupid is that your tables must have primary keys for the logical decoder to work. If you do not have a primary key, the logical decoder does not work.
And then you have a bad day, because you're like, why are these changes not streaming in? I don't get it. So I don't know if that's going to be changed at some point. This is actually the conference to talk to people about that to see if that's even possible. I love PGCon for that reason. All right. We talked about the test decoder.
We're going to move on. So there are a few different logical decoding plug-ins out there, or things that use logical decoding. Walter Jason's written in C. Jason's CDC is written into Rust. Last time I checked, it did not work with Postgres 10. Maybe that has changed. Debezium is actually a project.
I call it like the RRM for logical decoding, because it actually works with multiple databases, not just Postgres. It's an interesting project to watch. I last looked at it in, I think, January. Something to keep an eye on. Like, I've been hearing it more and more in the industry of people evaluating it for various things.
So which drivers support it? Well, the PQ, which is in C, as well as the PG logical binary that comes with Postgres. Postgres functions can interface with it. Well, there's various Postgres functions that allow you to play changes off of the logs and acknowledge them. That's also part of being available in libpq.
Python, as I mentioned, is like a PG2, which is what we're going to use. And JDBC has supported it since version 42. There might be more now that support it. I actually do need to look for that. But of course, if you hear of anything, please let me know.
So how do we use it? We've got to do a little bit of configuration. I've gone ahead and done this configuration in my database. But the key things that you need to know are in your postgresgrill.com, which is your main configuration area, you need wall level set to logical. So walls just really stands for write ahead log. And that's basically how we track all the changes that
are coming through in the database. I have max wall sender set to two and max replication slot set to two. So max wall senders, those are basically the tasks that let you send the wall changes over the stream. And replication slots are what we're mentioning before, the things that let you connect to it.
I chose two. You basically choose as many as you need. Basically, it needs to be non-zero. In your pghba.com file, you need to have a user that has replication permission. People who have replication permission do have a lot of permissions in your database. So please be aware of that.
Rough way of putting it, it's kind of like super user light. So be very wary with that. The one thing where I'm screening this is development mode, is I'm using the trust authentication level. If you give anyone trust in your system, they basically have full access to your system. And for purposes of development, if you have access
to my laptop, then there's probably a lot more serious issues going on. So I'm OK with having trust there. In the database, you also need to be able to create the logical replication slot. There's a function called PG, create logical replication slot. It takes two parameters. One is the name of the replication slot.
The second is the name of the decoder you want to use. In this case, I'm giving a name of schedule, and I'm using the wall to JSON decoder. I guess I should be creating this now. So actually, one thing real quick.
Sorry, I'm just going to skip. There's one thing I just need to prep, one second.
This is a little bit of the punch line, but I want to prep it since we're here.
Cool. So as you see, we created the logical replication slot. The function returns two items. One is the name of the slot. So the LSN is basically where you
are in terms of tracking all of the changes. So that's essentially when you're running a decoder, you're going to be sending up your LSN to say, all right, I've acknowledged this change. Keep incrementing it, and tell the right ahead log that it's free to get rid of those changes if nobody else needs them.
So first, let's test to make sure it works. So I have a little Python class that I wrote that basically says, hey, every time I get a change, decode the JSON and print it.
That's a long story short. Again, I'll make these examples available. Basically, we'll keep consuming changes from the stream until we do Control-C. Then we'll stop, and we'll close, and we'll be good.
All right, so now I've got to find that other window. There we go. OK. So I have this code living in a file called living this file.
So first, that's just a warning about psycho PG2 and some stuff happening. All right, let's start making some changes. So first, let's create a table.
So notice we insert the table, and hey, look, we got something. Doesn't have much, but apparently there's been a change. Crap, I lost the window. There we go. So let's insert. Do X. Hey, look, there's the change.
It's there. It streamed it. It's pretty cool.
It streamed it again. Cool. Let's say we stop it, and we insert a few more things. Hey, it kept it and replayed it. Now notice it actually replayed another change, like one change too many.
That is a bug I can't remember. It's in wall to JSON or psycho PG2. OK. Just because that came up, I wanted to point it out. Basically, we did not acknowledge that last change. We did, so we replayed it again. So basically, TLDR, make sure your changes get acknowledged.
But here, here's the long story short, is we are replaying all these changes from the database. And we're doing it through this external tool, and it's atomic. We're basically seeing all the changes in order. This is important.
So in this example, basically, this was my original example to show the changes streaming across the wire. And if I didn't delete the entire database, we would see that as we inserted all of this,
not only would we see these changes being inserted, but we'd actually see all the triggers being played out as well. And all of those changes in the trigger is also being inserted in the database. Because essentially, what the logical decoder does is it replays all the changes. So even if you're executing the triggers, it's going to show the changes that are coming through the triggers as well. Of course, if you have triggers that insert additional roles
within a transaction, that's all going to come over as one chunk over the wire. So we're not necessarily getting any benefit from using triggers with the logical decoder. We're going to have to think a little bit differently in order to get the benefit of using the logical decoder.
So here's the thing. We know that our most costly function in this is regenerating the calendar. So we know that it probably would be nice to offload that a little bit. But we also need to make sure the changes always propagate in as close to real time as possible. Because we want to make sure that when our users are viewing the calendar, they're getting the most up-to-date changes.
And we also want to make sure the managers who are managing the calendar are able to do the changes quickly, too. So they're not sitting there for 40 seconds while everything is refreshing, because that is horrible in so many different ways. So we're going to replace the triggers. We're basically going to use the same data model as before. You probably saw that, because I already pre-installed everything.
And we'll use the same helper functions. But we're not going to use the triggers, which there's a little bit of lying there. There are some delete triggers for one specific example. And that has to do with when we delete an opening
or when we delete a booking. And the reason is that we need some follow-up information to successfully refresh the calendar, particularly the date, which is not available in the logic of the coder. We only have the former primary key. We don't have any of the information. So we can't fully get off of the triggers in that case.
There's certain places where you might be able to. But for how we're regenerating the data, we cannot do that. What we can do is that we can basically get rid of all the other triggers and move the workload over to our logic of the coding listener to facilitate the transaction of refreshing the calendar.
So as you recall, we're basically going to be still using that calendar manage function as we see as before that calls those gigantic functions. We've already looked at that.
But the main modification that we make is it looks pretty much the same. So what's going to happen is that we're going to build a Python script. We're going to have a pre-built Python script that we look at real quick. And basically, we're going to see it read from a logical replication slot.
And if it detects a relevant change, we're going to take an action. So basically, it's going to emulate what a trigger does. But we're going to be controlling this from how we're interpreting the data stream on the Python side. And the key is that this actually moves the work outside from that initial transaction, which sounds a little bit scary because we do want atomicity when we're doing these giant updates.
But it's OK because we know that we're getting these atomic work units coming in from a logical decoder. So it'll be OK. We won't have everything committed all at once. But remember, if the logical decoder fails, that means we never acknowledge that we receive that change, which means we can replay it and try it
again until we get it to succeed. And that's really cool. We're still being able to process our data safely. So this is nice. I think we might actually get everything we need to ensure everyone has a pretty good user experience. So how do we read in the changes?
So what I did was I created basically a hash table that lists all the different table names where the changes can come in. And based upon what operation occurs, if it's an insert or an update, there's some SQL I run to follow up. So this is kind of like a trigger table where we have essentially all the follow-up actions
that we want to take on everything. As you see, these are the similar functions to what we ran, or similar SQL statements as well to what we ran before in our triggers. And then there's some logic here. And again, I'll hand it with it a little bit. But essentially, it's very similar to that logical stream
that we were running before. The difference is there's some more business logic in it. And there's business logic. Basically, this says we were able to connect everything.
OK, let's test it. All right. Like I said, the punchline is like 10 seconds long. So I have this running in something called Demo 5. And as I mentioned before, this
is going to be near real time. So what's fun is I do have a script that plays out all the different changes. It's actually this one. So what's funny is I would run this script and basically try to show doing the selects on the calendar
as they go along. But because this script runs a little bit faster than the logical decoder's processing and it's near real time changes, it's better to run the commands manually to see what happens. So first things first, let's insert our room. Remember, there's no triggers here. Everything's being managed by this Python script.
So I insert the room. As we see, if we look from here, we have our room. And if we look at our calendar, voila, it's inserted.
No triggers. I can prove that there's no triggers because if we look at the room table, there's no triggers. Easy.
What did I say about C programming before? All right. So let's insert those availability rules. And remember, availability rules, let's make sure we have timing on. Remember, this was that function that took a while to insert. We run it. Four milliseconds. Yes.
Like, fast. So this is probably going to scale pretty well. As you see from our debug output, like, pardon my French, a lot of shit goes into here. And if we look up the calendar, let's see exactly how we did the calendar. Are we going to get our availability rules?
It's working. Go for it. Not only that it updated only once, maybe.
Nope. Calculating time. Yep. So yeah, but the thing is, I'm at least doing it out of band. And right now, this is also being done serially. So one could say, yes, this has the chance to back up. It does have the chance to back up, but it's going to do a little bit of work. It's going to catch up. And there is a way to even offload it further
and potentially avoid the backups as well, which we will touch very, very briefly, because we are at 12 and I know lunch is important to people. But the key thing is, this is working. The lookups are still fast. Skip ahead to inserting more. Actually, let's delete everything first.
Sorry. All right, so deletes are still slow. And remember that we do still have some triggers on deletes. In our real application, though, we're never going to delete anything, because we're just going to have an executive flag and set that to false.
And then we can update it quickly. So let's insert some more rules. So those, again, insert quickly. And they're updated fairly quickly. And we see that they're there. So let's insert some bookings.
We see that it's there. So it's working. We're taking advantage of the logical decoder. We're able to read the streaming changes. And everything appears to be well. And I think this is where we can start bringing it all home now.
So the lessons from this exercise is that logical decoding does allow the bulk inserts to occur significantly faster from just doing it with triggers and within a transaction. And that was the goal, bless you. And in this application, we're able to get away with it, because for our users, we don't need full real time.
We just need close enough to real time. And this gives us close enough to real time. And for a lot of applications, just close enough to real time is good enough. There's certain things where you probably do want real time. We could list those out. There's definitely critical applications that need that. But for us, this is good enough. Deletes are still a little bit tricky,
which is why I say never delete anything. Just to archive, I know I'll keep saying that. But if you just need the primary key, it's fine. Like I said, in this case, we need the date to do that calendar regeneration. So that doesn't work. And as was being pointed out, applying all these changes serially.
So there's a potential bottleneck for long running queries. And actually, I talked about a query before where I took a union and made it a union. I was actually in our production system running to that bottleneck. I think that was one thing I did to optimize it. The other thing, I had a stupid SQL error in it. And basic combination of those two, things
sped up by like 100x. I mean, the bottleneck went away. Everything was going pretty quick. There are some wonderful distributed streaming tools out there. And particularly, there's Kafka. You can use that as a way to perform your follow-up queries and further distribute the load.
The nice thing is that you're able to at least acknowledge the changes on your Postgres stream. And so like, OK, well, I put this into my Kafka queue. I'm going to acknowledge I put this into the Kafka queue. And you probably want to store your transaction IDs and orders as you put that into Kafka. So that way, you ensure you don't go out of order. You're basically, I call this shifting the problem elsewhere.
Because you do have potential bottlenecks that you can run into with that method. But at least, one of the key problems that I don't want people to run into is running out of disk because your wall grows too big. I have actually nearly run into that myself. When I mean, yeah, actually, I think it was some obligation to this where
it got into one of those infinite restart cycles. I forget the exact reason why. But I was not the person who was managing it that day. And the person who was was like, hey, the disk is at like 90% full. Like, well, that shouldn't be because we have, the other day, it was like 10% full. So there's basically like an infinite loop
that was causing things not to get acknowledged. So again, one of those things with great power comes great responsibility. Always monitor. Conclusion. Postgres is robust. We covered a lot of features. And there's definitely some eye glazing over parts because it's like, how do we explain all this in three hours
and get to everything? But one, that's why I'm very happy to make the slides available and the examples available. And again, we just scratched the surface on some of these features as well. I mean, PLPGSQL can be a tutorial in itself. Triggers could be a tutorial in themselves. Just data type. I actually have a tutorial just on data types.
Like, that's ever expanding as well. Triggers. You see a lot of hate for triggers. But look, there are certain things triggers are very good for. I think dealing with some simple rewrites, like I said, converting dates and times into timestamp ranges, actually I found very useful.
Like, as a before trigger, the overhead is very low. There is some overhead to using triggers unless you're really writing them in C, in which case it's going to be very minimal, but who wants to do that? But there are times to use it, but you also got to know when it's not good to use them, too. And I think something like logical decoding is a tool that helps you to bypass triggers,
but still be able to get the atomicity you need in a relatively good speed. Utilizing logical decoding and now logical replication can eliminate some of this overhead, and basically you're transferring compute elsewhere. You're taking that load and you're distributing it. You know, you also.
freeing up your users to do other things, or ultimately your database sessions to be able to do other things and keep pushing things through your system. And of course, no, this is a panacea. You still need to use good architectural patterns. And one thing that I said is, in going through, I was saying some of these were the architect's decision. You might decide that you want to chunk your calendar in terms of a week.
That could be a way of doing it. That might make more sense. Like I said, it's up to you. And this is the art of programming. The science is learning all the tools that you have available. The art is putting it all together. And with that, go build some applications. There's a lot to think about. And I'd love to see other applications that
make use of this concept as well. I think in case, just by the way, another term for this logical decoding is change data capture. That's probably the more familiar term. I should have thrown that in earlier. So I break for questions. Yes, Dave.
A lot of pain. Like in the script, a lot of pain.
So this is called repeating your code, echoing, prompting, echoing, and having the same exact thing.
No, because I care. Yeah, I spend a lot of time on this.
OK, that's OK. I haven't written in a while, too, because it's been the month of many conferences, because everyone schedules their conferences in April and May. And my role causes me to travel to a lot of them.
So I will put that in the backlog. In the queue. I would actually say I feel like I'm a terrible PSQL user, but backslash d is so powerful. Actually, one of my favorite ones,
which I don't know where they got this abbreviation from. Well, let me get to it. So backslash df lists all your functions available in your default schema. Let's do this one. So if you want to inspect the function easily,
it's backslash sf, which has saved me so many times. But I don't know why it's backslash sf. I might actually ask people while I'm here. What? Show of hands? So OK, I do have a comment about that. But yeah, this has saved me so many times, being able to, oh, that's what it's doing here. Oh, I didn't apply the right version of the function here.
But yeah. So fun fact. By the way, in the upcoming Postgres 11, normally to quit the terminal, as you see, you have to do something like backslash q, which for people who are newer to Postgres, that is not that obvious. Postgres 11 introduces quit and exit to get out.
It is a big deal. And actually, in the beta 1 press release, I made sure to add that. In fact, I argued with people to add that, because I'm like, this is obviously not a prime thing to talk about. But from a usability standpoint, I actually think that's important, because it shows that the community has listened to the feedback
of newer users, saying, this is really frustrating. And the feedback of that has been like, oh, wow. Thank you for listening. And I think that's really important, especially in an open source community. There are a lot of smart people working on Postgres 11 and other open source projects. But you also need to hear how people are using them. And actually, some of these things that we talked about today are the result
of listening to seeing how people are using applications. And I think it is very important to have the two-way communication around that. So sorry if that's a whole side thing, but. It's like multiple releases, he would say, removed it. Nice. Cool. Any more questions?
I don't want to keep people from lunch. Cool. Well, thank you for coming. So. So.