Temporal Data Management in PostgreSQL: Past, Present, and Future
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 20 | |
Author | ||
Contributors | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/19033 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
PGCon 201213 / 20
1
3
6
7
9
11
12
14
15
16
19
20
00:00
VideoconferencingMetropolitan area networkData managementTable (information)ImplementationSign (mathematics)Point (geometry)Single-precision floating-point formatPort scannerPrice indexInterior (topology)AreaDigital filterSet (mathematics)Extension (kinesiology)FrequencySineRange (statistics)MultiplicationKey (cryptography)Mathematical singularityExclusive orTemporal logicInformation systemsTuring testZoom lensData Encryption StandardIcosahedronLogical constantMereologyExt functorEvent horizonPartial derivativeExecutive information systemBit rateBinary filePC CardExistenceLogarithmRevision controlTabu searchArithmetic logic unitWordInformationFunction (mathematics)Algebraic closureDuality (mathematics)CAN busHacker (term)MereologyTemporal logicOnline helpRange (statistics)Form (programming)Slide ruleInfinityCASE <Informatik>BitSemantics (computer science)Multiplication signPairwise comparisonPhysical systemTable (information)Theory of relativityConstraint (mathematics)Fisher's exact testQuery languageData managementWebsiteFuzzy logicRow (database)Limit (category theory)DataflowTheoryExclusive orResultantMathematical optimizationNumberReliefInstance (computer science)TimestampMathematicsStress (mechanics)Complex (psychology)WordInterface (computing)Standard deviationSinc functionGoodness of fitRight angleOperator (mathematics)Object (grammar)QuicksortRepository (publishing)Strategy gameDifferent (Kate Ryan album)Array data structureImplementationPower (physics)Term (mathematics)Group actionOrder (biology)Similarity (geometry)Partition (number theory)State of matterDomain nameTrailModal logicProcess (computing)Run time (program lifecycle phase)Computer configurationNatural numberData typeMetadataLevel (video gaming)System callPlanningWater vaporError messageData structureField (computer science)ChainBlock (periodic table)INTEGRALTime travelLoginPoint (geometry)FreewareCartesian coordinate systemAngleSpacetimeWindow2 (number)Expected valueSubject indexingTime zoneEvent horizonGame controllerNear-ringSingle-precision floating-point formatEntire functionMessage passingDatabase transactionInformationRevision controlIntegrated development environmentDatabaseSoftware testingExtension (kinesiology)Dependent and independent variablesOperating systemAnalytic setAttribute grammarEqualiser (mathematics)Nichtlineares GleichungssystemLoop (music)Software developerMultilaterationKey (cryptography)FrequencyConnected spaceMatching (graph theory)ExistenceLogicScheduling (computing)Normal (geometry)Projective planeCore dumpBound stateInclusion mapElectronic mailing listRootDirection (geometry)VarianceMathematical analysisBenchmarkoutputOpen setSheaf (mathematics)Exception handlingVideoconferencingLine (geometry)Data storage deviceNetwork topologyDecision theoryAlgorithmCausalityPerspective (visual)Condition numberMechanism designPattern languageFunctional (mathematics)MultiplicationHacker (term)Lattice (order)Basis <Mathematik>Bit rateDesign by contractService (economics)Algebraic closureDegree (graph theory)CountingForcing (mathematics)Context awarenessCategory of beingProgrammschleifeInclined planeAnalytic continuationFunction (mathematics)SubsetAnalogyFigurate numberArmEquivalence relationRegular graphCoefficient of determinationCodeRepresentation (politics)Scripting languageWeb 2.0PhysicalismFilm editingVector potentialSocial classPointer (computer programming)TupleArithmetic meanSign (mathematics)Raster graphics3 (number)IntegerUniqueness quantificationIntrusion detection systemNP-hardDampingWeb pagePreprocessorReal numberCross-correlationXMLUMLComputer animation
Transcript: English(auto-generated)
00:05
Okay, hello everyone, my name is Jeff Davis. I work for Astor Data, a part of Teradata, and I will be talking about temporal data management in Postgres, where we were a long time ago, where we're getting to today, and
00:22
where we're going in the future. So first I'll start off with kind of a really simple, or what seemed like a simple problem a long time ago. I was just trying to devise some kind of a, you know, sort of audit trail,
00:41
an auditable log of what changes had been made to the database over time. So you could imagine, in this case, it was a system involving fixed assets. So if an asset was reassigned, you know, from inventory to a person or from one person to another, you wanted to keep track of that,
01:01
not only the action itself, or where the asset happened to be at that time, but also who may have done the reassignment in the database, so you could figure out what happened. It turned out to be a little bit more complicated than I thought, because of the limitations of the system at the time.
01:23
The, you know, primary query for any database problem, there's kind of two sides to it. One is, you know, the representation of the information, and the other is actually being able to query it in a useful way. So Postgres was able to represent those changes pretty readily with triggers, but doing the query itself,
01:45
there weren't a lot of good mechanisms for that. And this isn't the whole problem, but it's just kind of what got me started down this path, is that I felt like this problem seemed like a typical problem, and seemed like something that should be able to be handled efficiently,
02:01
but it really wasn't. At every step, I felt like it was awkward. In particular, queries seemed awkward, and they didn't perform well. So, you know, one of the things about them, I mean, if you've done this kind of query, you know, trying to, you know, match up, you know, what happened to records in the past,
02:25
and trying to, you know, specify date ranges over which that was valid, that kind of thing. How many people have written queries like that out in the field? Yeah, pretty much everyone. So, you know a couple things. One of them is that you need to be really careful with less than, greater than.
02:43
And you quickly come around to a convention, usually, to kind of keep track of it better. But you have to be pretty careful. You know, there's a tendency to use nulls for one side of the equation, but then those don't work with the greater than, less than comparisons you really,
03:02
as you would want them to. So just specifying null on one side can break, you know, the queries that were working before for you. And so yeah, you've got to be careful of nulls, and then there's also other awkward things like just trying to represent single points of time, or empty
03:23
ranges of time. And then poor performance. I'm glossing over this this a little bit, because this this talk is more about, you know, the entire problem space, but if you, you know, look at how a query like this might have worked,
03:44
you know, quite a while ago, say in 8.3 or 8.4, then you know, you look at this and basically, you know, I just, when trying to write this example, I actually tried to get it to choose, you know, a different plan that was able to do a bitmap and, because a lot of times that can be a reasonable
04:02
plan to execute this query, but you know, for that even it's a little bit tricky to get it to choose that plan. Oftentimes it'll just use one index, even if there's an index available on both the kind of from time and, excuse me, and the to time.
04:22
So if you have a query like this, does this kind of a query look familiar to most people in the room? Yeah, it's a typical temporal query to to try and find, you know, when you have, you know, how it might have looked at some point in time when somebody made a decision.
04:43
And so this plan is just not not really workable. And, you know, even even if the optimizer is able to choose a slightly better plan, that plan isn't really a great plan either.
05:01
And from here basically just gets worse. So I mean this wasn't the end of the use case. I think that if this was kind of the end of the problems that that you really want to solve with temporal data management, then I think most people can get by. You know, there's usually kind of ways you can work around it and manage the performance problems in other ways, but
05:22
we'll see the problems get a little bit more complex, and there's more you want to do and eventually becomes unmanageable. So I tried to improve the situation through a few projects. Quite a while ago I started, I just wrote an extension, and it was to make a
05:46
data type called period. There was a period of time, basically only operated with one sub data type, and that was a timestamp with time zone, which is the typical out of the temporal types offered by Postgres, this is the one you'd ordinarily choose as a point in time.
06:07
So I wrote a data type around that that held both the beginning point and an end point in time. And that allowed a couple of advantages, of course. It made the queries simpler, and it made them indexable because of Postgres's extensibility system. Having those two things together in one data type and
06:26
being able to specify the necessary index support around it allowed it to perform perform better. I jumped ahead a little bit. I meant to kind of list out a few of these other things, but this is kind of just a list
06:41
you know where we started, some features that have been added, some features that are in the process of being released, and where I intend to go in the future. But for the period data type, this is kind of where I started, and this provided kind of a first cut that
07:03
solved the immediate problems, like I said, making the queries faster and allowing the index infrastructure to be used. It obviated the
07:20
confusion around greater than, less than signs by holding the inclusivity or exclusivity of the bounds internal to the type. And this was actually superseded by range types, which I'll go into in a little bit more detail. Since this was superseded, I don't want to describe the entire use pattern for this data type, of course.
07:47
But this is where it started. This was making use of Postgres's extensibility to get us there. Then something that couldn't be done with Postgres's extensibility system at the time
08:01
was also a seemingly pretty simple problem. And so no matter you know how much I tried to make an extension work to solve this problem, I couldn't. So I decided that we needed a core feature to provide an answer for avoiding schedule conflicts.
08:25
Out of all the features that I saw at the time that I really needed to make temporal work, this was the one thing I thought, okay, there's no way of avoiding it. This must be in the core. So that prompted me to start writing some changes to core Postgres.
08:43
You can think of this kind of like a unique constraint, but it's more flexible in that it can allow you to actually prevent schedule conflicts. So I'll show an example here in a second. But to describe it, essentially unique means that
09:02
if you have one tuple and then that means that there's no other tuple with an equal value for that attribute anywhere in the table. And that's declarative. So that means the system enforces it and you don't need, there's no other logic you have to follow, just the existence of a tuple with a value of five
09:22
means that no other tuple has a value of five for that unique attribute. Exclusion constraints allows you to kind of change that operator away from just equality to an operation that's very important for temporal, which is overlaps.
09:41
So let me move on to the example here. I think it'll be more clear. So you can create a table with the definition like this. And I'll get to this in a second here. This, it was hard to develop a kind of simpler example and also not use the range types from
10:03
previous, from later developments, which I'll get on to later. But this here, this TSTZ range, that's what supersedes the period data type and that's what gives you that beginning and end and the index ability. And this one in particular
10:22
is, you know, the same based on timestamp with timezone. So this, this create table has these two extra constraints down here that operate on this value in a special way, on this attribute in a special way.
10:43
And the key is that we're treating, you know, room and speaker just with equality. So if the room matches and the time period you've got the room overlaps with somebody else,
11:02
then there's a conflict. And that's what this constraint is describing. So with equality, you have to say only if they're equal is there a conflict. In this example you can say, well, if the room is equal and during overlaps, then you have a conflict. Again, this is declarative, so the system enforces it
11:23
internally in an efficient way, and you don't have to do anything more. You're just declaring this to be true, and then it works. Of course, neither, you can't have the same room booked by different speakers for overlapping periods of time,
11:41
but you also can't have the same speaker booked at two overlapping times in different rooms. So really you have two separate constraints here. That's why you have this exclude twice. One is saying that the room can't be double booked for overlapping periods of time. The other is that the speaker can't be double booked for overlapping periods of time.
12:05
That's a pretty common pattern, pretty common constraint. So another common pattern would be the equals for one attribute and the overlaps for another. One, you know, the room in this case indicates, you know, one resource, which is the room.
12:24
And this is the time attribute here. And that indicates that's this complex value that has a beginning and an end time. So I went through that a little bit quickly. Are there questions on this feature here, or why this is important?
12:47
Okay, so this was new in 9.0, but as you can see, I used the example with 9.2 because it was just so much easier to illustrate.
13:00
You see I used extensions, which make this easier to actually just go create this directly from my slides because you don't have to, I don't have to show you how to install B-Tree, just you install it using the package system for contrib on the operating system, and then you can just do this, create extension.
13:22
And that simplifies things. Yes? Does that mean you could literally use this code? Well this, in 9.0, this could be used, there are two things from 9.2 that I used here.
13:41
One of them is create extension, and that simplifies it because it makes the example more more self-contained, right? The other thing from 9.2 I used is this range type here, which I'll get to those in a second, but you can think of this like the period data type that I just described. So in 9.0, you'd have to use the period type instead.
14:03
The only reason I didn't put it in this slide is because I didn't want people kind of copying and pasting and using the period data type or thinking that they needed to. This is the newer way to do it, and it's more flexible, and so we can see that. So as you can tell, I was a little bit torn on the order in which to put these features.
14:23
One of them was, you know, ordered by the time they were developed. But yeah, so that's slightly confusing, and I apologize about that. This is basic, so this is the attribute with a constraint like unique.
14:45
You just need a list of column names, so you would have unique on user ID, right? Or let's say unique on, you know, user ID, you know, comma,
15:03
some other, you know, user ID, and then a room ID, you know, to match them up or something like that. And so you just have a list, but with this, in this case, you have something a little bit more because room, you're matching with equality, and this one, you're matching with overlaps.
15:22
So by adding the flexibility to use other operators aside from equality, then we need a way to specify that. So with is a keyword, and it's just a way to associate the operator with the attribute, and that describes the constraint.
15:44
9.0. Yeah, yeah, so 9.0, you can do this kind of thing. And so if you actually specified everything as equals, then it would be a unique constraint. If you specified everything as equals, then it would have the same
16:01
behavior as a unique constraint with nulls, which, if I remember correctly, is just that basically anything that evaluates to null does not cause a conflict. So for instance, with a unique constraint, you can have multiple null values for the same attribute. Okay, so yes, yeah, exactly.
16:36
This is actually because you're using a GIST index.
16:51
It's actually because of both of these, because you need you need some index, you need some indexable operator in order to enforce the constraint.
17:04
It's kind of an internal implementation detail a little bit more, but you need the index type that can support these search conditions. So GIST can support this overlaps. With this extension, GIST can also support equality.
17:24
B-Tree can always support equality, but in this case you can't mix and match, so I needed one index type that could support both equality and overlaps, and then by installing this, then GIST was able to accomplish that.
17:43
Mm-hmm.
18:01
Yeah, if it's possible, if this is a, you know, common use case, I mean this is not necessarily my decision, and this at this point, it's not really a matter of technology as much as whether we decide to move it in or not. If it gets the point where enough people are confused by having to install the extension, then we would probably
18:21
make that change, but then again, we also want people to get comfortable using extensions because they're going to be pretty important. you know, but yeah, it's not really a technical matter, but it's become very simplified by making just one line now.
18:40
And that's Dimitri Fontaine's work, that he did the extensions work. Back there? If you have everything as equality, so the semantics are the same, then the performance on, you know, some microbenchmark that I did that was
19:02
you know, basically just trying to measure that exact one piece of the system. I saw about a 20% difference, something in that range, and that's due to a minor difference in the implementation. It's actually very close in the implementation as well
19:22
as the semantics. But there is a minor difference due to the need to do this kind of one extra index probe after you've inserted it. It essentially doesn't get one optimization that the B-tree index gets.
19:40
So there is a difference there, but it's pretty minimal. And when you have a larger case, it would ordinarily be a small part of the overall time. It would be my expectation.
20:00
There's no semantic difference. They're a performance. I would have to think about that a little bit. There shouldn't really be a difference. They wouldn't be expected to be identical because the way a Gist index is built is a little bit complex and nuanced.
20:22
But yeah, for performance questions, actually, I'd like to kind of punt those to the end a little bit just because I was I'd like to move through some more of these features. But here's an example of the exclusion constraints in action. So you have room one, two, three. I didn't have much room to even put the whole error message here.
20:41
But the error message is pretty nice, and it tells you what actually conflicts, and it gives you all this nice information so that you can track down where the problem is. And so this here is kind of a straightforward example of, you know, double-booking the room for different speakers.
21:03
Of course, if the speaker was the same, it could fire either of the exclusion constraints because it would violate both. So this is a simple way to avoid schedule conflicts. And it performs, as far as I know, it's the best
21:21
performing way to accomplish it because it's doing the least work and has the finest-grained locking. It won't block things unless there is a real potential for conflict, which is nice. Similar to a unique constraint. If you're inserting five and six, it doesn't matter if they're on the same B-tree page or not.
21:41
One won't block the other because there's no way five and six can conflict in a unique constraint. And it's less error-prone. I really recommend avoiding, if you aren't using this mechanism, I really recommend avoiding trying to kind of work around it and improvise a solution with triggers. These solutions are pretty error-prone in a number of ways. It's easy to think you've got it right when it's wrong.
22:06
It's easy to think that, okay, well, I'll take a big lock for now and not care if it performs and then later decide to optimize it because you actually do care and then it's wrong again. So there's a bunch of pitfalls, so I would just recommend against that.
22:24
And then range type. So this is what gave us kind of the power to use that TSTZ range in the last example. So this is a generalization of the period data type. So now we no longer have to, for one thing, we don't have to go out and use this other extension.
22:41
But also that other extension for the period data type was somewhat limited in a number of ways. It only worked for one data type. So range types are a generalization that can work for many data types. Providing, again, the key aspect is that it's got a beginning and an end and
23:01
then a bunch of operators and then the index support necessary. So as long as you've got an ordered data type, you can use a range. Some are built into the system, and if not, you can easily create your own range type out of some other data type, whether that subtype is built in or whether that itself is an extension you get from somewhere.
23:30
So this replaces the, just to be clear, this replaces the period data type. So it also offers a lot of these built-in range types. So now rather than only using timestamp with time zone as the period data type had,
23:45
you can also use it for dates, regular timestamps without a time zone, or non-temporal types. And then if you have some new type or a new temporal type or a
24:01
slight modification of how you'd like the range of time to work, then you can create your own range type there. So it offers a lot more flexibility, and then when you define a new function that operates on ranges, then it'll work for any of the range, the range types that you have. And when you define a new range type, all of the functions that you define will work on all of those.
24:26
So in other words, if you have a date range and a timestamp TZ range, and you define a function that works on a range of timestamp TZ, you automatically get it to work on date. And so you don't end up with that kind of explosion of operators and so forth.
24:43
You can do it a lot more simply manage many temporal types. So kind of a simple example of this is, I mean, sometimes you have you know, say a hotel reservation might be by day,
25:02
check-in, check-out, and then a conference room reservation might be for a very specific beginning time, like May 17th at 2 30 p.m. So in that example, you'd want a timestamp with time zone range. Logically, they're very similar,
25:21
so you don't have to try and manage these two different data types and worry if you have all the operators for both. You get them both here, and you can manage them in the same way, but just at a different granularity. And then for these cases, since these are reservations, of course you'd want exclusion constraints here as well,
25:42
but I just didn't have room on my slides to include that. So questions about range types? Yeah, I'm sorry.
26:03
Yes, you can define float ranges. Floating point timestamps, if I recall, there are maybe a few caveats there because, well, for a number of reasons, I guess, but I'd have to think through,
26:21
you know, for some uses it's probably fine. I'd have to think about that. I think it's probably okay, but I'd be a little bit worried using float values with, say, an exclusion constraint or some other hard requirement like that. But if you're okay with the fuzziness of floats enough to use them, and you're okay with just running queries on them and not trying to define declarative constraints,
26:47
I think you're okay. And that you'd accomplish by creating a range type with float8 as the subtype. And then if you did that, you would get all the operators available for the other range types
27:04
automatically, and you'd be able to create indexes on them automatically. Oh, yeah, and the range types, it's got,
27:23
it's not represented with null, and it does not use null semantics. It's treated as, you know, an unbounded range where you just don't specify anything. It's kind of treated more like infinity than like a null. So, although
27:40
ostensibly nulls are supposed to kind of mean that you don't have anything there, then there are kind of some strange semantics defined in the standard involving comparisons, so you don't, we don't want to use those semantics. So we treat it more like infinity on one or either side. Yeah, yeah, that's a good way to describe it. Okay, so
28:24
this, by the way, range types will be available in 9.2, so that's you know, well on its way to being released, but it's not quite out yet. So these things I'll be talking about from here on out are future ideas.
28:42
So these are fairly hand-wavy at this point and can and likely will take various different forms than you might see them on the slides, but it's, you know, problems that I'd like to solve that I think are important that would, you know, help
29:02
make temporal data management easier in a number of ways or faster to use. So in this case, range keys and range foreign keys. So the idea here is you know, there's a concept of a foreign key and a primary key on a table for these, this equality, but for ranges, as
29:26
we were saying before, there are other operations that are more important for range values. So a range key would be, for all the range values in there, you'd specify
29:43
overlaps as part of an exclusion constraint, and for all non-range attributes, you would specify equality. And so that would give you, you know, if we kind of move back to this earlier example here, then
30:00
it would kind of simplify this because since this is a common pattern, we'd want to be able to just define, you know, range, key, room, comma, during, with maybe some annotation to tell it that during is a range or maybe it can pick that up on its own, but the idea is that that allows you to just,
30:21
you know, do the the common case thing just becomes a little bit of syntactic sugar, and then under the hood it would use the exclusion constraint. Then if you need more flexibility, of course, you could go back to the exclusion constraints.
30:45
Yeah, that would be a good example of, I was going to move on to range foreign keys, that would be a good example of range foreign keys. So here I was just talking about a range key where you're specifying in a table that may be referenced or may not be, but
31:01
just to avoid the necessity of, you know, the these are a little bit verbose just because they're flexible, right, whereas this common case here would be you could just specify with a range key and it would automatically just be kind of some syntax sugar is what I'd like to have there.
31:21
And then something that offers a slightly newer thing is so range foreign key would do more like what you were suggesting, which is in the
31:40
referenced table you'd have a range key, and then in the referencing table you would have a range foreign key that would mean that every value in the referencing side would need to be contained in the range on the referenced side.
32:04
So this would be kind of the range equivalent of a foreign key, the range analog I should say of a foreign key, and that would allow you to very simply, you know, if you have a, you know, in one table some
32:23
you know, trip that you're taking that might be for a conference, then all the activities that you have in some other table must be contained within the time that you're in that in that city. So you could imagine a system like that. Yes? Are you intending to support individual values on the referencing side, or just range it?
32:45
I would, and either one is definitely makes sense from a logical standpoint. How that's specified, you know, that might be something that you could, you know, choose so if you had this is this is a little bit open to how you specify that exactly, but absolutely either could be done, and
33:06
yeah. Right, yeah, so that's that's a good point. I mean from a
33:24
it would certainly be supported, right? That's kind of the easy, it's a subset of the other case, so I mean if you really needed to you could actually just make a singleton range anyway of that time point, which is one of the reasons why I wanted to support singleton range as well in the range case,
33:44
is to make those those cases more interchangeable. Then three and four in integer ranges, and you had a foreign key as two threes or a bound of both, that would be invalid?
34:06
That that would be my inclination would be toward invalid, and then on the referencing side if you had say another non range attribute, I mean at the typical case you'd have a non range attribute along with the range, right? So you'd have ID 123 on the referenced and on the referencing side.
34:26
So you know in that case really on the referencing side what you want is you want to combine those those two tuples because you you want ordinarily, and I mean this you know you can kind of do whatever you want, but my inclination would be that on the referencing side
34:43
you would want to collapse those ranges into one two three four you know one comma four inclusive to be a larger range. Oh
35:03
Yeah, I think I misspoke, but either way you'd want those two ranges that are adjacent if if the other attributes of interest are the same you want to collapse those into one range and that would be my inclination although perhaps it could be done the other way
35:23
but that would be my inclination is that you'd want to kind of make make some assumption that you know that the referenced side was these kind of contiguous not overlapping
35:42
ranges Yeah, so if if they're the exact same ranges you could actually use an ordinary foreign key yeah, yeah, but You'd want I think that that contiguous not overlapping is you know an interesting property maybe
36:10
whether we want to require that or not I'd have to think about how difficult it would be to not require it or whether it's too confusing to not require it or require it
36:21
But that might that's been brought up also as another potential constraint that we could enforce That can't be enforced currently, but you know the idea that you know you can't You know you can't add something that's adjacent and So that might be another constraint that you'd want to enforce and add something that's adjacent
36:46
With all the other attributes of interest being the same so other questions about the range foreign keys or range keys
37:08
So another feature that I'd really like to see and I'll show kind of an example query that applies here And this is this is important. I think Joins right now that equa joins are
37:24
Are well supported in the system right there's three different ways to do an equa join Maybe more than that if you count some of the variants and That's That's very well supported, but for joining ranges. I'll show a query here in a second
37:42
But the idea is that you match up based on the portions of the range that overlap So rather than Joining on equality. This is kind of a spatial join actually, but the idea is you're matching on the portions that overlap and this will be important for billing queries or
38:02
Actually many different kinds of queries, but I think that the most straightforward is a billing query like this where You you have maybe some kind of rates in one table and Usage in another for some service and
38:20
So somebody might start or stop using a service at an arbitrary point in time At I should say at arbitrary points in time and The rates might change At arbitrary points in time as well or perhaps based on the contract Maybe the rates change, you know an hourly or or a daily basis or something like that
38:43
So then the usage could span You know multiple multiple rates periods or a rate period could Encompass you know many users So you want to do a join and kind of take the portion that falls into that rate period and prorate it
39:06
and multiply it by the rate so That you're able to actually get a useful bill And there are many problems like this I just thought that this was the clearest case, but you're doing this kind of matching based on the portion that overlaps on each side
39:26
You know another example of this might be if you're doing kind of a fuzzy join And you want to kind of take, you know, say a log message versus some event that happens That's observed somewhere in the real world and you want to kind of correlate a logged event with you know
39:42
That's recorded with the system's time versus Another event recorded with some other systems time and you want to say okay. Well Put a window of five seconds on each one and now you have you know, these two events That are actually ranges now and you could do joins on that as well
40:02
Then you could do kind of a fuzzy join in that in that way. So the idea I mean really I call it range merge join, but the goal is I just want that query to be passed So that's that's how far I've gotten in the design Approximately I do have some ideas That I've you know proposed and some kind of pre proposals how this might work
40:25
but it turns out there's quite a few different ways to approach the problem and we'll have to see which ways look the most promising and You know, perhaps all of them have merit in which case, you know, maybe multiple
40:42
Approaches would be implemented But the goal here is to make those joints faster because I think that those are going to become more common in these temporal applications Sure Yes in in what form I guess would be would be my question
41:06
How's what kind of partitioning going to work out Oh I would like that very much. Yes. I'm not working on that But I think that especially since range types are in the in the core system now, I think that
41:27
I think that's a very good idea to kind of hold metadata about The partition ranges maybe Right, we may be even able to use a an exclusion constraint to make sure they don't overlap
41:43
You know, but yeah for a range partitioning, I think that that makes a lot of sense I You know, I really like that that idea I haven't gotten much further than that than just thinking okay This would be a good way to represent the metadata There are a few other problems with partitioning of course that we haven't settled but that might make
42:02
That change alone might make partitions scalable to more more scalable because we'd be able to exclude faster much faster than doing the kind of The constraint exclusion that we're doing now which
42:21
I'm sorry for the terminology there, but constraint exclusion is it's the partitioning terminology for excluding Based on a constraint, but it's different than exclusion constraints so Anyway, but yeah, I like that that approach and I'm
42:41
It is not here Nobody has written any code to do that that I know of I believe that some people are working on that problem But I don't have much more information than that Then I like it. I'd like it to be done, but I you know
43:01
I'm not planning to work on that personally, but I might obviously be involved somehow Okay So I was that related to merge join or was that Okay, yeah, so so back to the range merge join then I I think the idea here is to you know again
43:26
This is one of those cases. I was talking about in the beginning where to do this You know in in the system a long time ago would be very expensive It is able to thanks to range types. It's a little bit better. It's able to use maybe an index nested loop
43:43
a nest loop with an index on the inner side, but If not for that it would have to actually use a Nested loop just like for a non equa join where it's is a very, you know slow Kind of a process. So I think the range types did help a little bit with this this query here
44:03
but we really are going to need a Arrange merge join to make those useful for kind of larger Larger kinds of problems or more complex problems That you know get a little bit more into analytics
44:22
So Another feature that I'd like to see kind of going back to the original example This can be done, but I'd kind of like to see this standardized I mean, I think a lot of people have some very similar ideas what they'd like to get out of historical table log You know a lot of times that's really as far as the requirements go is we need a historical table log
44:45
There may be a few specialized requirements like whether the user is included in in Whoever created or deleted that record and other things like that that a trigger might record along with the data itself and the time
45:01
But oh got a question. I Believe the time travel option you're referring to is based on transaction IDs
45:21
And maybe associating those with With physical times. I am a little bit unclear on how that option actually worked but it was related to transaction IDs and when they were committed and then Kind of you were able to do kind of a snapshot back in time
45:40
Right, yeah, so so this is a little bit higher level than that I mean for one thing people rarely want to do that for their entire database Unless it's a very Strict environment. Usually it would just be a few tables that they really care about that have something to do with money or physical assets Or something like that So the other thing I would say is that you're able to collect actually more information this way because you have information about the user
46:06
that actually performed the operation whereas If transaction ID snapshots are all that you're using then You know, you don't have information about users and higher level, you know access control related information like that
46:23
so I think that You know this I'd like to see kind of a you know solution that's easier You know that can take kind of this simple requirement with maybe a few options and just do it have the system take care of it and I think that that would Make it much more common for people to do it for one thing
46:43
It would make the use case would be obvious. So I think it would kind of Make people more aware that it can be done rather than needing to use a trigger or something on their own And it would kind of standardize the way it's done and that would you know kind of develop
47:00
You know some expectations so You know One person wouldn't forget to include some critical piece of information Because it would be kind of one of the obvious options available to include And People would kind of know what to expect from a performance standpoint and other and other angles It would just be kind of a little bit more standardized
47:25
Other thoughts on this or questions about having it a table log. Is this something that People here would value as is kind of a quicker way to establish an audit trail for a table
47:53
Oh The You'd probably block changes to the historical log. That would be one of the things I haven't worked out all the permissions yet, but
48:08
You'd probably block changes to the history itself I
48:25
Don't think it's Going to be a perfect solution there that's going to kind of magically I mean there's you know, you've got a history there. So you want to keep the old columns Maybe you could add new columns, you know that might I don't see a reason why that would be
48:40
You know prevented But you know Maybe you'd need to kind of record the fact of exactly when that column was added so that that way you knew that Maybe some historical records not only is it just null but that they never even had the option at that point There's something so there might be some extra information you'd want to keep
49:01
But I think you could still allow adding of columns By just the nature that you're keeping all those old records means you really don't want to remove columns There's a lot of people who want to put it on the table Priority
49:20
They want to make sure that they raise something That I would call that kind of a special requirement I mean the idea, you know along with you know, when you offer something declarative to add a history log
49:45
then Personally, I think it would be a bad idea to then Also allow a way to subvert that Because then when you go back in time, you're actually not seeing the state as it was at that particular time, right? You're not it's it's false. So maybe if you had special requirement you could do that
50:04
But I don't know if I'd include that in the in the video Probably a trigger I Mean I unless there is some reason, you know
50:23
Not to I mean I think I think a lot of things could be accomplished with a trigger here. I Mean did is there did you have something in mind? Let me kind of rephrase This is there some limitation of triggers that you think would
50:51
I mean, I mean it's it's an implementation detail in this case I think the you know part of the hesitation to use triggers is that they're not declarative right there
51:01
But if we're able to offer something declarative Using a trigger as an implementation detail I think that at least some of the more some of those concerns would go away because They'd understand that it wasn't just doing some crazy thing that it had kind of a method to
51:48
I Mean I I think I think that's kind of the business requirements if you want historical pricing You either have to store the whole record and kind of you know, do some sort of
52:03
denormalized, you know kind of pull that All that information out from the other table and stick it in the original one or just log both And then that way you can do a historical kind of a join
52:27
It I think it may in in quite a few circumstances in this particular case I think what you'd be able to do is kind of do a snapshot query on both tables to get the old historical data That could be done maybe with an index kind of a scan or get that snapshot somehow
52:44
And after you have that you've highly filtered it then at that point you've got the snapshot All you care about is the equi join of those two sub results coming back together so I think you could do that with an ordinary echo join although There would be very similar queries that I think range merge join could be very useful for so I think it's very close
53:07
To a case where you'd want range merge join, but I think you could get away in that case with an echo join in the existing system
53:30
Yeah, maybe I if you wanted to do a join where you got Results from many Points in the historical time rather than taking snapshots of each
53:43
I guess as soon as you you do the snapshots of each at a particular point in time, then you've at that point Kind of made it non-temporal, but if you're not doing the snapshots if you want like a kind of whole history of Things going on and you want that joined result then yeah, I believe I believe that's true
54:26
Yeah Yeah Okay, yeah I think you could do so you're saying do it do a range join on the two history tables to come up with Kind of a new history table of the of all the records
54:44
Yeah Yeah, yeah Meaningful results Yeah, I think but that gives me kind of a better way of describing I think to Chris's question
55:03
is that you know a more intuitive way of looking at it is you could take the two history tables and join them and end up with a history table of As though you had just one table originally in the history of that So you could kind of
55:24
The the join of the history tables is the same as the history of the joined table You know I I think I think that might be a better way of describing it and see I think I have
55:40
one more kind of Idea to throw out here while I've still got a couple minutes So this is kind of a little bit a little bit out there But it's been requested quite a few times, and I think that there's a there's some Merit here, it's just that I think that we need to kind of understand the semantics of this and the use cases for those
56:03
Semantics in enough detail to really you know to really find out what we want here, but the idea is that you know you have Many different ranges you're joining them together or unioning them or aggregating them in a in a union fashion and
56:24
Obviously you can't have in the existing range types You can't have these multiple disjoint ranges that have a gap in between that just isn't representable today In range types. This is essentially saying well Yeah, you can even have gaps in there, and it's kind of like a bunch of ranges that are all
56:47
ordered from Left-to-right none of which overlap and none of which are adjacent
57:18
Yeah, yeah, I mean I'm still thinking about that one a little bit
57:22
And I don't know if I can quite respond to that one on my feet off the yeah
57:46
Right yeah, if you want to aggregate a bunch of ranges just to kind of repeat What he said we already discussed this very briefly, but if you're you have many ranges and you're aggregating Then where the aggregation operator is union
58:02
Of the range so you can imagine rather than like some aggregating on plus this would be This other range union aggregation You know Combining the ranges with the union then you don't want this kind of one range That's kind of out You know out there someplace to throw an error because you can't union the ranges because they're not touching
58:26
It was about similar yeah
58:40
Right yeah, yeah, so you can't subtract cut out the middle got out a little hole in the range So this this would allow you to do those kinds of operations I think that Like I said, I think I think that there's something here, it's been brought up several times But I would need to understand the semantics and the use cases that need those semantics a little bit better to
59:05
You know to really do a good job at something like this
59:24
I'm sorry. I can't hear you very well
59:43
Right yeah, but then we need a date new data type for those to return as well right because now that the domain Has expanded right? That's That's a good question
01:00:00
I mean, so part of it is, you know, with the array, the array doesn't have the constraint preventing, say, overlapping or misordered ranges, right? I mean, you'd ordinarily want them to be ordered and not overlapping, and so that constraint would be difficult. There's another problem with that, if I recall, but, I mean, essentially, you know,
01:00:28
it's similar to that. That's also been a reason why it hasn't been implemented already, is because we have this array of ranges, and so that can solve some of these use cases.
01:00:42
Even if it's user-aware, is there a way you can cast a range and then a set of ranges? Is there a way of, what's left over? Is there a schedule system? What hours open, type of? Yeah, that would be an interesting operation.
01:01:02
So, yeah, I think probably the way this will go, this is out of the ideas I've presented. I think this is the most, you know, kind of at least from a semantic standpoint, the least understood. So I think one way we might kind of wait to see how people use arrays of ranges to
01:01:21
solve these problems and then see where that is weak and then see if we can make a use case for multi-ranges from that. Yes? It's scary in places, and I think we specified, think about having a web interface, trying
01:01:51
to represent a date range is pretty easy, a date here, a date and time here, date and time here, and say set those two things.
01:02:02
That feels okay. You start having to have a recursive object to control it. Well, if you're talking about from the user's standpoint, maybe it's better to represent it as a picture.
01:02:26
Well, then the script can worry about the nested data structures. So yeah, I think that's a very good point, right?
01:02:42
I mean, there's not all that many interfaces that offer that much, and that kind of indicates the complexity of the problem, you know, and therefore kind of limits the use cases. So I think that I was responding a little bit tongue-in-cheek, but ... This is like the strict relational theory system.
01:03:05
Your database consists of ... So this may be the same sort of thing.
01:03:21
You probably don't very often want to have multi-ranges as part of the original input, or as part of a table, but for this to be part of the interim result is perfectly reasonable. I mean, I would say from a theoretical standpoint, there's nothing that prevents you from having
01:03:50
complex types, and those complex types can be anything. So I would say that I wouldn't reject it on theoretical grounds, but I think that from
01:04:01
a best practices standpoint, if you find yourself using lots of multi-ranges and empty ranges in your base tables, then there's probably some design flaw there. There's probably a schema design problem, and that they are more typically useful as
01:04:22
intermediate results because of the closure. So I would agree partially with what you're saying.
01:04:43
So yeah, I think there's a lot of connection between what I've been working on and the spatial data types and work that those two guys have been working on, and I'm hoping
01:05:03
to take advantage of some of that, and they've showed an interest in some of that.
01:05:22
You're talking about the range merge join now? I think so. I was thinking about that a little bit during their talk. I would need to think a little bit more again. I'm not quite sure if I can respond to that on my feet. I would like to spend some more time thinking about that, because if we're able to do those
01:05:46
similarity joins more efficiently, because they're using an index nest loop essentially, and so if they had a better way of doing that, I think it would work with some of their similarity searches and some of their strategies.
01:06:06
I think you can do it faster with some kind of a join than with the index nest loop kind of approach that they were using. I've got another question back there.
01:06:38
My proposal that I submitted to hackers involved the presorting phase, but that is
01:06:48
not the only approach, so that's one of the things that we'll need to discuss. So the algorithm was kind of presorting based on the lower bound of the range, and then
01:07:03
doing kind of a merge strategy. Yeah, I'm hesitant to say there's always some kind of pre-processing of the data at run time, right? So if you have like, it's going to be involving some kind of a temporary index or a sort order,
01:07:24
or something along those lines. So I'm hesitant to say that it wouldn't do anything, because in a way creating a temporary index is kind of sorting in a way. So anyway, but I think you see where I'm going, and there's a, right.
01:07:58
Yeah, I mean, I wouldn't think of it as very free, because I mean,
01:08:02
this is not an order that would be semantically useful in some other way. So it's not like we happen to have it sorted already because that's the order they put it in, or we happen to have it sorted this way because it happens, the user wanted it in that order anyway. You know, because this would be kind of a strange order, right, by lower bound.
01:08:22
And not that it would be useless, but just that I expect those opportunities to be less than what the query planner is geared toward finding already. But I mean, it's all kind of open for discussion, and I think, you know,
01:08:40
to that extent, you know, I think that if the planner can find this sort order already, and if it's already there, I think that this strategy that I proposed could have a lot of promise. Oh, Kevin. You know, I've just been thinking about disjoint ranges, and it seems like that
01:09:00
would be a good solution for certain classes of scheduling. I'm not sure if you can think of any other solution. For example, in court, you know, you have two attorneys and a judge, and they want to schedule a subsequent, you know, meeting. They've got to say, within this date range, what times are available
01:09:25
where all three people don't have something scheduled. And that seems like you could take, you know, a date range, a range of, well, disjoint set of working hours within a date range and
01:09:41
exclude scheduled activities and get a result that would be very useful. So this sounds like multi-range kind of use case, potentially. Okay, so that, I think, I like that use case. I mean, I think that that makes sense.
01:10:00
You'd be able to kind of cut pieces out of the schedule and then search within there. Yeah, I think that's a good use case. I'll have to think about it some more, but I think that's a good direction to go.
01:10:22
Okay, thank you everyone. Thank you.