We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Lies, damned lies, and statistics

00:00

Formal Metadata

Title
Lies, damned lies, and statistics
Subtitle
A journey into the PostgreSQL statistics subsystem
Title of Series
Number of Parts
19
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Before executing an SQL query, Postgres needs to decide on an execution plan. While there are multiple steps involved in generating a plan, this talk will focus on the main source of inputs for the query planner machinery, namely the statistics subsystem. Detecting problems with statistics is often a crucial step towards finding the reasons for bad plans and slow queries, so it's important to understand exactly what gets tracked and how it's getting used. As the universal rule of Garbage In, Garbage Out teaches us, without some degree of knowledge about the shape of queried data, it is impossible to get a reasonable execution plan. PostgreSQL employs a number of ways to maintain up-to-date statistics about table sizes and the distribution of data inside them, which in turn inform the query planner. In this talk we'll explore what statistical information is being tracked by Postgres, how is it calculated, where does the server store it and how can the operator query it, and finally how to tweak the whole system. Maintaining up-to-date statistics needs to balance performance overhead with information quality. For a large database, you can't just read everything and calculate some ratios, so the system employs a number of clever algorithms, which we will examine. Another concern is that some data types require specific statistical information that's specific to the query pattern for the given type. A column holding an array is less likely to be queried for exact matches, but it is often queried using a "contains" operator. We'll cover special cases such as arrays, full text search vectors and ranges, and we'll talk more in depth about how statistics for them are gathered. Finally, we'll look at some of the weaknesses of the statistics subsystem and areas where it could be still improved.
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Transcript: English(auto-generated)
So let's start. For the people who want to follow the slides on your laptops, you can get them from there. So it's w-u-l-c-e-r.org slash and the name of the talk.pdf.
It's going to be in the schedule afterwards. But if you just want to follow on, you can get them from there. But I'm also going to be showing them here, so we don't need to. So my name's Jan. I work at a place called New Relic. And I'm going to be talking about statistics
and the statistics subsystem of PostgreSQL. So the talk is going to be structured like that. We're going to start with just an overview and ask ourselves a question, why do we even need statistics? Then we'll see what statistics get calculated.
So what actual statistical information does PostgreSQL track? And then how can you look at them and how can you explore them so you know where they live, you know how they look like? Yes? Say again? URL. Oh, yeah, sorry. I'm just going to switch to that.
And then hopefully, I remember the outline of my talk. Then the second part of the talk is going to be the nitty-gritty. So we're going to see what's the actual math behind calculating those stats, what algorithms do we use. And then for people who survive the math part,
we're going to go a bit into configuration. And then we're going to talk about some of the more advanced features that the PostgreSQL system offers, and specifically some of the improvements to the statistics subsystem in version 10.
So let's kick it off with why do we need to gather statistics. It's a database. We know it stores data. You can query it with SQL. But why do we need statistics? So a short refresher, a SQL query gets executed in three steps. It gets parsed.
So we send in text. It gets parsed into something that the machine can understand. It gets planned. So the system decides, I'm going to execute it in a certain way. And then it gets executed, which is where stuff gets read from disk and processed and calculated, computed. So statistics are used in the second phase
of this, which is planning. To figure out a good way to run a query, we need to have some information about the data, the table, before we actually run it. So the planning algorithm needs some input. As every algorithm, it needs to have some input
to get different outputs. And the inputs come from the statistics subsystem. So it's actually what feeds the planner are the statistics. So for people who were here for the previous talk and for people who were at Alex talk about adaptive query
optimization, the problem of cardinality is one of the fundamental problems when you do query planning. So one of the basic things the planner has to know or has to figure out before it runs a query is, how many rows will this thing return? And if it's a top-level thing,
it doesn't really care because it's just going to return those rows to the client. But all the parts of the plan that are below that feed the higher-level parts of the plan, it needs to know how many rows of those things will return. So it knows how to decide on various things. Like, for instance, when you join two things and say
you want to do a nested loop, we're not going to do the basic thing of scan one thing and for each row in this thing, execute the whole scan on the other side. You'd like to get the order right. So you'd like to re-scan the small thing a lot of times
and not re-scan the big thing a lot of times. So you need to know which of the things is bigger. You need to also decide whether it's worth it to use an index to read through a table or if the table is small enough that you can just read it all. You need to be able to decide if you need to materialize or if you can materialize part of the result
or will it be just too big to fit in memory and it'll just lose out if you try to materialize. So it's very important to know how many rows an expression will return. It's a simple problem statement. So how do you figure out how many rows something will return? Well, usually rows come from tables. So you need to know for a table
how many rows are there in a table. And that's straightforward. Just need to know, we'll see how. But it's a simple thing though, you need to know it. And then you need to know how many of these rows will not make it to the output because of where clause is. So restrictions on that particular table scan.
So those are the two things. And whereas the total number of rows is conceptually easy, just have to know how many rows are there, estimating the selectivity of a where clause might be much more difficult. And even conceptually, it's not that obvious how the hell we can even do that.
So here are some examples. And they're increasingly complex for the planner to figure out what to do. If you just do a select star, then if you know the number of rows, then you know how many rows you'll get back. That's a simple case.
If you want to get all the stores that are in Alberta, it gets a bit more tricky because you suddenly need to know how many of those stores are in Alberta. So if you don't know, if you don't have the whole table in memory, you can just go and check. It's not that simple to do.
How would you go about figuring out if it's all of the rows, some of the rows, maybe none of the rows, because that's a database of stores in Brazil. So none of them is in Alberta, but that's one of the things you have to figure out. Then you have things like select star from places
where population is greater than something. And that's certainly more difficult because if you have stores in provinces, you might somehow get those stats and say, oh, I know that fraction of the rows of stores is in Alberta. But population is distributed. Every place has a different population.
So you can't just say, oh, I'm just going to calculate how many of those have population two million, how many have two million, two million one, two million two, that you just can't do it. You have to have some other kinds of statistics. Then you have multiple conditions. So say, you know how many things are in province
and you know how many things are in a given zip file, the zip, zip file, zip code, but you need to combine them together and to figure out how many of them are satisfying both those constraints. So then you have to somehow build up that knowledge from the basics. And then that's sort of the simplest cases
because then you get joins. And when you get joins, it starts getting more and more and more complex. So if you have a join like that, it's not that you have to know how many places you have in your database in a given province and how many stores in a given province. You have to know how many of them match.
So what the product of those will be when you match on province. And then it gets more and more complex as you tack on more and more clauses. And this is simple SQL. Like your typical application will generate much more complex SQL. So it's not that simple, but it's something we need to be able to choose reasonable plans.
So those are the kinds of problems that the planner comes against. And how does it get the info to solve them? So how does it, what's the information it has?
Because obviously you can't gather too much information because then it would be just slow. You'd have to store a lot of information that's not really pertinent to what the user is storing. So the overheads has to be balanced. So let's start with table statistics or stats that are kept for every table in the system.
Obviously the number of rows in the table is something that's useful. So that gets tracked. Now it's not, obviously it's not precisely the number of rows because you don't want to incur the overhead of precisely tracking how many rows are in each table.
So it's an estimate. And it's also not that useful as you think. It's useful to estimate selectivity and how many rows will something produce. But it's not really, it doesn't really tell you how expensive it is to scan a table because everything gets scanned block by block.
So disk block by disk block. And it's the same if you just want one row from that block or all of the rows. It's more or less the same. So another thing that gets tracked about every table is how many disk pages does the table use? And this decides how expensive will it be to scan the whole table?
What's interesting, those stats are also kept for every index because indexes look sort of like tables. There's file on disk. You can, there's tuples in them. They are on disk pages. So it's not only for tables, it's also for, so those were stats for the whole table.
And then every column in every table has its own individual statistics calculated. So actually most of the planner stats are calculated or a per column basis. And we have this information for every column. So starting with ones that are more obvious,
if you think, oh, what information would be cool to have about every column? The fraction of null rows. That's useful because a lot of the operations will only affect non-null rows, a lot of joins. You won't get any rows that match up with nulls. So it's just plain old useful.
Even if you don't have any information when you do a select blah, blah, where something equals 3, you know that 3 will never match up with a null. So you can straight off the bat discard the fraction that's null. So that's useful information. The width of the column in bytes is sort of useful.
You need to know if I'm going to fetch this, how much memory will this take? So you have a text column where you store the choices you list as. It'll be big. But if you have a Boolean column, it's going to be small. The number of values that are distinct values
is something that the system tracks. And it uses a little trick. The number that's stored as a statistic can be positive or negative. If it's positive, it's just the number of distinct things. So if you have stores in provinces, there's only so many provinces in Canada.
So it's just going to be a number. You can just say there are 40 different, well, here I am. I think there's 11. But then there's territories. So it's going to be a number. But sometimes, sometimes there's just too many of them.
And your country say there's, like the populations of cities in Canada is a distinct number probably, or almost distinct, because no two cities are going to have the exact same number of people in them. But you can't always say, I know there's exactly 33 million different populations.
So if the number is negative, the system interprets this as the fraction of rows that are distinct. So if you have a million rows and only 500,000 different values in them, so every two rows more or less
have, so there's two rows that have the same value. For every row, there's another row that shares value with it. This number is going to be minus 05. And so this is like a trick to compress that information into one floating point field.
Then there's stuff that's yet more involved. So it does keep an array of the most common values that appear in the table. So even if there's a lot of different values, some of them
might be very much more frequent than the others. It's going to keep them. They're going to be called most common values. And the companion array, which is the frequency of those most common values. So if you have your typical distribution when one thing is much more common than the long tail,
you're going to have that thing in the most common values array. And then you'll have its frequency in the frequencies table. So it's like two companion arrays for every column. So it's array data for every column. Then you have a histogram of the values
if they can be ordered. So most of the types in posers and in general can be ordered, numbers, strings. You can have an order. And if you can't have an order, you can put them in that order and then figure out which values divide it
into bins into buckets of equal size. And this will be your histogram. So this is useful. If you think about it, this will be useful for queries like select, blah, where, population greater than something. Because you can look at your histogram and say, oh, this falls into this bin.
So I know all the values I'm going to get are going to be only those two last bins. So if you have 100 bins, I know I won't get more rows than 2 out of 100. So the histogram is a way to estimate selectivity
of things that are continuous, like number fields. And finally, it stores a correlation number. It statistically defines a correlation between two values, which is the correlation between the order
of the things on disk and the logical order of those things. So if you have a timestamp column in an events table and you're writing stuff into it and you're always writing at current time, the data on disk is going to be laid out in the same way that the timestamps in the tuples.
So this means if you scan it in index order, which means you go timestamp by timestamp, it's going to more or less be the same as if you would scan it in sequential order. Because the index is going to hand you pages in disk order.
Whereas if you just have random stuff in the table, then an index scan will mean you will be jumping around, reading from different parts of the table. And the planner uses that to decide whether an index scan is going to incur a lot of random IO or not a lot of.
And those are the things that, yep? That last one. So is that? So the question is, why does the histogram exclude most common values? It's just a trick to save, to get even more info.
So when you calculate the histogram, it actually excludes the stuff it puts in the most common values and calculates a histogram over that. Because for those things, you know the exact frequency. You don't need the bin from the histogram. So it's just an optimization to make the histogram more precise. So in the planets?
Yeah, it looks at both. And if it finds it there, it gets it from. If it finds it in most common values, it uses that. If not, it uses the histogram. But it also keeps in mind that some of it has been excluded because of most common values. So that's sort of the stock thing that you get.
For every column, in every table of your database, that gets calculated. It's actually a pretty handy thing to see as a DBA if you want to explore your data. You have this pre-calculated for you with reasonable precision. You can just go and look at it. So some of these is pretty handy.
Now, some data types have their own statistical needs. So because a lot of this stuff is pluggable, you can actually, with your data type, provide statistical routines to calculate your own things.
If you say, oh, I don't need a histogram for my data type. I need something else. Then you can plug this in. And then because all these stats are stored in a single table, the schema for that table has to be pretty flexible because you have to be able to put
into the same table information about a floating point column, a text column, a in that address column, everything. So the schema is pretty flexible. And so if you plug something in, you can also take advantage of that and use the flexibility of that schema.
We'll see what the schema looks like in a moment. And then there's a million other features that Postgres has that have to interact with this, like foreign tables. If you write a foreign data wrapper, you can provide your own statistics, implementation. So if you have a Redis foreign table and you want to have some stats about your Redis keys,
you can do that, but then you have to write the code that does that. But you need to be able to do that to make the planner work because this is the input for the planner, and it's most of the input for the planner. So it's garbage in, garbage out. If the stats are bad, the plans will be bad, necessarily.
And another neat thing that you need to be aware of is that those per-column stats are also gathered for expression indexes. In the statistics system, there's a lot of interweaving behaviors that come from different features of Postgres that
need to plug into this. An expression index, for an expression index, you will need the histograms and things like that because you need to know how selective this expression will be. For a regular index, you don't really need that because you can use the column statistics. But for an expression, it can change the distribution.
If you index on lower something, you will get less distinct values if you just index on the actual text. So expression mixes also get this statistic treatment. Excuse me. Yep. Is the statistics one from the foreign server? You have to write your own code.
So just write your own C code. It's like for your own data type, you can write it. For a foreign table, you can just write your own implementation. And you're on the hook for using that. Yep.
So I'm going to remember this. The previous question was whether FDWs fetch their own statistics. And the answer is you have to write that code. It's your responsibility. The question now was whether expression index statistics look like column statistics. And yes, they have histograms, most common values, frequencies, the whole thing.
OK. So how do you actually go and see what's in your stats tables? PG class keeps table-wide statistics. That is where information table lives. So PG class looks like a good place to put that. There's two fields there, two columns there. Rail pages is the number of disk pages.
This is actually a pretty exact number because the storage manager pretty much knows how big the tables are. So it's pretty precise. Rail tuples is the number of rows. It's approximate, but it's a pretty darn well approximate thing. So a lot of monitoring systems could just
use that instead of running count star on the table, which is costly. Just look at PG class, get the rail tuples. And it's more than enough for trending, for seeing long term how big your table is. And it also has an entry for every index, like we mentioned.
Now, the per-column stuff lives in a table called PG statistics. And as we said, there's two kinds of those statistics, the ones that are data type independent. So any data type can be null. Any data type hides the width. And any data type, if it has an equality, almost all of it.
I don't know if there's any data type that does not have equality that has distinct values. So it's a concept you can have for any data type. And the reminder of the table are exactly five slots, where a slot is a series of columns that
has an integer code that defines what kind of thing lives in that slot. An operator, if it's something that has to do with ordering like a histogram, you need to know which operator has been used to calculate that histogram, because you can have different ways of ordering things.
And any array for the values of the column, so this is where the flexibility comes in, because any array can have anything in it. So the same PG statistics table in each row will have an array of different things depending on which column this is for. And an array of floating point numbers,
that's the actual measurements. So say most common values and their frequencies will be stored like this. You'll have a code that says, this is the MCV, most common values, slot. You'll have the equality operator, I think. And then you'll have the values that are the most common things, so text
if it's text, float if it's float, ranges if it's range, and the actual frequencies as real values. So it's a pretty torturous way of modeling that, but it's flexible enough to, like this would get you an F on relational design 101.
You have a code that says what this thing is, and then you have five of those hard coded slots. But it's flexible enough to keep everything we have in one table. So it does the job. And then there's the thing that we actually want to use, which I saved from that, which is pgStats. It's a view that's built on top of pgStatistics
that's more user friendly. It uses, instead of OIDs, it has actual table names. It has, instead of slots that are just called star numbers one, star numbers two, it has actual names for those slots. And what's most important, it's readable for everyone,
whereas pgStatistics is super user only. So pgStats makes sure you can't look at table columns you don't have access to. And pgStatistics has to be clamped down to super user because you can't let users, you don't want to have users even infer or be able to see the statistics of what columns they can't access.
It actually shows some data from inside the table. So pgStats is what you would use to sort of explore the data and look what's in it and play with it. So we went through this, and then we can see now the first one will just use rel tuples to estimate.
The second one will use rel tuples. And then if Alberta isn't the most common values array, as it will probably will, then it will just know the frequency. And then it will multiply the number of rows by the fraction of non-null rows. And then it will multiply that by the fraction of rows
that have Alberta in them. Because it will just have that start. So you have an array of the most common values. And these are the values that you think are the most common.
And you know the size of the array. It's a postgres array, so you can just know the size when you read it from the table. It just knows the size as it reads it from the catalog pgStatistics.
So the third one is going to use a histogram. The fourth one is going to use probably two. So statistics forms it, statistics for province. And it will multiply those selectivities, and so on, and so on, and so on. I don't want to go into how the planner works, because I want to focus on how the statistics subsystem
really work. Planner is a bigger subject. But we can see how this can feed those informations. At least now we have some info to work with. Right, so we know what gets, what lives in the tables. We know what means. How does it get there? Well, we all know analyze is the thing
that updates statistics. This is sort of common knowledge. Interestingly, vacuum updates some of the statistics. But analyze is the thing that actually does most of the job. So just a plain analyze will loop over all the tables and update the stats. So we're actually going to focus
on analyzing a single table, because it's the same thing. Like analyze on the cluster will just, or the database, will just loop and run the same code for every table. And you can actually tell analyze to analyze a specific column only, because the same code more or less gets applied to every column.
So it does the table write thing, and then it goes over every column and does the column on every table. Column thing. Vacuum also updates the PG class stats, because vacuum also has a pretty good knowledge about how many blocks a table has, how many rows are there, because it knows, OK, from the stuff I've seen,
the rows are this wide. So that many fit in a block. And I know the number of blocks is x. So I can just divide that up or multiply and put it as my rel 2 pools estimate. There's a couple of other commands that have to scan the whole table, so they take advantage and update PG class
so you have more up-to-date info. So analyze internally. Let's focus on the single analyze for, say, a single column. How does it work? What are the steps? Well, first and foremost, before it does anything, it switches to the context of a table owner. So everything that runs after, all the code
that runs after gets run in the context of the table owner. And that's pretty important, because there's at least three vulnerabilities that came out of not doing that correctly. It's not that simple to do. Analyze will call index functions, for instance.
And an index function can do whatever. Can write into tables. You can write index functions in procedural languages. So you have to make sure that analyze will not run in the context of the user who actually typed out the command, but rather the owner of the table.
Then it figures out how to analyze every column. Then it determines how many rows it needs to fetch, fetches them, does the math, and stores them in the table. So there are two things that are hard here. How many rows do we need to fetch? Because we obviously don't want to scan the whole table.
We want to fetch some part of the rows. And how do we actually, from that sample, extrapolate and calculate stats about the whole body of thing? So that's where we get into the math. Now, you don't need to run analyze normally yourself. The autovacking worker will run it for you
when it notices that a given fraction of the rows have changed. So there's configuration variables that tell you, oh, if 20% of the table rows have changed, this needs an analyze, so it'll have an analyze scheduled on it. If you do a bulk load, or as the conference was said,
if you do an upgrade with pgUpgrade, you will have no stats. And if you have no stats, your plans will be super bad. So having no stats basically means you can't use your database. So after you run a bulk load, it's important to run an analyze as part of your loading process
so you can actually start using the data. Otherwise, Postgres will give you very bad plans. And of course, if you disable autovacuum, you won't have stats. So it's not only vacuum that's on you when you disable that. It's also running analyze. So just as an aside, if you look at Postgres,
there's a process that's called stats collector. It doesn't have a lot to do with that. It calculates what I, I don't know if there's a good term for that, I call it runtime stats. So it starts about how many times the table has been scanned, how many times an index has been scanned, how many blocks have been fetched from it. It's more user admin stats, not planner statistics.
So those are different things. And people sometimes get confused when they have problems. So my stats collector process is eating my CPU. I need to run analyze less often. No, that's the different systems.
So let's jump into the math. First order of business is how many rows do we need to fetch? In a number of rows to fetch, the calculations start with the histogram size that we want. So if you want a histogram of 100 buckets, this is the input variable for the algorithm that decides
how many rows to fetch. So histograms are those values that divide the space of the values in the column into bins of equal size. There is the concept of a relative bin size error is how much does your worst bin that you generated
in your histogram vary from if you actually had scanned the whole thing and had a perfectly sized bin. And we'll see an example in a moment. So this is how skewed, how off your histogram is, the relative bin size error. And the error probability in those calculations
is the chance that if you assume a bin size error, you will not get it. So if I say, I don't want my bin size error to be bigger than foo. And the error probability of that is that the chance that I will actually get something worse than foo. So let's look at a simple example.
If you have a column, yes? Is there a second term, the target histogram size that statistics target? Yes. So the question was, the target histogram size is actually the statistics target. This number gets translated into the histogram size.
So if you have a column that has 1,000 values from 1 to 1,000, and you want a histogram of four bins, you will get five boundary values. And they should be that, 1, 250, 500, and so on. This divides into equal sized bins. If you calculated bins with 1, 300, then 550,
your biggest bin is of size 300. The ideal bin is 250. So this is the relative bin size error. It's 0.2. So this means you're 20% off in loose terms.
It makes sense as a measure how bad my histogram is. So what you want to do is you want to bound that value. You want to make sure your histogram is not worse than something. This is how you arrive at the number of the things that you want to sample. And there's a formula that tells you
how many rows you need to fetch. If you don't want your bins to be worse, then something with the probability of something else. So Postgres just substitutes those values in. So it says bin size error 05. I don't want my histograms to be more than 50% off
from what they should be. It says that the number of rows is 1 million. The total number of rows in the table, it doesn't know that. But it just assumes that, and we'll see that that's not really too bad. And then it says, I want a 0.01 chance that I'll get a worse thing than 05.
So I want to be pretty sure that I won't get something worse than 50% off. And you put the numbers in, you got 300. So if you want a histogram of size 100, you need to fetch 30,000 rows. That's what the math tells you, assuming the table has 1 million rows.
And it's an assumption that's pretty reasonable, because the n there is under a logarithm. So even if it's much bigger, the overall effect is not that pronounced. So it's kind of fine to say, let's just say 1 million. If it's 10 million, it's not throwing us off too much.
So 300, we have a number. That's more or less what we just said. Now, we know how many rows. So it doesn't matter which rows you pick. It shouldn't. It's a statistical sample.
For a histogram, you should pick a random subset. So it has to be properly random. And here, we're going to talk about how you get a properly random subset. But you just know that to have the error bound to those limits, you need those many rows. That's what the math tells you.
So you can't? No, it has to be right, because then you're just going to count your hot set. So it's not really going to be representative. It needs to be properly random. So that's the next step. How do you get those rows?
How do you decide from your 3 million row table, how many rows do you get? So there's two steps. First, you sample blocks. You decide which blocks you're going to sample, because reading blocks is the expensive part. And then from the blocks, you take the rows. So we actually sample that many blocks, even though every block has a lot of rows.
If our number is 300 times 100, we will sample that many blocks. And we're going to use an algorithm called Knuth's algorithm S, which he gives in one of his books, where you basically go through the blocks
and with some probability, take it into the sample. And the probability decreases as you go further in the table. So the math works out like that, that you will sort of, when you go through the whole thing, you will have sampled a random subset.
And if you see you don't have enough blocks to fill your sample, you just take all the rest of them. And so this is how you decide which blocks to take. Now, you read the blocks. And then every block has a lot of rows. So we need to go reading rows from the block and figuring out which rows to take,
because you don't need that many blocks. You need that many rows. And if your block has a lot of rows in it. So for rows, you can use algorithm S, because it requires you to know what's the total number of things you're going to sample. You know how many blocks a table has, but you don't know how many rows will there
be in those blocks, because you can't have variables. So we can't use algorithm S. Something that's a bit more involved is used, and it's called algorithm Z. That's created by Vider. It's a variation of something called
reservoir sampling, which people might have heard of. So reservoir sampling is a way of sampling stuff that you don't know the size of. And algorithm Z, all of these sound like Tesla model names, but they just call them like that. I don't know why. So algorithm Z is a tricky way of doing
that that doesn't require you to generate a lot of random numbers, because that's expensive. Getting random numbers is expensive. So it works simply. You get the first n rows, and you take, oh, this is my sample. And then for each next subsequent row, you either pick it or skip it. And the probability is sample size
divided by the row number. So it's less and less and less probable that you will get it, that you will pick it. And if you pick it, because your sample is full, you choose one thing from your sample at random, throw it away, and use that. Just replace it. And that's more or less how the naive version works. Postgres does it more smartly to not have two random numbers
generated for every loop. So we have rows. Once we have the rows, it's simple. Fraction of non-linear values, easy. Histogram, easy. You just look at the thing and pick the histogram bounds. The one thing that's a bit more difficult is distinct values, because you just have a sample,
and you need to have an estimate of how many distinct values are there in total in the table. So the heuristic here is pretty simple. If every row in the sample is different, we assume the column is unique. If we didn't get any duplicates, we just assume. It all has to be unique.
If all of them have one counterpart, so if every row in the sample is duplicated, we assume we have seen all the things that happened. So we assume that these are all the values that are there in the table. And we say, this is the n distinct. So if you have a Boolean table, in your sample, you have a bunch of trues and a bunch of falses.
And you will just figure out, oh, it has to be those two things that are there. If none of those are hold, so if you have one thing that's unique and another thing that's duplicated, there's a formula. Again, papers, math, publications that tell you this estimator is sort of the thing you need to assume
is the number of distinct things. So that's the mathy part. And personally, it's really cool to see that all this math is actually used for something practical. And you can download a paper, read it, put it in code, and it gives you better plans. This is actually pretty satisfying.
So configuration, there's not a lot of configuration. You can set default statistics target, which translates to the histogram width. And from this, everything else stands, how many rows will you sample, and so on and so forth. You can tell a table how often do you need to analyze.
So what fraction of rows need to be updated before you need to run an analyze. And the minimum threshold saying, oh, if it's a small table and 20% get tab rated but it's less than 10 in total because it's a super small table, don't bother analyzing. It's fine. That's the whole conflict.
The rest is math and magic. Same thing. So it's not super configurable. But sometimes if you get a bad plan, you can up the statistics target for the table. You'll take a bit more analyzing. You'll take a bit more storage to keep that stats. But you'll have better selectivity estimates. So it might be something to do if your plan is bad.
Just bump it and analyze. So let's get into some of the meatier stuff, which is some of the data types, like for arrays, it doesn't make sense to have the most common values.
Because no two arrays are going to be exactly the same. What you actually want is how many distinct values are inside those arrays. We need to unpack them and look at distinct counts of the things inside of them. So that's where those extra type analysis functions come into play. They actually look at different, unpack those arrays
and look inside and construct histograms of different things. TS vectors are sort of like arrays, but for text. Again, you have to have specific stats calculations to do that. Ranges, another thing. We're not going to go into detail there. But you can imagine that for a range,
you really don't need that much to know how many exactly identical ranges are there in the table. You need to know about the bounds of the ranges, so you need to unpack the data and get the stats from those bounds. And so finally, a sneak preview of what's coming in 10.
So you have this query. And to estimate selectivity, you will say, oh, I know that many stores are in zip YOA and that many are in Yukon. And then you will multiply that. And you will assume, oh, that's how many rows I will get.
That's wrong, because if something is in YOA, it's in Yukon for sure. But the planner doesn't know that. So this is a correlated statistic. This is a problem that you get when you have queries like that. It's been a problem for a long time. There's been two talks already that talked about this. The solution that you get in 10.0
is you can explicitly tell the planner those two things are correlated. Keep statistics about their correlation. And then use them when you have that where with two and two ends. So then you have to do it explicitly, because it's too much overhead to calculate this
for every combination of two columns. And you can have three columns or more. But this is now possible in 10 to tell the planner, hey, those two columns depend on one and another. So the zip code determines the province or it's functionally dependent on the zip code.
And so we can't just say, oh, this determines this. And that's it, because sometimes you have input errors, user errors, corner cases. So what actually gets stored is how strong the dependency is. So maybe there is one store that
got the wrong zip code attached. So we can't just say, oh, it's always definitely going to be in UConn. But we can say it's pretty much going to be in UConn. So there's a number that stores that. So is it meaningful for the end user to interrogate these values?
Is this maybe an estimate as to how good his data is? Sure. So the question is, is it useful to use that information to know how much garbage, how much bad data you have in? Sure. It's useful. If you create a correlation statistic between two columns,
you can then query it from PG statistics x. And then you will have a number. And if that number is not 099, but it's like 03, then you know that something's wrong in your data set. And there's a similar thing for distinct,
because that's another typical problem. If you group by, you won't get every province zip combination. You will get less distinct things, because everything in one zip code will be in the same province. So you can also tell the system, hey, calculate the number of distincts for a pair,
not just for this column. Do this for two columns. We're sort of running out of time. So I can take one last question, then we'll have to wrap it up. I just wanted to make sure, by default, it's specified manually. Yeah, so it's always been the case that no correlation starts with per column, always.
So there was no correlation way to tell the system, hey, be aware of this correlation even. Nowadays, it has to be explicit, because it's just too many combinations to calculate all the time. But it has to, you can do this. Another thing, another gotcha there is that if you query with a zip code
and a province that don't match, the system doesn't know that they match. The system knows that the columns are correlated. It doesn't know anything about the values. So this will give you a bad estimate, a worse estimate, even, because it will say, oh, you constrained the zip code and the province. It's not really that strong of a constraint,
because one determines another. But if you do a query that's not consistent with the dependency, you will get a bad estimate, which is something you have to be aware of. If your query system generates a lot of queries like that, you might actually get be aware soft if you do that.
Yep? In a specific data distribution, so is it any specific data distribution, or for that no distribution? So the question is, if the math makes sense
for any data distribution, it should. Yeah, it should. If it's badly skewed, you might miss it. So if you're just unlucky, you might miss it also, because it all works in probabilities. But the math tells us that there are hard bounds on the errors that you can get.
Sometimes the error, like 50% is a big error. And if you choose the wrong plan, if it makes your plan tree flip around, you can blow up your app. But yeah, independently of the distribution, this sort of tries to work for all distributions. But badly skewed things will mess you up.
So we have to end. If there's more questions, you can just come down. You can find me afterwards.