Lies, damned lies, and statistics

Cite

Related Material

PGCon - PostgreSQL Conference for Users and Developers

Urbański, Jan

Formal Metadata

Title

Lies, damned lies, and statistics

Subtitle

A journey into the PostgreSQL statistics subsystem

Title of Series

PGCon 2017

Number of Parts

Author

Urbański, Jan

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/48976 (DOI)

Publisher

PGCon - PostgreSQL Conference for Users and Developers

Release Date

2017

Language

English

Content Metadata

Subject Area

Information Science

Genre

Conference/Talk

Abstract

Before executing an SQL query, Postgres needs to decide on an execution plan. While there are multiple steps involved in generating a plan, this talk will focus on the main source of inputs for the query planner machinery, namely the statistics subsystem. Detecting problems with statistics is often a crucial step towards finding the reasons for bad plans and slow queries, so it's important to understand exactly what gets tracked and how it's getting used. As the universal rule of Garbage In, Garbage Out teaches us, without some degree of knowledge about the shape of queried data, it is impossible to get a reasonable execution plan. PostgreSQL employs a number of ways to maintain up-to-date statistics about table sizes and the distribution of data inside them, which in turn inform the query planner. In this talk we'll explore what statistical information is being tracked by Postgres, how is it calculated, where does the server store it and how can the operator query it, and finally how to tweak the whole system. Maintaining up-to-date statistics needs to balance performance overhead with information quality. For a large database, you can't just read everything and calculate some ratios, so the system employs a number of clever algorithms, which we will examine. Another concern is that some data types require specific statistical information that's specific to the query pattern for the given type. A column holding an array is less likely to be queried for exact matches, but it is often queried using a "contains" operator. We'll cover special cases such as arrays, full text search vectors and ranges, and we'll talk more in depth about how statistics for them are gathered. Finally, we'll look at some of the weaknesses of the statistics subsystem and areas where it could be still improved.