Lies, damned lies, and statistics

Zitieren

Zugehöriges Material

PGCon - PostgreSQL Conference for Users and Developers

Urbański, Jan

Formale Metadaten

Titel

Lies, damned lies, and statistics

Untertitel

A journey into the PostgreSQL statistics subsystem

Serientitel

PGCon 2017

Anzahl der Teile

Autor

Urbański, Jan

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/48976 (DOI)

Herausgeber

PGCon - PostgreSQL Conference for Users and Developers

Erscheinungsjahr

2017

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Information und Dokumentation

Genre

Konferenz/Talk

Abstract

Before executing an SQL query, Postgres needs to decide on an execution plan. While there are multiple steps involved in generating a plan, this talk will focus on the main source of inputs for the query planner machinery, namely the statistics subsystem. Detecting problems with statistics is often a crucial step towards finding the reasons for bad plans and slow queries, so it's important to understand exactly what gets tracked and how it's getting used. As the universal rule of Garbage In, Garbage Out teaches us, without some degree of knowledge about the shape of queried data, it is impossible to get a reasonable execution plan. PostgreSQL employs a number of ways to maintain up-to-date statistics about table sizes and the distribution of data inside them, which in turn inform the query planner. In this talk we'll explore what statistical information is being tracked by Postgres, how is it calculated, where does the server store it and how can the operator query it, and finally how to tweak the whole system. Maintaining up-to-date statistics needs to balance performance overhead with information quality. For a large database, you can't just read everything and calculate some ratios, so the system employs a number of clever algorithms, which we will examine. Another concern is that some data types require specific statistical information that's specific to the query pattern for the given type. A column holding an array is less likely to be queried for exact matches, but it is often queried using a "contains" operator. We'll cover special cases such as arrays, full text search vectors and ranges, and we'll talk more in depth about how statistics for them are gathered. Finally, we'll look at some of the weaknesses of the statistics subsystem and areas where it could be still improved.