Beyond EXPLAIN: Query Optimization From Theory To Code
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 34 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/48453 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
PGCon 201628 / 34
1
2
3
4
5
6
12
13
14
15
16
18
20
21
23
26
27
29
32
34
00:00
Enterprise architectureCodeTheoryMathematical optimizationQuery languageRelational databaseSelf-organizationLinker (computing)ReliefMathematical optimizationImplementationFundamental theorem of algebraTheoryQuery languageRelational databasePhysicalismPhysical systemData modelArithmetic meanDatabaseTheory of relativityComputer programmingComputer animation
00:52
Relational databaseSelf-organizationLinker (computing)Table (information)Resource allocationData storage deviceStrategy gameLattice (order)Mathematical optimizationBlack boxTheory of relativityView (database)Physical systemEndliche ModelltheorieData modelPhysicalismLogicDatabaseOrder (biology)Flow separationSelf-organizationQuery languageMathematical optimizationRelational databaseComputer animation
01:35
Mathematical optimizationPerfect groupFunction (mathematics)Mathematical optimizationPerfect groupPhysicalismQuery languagePhysical systemDatabaseMereologyRelational databaseFunction (mathematics)Black boxTheory of relativityComputer animation
02:59
Mathematical optimizationFundamental theorem of algebraTheoryQuery languageImplementationControl flowDatabaseSoftware frameworkBenchmarkTwin primeImplementationPhysical systemFundamental theorem of algebraDatabaseMathematical optimizationBenchmarkResultantMachine visionTheoryQuery languageComputer animation
03:38
Mathematical optimizationTheorySoftware frameworkQuery languageComputer fontRule of inferenceVariety (linguistics)OracleMathematical optimizationQuery languageTheoryVariety (linguistics)PlanningDatabaseLogic programmingResultantSoftware developerRule of inferenceSoftware frameworkCategory of beingMachine visionEstimatorComputer animation
04:36
Mathematical optimizationNP-hardProcess modelingEstimationOverhead (computing)Read-only memoryBefehlsprozessorFunction (mathematics)CombinatoricsLattice (order)EstimatorSpacetimeTupleBefehlsprozessorMathematical optimizationStrategy gameOperator (mathematics)Complex (psychology)Query languageSelectivity (electronic)Pattern languageComputer programmingCondition numberMultiplication tablePort scannerEndliche ModelltheorieTurtle graphicsOrder (biology)Computer animation
05:47
Mathematical optimizationOracleEstimationStatisticsRelational databasePhysical systemMathematical optimizationDirection (geometry)Query languageImplementationStandard deviationSoftware frameworkGroup actionKey (cryptography)Endliche ModelltheorieMachine visionComputer animation
06:39
Data storage deviceEstimationWell-formed formulaStatisticsFinitary relationPrice indexKey (cryptography)BefehlsprozessorNumberWeb pageFundamental theorem of algebraIterationPhysical systemCountingSubject indexingData storage deviceKey (cryptography)BefehlsprozessorEstimatorStatisticsEndliche ModelltheorieDatabaseSoftware frameworkSystem callTheory of relativityQuery languageComputer animation
07:33
Finitary relationMathematical optimizationSearch algorithmPlanningLattice (order)MultiplicationProcess (computing)Order (biology)Theory of relativitySubject indexingPhysical systemSingle-precision floating-point formatMathematical optimizationTable (information)Bounded variationBit rateComputer animation
08:35
PlanningLattice (order)AlgorithmTheory of relativityMathematical optimizationProcess (computing)Search algorithmGreatest elementValidity (statistics)Bounded variationComputer animation
09:35
Menu (computing)Execution unitMathematical optimizationSoftware frameworkGreen's functionSQL ServerProgrammable read-only memoryMaxima and minimaModal logicoutputRead-only memoryNormed vector spaceEmailComputer fontQuery languageSoftware developerData modelRelational databaseAlgebraLogicOperator (mathematics)AlgorithmLoop (music)Price indexHash functionMathematical optimizationExtension (kinesiology)Point (geometry)PlanningSearch algorithmPhysical systemTransformation (genetics)Server (computing)SequelRelational databaseSelectivity (electronic)Software frameworkLogicProjective planeSubject indexingQuery languageContext awarenessAlgebraPhysicalismTheory of relativityOperator (mathematics)Lattice (order)Port scannerAlgorithmImplementationHash functionSequenceSoftware developerCASE <Informatik>Computer animation
11:10
Axonometric projectionLattice (order)Transformation (genetics)Operator (mathematics)LogicAlgorithmPlanningRule of inferenceNetwork topologyLogicTransformation (genetics)Type theoryAlgorithmOperator (mathematics)2 (number)AlgebraSelectivity (electronic)PhysicalismForm (programming)Exterior algebraLattice (order)Subject indexingPort scannerArchaeological field surveyProjective planeEquivalence relationCASE <Informatik>Special unitary groupScaling (geometry)Computer animation
12:20
Lattice (order)Axonometric projectionLogicTransformation (genetics)Operator (mathematics)AlgorithmLimit (category theory)SpacetimeOperator (mathematics)Transformation (genetics)Type theoryAlgorithmOrder (biology)Point (geometry)Rule of inferenceSpacetimePlanningMathematical optimizationResultantMultiplication signBranch (computer science)Limit (category theory)Computer animation
13:40
Mathematical optimizationProcess modelingStatisticsEstimationTheoryComputer fontPhysical systemMathematical optimizationExtension (kinesiology)Software frameworkTransformation (genetics)TheoryElectric generatorEndliche ModelltheorieMereologyImplementationComputer animation
14:43
Mathematical optimizationComputer programmingAerodynamicsBefehlsprozessorMilitary operationProcess modelingHash functionLattice (order)EstimationStatisticsWell-formed formulaMetreWeightElectronic mailing listNumberNumbering schemeOverhead (computing)Right angleEndliche ModelltheorieMultiplicationEstimatorStructural loadPhysical systemWell-formed formulaOperator (mathematics)Single-precision floating-point formatCountingCASE <Informatik>Parameter (computer programming)Query languageMultiplication signType theoryMilitary baseImplementationMathematical optimizationStatisticsBranch (computer science)Basis <Mathematik>Slide rulePlanningPower (physics)Table (information)Computer animation
16:21
Selectivity (electronic)Query languageType theoryPrice indexSequenceTable (information)Fraction (mathematics)VelocityRange (statistics)Port scannerType theorySelectivity (electronic)Query languageSubject indexingSequenceMathematical optimizationEndliche ModelltheorieFundamental theorem of algebraProduct (business)Scaling (geometry)Fraction (mathematics)Row (database)PlanningComputer animation
18:06
Mathematical optimizationWeb pageTupleDivisorMusical ensembleOperator (mathematics)Complex (psychology)NumberCombinational logicLibrary catalogCalculationVariety (linguistics)DivisorFunctional (mathematics)Web pageQuery languageTupleBefehlsprozessorPort scannerWell-formed formulaSequenceComputer animation
19:06
BefehlsprozessorWeb pageTuplePrice indexSelectivity (electronic)StatisticsData bufferBitAreaSubject indexingConnectivity (graph theory)Tablet computerWeb pageNumberoutputSelectivity (electronic)Buffer solutionNetwork topologySemiconductor memoryQuery language2 (number)BefehlsprozessorAverageTupleOperator (mathematics)Computer animation
20:38
Selectivity (electronic)Web pagePrice indexFunction (mathematics)Data bufferEstimationEstimatorCache (computing)Subject indexingSound effectFunctional (mathematics)Line (geometry)Web pageComputer animation
21:20
Pattern languageRandom numberSequenceWeb pageTupleBefehlsprozessorPrice indexEstimationCross-correlationOrder (biology)Tablet computerTable (information)TupleMemory managementConnectivity (graph theory)Pattern languageSubject indexingWeb pageCalculationCountingRandomizationNumberPhysicalismCross-correlationDivisorCASE <Informatik>Musical ensembleStatisticsWell-formed formulaRandom accessBefehlsprozessorSequenceSelectivity (electronic)Computer animation
22:56
Hash functionLattice (order)Lattice (order)Port scannerRow (database)NumberSubject indexingSelectivity (electronic)Loop (music)Electronic signatureBootingHash functionSequenceComputer animation
23:39
Lattice (order)Hash functionBuildingPhase transitionTable (information)Lattice (order)Hash functionBefehlsprozessorTupleSummierbarkeitOperator (mathematics)Food energyBuildingInterior (topology)Computer animation
24:31
Hash functionLattice (order)BuildingAverageTuplePointer (computer programming)NumberTable (information)Hash functionTupleInterior (topology)Doubling the cubeRootPhase transitionPortable communications deviceTablet computerAverageImplementationLattice (order)CurveBefehlsprozessorComputer animation
26:23
Interior (topology)Lattice (order)Loop (music)CausalitySound effectState of matterFood energyScaling (geometry)Group actionObject-oriented programmingIdentical particlesTupleProgrammschleifeLoop (music)Interior (topology)Port scannerLattice (order)NumberOperator (mathematics)BefehlsprozessorComputer animation
27:37
Lattice (order)Loop (music)Interior (topology)Price indexData bufferThread (computing)Alpha (investment)NumberNormal (geometry)Subject indexingOperator (mathematics)Sound effectTupleInterior (topology)Condition numberFunctional (mathematics)Function (mathematics)Slide ruleLattice (order)Port scannerBefehlsprozessorComputer animation
28:29
BenchmarkQuery languageRule of inferenceDivisorScale (map)WorkloadOpen sourceImplementationLoop (music)Lattice (order)Interior (topology)Price indexData bufferLattice (order)BenchmarkPort scannerWell-formed formulaComputer animation
29:25
Execution unitBenchmarkQuery languageRule of inferenceScale (map)DivisorWorkloadImplementationOpen sourceElectric generatorBenchmarkQuery languageOpen sourceStructural loadImplementationPort scannerData warehouseComputer animation
30:12
Hard disk driveDefault (computer science)Parameter (computer programming)Hash functionLattice (order)Price indexRevision controlOrder (biology)Server (computing)Port scannerParameter (computer programming)Integrated development environmentState observerRun time (program lifecycle phase)Hash functionMiniDiscDefault (computer science)CoprocessorLoop (music)SequenceEstimatorCausalityComputer animation
31:15
Event horizonPrice indexEstimationCASE <Informatik>Parameter (computer programming)CountingSelectivity (electronic)Point (geometry)Table (information)Subject indexingCondition numberError messageSweep line algorithmCASE <Informatik>Selectivity (electronic)Ocean currentNumberGraph (mathematics)SpacetimeSequenceQuery languageServer (computing)ResultantPlanningRange (statistics)Run time (program lifecycle phase)TwitterBand matrixLine (geometry)CurvaturePort scannerComputer animation
33:35
Event horizonPrice indexEstimationCASE <Informatik>Parameter (computer programming)CountingSelectivity (electronic)Electronic meeting systemBand matrixLattice (order)Web pageMultiplication signRandomizationQuery languageoutputSequenceParameter (computer programming)EstimatorServer (computing)Computer animation
34:44
Selectivity (electronic)Parameter (computer programming)Price indexLoop (music)EstimationOrder (biology)CountingHash functionElectronic data interchangePort scannerNumberResultantHash function2 (number)Query language1 (number)Lattice (order)Subject indexingLoop (music)EstimatorParameter (computer programming)Range (statistics)JukeboxWeb 2.0TwitterComputer animation
35:58
Order (biology)EstimationQuery languageCASE <Informatik>Selectivity (electronic)Run time (program lifecycle phase)FrequencyMathematical optimizationQuery languageWave packetAnalytic continuationSocial classNumberPlanningLoop (music)Lattice (order)Computer animation
36:54
Selectivity (electronic)CountingEstimationOrder (biology)ExistenceQuery languageCASE <Informatik>Process modelingBenchmarkTwin primeMathematical optimizationPrice indexLattice (order)Hash functionParameter (computer programming)EstimatorStatisticsRun time (program lifecycle phase)Query languageSelectivity (electronic)Complex (psychology)Range (statistics)CurvatureLattice (order)Cross-correlationMathematical optimizationTwitterImplementationDatabaseMereologyBenchmarkPlanningMultiplication tableCASE <Informatik>ResultantParameter (computer programming)Well-formed formulaCountingOpticsScaling (geometry)Entrainment (chronobiology)WeightComputer animation
38:53
TheoryComputer fontMathematical optimizationDesign by contractFeedbackDisintegrationAerodynamicsMachine visionQuery languageMathematical optimizationINTEGRALDynamical systemPlanningApproximationLimit (category theory)Queue (abstract data type)EstimatorData conversionComputer programmingComplex (psychology)outputLatent heatStatisticsEndliche ModelltheorieCartesian coordinate systemSequelSet (mathematics)Statement (computer science)AreaMultiplication signReal numberXMLComputer animation
40:39
Mathematical optimizationQuery languageStrategy gameEstimationRun time (program lifecycle phase)StatisticsElectric currentPartial derivativeOperator (mathematics)Partial derivativeParsingPlanningEstimatorStrategy gameOcean currentStatisticsINTEGRAL1 (number)Statement (computer science)ResultantMultiplication signMathematical optimizationSoftware testingQuery languageRun time (program lifecycle phase)Computer animation
41:46
EstimationScale (map)Performance appraisalSelectivity (electronic)Range (statistics)Run time (program lifecycle phase)StatisticsSet (mathematics)PlanningMathematical optimizationSelectivity (electronic)Run time (program lifecycle phase)Query languageEstimatorRange (statistics)StatisticsMultiplication signComputer animation
42:38
Neumann boundary conditionError messageEstimationMathematical analysisSensitivity analysisMathematical optimizationTupleEntire functionTheoryMereologySelectivity (electronic)PlanningError messageMathematical optimizationEstimatorTotal S.A.ResultantNetwork topologyLattice (order)Theory of relativitySensitivity analysisPhase transitionSheaf (mathematics)Computer animation
43:39
Computer fontSoftware frameworkMathematical optimizationProcess modelingBenchmarkMathematical optimizationImplementationPhysical systemResultantMachine visionFundamental theorem of algebraCovering spaceBenchmarkJSONXMLComputer animation
44:35
Hash functionLoop (music)Parameter (computer programming)EstimationOrder (biology)Similarity (geometry)CountingSelectivity (electronic)Price indexLine (geometry)Task (computing)DatabaseError messageQuery languagePlanningLoop (music)Parameter (computer programming)CASE <Informatik>Hash functionLattice (order)Right anglePhysical systemFlow separationComputer animation
47:10
Hard disk driveDefault (computer science)Parameter (computer programming)Hash functionLattice (order)Price indexLoop (music)Order (biology)CountingEstimationSelectivity (electronic)Database transactionParameter (computer programming)Query languagePlanningMachine learningEstimatorAnalytic setEndliche ModelltheorieMultiplication signMathematical optimizationCASE <Informatik>Parallel portSoftware testingVirtual machineRun time (program lifecycle phase)Computer animation
Transcript: English(auto-generated)
00:01
Okay, I'm going to start. Thank you for coming. Today, me, Yuto Hayamiz, and my colleague, Ryoji Kawamichi is presenting. And in this talk, we are going to cover the theoretical fundamentals of query optimization and implementation of,
00:21
implementation basics of Postgres optimizer. So first, let's start with looking back at the history of database system. Before relational database system, or before relational data model, curling was physical activity. Physical means you have to know where data is stored,
00:44
how data can be retrieved, and you have to navigate yourself by programming. But after relational data model, so the invention of relational data model has changed the situation. And relational model separate the logical data view
01:04
and physical data organization, and curling become logical activity. So physical aspects of database system is black box. And what users have to do is just declare what users want.
01:23
So in order to build the working relational database system, so this gap filling technology is the key enabler. And there's the query optimization. So imagine if we have perfect query optimizer.
01:41
If query optimizer perfectly fill the gap between logical and physical, we do not have to care about physical behavior of database system anymore. So that means we do not need to explain anymore, at least for users.
02:02
But reality is tough, and as everybody here understands, query optimizer is not perfect, and sometimes it's far from perfect. So query optimization is inherently difficult problem. So theoretically difficult to perfectly solve this problem,
02:24
and it's also practically difficult to implement. So in relational database systems, although a large part of physical aspects is black box, but still users need to take care of physical behavior of database system.
02:42
So this is why explain is so much explained all around the world. But why not we go further? Explain is, with explain, we can see only the final output of the optimizer. But if you understand query optimization more deeply,
03:03
it enables you to control more better your database system. So in this talk, we are going to present the theoretical fundamentals of query optimization, and also cover the implementation basics of Postgres,
03:22
and especially focusing on the basic scan and join optimization. So we also present the experimental results with Postgres optimizer with TPC-H benchmark. Okay, so let's start with theory of query optimization.
03:43
For query optimization framework, there are two main categories, cost-based and rule-based. For cost-based optimization, optimal plan is selected based on the cost estimation. And today, most modern database optimizers are cost-based.
04:03
For rule-based optimization, query plan is selected based on heuristically ranked manually made rules, and rule-based optimization is very easy to implement and easy to reproduce the same result. But because optimizer developer
04:25
have to manually tune the rule, it's very hard to handle a wide variety of query plans. And today, the main topic is cost-based optimization. And main challenges is cost-based optimization
04:40
is these three problems. So the first challenge is cost modeling. So estimating how much CPU operator cost, how much I operation cost, or something like that, it's very difficult problem. And the second challenge is cardinality estimation. So how much tuples outputted by scans,
05:03
joins aggregation, and the optimal strategy of query execution depends on that size. And even for simple scanning query, if there is a complex selection condition, cardinality estimation is very hard.
05:21
And if joins or aggregates subqueries are involved, it's much, much more difficult problem. And the third challenge is join ordering search. When joining multiple tuples, there are many possible patterns of join orderings, and exhaustive search of that search space
05:41
is known to be NP-hard, so we need much, much more smarter way. So to solve this problem, to solve these three challenges, initial technology direction was established by System R optimizer. System R is one of earliest
06:01
relational database implementation by IBM Laboratory, and this System R paper defined the standard framework of cost-based optimization. So, of course, recent query optimization technologies much, much more sophisticated than System R,
06:21
but the basic technology direction is still unchanged from System R optimizer. And modern optimizers are all usually mentioned as System R style. So let's look at how System R optimizer works. In System R optimizer,
06:42
cost of query execution is calculated from the estimated page-to-page count and the storage API call count. This is CPU cost, actually. And the full page-to-page count and storage API call count statistics, including cardinality of each relation
07:03
and the number of pages in each relation, the number of distinct keys in each index, and the number of pages in each index is corrected for estimation. So this is very similar to Postgres, and this very fundamental framework for cost modeling
07:24
is still used in many database systems. So this fundamental has already established about 40 years ago. And so System R introduced bottom-up plan search algorithm.
07:40
For optimizing single-table scan, just select the cheapest access path, and this is very easy. So selecting sequential scan or index scan, something like that. And for optimizing multiple relation join, there could be many, many possibility of join orderings.
08:01
And in order to, in bottom-up plan search, join orderings search starts from a single relation optimization. So for each base relation, pick up the cheapest plan, and then find the optimal join orderings
08:22
of every possible two relation joins, and then find optimal join orderings of every possible three relation join, and iterate this process. Let's look more detail with example of four relation join.
08:41
At the first step, cheapest plan for each relation is this orange plan should be cheapest one. And the next step, generating possible plan for two relation join with this cheapest plan.
09:02
And then pick up the cheapest one for the later steps. And doing the same thing for every possible combination of join, two relation join, and repeating the same process. And finally, we get the four relation join plan,
09:22
and select the cheapest plan. Well, this is the bottom-up plan search algorithm, and this algorithm widely used in modern optimizer like Postgres. And here, I'd like to introduce another important optimization approach
09:41
introduced by volcano or cascades. Volcano paper introduced top-down, transformational plan search algorithm. So this optimization algorithm is not well-known as system or optimizer, but widely used in practicality, such as Microsoft SQL Server,
10:01
Apache Hive, or Greenblum Orca. And another interesting point of this volcano paper is its extensibility. So volcano is actually not query optimizer, but query optimizer generator. So it defines the general framework of query optimization.
10:21
So like GIS defines the general framework for index. So volcano is not limited to optimization of relational query. And users of volcano, in this case, the optimizer developer, defined actual implementation
10:41
by giving definition of logical operator and physical algebra. So in the context of relational database, logical operator means relational algebra operator, like scan or select or project or join. And physical algorithm means a specific
11:02
scan or join method like index scan, sequential scan, hash join, etcetera join, or something like that. So let's look at top-down transformational search in volcano. The search starts from initial logical plan.
11:20
Logical plan means the trees consist of only logical operator and no physical algebra. And for example, in this case, trees are state, project, join, select, and no physical algebra, physical algorithm.
11:41
And starting from this initial logical plan, alternative plans are generated with these three type of transformation. The first transformation is the logical operator transformation. So logical operator tree is transformed into equivalent form. For example, changing join ordering
12:02
or pushing down projection. And the second transformation rule is physical algorithm selection, like selecting hash join for the sub-tree, selecting index scan for this operator, something like that.
12:20
And the third type of transformation is sorting or enforcing. For example, enforcing, sorting for these two points, then merge join algorithm can be used in this join operator.
12:40
So in top-down search of volcano, using this transformation rules, possible plan candidates are generated, and volcano applies these rules recursively to search the possible plans. And in bottom-up search,
13:00
in bottom-up search algorithm, all the results are necessary to composing the final result. So it is very difficult to reduce the search space, but in top-down search, it is possible to intentionally limit the search space. For example, well-known branch and bar algorithm
13:23
can be used to pruning the optimal candidates. Or we can limit the search space by user-defined optimization time deadline or something like that.
13:40
Okay, so far, two major cost-based optimization style were covered, system R style and volcano style. So system R style is bottom-up optimization with cost modeling, and volcano is a top-down transformational search with extensive optimizer generation framework.
14:06
Okay, so this is the end of the theory part, so I switch back to Ryoji. Move on to implementation of Postgres optimizer.
14:54
As mentioned before, Postgres optimizer is system R style
15:00
and bottom-up branch search scheme is basically the same as previous slide. For cost estimation, Postgres models execution costs in power operation basis. Maybe, most of you know these five cost parameters.
15:24
Actually, there is the new parameter, power table costs in upcoming 9.6, but we want to focus on basic cases this time, so it's not covered.
15:45
These parameters represent computational overhead of single operation, and the number of each operation in a query is estimated from statistics and from cost formula of each plan node types.
16:05
This is cost formula. It consists of multiplication of single operation cost and operation count, single operation cost, operation counts, this time.
16:25
Postgres has a lot of plan node types, so it's very hard to cover all of these types. In this talk, we would like to focus on only basic cases and look into the details of cost modeling.
16:44
Let me start with these two fundamental scan methods. Sequential scan is the simplest methods, just scanning whole time sequentially.
17:02
On the other hand, index scan selectively scans only interesting data. Execution cost of index scan is proportional to the selectivity of query here,
17:21
and if you select only small fraction of records, index scan is much cheaper than sequential scan, this range. But if you select large portion of data, sequential scan is more efficient.
17:47
The optimizer will choose index scan at lower selectivity, and will choose sequential scan at higher selectivity.
18:00
Let's move on to how cost of these scans are modeled. First, sequential scan. Cost formula of sequential scan is rather simple. Just count the number of pages reached for IO cost,
18:22
and the number of tuples for CPU tuple cost. CPU operator cost is dependent on complexity of their clothes. The complexity means the varieties of operation,
18:43
such as operator, or function, or subquery, or combination of these operations. All the operator cost factors are registered in Postgres catalog, and these values are used
19:03
to calculate CPU operator cost. Next, index scan. Index scan is a bit more complex than sequential scan, and basically consists of these five cost components.
19:20
Let's check them one by one. The first component is B plus three search. Internal node of B plus three is frequently accessed, so it is assumed to be always cached in buffer memory.
19:42
So, CPU cost of searching down B plus three is counted not as IO cost, but as CPU operator cost. Then, I write the leaf of index.
20:01
Executor scans index tuples in leaf pages. Scanning leaf pages is the second cost component, and it is counted as CPU index tuple cost.
20:22
This can be calculated from the number of leaf pages, average number of index tuples in a page, and selectivity of query.
20:41
For IO cost of index leaf pages, we need to consider cache effect. For this purpose, there is a famous estimation function developed by McCart and Lohmann, and Postgres uses this function.
21:03
This is, this is line, which has no cache effect. And this is line, which is affected by McCart and Lohmann function.
21:22
Next cost component is heap table scan. Now, if physical orderings or index tuples and heap tuples are totally uncorrelated in this case,
21:41
the IO pattern to heap table will be random, random access. So, the cost is counted as random page cost. On the other hand, if physical ordering
22:02
or these pages are correlated, IO pattern will be sequential. Postgres collects correlation
22:20
for each index column in statistics. And in cost formula, sequential page cost and random page cost is weighted by correlation factor. And the last cost component is CPU cost of scanning heap tuples.
22:41
The calculation of this cost is almost the same as sequential scan, except the number of tuples is computed with index scan selectivity. Next, we cover these two join methods,
23:04
hash join and nested boot join. Hash join is the most efficient join method for joining large number of records. And it usually combined with sequential scans. On the other hand, nested boot join
23:23
is generally better at joining small number of records with selective scans. So, it is usually combined with index scans or small sequential scans. First, let's look into hash join.
23:45
Hash join has two execution phases. The first phase is called build phase. In the build phase, inner tuples are scanned
24:00
and hashed into hash table, hash table buckets. So, cost is the sum of inner scan cost and hashing CPU operator costs and CPU tuples cost.
24:23
After build phase, the hash table is operated and probe phase starts. In probe phase, tuples in outer table is fetched one by one.
24:42
And for each tuple, hash table lookup is executed. For estimating CPU cost of hash table lookup, the number of buckets in hash table methods, bucket number.
25:05
Suppose 16 tuples are hashed into a hash table. If there are two buckets in the hash table, each bucket contains eight tuples.
25:27
So, in hash table lookup, four tuples are compared in average. If there are four buckets in the hash table, each bucket may contain four tuples,
25:40
so two tuples are compared in average. This means if the number of buckets becomes larger, the CPU cost of hash table lookup becomes lower. In postgres implementation,
26:01
the number of buckets does not increase continuously because it is doubled according to the number of hashed tuples. So, estimated cost of hash join looks like this zigzag curve.
26:25
Next, nested loop join. Nested loop join loops inner scan for each outer scan tuple like this. Tuple loop, tuple inner scan loop.
26:41
Repeated execution of inner scan causes caching effect and may change the cost of inner scan cheaper. So, postgres distinguish the first invocation of inner scan with repeated invocation.
27:02
If the number of outer tuples is one, inner scan is invoked only once, so no special consideration is needed. Cost is calculated by just summing up the cost of outer scan, the cost of inner scan
27:26
CPU tuple cost and CPU operator cost of evaluating join conditions. If the number of outer tuples is greater than one,
27:44
inner re-scan cost is calculated separately. As mentioned in the slide of index scan, postgres implements MACCUT and ROMAN function for estimating caching effect. And this function is also used for estimating
28:03
caching effect of repeated inner scan. The number of output tuples, CPU operator cost of join conditions are easily calculated by multiplying
28:21
the number of tuples from outer and inner. So far, we have covered detailed cost formula of basic scan and join methods. Now, let's look at how it works with TPC-H benchmark.
28:46
Oh, sorry.
29:05
Is it just different between the first and the just different from first and second? And the last name, cost.
29:31
TPC-H is data warehouse benchmark, and it defines schema, data generation rules,
29:41
and 22 queries from simple scans to complex joins and sub-queries. Maybe many of you know BBT3 and PGTPC-H.
30:02
They are open source implementation of TPC-H. In this experiment, we used 100 gigabytes data set. This is our experimental environment. The server has two Xeon processors and 24 disks.
30:25
Postgres version is 9.5, and in order to observe default behavior of Postgres, we used default cost parameter values, then measuring cost estimation
30:41
and execution time of sequential scans and hash joins, only enables sequential scan and enable hash joins parameter reset to on, and parameters for other methods were disabled.
31:02
And for measuring index scan and nested loop joins, only these two parameter were set to on, and other are disabled. The first result is the simplest case.
31:24
This is TPC-H query number one. Just scan the largest line item table. For sweeping selectivity space, we modified fair close condition. In this case, we changed selectivity
31:42
with selected duration of shipping dates column in line item table. This graph shows estimated cost of each query plan
32:00
according to the selectivity of query. As you can see, the cost of index scan is almost proportional to the selectivity, and it is cheaper for small selectivity.
32:25
On the other hand, the cost of sequential scan is almost consistent, almost constant across all selectivity, because sequential scan always scans entire table.
32:45
This graph is the execution time. As you can see, Postgres is good at estimating cost to end of each scan methods,
33:01
increasing cost of index scan, and flat cost of sequential scan. But there is an error for cheapest plan selection.
33:22
In this range, index scan is estimated less expensive than it should be. This is, our server has rich IO bandwidths.
34:19
This is because our server have rich IO bandwidths,
34:26
and sequential IO is about 100 times faster than random IO, so tuning random page cost parameters should improve the accuracy of estimation.
34:48
Next, look at join query. TPC-H query number three is three-way join query with simple aggregation.
35:05
The result looks similar to query number one. Cost of nested loop with index scan is almost proportional to the selectivity, and cost of hash join with sequential scan
35:22
is nearly flat. Again, cost-strain estimation is good for each join methods, but in this range, a nested loop join takes several thousand seconds,
35:46
while hash join finished in hundred seconds. So cost parameter calibration matters a lot.
36:02
Look at more complex queries. Query number four is semi-join query. As you can see, execution time of query number four was unstable because of plan switching for nested loop join.
36:25
These are three plans, one plan, two plans, three plans. Postgres optimizer thought that cost values
36:42
of these three plans are continuous, but actually execution time was discontinuous. Query 22 is anti-join query, and cost estimation were flat,
37:03
flat across all selectivity range, but actually execution time changed according to selectivity. To estimate cardinality of semi-join and anti-join actually is difficult because it needs complex statistics
37:23
such as multi-column correlation, and there is no general solution. Let me summarize the Postgres implementation part. In this part, we visited cost formula or some basic scan and join methods,
37:42
and observed how Postgres optimizer worked with TPC-H benchmark queries. For simple scan or multiple table join queries, Postgres optimizer is good at estimating overall trends of each scan and join methods.
38:04
The result also indicated that Postgres could choose non-optical plan without cost parameter tuning, especially IO cost parameter in our case. For complex queries like semi-join or anti-join,
38:23
even overall trends estimation were not accurate. This does not mean the implementation of Postgres is poor. However, cost estimation of these kinds of queries are inherently difficult.
38:41
So that means deep understanding of how query optimization works is critical to have a good command of your Postgres database.
39:22
So before concluding this talk, let's overview cutting edge technologies for further improvement of query optimization. So traditionally, query optimization has been a closed problem, closed means. So query optimizer is required to convert
39:40
input SQL statement into optimal plan in a short time with limited knowledge of actual data and with approximated cost model. So in this program settings, it is inherently difficult to find the optimal plan for complex queries employed in real world application.
40:03
So as we have seen in the last TPC-H experiments. So this intro, rethinking the program setting itself is the active research area. So for example, not limited to general statistics collection,
40:25
but also utilizing previous execution history to improving cost estimation of specific queries or dynamic integration of query optimization and query execution. So this is some interesting example.
40:42
Mid-query optimization. This is the pioneering work in integration of optimization and execution. So this method detects sub-optimality of executing query plan by collecting runtime statistics and incrementally improving cost estimation.
41:02
So when learning query plan is detected as sub-optimal, its execution is canceled at the middle of execution and the query plan is modified to run faster. So there are some from modification strategy like the simplest one is just discarding current result
41:23
and start new execution. And another smart strategy is modifying sub-tree of query plan which is not started yet. Or saving the partial execution result and generate a new secret statement
41:42
that's using the partial execution result. Another approach of runtime plan switching is plan bucket. So in this method, plan bucket, at the query optimization time, a set of optimal plan candidates are generated for each selectivity range.
42:02
For example, for this selectivity range, P1 is the best, and for this range, P3 is the best. And initially, the optimal plan was initial selectivity estimation is used for execution.
42:20
And runtime statistics is corrected to improve selectivity estimation. And when initial selectivity estimation turns out to be wrong, execution plan is switched to another plan during execution. So there is another interesting research.
42:42
So this is also theoretical approach to improving plan selection accuracy. In this work, uncertainty of cost estimation is analyzed. For example, cost estimation error in the base relation of join tree.
43:04
So it will be amplified and results in large cost estimation error in entire plan. So the optimality of plan selection is very sensitive to this kind of error.
43:20
In this method, such an error-sensitive part is analyzed, and that part of plan is partially executed in the optimization phase, and for improving accuracy of estimation, and reduce the uncertainty of total cost estimation error.
43:40
Okay, finally, let me summarize our talk. So in this talk, we cover theoretical fundamentals of cost-based optimization, including system style optimization and volcano style optimization. And then we visited implementation of Postgres Optimizer, and show some results with TPC-H benchmark,
44:03
and also presented the overview of some cutting-edge technologies for further improvement of QA optimization. That's all for our talk, thank you. Do you have any questions?
44:39
Currently, we don't have such tools,
44:42
but for another database system, maybe Oracle has something like that, and maybe implementing such tool is an interesting task for Postgres.
45:03
Pardon? So that's very difficult, huh?
45:24
You repeat the question. Ah, so, is there any way to manually tune the, to reduce this error? So, and this is, so, practically, it's very difficult problem,
45:43
but for a simpler case, we can set the calibration query, so very easy query to estimate the cardinality, so we can, we know there is no cardinality error, so error should be only cost parameter values,
46:03
so making such case and running some calibration query, and maybe you can manually tune that way, and that method is presented in ICD paper, so, oh, wait, I can, there we go, there we go.
46:30
I understand you can measure the time, but how would you tell Postgres which query plan to choose so you can plot them on the graph? Usually, Postgres will pick a query plan,
46:40
and then you're stuck with it, and you can't draw the two lines and the intersection. So, the question is how to draw a nested loop join, a hash join line separately, right? And yeah, that's true in general case, so we don't have very limited liberty
47:03
to choose the query plan, but for this kind of query, by setting this enable parameter, we can limit the plan selection, and so this query is very simple, we can work in this way.
47:30
Could you speak more loudly? But you should be, oh, sorry.
47:45
Bye, oh, sorry, I can catch your question, and. Maybe we can take it off.
48:01
Okay, sorry, yes. When it comes to parallel use right now
48:22
for parallel query, it's the actual execution times that are observed. So, your question is, did I try parallel execution and check the cost modeling of parallel execution? The answer is, currently, no.
48:42
But we are trying to test parallel execution performance in our laboratory, and so, but parallel execution is very difficult to estimate the cost model, so it's not simple in this case, like this case.
49:03
Did you think about machine learning? Mm? Machine learning stuff. So you can, alone. So machine learning is very, machine learning is useful for improving final,
49:23
estimation accuracy, but analytical understanding of cost modeling is fundamental of query optimizer. So we cover the fundamental in this time. Okay, time is up? Okay, thank you so much.