Let's talk database optimizers
Formal Metadata
Title 
Let's talk database optimizers

Title of Series  
Author 

License 
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2018

Language 
English

Content Metadata
Subject Area 
00:00
Histogram
Software engineering
File format
Software developer
Software
Database
Mathematical optimization
00:34
Table (information)
Sequel
Multiplication sign
View (database)
Mereology
Number
Order (biology)
Cache (computing)
Query language
Subject indexing
Information
Mathematical optimization
Software developer
Planning
Bit
Instance (computer science)
Statistics
Cache (computing)
Subject indexing
Number
Query language
Personal digital assistant
Factory (trading post)
Order (biology)
Quicksort
Table (information)
Mathematical optimization
Optical disc drive
Resultant
02:40
Table (information)
View (database)
Set (mathematics)
Planning
Mereology
Area
Order (biology)
Query language
Order (biology)
Table (information)
Optical disc drive
Resultant
Mathematical optimization
Condition number
03:56
Functional (mathematics)
Planning
Neuroinformatik
Revision control
Order (biology)
Query language
Operator (mathematics)
Order (biology)
Buffer solution
Revision control
Table (information)
Resultant
Condition number
Stability theory
05:12
Group action
Equaliser (mathematics)
1 (number)
Set (mathematics)
Mereology
Total S.A.
Number
Order (biology)
Ranking
Summierbarkeit
Partition (number theory)
Mathematical optimization
Condition number
Information
View (database)
Total S.A.
Stack (abstract data type)
Sequence
Window function
Partition (number theory)
Arithmetic mean
Query language
Personal digital assistant
Function (mathematics)
Order (biology)
Condition number
Ranking
Optical disc drive
07:53
Partition (number theory)
Subject indexing
Trail
Personal digital assistant
Order (biology)
Condition number
Stack (abstract data type)
Window function
Row (database)
Stability theory
08:47
Group action
View (database)
Equaliser (mathematics)
Range (statistics)
Total S.A.
Total S.A.
Local Group
Query language
Order (biology)
Statement (computer science)
Summierbarkeit
Table (information)
Summierbarkeit
Mathematical optimization
Condition number
10:14
Group action
Link (knot theory)
Price index
Mereology
Local Group
Subject indexing
Order (biology)
Personal digital assistant
Intrusion detection system
Order (biology)
Summierbarkeit
Table (information)
Condition number
11:08
Pairwise comparison
Group action
View (database)
Planning
Stack (abstract data type)
Window function
Local Group
Revision control
Partition (number theory)
Latent heat
Query language
Personal digital assistant
Order (biology)
Phase transition
Right angle
Table (information)
Mathematical optimization
Condition number
Stability theory
12:29
Slide rule
Asynchronous Transfer Mode
Table (information)
Link (knot theory)
Price index
Bit
Stack (abstract data type)
Group action
Local Group
Partition (number theory)
Cube
Function (mathematics)
Query language
Condition number
Process (computing)
Mathematical optimization
Window
13:26
Order (biology)
View (database)
Cube
Query language
Function (mathematics)
View (database)
Price index
Condition number
Stack (abstract data type)
Cartesian coordinate system
Mathematical optimization
Local Group
15:27
Query language
Collaborationism
Information
Statistics
Mathematical optimization
Point cloud
00:05
okay so let's get started thank you
00:09
everybody for coming my name is Vicenta I am a software engineer for the MRED B foundation and today I'm going to give you a brief overview of the new developments in optimizers we've already heard oysters talk about histograms so that's something that Murray DB has in a different format in a 101 and onwards and now we're going to compare a few other things that optimizers can do so
00:35
first of all what is the goal of the query optimizer so we want to get given a certain sequel query the fastest time possible when executing it and we have different tools to do that we can choose to either cache something some data before we actually compute the full result we can do some index lookups and get your that data faster this way or we can try to rewrite your query in a more efficient manner that we can query it faster that so now problem is the number of possible plans grows exponentially with the number of tables it's actually a factorial number of tables but let's say that it's exponential and in a
01:21
perfect world the query optimizer should be able to give you the best result all the time unfortunately that's not the case for many queries it does happen but there are certain instances where it works quite badly we're not going to be too concerned about those for now but any optimization that we do we try to reduce the number of possible queries which behave poorly so in order to explain what kind of optimizations we're doing in recent developments I need a bit of background so first of all a derived table in my skill is any sort of table which you can find as part of a subquery in the from clause whenever I say they're derived table it's either this or a view as part of the from clause so a view is basically the same thing here's our query example we want to get all our customers which had orders with which had large orders so greater than 1 million entries in their order there's only a few customers so that's why there are VIP customers and the orders have to be from October start to end 2017
02:41
how would you execute this in a naive fashion well we first compute the result in this table here so this is October orders we just do a filter on the orders on the date and then we try to join this table with VIP customers using the join condition the amount must be greater than 1 million so here's the set of October orders a headmounted a 1 million that part will enter the join with the VIP customers and you get your result however we can do better than
03:19
that and this is something at my skill and Marie DB and obviously percona says there based on my skill also do we rewrite the spiri because we can and this data this condition here it's only based on orders and if we move this condition outside of the subquery and just replace the custom table with the orders table we're going to get this query so now the optimizer has a more general view of how we actually want to query the data we have all the conditions in one place and this leads to a better query plan if you do explain
03:58
you will see that we only have two tables visible VIP customers in orders we don't see October orders anywhere in the explain plan and the execution goes
04:08
as follows we get the orders then we get the October orders and the amount there's an intersection happening so there's not there are orders that are not in October which have greater than 1 million amount only the intersection of all these conditions will enter the join operation and then this gets joined with VIP customers so instead of first computing everything here we just do one table scan over orders and one over VIP customers there's there are some tricks to buffer results but I'm not going to go into too many details details here so conclusion is that merging is good it simplifies the query and as I said it works in all stable versions of mysql or mariadb however there's a problem when we when aggregation comes into play we can't use it when aggregation happens because if you rewrite the query which has an aggregate function in the sub query results are not what you expected so
05:13
here's an example we want the october totals so all the the number of orders the number of items ordered per customer in october if we were to join this query inside the main one you would actually start aggregating stuff in this one and you don't want to do that you actually want the totals for one customer in this case which is customer ID equal one but we have a group by customer ID so we can make use of this here all right so we have lots of customers we don't want to get all of all the totals just for one customer so you make use of this group by column here and we identify this condition customer ID and we push this condition all the way to the where clause here because we have this group by column here so now when aggregated we only aggregate one customer so we get our orders we get only customer one orders and then we sum those up so instead of doing aggregation over the full data set we just aggregate one customer this optimization is available in MariaDB 1010  and this tactic also works for window functions if you are here from the start if you've seen that we have window functions in more ddb 10th 10  and there is also window functions coming in MySQL 8 0 so we can push the same condition if you have a window function with a partition by clause here so partition by means that you compute this rank sequence for each customer you can see here customer ID equals 1 has orders 1 2 3 or the ranks same for customer 2 and so on and we don't want to have to compute this rank column for all of the table if we're only interested in customers and customer ones top 3 orders so we only want this here and because we have this equality condition here and there is this partition by clause here you know what we know that we can simplify the query and push this part of the condition all the way up here in the where Clause of this query so here so if
07:47
you if you push this you only get to compute this part and this information is not not even accessed okay so if you
07:56
compare how execution would happen currently in Mirae db10 2 which is the stable release minus QL it's 0 which has window functions my escalate David computes all free top top free orders for all customers and then you would get the customer the rows for customer ID equals 1 Emory DB 10 free and also in pause Bears we've also tested how possible asset you will only compute the top 3 orders for customer ID equals 1 and in some cases this can be much faster because we can make use of indexes to identify the rows for that particular customer make sense so far
08:40
okay either I'm very clear or nobody's understanding anything at all so final
08:49
optimization I think the name for this is not quite fortunate so we're probably going to change it so for now we call it split grouping for derived so what happens here where we have a similar query to the one before we have a group by statement but because the there is no equality we don't filter by one customer ID we have a range of customers we have customer name in John and Bob but we don't have customer name inside orders we have customer ID ok so so we have the join condition here customer ID equals October totals customer ID and we filter by customer name so we can't do condition push down immediately because there is no column customer name here however we can be smart about this and
09:42
actually let's first look at how we are not smart so when we're not smart we do a sum of everything so for all customers we get the sum and then we this means we get the October totals and then for each customer we filter only Bob and John for both and join them so we have to do aggregation over the full orders table again not good this gets big very fast as you see we don't need customer X when
10:15
we're smart we actually look at customers first we see which customer IDs we are allowed to use and we only aggregate those customers so we can compute a sum for Bob and we can compute a sum for John and this way we completely ignore this part of the table as well as this part of the table there
10:41
is some requirements for this though in order to actually do this we have to have a joint condition here so we have to have something that links orders to customers in this case customer ID is linked this is the column that links the two tables and we also must have a index on the group by column to be able to quickly check if the customer that we are looking for is actually in the table so here we want to have an index to
11:10
check if it's Bob in orders but Bob is Bob Bob's ID in orders if you don't have this this optimization won't actually work right now it's still in the beginning phases so we haven't really optimized this for all cases we only apply it when we're definitely sure that it will bring a speed up otherwise we are very we don't use it because we're not sure if it's going to produce a better query plan and to look up to see
11:37
how features stack up well we have the right table view merge this the first thing we talked about in all stable versions condition pushdown is a Marida be only specific feature window functions are coming into MySQL h0 and I haven't personally check but I believe merge actually works with cities too in my skill and also condition pushed down for partition by and split table grouping is only available in Murray to be 10 feet so we're hoping that MySQL can also implement this because if more more users get optimizations it's better for everybody obviously this is not a comprehensive comparison I'm not trying to show that MySQL doesn't have any features or we that we have so a lot more this is just on the features which I talked about during this talk okay this is actually a
12:34
overview of how things work this is based what I talked about if you want to read we read this somewhere it's it's available I will make I'll post this talk on on the FOSDEM link so you can download it there okay that's all I had
12:53
thank you very much yes if there's an I actually have more slides possibly but it's a bit difficult so if people are interested I can talk more about it offline yep go ahead
13:35
so is it this this one so I'm not really sure I get your question so let me try and rephrase it so the question is why don't we have a where clause inside the view view here right well for example if you try to make an application for users which which works for multiple you say say you're in a company and you want this view to work for multiple departments maybe don't want filter by a specific user for that department so that's why well maybe the example is not the example is to showcase the optimization possible it's not to show that this is the way you should actually write all your queries okay yes you can optimize it manually but yeah the idea from the start of the talk is this so in a perfect world any query
15:31
should run fast okay so that's what we're trying for even if you shoot yourself in the foot okay thank you