AV-Portal 3.23.3 (4dfb8a34932102951b25870966c61d06d6b97156)

Let's talk database optimizers

Video in TIB AV-Portal: Let's talk database optimizers

Formal Metadata

Let's talk database optimizers
Title of Series
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Histogram Software engineering File format Software developer Software Database Mathematical optimization
Table (information) Sequel Multiplication sign View (database) Mereology Number Order (biology) Cache (computing) Query language Subject indexing Information Mathematical optimization Software developer Planning Bit Instance (computer science) Statistics Cache (computing) Subject indexing Number Query language Personal digital assistant Factory (trading post) Order (biology) Quicksort Table (information) Mathematical optimization Optical disc drive Resultant
Table (information) View (database) Set (mathematics) Planning Mereology Area Order (biology) Query language Order (biology) Table (information) Optical disc drive Resultant Mathematical optimization Condition number
Functional (mathematics) Planning Neuroinformatik Revision control Order (biology) Query language Operator (mathematics) Order (biology) Buffer solution Revision control Table (information) Resultant Condition number Stability theory
Group action Equaliser (mathematics) 1 (number) Set (mathematics) Mereology Total S.A. Number Order (biology) Ranking Summierbarkeit Partition (number theory) Mathematical optimization Condition number Information View (database) Total S.A. Stack (abstract data type) Sequence Window function Partition (number theory) Arithmetic mean Query language Personal digital assistant Function (mathematics) Order (biology) Condition number Ranking Optical disc drive
Partition (number theory) Subject indexing Trail Personal digital assistant Order (biology) Condition number Stack (abstract data type) Window function Row (database) Stability theory
Group action View (database) Equaliser (mathematics) Range (statistics) Total S.A. Total S.A. Local Group Query language Order (biology) Statement (computer science) Summierbarkeit Table (information) Summierbarkeit Mathematical optimization Condition number
Group action Link (knot theory) Price index Mereology Local Group Subject indexing Order (biology) Personal digital assistant Intrusion detection system Order (biology) Summierbarkeit Table (information) Condition number
Pairwise comparison Group action View (database) Planning Stack (abstract data type) Window function Local Group Revision control Partition (number theory) Latent heat Query language Personal digital assistant Order (biology) Phase transition Right angle Table (information) Mathematical optimization Condition number Stability theory
Slide rule Asynchronous Transfer Mode Table (information) Link (knot theory) Price index Bit Stack (abstract data type) Group action Local Group Partition (number theory) Cube Function (mathematics) Query language Condition number Process (computing) Mathematical optimization Window
Order (biology) View (database) Cube Query language Function (mathematics) View (database) Price index Condition number Stack (abstract data type) Cartesian coordinate system Mathematical optimization Local Group
Query language Collaborationism Information Statistics Mathematical optimization Point cloud
okay so let's get started thank you
everybody for coming my name is Vicenta I am a software engineer for the MRED B foundation and today I'm going to give you a brief overview of the new developments in optimizers we've already heard oysters talk about histograms so that's something that Murray DB has in a different format in a 10-1 and onwards and now we're going to compare a few other things that optimizers can do so
first of all what is the goal of the query optimizer so we want to get given a certain sequel query the fastest time possible when executing it and we have different tools to do that we can choose to either cache something some data before we actually compute the full result we can do some index lookups and get your that data faster this way or we can try to rewrite your query in a more efficient manner that we can query it faster that so now problem is the number of possible plans grows exponentially with the number of tables it's actually a factorial number of tables but let's say that it's exponential and in a
perfect world the query optimizer should be able to give you the best result all the time unfortunately that's not the case for many queries it does happen but there are certain instances where it works quite badly we're not going to be too concerned about those for now but any optimization that we do we try to reduce the number of possible queries which behave poorly so in order to explain what kind of optimizations we're doing in recent developments I need a bit of background so first of all a derived table in my skill is any sort of table which you can find as part of a sub-query in the from clause whenever I say they're derived table it's either this or a view as part of the from clause so a view is basically the same thing here's our query example we want to get all our customers which had orders with which had large orders so greater than 1 million entries in their order there's only a few customers so that's why there are VIP customers and the orders have to be from October start to end 2017
how would you execute this in a naive fashion well we first compute the result in this table here so this is October orders we just do a filter on the orders on the date and then we try to join this table with VIP customers using the join condition the amount must be greater than 1 million so here's the set of October orders a head-mounted a 1 million that part will enter the join with the VIP customers and you get your result however we can do better than
that and this is something at my skill and Marie DB and obviously percona says there based on my skill also do we rewrite the spiri because we can and this data this condition here it's only based on orders and if we move this condition outside of the sub-query and just replace the custom table with the orders table we're going to get this query so now the optimizer has a more general view of how we actually want to query the data we have all the conditions in one place and this leads to a better query plan if you do explain
you will see that we only have two tables visible VIP customers in orders we don't see October orders anywhere in the explain plan and the execution goes
as follows we get the orders then we get the October orders and the amount there's an intersection happening so there's not there are orders that are not in October which have greater than 1 million amount only the intersection of all these conditions will enter the join operation and then this gets joined with VIP customers so instead of first computing everything here we just do one table scan over orders and one over VIP customers there's there are some tricks to buffer results but I'm not going to go into too many details details here so conclusion is that merging is good it simplifies the query and as I said it works in all stable versions of mysql or mariadb however there's a problem when we when aggregation comes into play we can't use it when aggregation happens because if you rewrite the query which has an aggregate function in the sub query results are not what you expected so
here's an example we want the october totals so all the the number of orders the number of items ordered per customer in october if we were to join this query inside the main one you would actually start aggregating stuff in this one and you don't want to do that you actually want the totals for one customer in this case which is customer ID equal one but we have a group by customer ID so we can make use of this here all right so we have lots of customers we don't want to get all of all the totals just for one customer so you make use of this group by column here and we identify this condition customer ID and we push this condition all the way to the where clause here because we have this group by column here so now when aggregated we only aggregate one customer so we get our orders we get only customer one orders and then we sum those up so instead of doing aggregation over the full data set we just aggregate one customer this optimization is available in MariaDB 1010 - and this tactic also works for window functions if you are here from the start if you've seen that we have window functions in more ddb 10th 10 - and there is also window functions coming in MySQL 8 0 so we can push the same condition if you have a window function with a partition by clause here so partition by means that you compute this rank sequence for each customer you can see here customer ID equals 1 has orders 1 2 3 or the ranks same for customer 2 and so on and we don't want to have to compute this rank column for all of the table if we're only interested in customers and customer ones top 3 orders so we only want this here and because we have this equality condition here and there is this partition by clause here you know what we know that we can simplify the query and push this part of the condition all the way up here in the where Clause of this query so here so if
you if you push this you only get to compute this part and this information is not not even accessed okay so if you
compare how execution would happen currently in Mirae db10 2 which is the stable release minus QL it's 0 which has window functions my escalate David computes all free top top free orders for all customers and then you would get the customer the rows for customer ID equals 1 Emory DB 10 free and also in pause Bears we've also tested how possible asset you will only compute the top 3 orders for customer ID equals 1 and in some cases this can be much faster because we can make use of indexes to identify the rows for that particular customer make sense so far
okay either I'm very clear or nobody's understanding anything at all so final
optimization I think the name for this is not quite fortunate so we're probably going to change it so for now we call it split grouping for derived so what happens here where we have a similar query to the one before we have a group by statement but because the there is no equality we don't filter by one customer ID we have a range of customers we have customer name in John and Bob but we don't have customer name inside orders we have customer ID ok so so we have the join condition here customer ID equals October totals customer ID and we filter by customer name so we can't do condition push down immediately because there is no column customer name here however we can be smart about this and
actually let's first look at how we are not smart so when we're not smart we do a sum of everything so for all customers we get the sum and then we this means we get the October totals and then for each customer we filter only Bob and John for both and join them so we have to do aggregation over the full orders table again not good this gets big very fast as you see we don't need customer X when
we're smart we actually look at customers first we see which customer IDs we are allowed to use and we only aggregate those customers so we can compute a sum for Bob and we can compute a sum for John and this way we completely ignore this part of the table as well as this part of the table there
is some requirements for this though in order to actually do this we have to have a joint condition here so we have to have something that links orders to customers in this case customer ID is linked this is the column that links the two tables and we also must have a index on the group by column to be able to quickly check if the customer that we are looking for is actually in the table so here we want to have an index to
check if it's Bob in orders but Bob is Bob Bob's ID in orders if you don't have this this optimization won't actually work right now it's still in the beginning phases so we haven't really optimized this for all cases we only apply it when we're definitely sure that it will bring a speed up otherwise we are very we don't use it because we're not sure if it's going to produce a better query plan and to look up to see
how features stack up well we have the right table view merge this the first thing we talked about in all stable versions condition pushdown is a Marida be only specific feature window functions are coming into MySQL h0 and I haven't personally check but I believe merge actually works with cities too in my skill and also condition pushed down for partition by and split table grouping is only available in Murray to be 10 feet so we're hoping that MySQL can also implement this because if more more users get optimizations it's better for everybody obviously this is not a comprehensive comparison I'm not trying to show that MySQL doesn't have any features or we that we have so a lot more this is just on the features which I talked about during this talk okay this is actually a
overview of how things work this is based what I talked about if you want to read we read this somewhere it's it's available I will make I'll post this talk on on the FOSDEM link so you can download it there okay that's all I had
thank you very much yes if there's an I actually have more slides possibly but it's a bit difficult so if people are interested I can talk more about it offline yep go ahead
so is it this this one so I'm not really sure I get your question so let me try and rephrase it so the question is why don't we have a where clause inside the view view here right well for example if you try to make an application for users which which works for multiple you say say you're in a company and you want this view to work for multiple departments maybe don't want filter by a specific user for that department so that's why well maybe the example is not the example is to showcase the optimization possible it's not to show that this is the way you should actually write all your queries okay yes you can optimize it manually but yeah the idea from the start of the talk is this so in a perfect world any query
should run fast okay so that's what we're trying for even if you shoot yourself in the foot okay thank you