Faster Analytics with MariaDB 10.2

Video in TIB AV-Portal: Faster Analytics with MariaDB 10.2

Formal Metadata

Faster Analytics with MariaDB 10.2
Advanced SQL features - CTEs and Window Function
Title of Series
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
MariaDB 10.2 has brought two new important querying features, Common Table Expressions (CTEs) and Window Functions. Both features provide greater expressibility to queries, thus opening up opportunities for the optimiser to provide speedups not previously achievable. With a focus on analytical queries, we will see how to improve query performance sometimes by an order of magnitude compared to regular SQL.
Keywords Databases

Related Material

Video is cited by the following resource
Point (geometry) Sequel Multiplication sign 1 (number) Staff (military) Basis <Mathematik> Window function Type theory Computer animation Query language Personal digital assistant Energy level Table (information)
Trail Server (computing) Group action Run time (program lifecycle phase) Vapor barrier Open source Sequel State of matter Code View (database) Multiplication sign Direction (geometry) Function (mathematics) Student's t-test Mereology Product (business) Medical imaging Frequency Estimator Different (Kate Ryan album) Selectivity (electronic) Mathematical optimization Condition number Identity management Multiplication Inheritance (object-oriented programming) Software developer Graph (mathematics) Projective plane Data storage device Database Bit Degree (graph theory) Process (computing) Exterior algebra Computer animation Query language Personal digital assistant Linearization MiniDisc Right angle Table (information) Freeware Arithmetic progression Resultant Spacetime
Computer chess Point (geometry) Group action Server (computing) Sequel View (database) System administrator Real number Function (mathematics) Parameter (computer programming) Mereology Regular graph Rule of inference Computer programming Formal language Power (physics) Number Alphabet (computer science) Data structure Recursion Partition (number theory) Exception handling Email Information Key (cryptography) Projective plane Electronic mailing list Complete metric space Sequence Window function Word Computer animation Query language Personal digital assistant Network topology Order (biology) Video game Object (grammar) Table (information) Resultant Library (computing) Row (database)
Point (geometry) Dataflow State observer Game controller Sequel Multiplication sign Set (mathematics) Function (mathematics) Mereology Regular graph Perspective (visual) Element (mathematics) Number Neuroinformatik 2 (number) Population density Average Semiconductor memory Negative number Data structure Information security Position operator Chemical equation Electronic mailing list Bit Database transaction Database Line (geometry) Cartesian coordinate system Benchmark Sequence Timestamp Frame problem Window function Subject indexing Data mining Computer animation Personal digital assistant Logic Query language Order (biology) Website output Right angle Figurate number Ranking Resultant Row (database)
Group action Digital electronics Multiplication sign Equaliser (mathematics) 1 (number) Set (mathematics) In-Memory-Datenbank Function (mathematics) Parameter (computer programming) Mereology Disk read-and-write head Neuroinformatik Different (Kate Ryan album) Stability theory Social class Area Block (periodic table) Electronic mailing list Maxima and minima Bit Database transaction Window function Order (biology) Right angle Ranking Quicksort Resultant Row (database) Point (geometry) Laptop Slide rule Overhead (computing) Service (economics) Sequel Virtual machine Similarity (geometry) Average Time zone Multiplication Physical law Content (media) Database Cartesian coordinate system Timestamp Subgroup Subject indexing Radius Computer animation Personal digital assistant Query language Table (information)
Revision control Default (computer science) Category of being Distribution (mathematics) Computer animation Repository (publishing)
Computer animation
and types of my name is beech into I think you for coming my for staff Roscommon so
I'm not quite familiar with the audience and the technical level of everybody so if there are any questions from doing my talk feel free to interrupt me there should be plenty of time for discussion to come so 1st of all just get a general idea how many of you here know of my OK uh how many of you actually a sequel on a day-to-day basis for the should OK so uh about familiar that's good but feel free to ask me any questions if there's something which is unclear that's the purpose of this talk is to show you how you can make queries run faster especially analytical ones and using the new features which are now in our latest stable release which is maybe be tend to syringe talk about common table expressions our window functions how these to work together and need to go for a few practical use cases of all of these are actually problems that I had to solve at 1 point so would be examples are from
production and and also to go to a bit of details about what the development status is for these features what's coming up in the picture releases hassles start with common table expressions so common table expressions are a new way to express and what's my skill and maybe used to call the right tables so whenever you wanted to do a they select n in a from clause you would have to create a direct table well here we can define this direct table outside of the current selected so here's the in red I've highlighted what is the new syntax were cities we start with the keyword with then there's the a common table expressions name in in this case engineers and then we have what's going to be inside the state so after this definition you can use the engineers have keyword as if it were a table in the database that and then you can select from engineers and potentially filter out other using other conditions here and this is where you use it so wouldn't from clause have engineers is the identical query that you had to write before so here we have the from balls which has the subselect in the nice part is that you can use multiple cities and multiple cities can reference each other so here we start with engineers and then we go to Europe only engineers by filtering the engineers previously defined in Table if you want to do this with derived tables you'd have to write the query within a query and it gets pretty hot track because you have this nested versus the linear view of cities so from a readability standpoint cities are obviously very with but that does more to this Due to how cities work there is actually room for optimizations that um the query optimizer can do to give you faster running queries and choose other example right here we're trying to select and to compare how products did it in this current here was the previous year and here we are using the same table twice so 1st we that we define now all the sales for all products in a given year and then we just join them together to see which products have performed better this year as opposed to last of it you have so so far you can identify the using the with keyword and the estimate to derive tables and that you can actually write most of the time uh identical queries with direct tables and they can provide you more efficient so on code and I'm going to go into details on why this is so in view of the optimization that we support so how do we
execute cities so why the the just execute the B a query which defines the the store in temperate table the and whenever you need it you do it again and again and this is the the basic it it always works but if you have multiple references like here you have too great to temperate tables which is not optimal however when you do find that you are using a twice it is a good idea to great only 1 temperate table for a distinct ICT expression and this is 1 optimization and it helps on now has my by saving the space and execution time however when you do this you cannot perform other optimizations so it's either this 1 or something which from Moretti has done with views before and it's something we call city merging so here we have 80 select we're selecting engineers from on from the table that we have already and we can think of this like of you the trick is that since the city is used in a joint but there is no group by or distinct clause and inside the city you can actually merge these degrees together to perform just 1 a simple select so you can transform from the city expression to a and single select query and in this case the query optimizer has uh vision of across all the tables involved it can perform all the possible up images of this is I can so by rewriting the query get on you give the optimize a chance to do as much as possible and this is similar to how we use work already this is not something young this just that due to how we implemented cities it's if this came in a natural now here's a different optimization and this is a alternative to when you don't have the conditions surprised by merging cities so here we have 2 problems we have a group by in the city so if you want to merge this In the parents of query you would change the results complicated and you also have some conditions in the work was in the outer solar what we have obviously cannot merge however maybe make sense to push the condition from the other query into the city because they're all you're still going to filter out things of and if you filter out them before computing temperate table this will save on the disk space and execution time so in this case you get smaller emperor tables and in this world potentially speed up your queries when when you cannot do city merging in the 1st place this feature was action plemented by student and this is something we are very proud of that we get to work in the community of all so all this was contributed in a free month period if you've heard of Google Summer of Code it's a very interesting project and if you're in charge of an open source of project I do suggest you try and apply it's something that's very worthwhile the you comparing maybe to my skill and also was graphs and sequel server so what when we try to implement cities as well as when the functions coming later we tried to do as bistable as a similar job as the of the database vendors so both ostensible server supports at and we do condition pushed down but we don't do reuse and the reason why we don't to reuse is because whenever you choose to to reuse of the U and all the other all possible optimizations so that's why we currently do not support use but we are thinking of adding have metastasized whether to reuse cities or not this has little world from 1 of so this data is from 8 0 this is from of I think April this year in the I have looked if there is there have been the differences the the from my understanding condition push down was not in in the my skill plans when asked the developers so it's Province they're not there I OK so for oppose press they don't do either merging or condition pushed out maybe somebody is more familiar they can explain the reason for this I don't think the only from a simple able together is that progress considers cities as optimisation barriers so the optimizer can only work within the city or after the city but not mixing them together another interesting
feature is that cities can be accessed in a recursive manner I'm not going to go into details on how you can do this but the interesting part is that this action makes this sequel language during complete so you can write any program which would run in a in a crate in a different language and with the cities and interesting example is if somebody has actually written a chess library for with recursive cities the so you can actually chess with that I think was aggressor server in this case because cities are more same used for hierarchical Koreans so you can have a tree structure you can actually query within the tree and this makes her life a lot easier when you have to get a list of the parts from for an object or a list of employees which reports directly or indirectly the someone OK so enough about cities the and the main take from here is that they are essentially views but they are local to a query you don't have to create them separately you will get better performance if you see these wisely and with recursive because it if you if you have a lot of power to express on complicated courts any questions so far can either nobody understood anything or everything's crystal clear and this is trivial information OK so let's move on to window functions and is the project I personally would come and it's something that can can be very difficult to explain to somebody but it's something can really speed up queries when used properly so when the functions acts like a regular functions as well as aggregate functions so the trick is that they can access multiple rose from the current yield that the computing the value on which means that with this you can actually try to eliminate self joins so whenever you are joining the same table itself you can potentially get rid of that kind of query and this can sometimes be queries that take 10 hours for large datasets to the past 1 minute and how examples of this perhaps selective said when the functions
are computed over a sequence of rooms like simple like aggregate functions such as some but be resolved gets output that the once per each row in the result set not once per group you can find them using the over clause and this is a go for a few examples to see make my point on this so he would have a simple table with if users they have an e-mail a 1st name last name and there's split up by either regular or admin users if we forgot to add primary key to this table we can generate 1 of the real number when the function so remember here takes no parameters but it has to have the school keyword right here with any records of words there is a reason reason for that and it will generate an increasing sequence of numbers from 1 to 5 this is very similar to a function a a regular function but the trick is that but it has to have an order so if you you if you look at this number here this order is not guaranteed the currently we will make an exception for this kind of queries in a future release but but as it stands this order is not deterministic we can also get this this order so we scramble than the and ah yes I have an order by e-mail and I'm on the rules but you know there are in alphabetical order but number doesn't know which border the rows actually should be looked at In order to fix this we have to specify an order by for the row number so we've we've we're adding an order for the Rose for that window function now what happens if you want to sequence of numbers 1 for admins and 1 for regular users well you can split this up with a mother syntax partition by its very similar to group by all the rows are screwed up by the account so if you can look at it as if these were separate tables and recomputing row number across admins and regular users separately make sense so far I have another example and I think that 1 would clear up how petition by an order by work however was the
the so Greg users are there they're
free regularization the 1 2 3 users and 1 2 and uses yes it so it starts a different sequence based on account so that's a get um how the how this behaves like an aggregate function and that think this will make sense how order by works so same have this sensor data here they would have something which recorded so value and it puts a timestamp and the value in the database have time and value is sure to plot this figure it is this is how it look like know what what if we want to smooth this figure out so notes noisy but we can use average as a window function to perform a moving average over the data the so here we do average all the values and now we have to order the values by time so that the average knows which part of the data to smooth out and it will smooth out 3 rows before the control and 3 rows after the but these rows are ordered by time that is so for each point here in the red line the value is computed by 3 data points before repeated the points after and 1 data point to the current show itself we can increase this by making a larger moving window the it and what do they questions about this but but that's about how this moving window works in the numerical example so it's the same example as before and what is that of average I have something just so they don't have to divide by the number of elements in the freight in we have all the rows which take part in the computation red is the role which we are computing the value for and in practices that have the expression which is actually run behind the seats so 1st there was some up over between 1 proceeding in 1 polling and then 2 preceding and 2 following the 1st row doesn't have any preceding rose so just add the current flow in the right after going back going 1 step further we now have 1 preceding row so we're adding the previous 1 but we also have an extra neural so now we're adding but the previous the 1st the next 1 and the current itself if we move 1 step further we start losing arose from the 1st example there take an hour so adding the 1st row itself we we have to move and it just keeps on going like this 1 more step you can see that even in this case we start losing rose from the top and here's an interesting observation whenever you advance the window to the next row you have to do 1 removal in the previous computation and add 1 extra to get the new value of which means that we can actually compute this by just in constant time so whenever we need to go to the next row we just add 1 value and remove on that and this is constant time which is what actually makes when the functions very fast this is the for example so you can see here that pretty symmetrical we get to add values and values that has each row go as as a true comes around OK I'm not going to go into the top 3 practical examples which I have this promise and I have to think it should be that should depend upon so out and then have some benchmarks for these results the they are it's their run on my computer so they're not real benchmarks as in various strict but this should give you a rough idea of how performance improves with window functions so what we have here is the bank example we have a set of transactions for various customers and we want to get their role in account balance base how much money you have in your account after each transaction and want to do this for each individual customer we have a customer I. D. a transaction ID and I'm the transactions and then there's the amount with either we when you pay or when you get money deposited into the account either positive or negative value the way you would do this with a regular figure 6 a sequel query without window functions you would a sub-query we would sum up all the transactions from me at the beginning of time front the current OK so here we are find all the transactions for the current customer and then we get all the transactions which our previous for grandfather and all these are summed up with window functions we it's a colorful cleartext express this so what do we do we have a son or the amount with the data by customer we ordered by time span and we want to add all the rows from the beginning of time unbounded proceeding up until the current row it all these are ways to specify how the frame moves when you advance through and it gives the same result nonsense performance the figure sequel no indexes I see all that all that the database can do without using any additional on data structures however this based increases in an n squared fashion 10 times more rows input implies a hundred times more computation time With sequel uh but with an index this gets improved but this will impact and In performance because you have to update the index as well within the functions this becomes a linear query with a slight offset here because the bigger security turns from memory to this and we have a site cannot afford this axis here as well but some from something that takes 4 thousand seconds to 5 100 seconds improvement how literary frequent great is to get the top N values of a certain data set so in this case we don't get the top 5 earners by the part what are the top 5 salaries and who gets the so here's the thing that we have 2 departments sales and engineers each with their own salary and complaining if you want to get the top 5 learners you need to count how many people are before you with a that learn more than you and you only want to add the to select the values which have has at most 5 people that earn more than you so how do we do this well we would have to count how many people are not myself so it must not match my name it must be in the same department as me and it must have a salary greater than mine they and we were has it has less than 5 people like and that's matched the perspective and then they are 1 of the top 5 donors in the park no logic is 1 if I also want rank people want them to have a rank had and 1 2 3 or or so well if you want to do this you have to face you repeat the query but have in the select list as well so it's the same thing as a noun counting them and and plus 1 so that they get the ranking correct but notice that there is a then there is a around here that matches so we have to give up on the same amount this causes problems because if you want dense fragments of bit more it's a bit complicated but let's see how we do this with a window functions that we had the rank function which is only works as a window function you can petition you and we just express what we want to compute so we partition by department and then we order by salary and the ordered descending because that's we want the top 5 greatest but implicitly it's it starts from the lowest value
to highest value the and now you would you would be tempted to get to filter out the so that the the top 5 you would want to do where ranking is less than 5 OK but this does not work you're not allowed to use when the functions in the where clause think the reason for this is that window functions are computed after everything else in the selected is computer so everything is every gets computed as if there is no in the function and then we add the column afterwards this has to happen because otherwise you would get ambiguous results things and this is where the city is coming we just track that creating a city and and feel proud this city expression so this is how window functions work with and cities where we can also use the year the right tables we can have a select inside the from class if you want but of which it is thought more readable and that's a good performance this 1 is 1 of big ones so here with regular simple once you get up to 20 million rows of this ever finishes not In the last part of the better this laptop and which the with radius equal and indexes it gets better but it still takes a long time after a certain point within the function of this actually from come complete almost instantly because we're just going through the the table once after we sort so just is just 1 sort and then 1 go for that some of the this is a sorry I did not understand the question the the head heading for you yes we will have to I'm so it kind of thing and maybe and I don't understand during it it the way it was wrong Nina and the noble soul how would you use having in this and if you're referring to this square right where you this this is a list with how and so on all having is computed this rank is computed after having and you can only use the we cannot reference here we can only use them when the functions in the order BY clause and in end the select clause that does just because it doesn't exist before our when having is getting a circuit the July need to wrap this create great intercity and then get the hell out there also filtered OK and my final example and so this example areas and is again from a similar use case would submit a bid filtered out but it is that we have a set of of search of machines and whenever as a technician goes and fixes this machine we put a timestamp in the database that service has happened for this machine at this particular time and if you're interested to see what is the average time between services so let's try to find out which machines break more often the shows that they can pass them question a simple now what we want to do in order to get the average time between services is to get the difference between consecutive services for each machine doing this would trigger a sequel is a bit of a pain I gave it is to join the table itself and match the same action these and then look for the time to look for the transactions that have a time before the current position and from those transactions get the maximum 1 so that you actually get the most recent 1 that is before the current 1 and then just do a difference between the current time and the previous time this is 1 way to do it there's problem multiple ways to do it but I think this 1 is for the simplest 1 I could come up with that could actually fit into 1 slide and we have to group by matching so that we get this value for each machine ID for each transaction the now we can do this a lot easier with window functions we have a nice function called lag which is able to access previous rose from the uh from the table so here what do we do we get the much ID time and then we we want the value that was before the current row so let's just looks 1 role before you can take multiple parameters again and we also look at 10 rows before it just it it's a function X had a 1 parameter or 2 parameters and of lag is getting computed by machine ID and we ordered by time because that's how we want the values to be sorted the and then you can just you an average from the time difference between the previous time and the current stack make sense I think this is where way simple to comprehend than the previous huge block of so and why is this is also a pastor the law index this unfortunate with an index it's not so bad but at the same time for you a year on connects improvement in speed of annex increase in data size you get a 10 x increasing time in competition time however when the functions on campus content times faster because you only have to do 1 select instead of having to do I think it was this looks like it's yeah so so having to do a joint then is an index lookup to find which values the maximum 1 you're just able to scan for the table once after you've sorted you can see again the penalty when you're were dropping from that in-memory database to this axis this is normal and 1 thing point out that we go back to the 1st example here if there is a penalty with small datasets that OK so there isn't a small overhead when you start computing With this quickly becomes negligible once you get to a certain point in datasets so this is an extra step that the query uh execution has to perform but if this step at speeds up the previous query quite a lot then is this will if you in big improvements in performance OK well that's about all I have a summary is that you can use subgroup begins window functions to eliminate self joins and this will then she make agrees more readable like we've seen with flag or with the the average example and in a few cases you can get faster queries this is not something which is free so you don't get this speedup without right rewriting of queries but is something that can desk improve if you're willing to rewrite your application to make use of open the functions right thank you very much the only just a few examples of our more when the function is to support the this the full list we don't support group concat but this is something that we are planning to do is just in a technical difficulty in getting it to work correctly and this a few things that we don't support these are on if you features which are planned for the next release entry so probably all of these will be implemented in the next stable release right thank you very much
and I haven't questions of science was only the crowd this is maybe 10 to it's not the default version which you find in indigenous to have to get the custom package but we do have a custom repositories that that make installation of 10 to in Linux distributions simple then so if there is no questions you can still find around so property thing so thank you very much for coming