Merken

The Secret Life of SQL: How to Optimize Database Performance

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
this is the kind what here and
and are going to go and get started I think you everyone for coming and my name is Brandon name and I am an engineer and get home I work on the platform dated team which among other projects is responsible for reviewing every single migration yeah that comes the get get so we usually see at least a couple a day and were also responsible for improving the performance problems that we have and with an application the so what this talk about and tumor and go over some best practices that aren't always best practices after all that I so quick overview and 1st and foremost gonna talk about database indexes so what are they and how they work and database indexing is fundamental to performance and then will dive into some of the Common rules of indexing and querying and data modeling and talk about what those are when and why do they break down what are the tools you can use to understand what's going on and having move forward when you can rely on those rules anymore so when even is an index and 1 index is a copy of selected columns of data from a table that can be searched quickly and efficiently and the reason why they can be searched more quickly is because an index sorted and index as many rows of the table so maybe that's something you don't realize that there is 1 and X 1 row in the index for every row in your table and so you can really think of indexes and tables as the same thing and but you can get to the data you need an index faster and then look up whatever else you need in the main table knowing exactly where to look
so I think understand indexes majors they take a step back and think of this in terms of a real life example and so
hypothetically I am a cat owner and I have this great big book called ultimate cat book which contains a ton of great information about cat but when the my cat get sick and I'm pretty sure my cat is having some sort of an allergic reactions so not knowing where to go in this book to find information about cat allergies and might have to flip through every single page in this book and checking whether or not that it has the information I need about allergies but luckily this book has an
index and this index is alphabetically sorted so I can go to the A. section and see that the cat allergies chapter starts on page 88 and with that information that can flip directly to that page and ultimate cat book read up on cat allergies and save my dying cat yeah cell and on a very basic level this is the same way that database indexes work so indexes are used to
filter data and for most queries we just 1 a subset of the data in a table so we often use an index i to get that information and index is indexes can be used to sort data so as I mentioned indexes are sorted and indexes can also be used to retrieve records so sometimes the index will have all of the information you need and you actually don't have to reference back to the main table to get information and that at the end of the day on what indexes provide us is they just give us a way to get to the smallest information this this that so let's take our database example this time we have a repository stable here that has an idea which is the primary key the name of the repository and the idea of the owner of the repository and we have a sample query where we want all the repositories where the owner ideas greater than 500 so without any index would have to do a full table scan so what that looks like is we have to scan through every
record in the table asking whether or not the owner ID is greater than 500 and that would take a very long time and this is exactly what we're trying to avoid with their usage of indexes and looking at
another example we have the same repositories stable the same query but now we have an index over owner ID so as you'll notice and the index is sorted based on owner ID which is the leftmost column in the index in this case the only column the index you also might notice that the ideas appended onto the right of the index and so when this index we can go directly to the part of the indexes were on idea becomes larger than 500 and we can look back in the main table to get that information knowing exactly where to look so this is what makes an index faster so essentially a repository still becomes kappa and index over on our idea is just like the index of that book yeah so looking at 1 more
example in this case we just want the name of the repositories where owner ideas greater than 500 and now we have an index over both on idea name you'll notice that the index is still sort of based on the idea but it also includes the name field as well as the idea appended onto the end In the same way we can go directly to of the index that has the information we want but this time since the index includes the name field we have everything we need and we don't have to reference back to the main table and so this is a situation in which an index can be used to retrieve data and so if you have a situation like this this will be super and some a few more things that
indexes and index counters at contents are use left to right and so as I mentioned excess of sorted based on the leftmost column however if you have duplicate values in that column and it will then sort those records based on the next column in the index so we have to add to our repositories of an honor ideas 745 those are then ordered based on the names and name is also included in the index and we example here where we wanted the repositories where the name is rails and name is not the leftmost common indexed so it's really not sorted based upon the names we actually continues index here to resolve that query so now that we have a better understanding of how indexes work and were in a move on to some common rules of indexing querying and data modeling talk about what they are 1 and why they break down where some tools you can use to understand what's going on and how do you move forward so 1st rule is any columns involved in query should be covered by an index so let's take a look at a sample
query here we want all the labels I with the label repository ideas here for 5 or 6 so looking at the square your 1st thought might be to add an index of a repository D and that way we can get to the records faster since that index will be sorted based on repository nothing instinct sword and go ahead and add an index over repository the and a bit later on we might add to this query and say we want the labels with a repository idea for 5 or 6 and with the name of feature and so your 1st instinct might be now to add an index over repository idea name and once again that's a great thought but now a index of repository D is redundant and so what is a redundant index it's
basically an index that rt contained within another index so as you can see here on both of these indexes sort based on repository ideas that's the leftmost column each of those indexes so any query that's only concerned with the repository D can actually use the index of repository Indian name to satisfy the query and we no longer need and that's that's over repository AD only so when we breaking this role and again our rule was any columns involving query should be covered by an index we're not really breaking and we're just adding the caveat of unless there is already a covering index so when this matter and selected so that the super fast the index you know that you didn't use 1 or the other out what's the harm well indexes take up space and that's something we always wanna be aware of and most importantly adding an index is going to slow down at the updates inserts and deletes because if you think about this logically sister storing that information in more than 1 place at that information changes it's going to be need it's going need to be updated in all those places as well so even now we get a performance improvement on selects and you know over indexing can lead to performance problems for other operations the and we care so much about this I have that we have a challenge command built into our slack and so that command is my sequel that indexes and it just takes the name of a table and it will tell us that there are any and redundant or unused indexes for that table so a
role was any columns involving query should be covered by an index just make sure you're avoiding on user redundant indexes so on an island that role and talk about index prefixes so sometimes it's actually better are necessary to use an index prefix as opposed to an index yeah so what's index prefix and so here we have an index over repository name and now we have an index prefix over repository name and you might notice that that index only includes the 1st characters of the name of the repository so an industry prefixes is exactly that it just prefix is a subset of data within a column so let's take a look at another
sample query sign is a bit more complex and don't you worry about understanding it but it's just asking for the reason code reviewers for a certain code paths so looking
at this query on the columns that stand out to us are repository did and created that so once again our 1st instinct would probably be to add an index over all of columns so before doing that it's important to analyze the table by adding an index over so once again a bit of a puppet shout out so we have this built-in command called bicycle table and then it will give us some great information about uh table that we have in our database including the size of the columns included in any indexes we only have and then looking at and the columns in a table we can see that the path column is of the data type a binary 1024 so that's like really scary and it's a really large column and were actually not gonna be able to have an index of the that that's going to violate my limitations for the size of the index yeah
so when we can break this rule and rule is any Collins involving query should be covered by an index were actually not gonna cover them with the regular and extra going to cover them with an index the effects and so in this situation we have a longer datatypes on another situation in which you might wanna use an index prefix is when querying the data works well with the prefix so maybe you have a query and give me all the user names that start with a more we really don't care about the rest of the username and we just care about the 1st character of the user so we could put an index over that 1st character reference back to the main table and get everything we need the and so how do we know how long to make the prefix degree question and essentially just 1 may get long enough to differentiate values within this field so if we think back to our sample query and the common question was passed and that's just basically a file path so every single file your repository sites of lives slash in the index prefixes only over the 1st 4 characters and that column that index is not gonna do us any good so we need to make a long enough to differentiate the vise of in the field and it's a good idea to be off of real data if possible the so for that query in question and pure was open to add an index of that column and the developer said that he chose 255 bytes as the prefix because the longest path and get get is 199 by so he took a pretty large repository on and evaluated data and came up with the length of 255 and he was able to add that and extroverts right into the migration to add the index and then we got the coverage so we needed it so when we gain from using an index prefix well 1st off once again space less space is required for the index and we gain the ability to index larger data types that perhaps we weren't indexing before and we can also see a comparable performance improvement as a full index at the length is calculated properly and were smart about it so a rule was any columns involving
query should be covered by an index just make sure you're looking out for redundant use indexes and on using an index extroverts when it's appropriate or necessary that so a 2nd rule is used in always written records satisfying 1 or more of several conditions so I can guess that a lot of you have some queries with or causes in them in your application so take a look at another sample
query and we want the max ID from users of for users battery the spamming users or the users are suspended as we can see this query takes about a minute to run so that makes this really bad were defined in just be that so general the 1st thing I do when I see a slow query is I run and explain on that query
and and what that will do is it'll give me some great information about how this query being executed and the columns and ending attention to are possible keys which just is possible indexes and that this is will include any indexes that and involve column in your query and then keep the key column itself is telling you which index was actually chosen to be used for the query itself the rows is the number of rows and needed to be scanned and extra I will give you some extra information like using we're using index using files were things like that not like out that we
have this built and I get help cell and any developer anyone really can run and explain on a query and slack and and get the explain out and so this really encourages all developers that are necessarily on the database team to really understand what the queries are doing so they don't come to us 1st the help they can look into that themselves and so as we can see here and the key is in all so it's not using any index for this query whatsoever so so let's let's take a look at what with the possible indexes that might have been used so we have an index over spammy and we have an index over suspended at so those are the exact columns question of a query however there and the use of that water is is limited because my C will can only use 1 index per table during a query so if we were going to use 1 or the other we will be excluding rows that should be returned that to resolve this query and so when we break this rule about using an or well 1st off the table being great is small so full table scan is performance so maybe you have a like there's and it runs fast enough and so this is an image per issue for you but I would guess that's not the case for most of the tables and databases and and in this situation the use of the wars preventing any index from being used so instead of adding an index on a query great Britain easy
union and that's going to basically split the query up into 2 queries that can each exploit the right index and then you know that those results and then grabbed the max ID from there so this query run 11 ms way better and were a lot happier that and 1 thing to note put stress has an index word feature so I'm your query might be able to recognize that it has to index is that it can use merge them and then use that so maybe this is fast already and 1 thing to consider them as this is opener will require more will more work in the process of solving the query so it actually might be more efficient to use of the Union anyway so a rule was used in order to return rocket records satisfying 1 or more of simple conditions I'm except when the laws preventing an index from the news and you have to do a full table scan and it might make it might make sense to tune your grade use union thank so at the rule it is if there is an index over all the fields in your queries you're all set have a great day there's something on the and so it's signal and
another fund sample query here and we want all the commit comments for this repository ordered by ID once again we can see this query is running pretty slowly it's taking about 40 seconds and so will run and
explain and examine what's going on under the hood when this query run and we can see that and we have some possible index is that we're going to the potentially use but the Query planner is actually using choosing to use the primary index and so why not using this index over repository D user ID and if you think about this query all we really care about is you care about is the repository so you would think it would go to that and that's bad that's sorted based on repository ID which is the leftmost column and then after against those records I do a sort on them afterward it's not doing that I'm and so this makes us really mad and what's happening is that since we have an order by ID cause at the end of our query my sequel things better to scan the primary key and that's just so that the rows are generated and water and it doesn't have to do a sort this is a valid but it's just the wrong decision and so we go from being mad too sad because we did everything right are queries sounds we can't really change at and we have an index and it's just still slow so the situation is such that my
sequel is choosing to use an index that is in performance or no index at all when 1 is available so we don't only Siegel out
and so a lot of parameters so the only thing we really can do now is just provide a helping hand and so we're going to use an index
and so in this case for using a force index which is basically just gonna tell my sequel exactly which index used to resolve this query and this runs in MS and its way faster so we're happy so were happy and lexical is happy and have some other options here and you could also use a use index which is more of a hands on you could also use a ignore index which explicitly tells my sequel what index not to use but then it let's let's and make the decision afterward so what's the next best thing there are some blotches here and
so future-proofing is a concern so say you added an index using a specific and dance and then later on that index is deleted immediately have to be aware that and and was 1st doesn't have any index himself sorry of so I rule as if there is an index over all the fields in your query all set except when the Query Planner doesn't know to use the index in which case help it up so a 4th and final rule and this is a big 1 is to avoid redundant data across tables so this is something that were
tied now in terms of data modelling and data storage and it's a good idea to have all your tables follow the single responsibility principle the only contain information in regards to the subject matter that table and so we a sample database here a small 1 just support request table a repositories table and adjustable we also have a user stable which only contains user information and you'll notice that there's a common here called spinning and not just to notes whether or not a user is a spam user so in order to satisfy some really fundamental requests add to get home but we have to join across multiple tables so for example if we wanted a group of poor class for example and we need 1 major excluding any clockwise open by spammy users so we have to do a join across these 2 tables were basically reading information from 2 places what if we needed repositories except those on by same user once again that information is in 2 places we have to do a complex joint but those 2 tables or if we just need to add the just for a user but we need to make sure that users understand user once again we need to do a joint across these tables so this is really adding up performance wise so when you break this rule
about not having redundant data across tables while additional Reza joined the causing noticeable performance degradation so this was a really adding up and we're seeing a really slow queries just to satisfy these really fundamental questions the we also have a higher ratio of reads right and this is a really important point so and we basically needed to check for spamming is on almost every request to get i . com for content like this and so that number is really high and conversely the amount of time that were marking users family is much slower in relation to that read number so we have a higher ratio of reads writes in this situation so we're going to decide to have redundant data across tables organ and analyzes data so we're going to have a column to all of these tables and that to so no studies a spam user we actually had added suspended onto that and so now we only have to check for 1 place to get this information and it's much faster so a query that is solid like that
the complicated had a left outer join had an or clause in there and is highly simplified and much much faster so this is the performance improvement that we started in online do normalizing this data and and I can tell you tell you we did it across a lot of tables not just the 3 dimensions and so this is just the rate of get requests for the poor question repositories the REP API and as you can see the response rate is way better so once again nothing comes for free and there are some trade off that you have to consider before making a decision about to do normalized data like this and 1st and foremost and there's going to be a lot of changes to your database are right away so we can have migrations to add those columns for all those tables and don't forget to add indexes for those columns as well and we also had to do a lot of data transitioning to backfill that data for existing records another thing to think about is data quality so now the restoring that information in multiple places and that information needs to be kept up to date inconsistent and so the way that we handle this was with some background jobs and nightly conduct to resolve any mismatches and so forth but I'm get you to think about the pros and the cons before making a big decision like this so role was avoid redundant data across tables that's generally a good rule by when you join in on another table for almost every request it's getting costly you have a high volume of reads to rates for the data in question and it's probably a good idea to consider data generalization so
recaps and are for our 1st rule was any columns involving query should be covered by an index and make sure you're avoiding redundant our unease indexes and sometimes an index prefix is either necessary or a good enough but and use in order to return record satisfying 1 or more of several conditions of a union might be necessary to exploit the indexes and a 3rd rule was that there is an index over all the fields in your query are all set and if my sequel making the wrong Chinese how they make the right 1 and and avoid redundant data across tables and except when read the slow 3 try ratio is high and you can think about normalizing so some takeaways an index the done over index and nothing comes for free indexes takes space and the slow down on inserts updates and deletes so that something you want to think about to your query to exploit the indexes that give you the best results and so sometimes it's not about adding an index it's about changing your query to use the right 1 but there are tools to help you and I encourage you to run explain on your queries and try to understand how they're being processed and and make sure you examine existing columns in existing tables in existing indexes before you make any changes you can
do everything right and still have performance problems so this is kind of the point of this talk is that you can follow these rules and still see problems you need to really understand what's going on under the hood and and get created in certain situations yeah yeah so and we are hiring on that when did him and if any of these interest still have to experience that I'd love to talk to you you can use that link if you wanna check out the job posting thank you do you nurse if
and SHT uh
Tabelle <Informatik>
Automatische Indexierung
Datenhaltung
Datenmodell
Indexberechnung
Einfache Genauigkeit
Schlussregel
Kartesische Koordinaten
Systemplattform
Computeranimation
Datenhaltung
Datensatz
Automatische Indexierung
Migration <Informatik>
Projektive Ebene
Tabelle <Informatik>
Automatische Indexierung
Videospiel
Reelle Zahl
Dynamisches RAM
Information
Computerunterstützte Übersetzung
Term
Quick-Sort
Computeranimation
Homepage
Automatische Indexierung
Stabilitätstheorie <Logik>
Dokumentenserver
Dokumentenserver
Datenhaltung
Zellularer Automat
Abfrage
Computeranimation
Homepage
Übergang
Teilmenge
Datensatz
Automatische Indexierung
Stichprobenumfang
Information
Quick-Sort
Computerunterstützte Übersetzung
Schlüsselverwaltung
Tabelle <Informatik>
Automatische Indexierung
Dokumentenserver
Dokumentenserver
Abfrage
Kappa-Koeffizient
Computeranimation
Portscanner
Datensatz
Rechter Winkel
Automatische Indexierung
Mereologie
Information
Tabelle <Informatik>
Automatische Indexierung
Dokumentenserver
Dokumentenserver
Datenmodell
Content <Internet>
Indexberechnung
Abfrage
Schlussregel
Quick-Sort
Computeranimation
Datensatz
Datenfeld
Automatische Indexierung
Stichprobenumfang
Information
Inhalt <Mathematik>
Meta-Tag
Tabelle <Informatik>
Tabelle <Informatik>
Automatische Indexierung
Nichtlinearer Operator
Bit
Einfügungsdämpfung
Dokumentenserver
Raum-Zeit
Dokumentenserver
Relationentheorie
Mathematisierung
Indexberechnung
Abfrage
Schlussregel
Fortsetzung <Mathematik>
Quick-Sort
Raum-Zeit
Computeranimation
Schlussregel
Datensatz
Automatische Indexierung
Information
Persönliche Identifikationsnummer
Operations Research
Tabelle <Informatik>
Automatische Indexierung
Bit
Dokumentenserver
Dokumentenserver
Relationentheorie
Indexberechnung
Abfrage
Code
Computeranimation
Schlussregel
Teilmenge
Abfrage
Automatische Indexierung
Vorzeichen <Mathematik>
Code
Stichprobenumfang
Disk-Array
Retrievalsprache
Web Site
Bit
Dokumentenserver
Regulärer Graph
Sierpinski-Dichtung
Raum-Zeit
Computeranimation
Typentheorie
Migration <Informatik>
Datentyp
Stichprobenumfang
Inverser Limes
Softwareentwickler
Soundverarbeitung
Automatische Indexierung
Dicke
Dokumentenserver
Raum-Zeit
Datenhaltung
Mathematisierung
Abfrage
Indexberechnung
Schlussregel
Paarvergleich
Elektronische Publikation
Schlussregel
Datenfeld
Minimalgrad
RIS <Medizin, Informationssystem>
Automatische Indexierung
Information
Tabelle <Informatik>
Automatische Indexierung
Relationentheorie
Abfrage
Indexberechnung
Schlussregel
Kartesische Koordinaten
Computeranimation
Inverser Limes
Schlussregel
Datensatz
Abfrage
Automatische Indexierung
Konditionszahl
Stichprobenumfang
Konditionszahl
Tabelle <Informatik>
Retrievalsprache
Automatische Indexierung
Schlüsselverwaltung
Prozess <Informatik>
Wasserdampftafel
Datenhaltung
Indexberechnung
Zahlenbereich
Abfrage
Zellularer Automat
Befehl <Informatik>
Schlussregel
Nichtlinearer Operator
Elektronische Publikation
Computeranimation
Datensatz
Automatische Indexierung
Information
Softwareentwickler
Schlüsselverwaltung
Hilfesystem
Bildgebendes Verfahren
Tabelle <Informatik>
Resultante
Retrievalsprache
Prozess <Physik>
Gesetz <Physik>
Computeranimation
Gradient
Datensatz
Stichprobenumfang
Ordnung <Mathematik>
Finite-Elemente-Methode
Zwei
Abfrage
Indexberechnung
Schlussregel
Schlussregel
Inverser Limes
Portscanner
Datenfeld
Offene Menge
Automatische Indexierung
Konditionszahl
Wort <Informatik>
Ordnung <Mathematik>
Normalspannung
Portscanner
Dokumentenserver
Physikalischer Effekt
Wasserdampftafel
Abfrage
Indexberechnung
Fortsetzung <Mathematik>
Quick-Sort
Computeranimation
Inverser Limes
Schlussregel
Portscanner
Datensatz
Rechter Winkel
Automatische Indexierung
Ordnung <Mathematik>
Ordnung <Mathematik>
Schlüsselverwaltung
Parametersystem
Automatische Indexierung
Konfiguration <Informatik>
Abfrage
Indexberechnung
Fortsetzung <Mathematik>
Computeranimation
Entscheidungstheorie
Konfiguration <Informatik>
Inverser Limes
Schlussregel
Forcing
Automatische Indexierung
Retrievalsprache
Relationentheorie
Einmaleins
Klasse <Mathematik>
Gruppenkeim
Term
Computeranimation
Datenhaltung
Ganze Zahl
Stichprobenumfang
Endogene Variable
Speicher <Informatik>
Dokumentenserver
Datenhaltung
Abfrage
Einfache Genauigkeit
Indexberechnung
Schlussregel
Schlussregel
Datenfeld
Offene Menge
Automatische Indexierung
Information
Ordnung <Mathematik>
Innerer Punkt
Tabelle <Informatik>
Punkt
Selbst organisierendes System
Dokumentenserver
Relationentheorie
Hausdorff-Dimension
Mathematisierung
Familie <Mathematik>
Zahlenbereich
Computeranimation
Spezifisches Volumen
Datenhaltung
Multiplikation
Datensatz
Ganze Zahl
Prozess <Informatik>
Migration <Informatik>
Endogene Variable
COM
MIDI <Musikelektronik>
Spezifisches Volumen
Inhalt <Mathematik>
Ordnung <Mathematik>
Widerspruchsfreiheit
Tabelle <Informatik>
Beobachtungsstudie
Addition
Informationsqualität
Dokumentenserver
Datenhaltung
Relativitätstheorie
Abfrage
Wiederkehrender Zustand
Schlussregel
Migration <Informatik>
Bitrate
Schlussregel
Entscheidungstheorie
Rechter Winkel
Systemtechnik
Information
Wärmeleitfähigkeit
Tabelle <Informatik>
Resultante
Retrievalsprache
Einfügungsdämpfung
Punkt
Relationentheorie
Mathematisierung
Fortsetzung <Mathematik>
Raum-Zeit
Computeranimation
Datensatz
Prozess <Informatik>
Konditionszahl
Tabelle <Informatik>
Automatische Indexierung
Abfrage
Indexberechnung
Schlussregel
Binder <Informatik>
Schlussregel
Auswahlaxiom
Datenfeld
Abfrage
Automatische Indexierung
Rechter Winkel
Konditionszahl
Ordnung <Mathematik>
Tabelle <Informatik>
COM

Metadaten

Formale Metadaten

Titel The Secret Life of SQL: How to Optimize Database Performance
Serientitel RailsConf 2017
Teil 63
Anzahl der Teile 86
Autor Knight, Bryana
Lizenz CC-Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.
DOI 10.5446/31293
Herausgeber Confreaks, LLC
Erscheinungsjahr 2017
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract There are a lot of database index and query best practices that sometimes aren't best practices at all. Need all users created this year? No problem! Slap an index over created at! What about this year's active OR pending users, sorted by username? Are we still covered index-wise? Is the query as fast with 20 million users? Common rules of thumb for indexing and query crafting aren’t black and white. We'll discuss how to track down these exceptional cases and walk through some real examples. You'll leave so well equipped to improve performance, you won't be able to optimize fast enough!

Ähnliche Filme

Loading...