We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Advanced Django queries optimization

00:00

Formal Metadata

Title
Advanced Django queries optimization
Alternative Title
Django queries optimization
Title of Series
Number of Parts
132
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Collecting data from multiple Models is a common flow in Django development. In most of the cases prefetch_related and select_related do all of the job for optimizing the queries. When the models structure become complicated enough and we've put our logic for collecting these items in models' properties we suddenly cannot use prefetch_related or select_related anymore. Our View 's queries count depends on selected items count with high complexity. The problem is in the selection logic: we've implemented an algorithm which calculates something and we cannot prefetch or select all of the objects. So this talk's purpose is: To show how to use Django ORM to move the selection logic, the calculations, the aggregations over group of objects from our python code to our database and moving all of the logic for this in one place - the queryset and test it easily.
35
74
Thumbnail
11:59
Customer relationship managementQuery languageDigital rights managementObject (grammar)LengthData modelTotal S.A.Task (computing)Table (information)Group actionSummierbarkeitDialectInheritance (object-oriented programming)IntelModul <Datentyp>Function (mathematics)ExpressionComputer-generated imageryReal numberRegulärer Ausdruck <Textverarbeitung>Software testingSeitenbeschreibungsspracheLTI system theoryCurvatureAsynchronous Transfer ModeLimit (category theory)Mountain passFirst-order logicFunction (mathematics)ExpressionSequelQuery languageStudent's t-testDatabaseLimit (category theory)Parameter (computer programming)SummierbarkeitGroup actionField (computer science)Instance (computer science)Fraction (mathematics)Object (grammar)Category of beingType theoryResultantSocial classOperator (mathematics)Selectivity (electronic)Electronic mailing listLengthSoftwareSet (mathematics)CodeProjective planeSoftware testingMoment (mathematics)Endliche ModelltheoriePresentation of a groupTotal S.A.CountingWritingComputer programmingIntegerCASE <Informatik>Shared memorySelf-organizationReal numberTheory of relativityAreaCartesian coordinate systemBlock (periodic table)Musical ensembleMereologyWeb pageData storage deviceService (economics)Classical physicsLevel (video gaming)Interior (topology)Multiplication signPoint (geometry)MultiplicationWebsiteExistenceLine (geometry)Pattern languageFerry CorstenProcess (computing)Order (biology)MetadataRow (database)Moving averageSystem callFunctional (mathematics)Message passingInterface (computing)Table (information)Hidden Markov modelDifferent (Kate Ryan album)Wrapper (data mining)Raw image formatTerm (mathematics)MappingIntrusion detection systemBuildingStatisticsSource codeInteractive televisionSingle-precision floating-point formatSinc functionMehrplatzsystemWeb-DesignerWeb applicationPlastikkarteMathematicsSoftware architectureError messageMilitary baseGoodness of fitUnit testingBlogArmObject-relational mappingView (database)Cache (computing)Gastropod shellVideo gameProper mapAsynchronous Transfer Mode2 (number)Computer virus
Transcript: English(auto-generated)
Okay. Hello everyone. Uh, I'm here today to talk to you about, uh, advanced jungle queries optimization. Uh, I'm coming from this small country, so far away from here, called Bulgaria. Uh, if you hear about this for the first time, no
Sofia, which is the capital city of Bulgaria. Um, I'm a last year student. I'm also a
jungle and, uh, react web developer. Um, I was very lucky to find great, great people and friends who really loved their job to build software that solves
a real life problem. And I'm still working with them. Uh, so we are Hacksoft. We are outsourcing company, which built, uh, software for business. And, uh, we make programming courses in Bulgaria for, well, in Python and Ruby. Uh, we organize
a big programming conference in Sofia of, uh, 15th, 16th of September. So we'll be very happy to see you there. But that's about us. Let's talk about jungle. Uh, my aim during this talk is to tell you about three types of
problems that we've hit while developing jungle applications. And, uh, we've faced while using the, the ORM to interact in the database. But first,
let's make two assumptions. The ORM's main idea is to make the database interaction easier. We want to avoid writing raw SQL and we, we just can
write a single Python class which represents the database table. And we can call methods like .all that will fetch all the, the raw from the database. And my second assumption is that Python is not the new SQL. What
I mean is that SQL is great when you want to make database operations. But using ORM makes you, uh, makes you think that you can, uh, do the work for example with simple Python filter which is not a good idea. So
the first problem, too much SQL queries. Uh, on the project I, I work at the moment. We had an API which had like five or six thousand SQL queries
per request which was the real performance issue. So, uh, how we solved that? We know these two methods, select related and prefetch related. They came natively from the jungle ORM. So, how they
works? Let's say we have two models, user and shared account. So every user is related to a shared account and every shared account can, can have multiple users. So let's say we, we need to have all the users. We make user.object.all which under, under the hood every
query set is just a SQL query. So the, the query for this expression is select everything from user. So the problem with this query is that if we want to get the account for every user
we need to make an extra query for each object. So we can solve this with select related. You, you just tell the ORM that you need, you, it needs to, uh, join the table with the shared account so you get it with the same query. Uh, but if you
want to get the shared account object with, with their users you cannot join. You need to use, you need to get, uh, more
than one user for every account. So the way we do that is using prefetch related. We need to, to select all the shared accounts, get their IDs, then select the, the users which have these IDs in the, in class. And, uh, the one thing
that we don't, do not see here when we use that is, uh, you need to fire the first query then you force making a second query. Actually, prefetch related doesn't optimize your API. So you make second query anyway. So if you
don't need the users, it will be, you don't need to use the prefetch related. So, tip one, you can, you can, uh, make a simple SQL joins with select related and
prefetch related. It's easier. The second problem, too much data. We have an API which make, uh, hundred queries which is not very big count, but we had too many objects. Let's say we, let's see it in, in the example.
We have, we have a music service web application with three models. We have, uh, user, every user has related playlists and every, every playlist has, uh, related songs. So let's say we need to get the total length
of every playlist. We cannot store it in the playlist because as soon as we add a single song, it will change. So implement the property in the playlist model which will calculate the sum of all songs length. Uh, but
this is Python so we can write just like sum from list comprehension to save some lines of code. Uh, let's make the same thing for the user model. We need to get all the playlists and sum their length. But the
problem with this code is, uh, if you want to get all users with their length of all the playlists, nothing else, just that, and we have ten users, every, every user has ten playlists, every playlist has ten songs, then in order to calculate the length
for ten users, we need to make, we need to fetch a thousand, one thousand, one hundred and ten database rows, which will not be very slow, but if you have ten thousand, ten thousand database
rows, it will be a problem since there is still a physical limitations between the, the software architecture. So here is the second assumption. Python, Python is not the new SQL. Uh, SQL is made to, to make this thing easier. And what
we do here is just a sum over a group. It's a sum aggregation over a group of songs, grouped by playlist. And this aggregation looks like this.
It's a select clause from the playlist and, uh, the song's total length field is calculated as an aggregation over this. And here's our model
at the moment. So, how can we do that with an ORM expression? How can we generate this query with the plain Python? We can find a query set. The query set is the jungle way for generating SQL query. You can, you can write
custom method, which will select or prefetch something or annotate, which is the ORM syntax for S clause. Select subquery as something. So let's define a query set. Uh, in our query
set, we need to group all the songs, then make a, then we have to make a sum aggregation. So we define a method called collect, which annotates the song total length. And we modify our property in the model, so it can use the
de-annotated field. Now, the reason we do that is if we get the playlist from the song with the song.playlist, it, uh, we don't get it from the query set. So it doesn't know about this collection. We don't go collect there.
So we, we need the Python code anyway. But if we just need the playlist with their songs length, we can define the collect, we can annotate the songs total length, and we should put something there which is equal to the group, the group by class. So, the ORM
has two functions that will do the job. The first is subquery. Subquery will accept two arguments, query, which is the expression for the group by class, and output field,
since the database will know what to return as a result of this query, but the ORM doesn't know what type of value should expect. So, the output field should be integer field, since we some integers. And the query set
will be song object, uh, with values just the playlist ID. This is the, the field to group by. The next thing is we need,
from the group by, we need just the group for this playlist, not all the groups. So we filter the playlist ID with the outer ref, which is the, the ORM expression of get me, get me field from the outer query. And the last thing we should do is to make the aggregation itself. In our case, it's
just some, some of fields of the song. And, uh, when it, once we call playlist.objects.collect, it will produce the same query that we will do if we didn't use the ORM. We just generated it.
Okay. Okay. We, we want to do the same thing for the user model. We need to group all, all the playlists, get their length
that we calculated, and return the sum aggregation over them. So, we, we do exactly the same thing. The only difference is that, uh, we group the playlist by the user ID, filter them with the user ID,
and then we should sum over the aggregation that we implemented in the playlist query set. And we can get it in the first line of the property. So, the end query will look like this.
Okay. Tip two. If you have an API which calculates something over a group of objects, but you don't need the objects, you can use subquery and
autoref. And use the database function to calculate since they're really fast, and, uh, you will, will avoid fetching the objects in the application. And the third problem. We have too much queries, but we have too much data. And the problem, the
difference from the previous two is that you cannot use select-related, you cannot use prefetch-related, but, uh, let's see it in, in the example. We have three, these three models, the same at the beginning. Uh, but
the only difference is that the length of the song, the real length of the song is calculated as, as the, the length multiplied by 0.8. So, if we do the, the prefetch-related, if we,
if we do the same thing at the beginning, uh, it will select everything, but you need to calculate the song's real length anyway. So, you need all the objects anyway. Uh, okay. So, how
can we do that with Django ORM? Django ORM provides an expression wrapper, which is the way, uh, Django says you can use, you can implement this expression when passing it to this, this wrapper
and, uh, define the output field. So, what should be the expression? Uh, we have just the length multiplied by 0.8. You can get the field, the, the object field with the F expression, which will
be evaluated in the database while fetching the objects, not in the Python code. And you can pass Python value to the database and say, do these things with these values from the database rows and from the query. So, we
can just get the length and define the output field, which is integer field. But, once we do that, the previous group by class that we've written will not be correct since they should depend on the real length, not the length. And we should
go to, we should go to this query set, this query set and say values list from some aggregation of not the length,
but the expression equal to the real length. You should get it from the, the song, song query set. And the problem here is how we, how should we test that? We write an ORM expression
and we, we think it's correct, but we don't have the test for it. If it's a Python code, you can make a unit test for it and you can, you can test every case you have. But we said that we need to
have the Python code anyway because you, you can get the object as a relation of the other object. You can get the playlist as a song.playlist, not the playlist.object.collect. And if you had these
properties, we can write proper test for them, proper test like this. We have a song object, we have the real length expected and we just assert this. So the only thing
we can, we need to do is just to get the object again with the collect, with song objects collect, we get with the same ID and assert the same thing. This is the only change in the, in our test.
So the good thing here is if you have the proper test for every mode of property, you, with the minimal change, you can see if, if it works.
Thank you very much, you have asked, so we have plenty of time for questions. So maybe you would like to ask some questions how to do so I can give you the microphone. Good talk.
One question, you solved one of the problems there basically through code duplication in the example of multiplying the length of the song. Do you have any tips on reducing code duplication while maintaining efficiency? Reducing what? Reducing code duplication.
In your example, you multiplied the original database length of the song by 0.8. Oh yeah, let me turn it twice. Of course that's a contrived example. You mean the song total length with the
new property? No, by 0.8 to get real length in one of your later examples. In the query set? Not in the query set. You have one method on the song model where you multiply the stored length of the song by 0.8 to get the real length
and in your higher level method on the user you do the same multiplication operation. And that is code duplication. You're doing basically the same operation in two different places. Yeah, it's the same aggregation. Yeah, but that is considered
an anti-pattern, something generally to be avoided. In this case you're doing it for efficiency but it still carries some risk. Well, you need to you need to group your objects anyway because of the relation. Yeah, but is there a way to do it with less duplication
is what I'm asking. I'm not sure I understand you. That's okay. Thank you. Maybe you can
clear this up and then you look at the code together because you haven't found the whole place yet. Any other questions about queries, making them fast? Oh, yeah. Thank you for the presentation. I have one rather trivial question. You use the decorator
property. Would it be advisable to use the Django own decorator cache property? Would it bring some benefit here? Yeah, it will be really shorter and easier. The only problem that you can face if you use cache property
is, let's say you return money money field instance in the property. But the database doesn't support this type of this type of value. So you need to get the field collected by the query set and then pass it to the Python class money field.
And this is, I think this is the only case that you can cache problems with cache property. Thank you. Sometimes you have like the idea of
the query you want to write but then the things you end up writing with the Django REM don't look at all like the query you had in mind. So are there first like good results to understand what you should be using in the REM in order to get what
you need as SQL and is there a way to verify that the SQL actually is what you want it to be? Well, I think it's easier to check your REM expression while you develop when you see the real SQL query.
But you better write the REM expression since it's easier to maintain long term. And you you can use the query just to check like a science check. You want what you want you need to achieve
and you just need to see the real query. And you can easily print the query itself in the Python shell. Yeah, but the REM expression is definitely easier to maintain long term. And is there any
good results in like a simple place where we can get mapping and this query or REM expression gives this SQL like a building block or something like we have for the class base views, the classic class base views site that many people use. Is there the same kind of
resource for the REM? Well, I don't think so. Okay, thank you.
And with this you can get the actual SQL query which you which you will generate.
You mean the name of the expression of the annotation? Yeah, it's, let me check the query. Where is it? The query here, the sub-query here has an SQL the songs total length. It's the same name as the
let me check the annotation. The same as you you've written in the annotate method as a keyword argument and a Python function. You can extract it from there.
What were your experiences writing raw SQL from Django? You mean in Django application? Yeah, like using the .raw query set. Well, in our project we don't have the plain SQL run by your plain SQL
while interacting with the database. But this is very useful when you have, for example, service which integrates with your application and with your application database and calculates some statistics.
You need to put the the plain SQL queries there. And you can get it from the query set. You can make an ORAM expression, get the SQL query and put it there. And you have tests for it. This is the good part.
I think the general question was how to find out about the existence of outer ref sub-query and so on in Django because when you show this I think people are seeing it for the
first time, so maybe to help. This was general. How did you find out about the existence of outer ref and how to use it? Well, we actually have a really heavy API that we need to optimize. It was extremely important to optimize it since it was very used. So
in Django recommendations, there is too short explanation how exactly sub-query and outer ref actually works. Actually, there were a blog post that explains the
behavior of these two properties, but we found it while using it. We just make attempts then print the SQL query and see what happens. And that, for example, how we found
how... Let me check the query set. You need to make group by over the songs for a playlist,
but you need to tell the field that you want the group to be grouped by. You need to group the songs by playlist ID. So how should you say that? The only idea of sub-query and outer ref with aggregation like some is to make group by, but you don't have an interface for it.
It's not explained very well. So the way you do this is to call values or values leave before making an aggregation. So when you say song.objects.values from playlist ID, the select class
will look like select playlist ID from song. Nothing else. Just that. So, once you make values list with the aggregation, it will return group by this field. So this is the way you can actually do it.
Hello. Thanks for the great talk. So you basically found out how to make such queries by trial and error. That's what you're saying? I mean, there are... As I understood, there's no documentation
in Django about the... Yeah, that was the main thing. The other thing was Django documentation, of course. That's the basic and multiple blog posts for... with examples. Thanks.
So do you use MySQL or Postgres for this and do you know if this works the same on both of the bases because Django RM has some issues with MySQL that are not supported
some parts of it. You mean if you want to generate a query which is not supported by the RM or... Yes, this subquery and autoref part does it work on MySQL databases or only on Postgres? Do you know what... Actually, what database do you use on this project?
I think yes. It will work.
Okay, any more questions about
queries? If there are no queries anymore, then we stop. Thank you very much. Give a big hand to the speaker.