Efficient Django
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 05 | |
Number of Parts | 169 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/21123 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Red HatDemonGoodness of fitPosition operatorScalabilityLecture/Conference
00:34
TheoryScale (map)ScalabilityAbstractionSoftware developerData conversionMachine learningPredictionService (economics)Coma BerenicesAlgorithmBit rateEvent horizonSound effectPareto distributionFocus (optics)Task (computing)Mobile appServer (computing)State of matterStructural loadChemical equationMereologyDifferent (Kate Ryan album)DatabaseSoftware developerBeat (acoustics)Video gameScalabilityBit ratePhysical systemStudent's t-testCategory of beingDatabaseSpectrum (functional analysis)State of matterData storage deviceComputer architectureCombinational logicOpticsSound effectParticle systemLastteilungCartesian coordinate systemMachine learningFocus (optics)Normal distributionTask (computing)Point (geometry)Patch (Unix)ResultantField (computer science)DigitizingPredictabilityService (economics)CausalityStructural loadMobile appArithmetic meanSubject indexingPareto distributionEndliche ModelltheorieComputer animation
03:50
Cache (computing)Default (computer science)Block (periodic table)Template (C++)Level (video gaming)Generic programmingEndliche ModelltheorieOperator (mathematics)Arithmetic meanCodeTemplate (C++)Normal (geometry)Physical systemWeb 2.0Standard deviationBlock (periodic table)Cache (computing)Different (Kate Ryan album)Content delivery networkGoodness of fitNumberMultiplication signCartesian coordinate systemMereologyGreatest elementComputing platformLecture/ConferenceComputer animation
05:15
Level (video gaming)Cache (computing)Generic programmingBefehlsprozessorCalculationSynchronizationDressing (medical)Object (grammar)Read-only memoryDatabaseReading (process)MeasurementCache (computing)Endliche ModelltheorieValidity (statistics)Physical system2 (number)MereologyImage resolutionMeasurementBefehlsprozessorSemiconductor memoryMechanism designCartesian coordinate systemLecture/ConferenceComputer animation
06:26
MeasurementPhysical systemBefehlsprozessorRead-only memoryStructural loadDatabaseDependent and independent variablesBit rateCache (computing)Queue (abstract data type)LengthMobile appMetric systemUser profileMethodenbankComputer programSystem callNumberStandard deviationFunction (mathematics)CountingMeasurementPhysical systemStructural loadStandard deviationSystem callCartesian coordinate systemMetric systemBefehlsprozessorMethodenbankNumberProfil (magazine)Dependent and independent variablesScalabilityStatisticsTask (computing)Line (geometry)Disk read-and-write headMultiplication signResponse time (technology)Patch (Unix)Core dumpSoftware developerPattern languageRun time (program lifecycle phase)Semiconductor memoryLecture/ConferenceComputer animation
08:05
MethodenbankCodeAreaLoop (music)ProgrammschleifeDebuggerHydraulic jumpComplete metric spaceInformationElectronic visual displayDependent and independent variablesHypothesisLattice (order)Metric systemServer (computing)Query languageVariable (mathematics)Physical systemNormal (geometry)Cache (computing)Cartesian coordinate systemBefehlsprozessorMethodenbankSeries (mathematics)Multiplication signSoftwareReading (process)Gastropod shellCodeStandard deviationScripting languageMeasurementControl flowSoftware developerComplete metric spaceSpectrum (functional analysis)DebuggerPredictabilityCASE <Informatik>Virtual machineView (database)AverageLecture/ConferenceComputer animation
11:00
Extension (kinesiology)Graphical user interfaceDependent and independent variablesElectronic visual displayInformationNumberLine (geometry)HypothesisPhysical systemView (database)MethodenbankWeb pageGraphical user interfaceWeb browserResultantSystem callExtension (kinesiology)Profil (magazine)Endliche ModelltheorieCartesian coordinate systemCore dumpLevel (video gaming)Single-precision floating-point formatLecture/ConferenceComputer animation
12:02
Dynamical systemView (database)Moment (mathematics)Process (computing)BitCASE <Informatik>Extension (kinesiology)Lecture/Conference
12:39
MeasurementSpacetimeSubject indexingField (computer science)DatabaseEvent horizonSingle-precision floating-point formatComputer programmingSubject indexingMultiplication signEndliche ModelltheorieKey (cryptography)Meta elementArray data structureQuery languageEqualiser (mathematics)Lecture/ConferenceComputer animation
13:50
MeasurementSpacetimeSubject indexingSubject indexingElectronic mailing listRow (database)Point (geometry)Residual (numerical analysis)Arithmetic mean2 (number)Query languageTable (information)CASE <Informatik>Different (Kate Ryan album)Weißes RauschenArithmetic progressionMultiplication signQuicksortLecture/ConferenceComputer animation
15:04
MeasurementSpacetimeSubject indexingSubject indexingSpacetimeWeightOrder (biology)PermutationMultiplicationProfil (magazine)Field (computer science)Lecture/ConferenceComputer animation
15:49
NumberQuery languageObject (grammar)Regulärer Ausdruck <Textverarbeitung>Operations researchDatabaseSubject indexingOperator (mathematics)Different (Kate Ryan album)Expected valueRow (database)Overhead (computing)Insertion lossNumberMultiplication signQuery languageRemote procedure callQuicksortInternetworkingBit rateSound effectLecture/ConferenceComputer animation
16:44
NumberQuery languageObject (grammar)Regulärer Ausdruck <Textverarbeitung>Operations researchRow (database)Object (grammar)Moment (mathematics)Operator (mathematics)LaptopSoftware testingField (computer science)BitDatabaseDynamical systemQuery languageCASE <Informatik>Set (mathematics)ExpressionMaxima and minimaNumberLecture/ConferenceComputer animation
17:48
Query languageNumberData modelObject (grammar)Regulärer Ausdruck <Textverarbeitung>Operations researchExpressionPresentation of a groupLink (knot theory)Forcing (mathematics)Sign (mathematics)Query languageOperator (mathematics)Instance (computer science)Parameter (computer programming)GradientLecture/ConferenceComputer animation
18:27
Query languageObject (grammar)Endliche ModelltheorieObject (grammar)Query languageDatabaseLogicMultiplication signGenderDifferent (Kate Ryan album)Relational databaseCASE <Informatik>Key (cryptography)View (database)Field (computer science)Set (mathematics)Lecture/ConferenceComputer animation
19:26
Query languageObject (grammar)Query languageMultiplication signField (computer science)DatabaseBitObject (grammar)Intrusion detection systemLecture/ConferenceComputer animation
20:04
Query languageObject (grammar)GUI widgetCASE <Informatik>Extension (kinesiology)Digital filterPrice indexFitness functionComputer fileDatabaseLattice (order)Query languageSemiconductor memoryView (database)Field (computer science)Sampling (statistics)Electronic mailing listRight angleSubject indexingObject (grammar)WordDefault (computer science)Endliche ModelltheorieSystem administratorSystem callOrder (biology)Key (cryptography)Reading (process)Lecture/ConferenceComputer animation
21:45
Query languageGUI widgetExtension (kinesiology)CASE <Informatik>CuboidWeb browserVolumenvisualisierungDatabaseVirtual machineVideo gameRule of inferenceCross-correlationField (computer science)Relational databaseElectronic mailing listMultiplication signLecture/ConferenceComputer animation
22:34
GUI widgetQuery languageCASE <Informatik>Extension (kinesiology)Field (computer science)Relational databaseWeb browserSystem administratorRight angleProcess (computing)CASE <Informatik>Endliche ModelltheorieFilter <Stochastik>Raw image formatCartesian coordinate systemTemplate (C++)Lecture/ConferenceComputer animation
23:40
Query languageGUI widgetCASE <Informatik>Extension (kinesiology)Cache (computing)Normed vector spaceSpacetimeSummierbarkeitField (computer science)LengthStandard deviationWeb browserForm (programming)Normal-form gameComputer scienceFilter <Stochastik>Phase transitionKey (cryptography)Cache (computing)NeuroinformatikOperator (mathematics)Lecture/ConferenceComputer animation
24:32
Query languageCache (computing)Normed vector spaceDatabasePhysical systemProcess (computing)Physical lawLevel (video gaming)Table (information)Cache (computing)Projective planeCustomer relationship managementCrash (computing)Lecture/ConferenceComputer animation
25:10
BitCASE <Informatik>Physical systemTable (information)Point (geometry)Crash (computing)Multiplication signRight angleSpacetimeSoftware testingDampingMatrix (mathematics)Cache (computing)Projective planeQuery languageGoodness of fitCartesian coordinate systemCodeDatabaseLecture/Conference
26:33
Operations researchQueue (abstract data type)Task (computing)BefehlsprozessorMultilaterationCartesian coordinate systemElectric generatorProcess (computing)Software testingProbability density functionService (economics)Physical systemHypermediaACIDState of matterMultiplication signComputer animation
27:28
Operations researchQueue (abstract data type)Analytic setStaff (military)Computer configurationNumbering schemeSet (mathematics)Multiplication signSheaf (mathematics)Physical systemPlastikkarteReading (process)Lecture/ConferenceComputer animation
28:22
DatabasePhysical systemCache (computing)Default (computer science)2 (number)Connected spaceCartesian coordinate systemDatabaseGame theoryProcedural programmingSet (mathematics)Standard deviationLecture/ConferenceComputer animation
29:25
Connected spaceServer (computing)Universe (mathematics)IdentifiabilityMobile appLecture/Conference
30:01
Key (cryptography)Intrusion detection systemCollisionUniqueness quantificationIdentifiabilityMultiplication signSequenceRow (database)Default (computer science)Error messageNormal (geometry)2 (number)CollisionComputer animation
30:38
Key (cryptography)Intrusion detection systemUniqueness quantificationCollisionCondition numberUniqueness quantificationRadiusString (computer science)Information securityField (computer science)PropagatorService (economics)Customer relationship managementReading (process)Execution unitMultiplication signDatabaseCartesian coordinate systemLecture/ConferenceComputer animation
31:31
RadiusField (computer science)Human migrationArchaeological field surveyValidity (statistics)Key (cryptography)Projective planeDatabaseScalabilityStandard deviationIntrusion detection systemLecture/Conference
32:05
Human migrationParallel portPasswordSoftware testingComputer configuration10 (number)CASE <Informatik>DemosceneInformationHuman migrationExecution unitChemical equationParallel portSoftware testingPhysical systemMobile appHacker (term)Multiplication signDatabaseUnit testingComputer animation
33:03
Human migrationParallel portPasswordSoftware testingSoftware developerView (database)CASE <Informatik>MiddlewareMobile appInstallation artPasswordMetropolitan area networkAnalytic setAuthenticationCartesian coordinate systemSummierbarkeitDivisorSampling (statistics)Validity (statistics)Lecture/ConferenceComputer animation
33:51
Human migrationParallel portPasswordSoftware testingPasswordExecution unitHash functionUnit testingProcess (computing)Software testingSampling (statistics)Local ringOverhead (computing)System callComputer fileInternetworkingService (economics)Line (geometry)Multiplication signSound effectExtension (kinesiology)Computer programmingSemiconductor memoryPhysical systemArithmetic meanLecture/ConferenceComputer animation
34:47
MeasurementLine (geometry)1 (number)Multiplication signMathematical optimizationLecture/ConferenceComputer animation
35:21
NumberProgrammer (hardware)Mechanism designPhysical systemProduct (business)DatabaseScalabilitySheaf (mathematics)Social classCondition numberForm (programming)Array data structureProxy serverGamma functionLecture/ConferenceComputer animation
36:20
NumberProgrammer (hardware)ScalabilityBlogProjective planeGradientDirection (geometry)AreaProduct (business)CASE <Informatik>Physical systemCartesian coordinate systemMultiplication signMathematical singularityFacebookLecture/ConferenceComputer animation
37:03
Programmer (hardware)NumberLine (geometry)Link (knot theory)Programmer (hardware)Connected spaceLocal ringMultiplication signMedical imagingPlanningSemiconductor memoryRight angleReading (process)BefehlsprozessorExtension (kinesiology)Metropolitan area networkVirtual machineData structureSoftwareHard disk driveNP-hardComputer animationLecture/Conference
38:21
Slide ruleWeightSlide ruleBit rateWage labourDistanceSemiconductor memoryPersonal digital assistantDifferent (Kate Ryan album)Control flowPhysical systemProcess (computing)Computer configurationLeakMultiplication signComputer animationLecture/Conference
Transcript: English(auto-generated)
00:00
Hi, good morning. Thank you for joining us in this first session in the PyCharm room. Our first speaker is David Arcos, and his talk is titled Efficient Django. Thanks for coming. In this talk, I will speak about Efficient Django.
00:23
I will tell some tips and best practices for avoiding scalability issues and performance bottlenecks, okay? The four main things that we will see are the theory, the basic concepts, then measuring how to find bottlenecks, and finally, some tips and tricks.
00:42
The conclusion, of course, is that Django can scale. So hi, that's me, I'm David Arcos. I'm a Python developer since 2008. I'm co-organizer at Python Barcelona at the Meetup. I'm an ICTO at Lead Ratings.
01:00
Lead Ratings is a startup in Barcelona that does machine learning as a service. So we provide a prediction API so our customers can rate their sales leads and then improve their sales conversions. Looks difficult, but it's quite straight.
01:21
Okay, let's start with the basic concepts. Have you heard of the Pareto principle, the 80-20 rule? It says that for most of the, for many events, most of the effects come from a few of the causes.
01:41
And this happens in many different fields. In scalability, this happens too, okay? We can focus on optimizing 80% of the task and achieve a very few results, or just focus on a few vital tasks, the 20%, and we will achieve most of the results.
02:03
The difficult thing here, of course, is to identify these few tasks. So if we want to improve the performance and the scalability of our platform, we need to identify the bottlenecks.
02:23
Basic concepts on scalability usually. Scalability is defined as the potential to grow a system, just by adding more hardware, without changing the architecture, okay? It's recommended that you don't store the state
02:42
in the application servers, but on the database. If you keep stateless app servers, you can do load balancing, and then you can scale them horizontally, which means just add more hardware, and if the state is not shared, it's very easy to grow. But then we move the problem to the other side,
03:02
to the database. If the state is in a single point in the database, this will be difficult to scale. It depends on the database. It's not the same scaling a Mongo, a Postgres, Redis. Each of them have different things.
03:23
To improve the database performance, this is quite obvious. On one hand, you have to do less requests, and on the other one, you have to do faster, more efficient requests, and we will see how to later. On doing less requests, means that you have to do less reads and less writes. You can achieve this with caches.
03:43
On doing faster requests, you can do many things here, so we will see how to index fields, and you can normalize your models. The normalizing means that you have some recalculated data inside the model, so you don't have to do expensive operations all the time.
04:01
About the Django templates, the standard templating engine is good enough. Jinja is a bit better, but anyways, you have to cache all the templates. Django has fragmentation. That means that you can cache just little blocks of the templates. You don't need to cache everything at the same time,
04:21
and you can go layer by layer, template by template, and do different caching at different spots. Of course, this depends on your system. If you are doing an API, you don't have templates, but if you are doing a normal web application, you will have a lot of code that can benefit from this.
04:44
The cache, this is one of the most important things. Of course, you can cache almost everything, so the most standard approach is to go layer by layer of your stack and try caching things. From the top, if you are using the,
05:01
if you are using Varnish, if you are using a CDN platform, the access to the database, the templates, sessions, everything. Django has very good cache documentation, and it's very powerful. And the problem here is the cache invalidation. How do you invalidate the cache?
05:23
Once a model is updated, you have to remove it. You can do it in many different ways. We will see how later. So cache everything. Bottlenecks, now we are moving to the interesting parts. You have to identify the bottleneck on your system.
05:42
The bottleneck is the place that makes your system slow. If you remove a bottleneck, your system will go faster. Then you will have another bottleneck. You have to identify that bottleneck, solve it, and rinse and repeat, okay? It depends a lot. Different systems will have different bottlenecks.
06:00
If your bottleneck is the CPU, the memory, the database, you can do different things. The thing is that first you have to fix the current bottleneck, and then move forward to the next one. So how do we find the bottlenecks? Okay, second part, measuring. You can monitor your application, see data, numbers,
06:23
and this can help you to find the bottleneck. As they say, you can't improve what you don't measure. So you measure your system to find the bottlenecks, you fix those bottlenecks, and then you verify, because you are measuring,
06:41
you verify that the bottleneck has been fixed. And you keep doing this until it's efficient and performant and scalable. Easy to say. So from top to down, monitoring, you can monitor the system, load CPU memory, to check the basic stats.
07:01
The database, of course, it's very important. What is per second, response time, the size of the database, even. Same for cache. The queue, when you have a system of workers, it's important to see how many tasks do you have queued. If it's going too fast,
07:21
then the bottleneck could be there. And also, custom metrics for your application. You can do profiling with the Python C profile module, which is the standard module for profiling. And profiling allows you to run the Python code,
07:41
and it will return you some numbers like this, the number of calls that goes in each call. Running time, time per call, these numbers are interesting for finding which is the slow call, the slow line, and which lines are being repeated the most.
08:02
Because you can have an idea in your head on how the application is performing, but until you measure, it's just a hypothesis. Time meet. The time meet module is another standard Python module that does what it says.
08:20
It's times, how much time does it take to run your command. So you can use it to call a script or you can embed it into Python code. Here, it's calling just a method, and time meet runs this snippet many times, and calculates the average, the best, and well, this kind of metrics.
08:42
So the idea here, it says best of three. Usually, as a baseline, you want to use the best possible time, because in your system, you have many different variables, and the best time is when you have the cache pre-populated is when the CPU is not doing other things,
09:00
is when you are not having network problems. So the best measure works okay for knowing a lower bound of your system. IPDB. PDB is the Python debugger, so if you are using IPython, IPDB is the same for like IPython.
09:22
So it has a few more features, like better top completion, async text highlighting, more tracebacks, introspection. You just use IPDB.settrace, and then when your code goes over there, it will stop. It will give you a shell to keep executing Python.
09:40
Okay, so from a normal young application example that you are running in your machine, you just put a traceback here, a breakpoint, sorry, a breakpoint, and then the run server will stop, and you can see all the variables that are there. You can keep running, you have a few commands to continue, to go step by step,
10:01
and this is very useful because when you detect a bug, you can just raise this and check it. No need to go through the tracebacks. Another very important tool, the Django debug toolbar. Django debug toolbar consists on a series of panels
10:22
in those panels, you can check things about everything, and you can add more panels, okay? So you can do the profiling here. You can see the SQL queries. You can select to explain the queries. You can see what's in the system right now.
10:42
You can see how much time it takes. Also, things about read reactions, about the templates, about the cache usage. For me, this is the most useful tool for debugging things because when you have a theory, a hypothesis,
11:02
on how your system is working, but then the numbers don't make sense, you can go line by line, view by view, and check really what's happening. It's very modular, so you can add more modules. First, the Django debug toolbar line profiler embeds the Python profiler,
11:21
so you have a new toolbar panel, and then you can provide the views, the models, everything, it's very useful. And then Django debug panel, not Django debug toolbar, but panel. This is an extension for the Chrome browser because some calls don't return HTML. If we go back, this is, in this picture,
11:43
we can see, this is the result of a single page of your application, then you click on a button that says Django debug toolbar, and it opens all of this, okay? But this is an HTML view, and all of this is HTML and JavaScript. But sometimes you are not using HTML, you are doing an API or Ajax request
12:02
or non-HTML responses, and you are returning a dynamic JSON, whatever. In those cases, you cannot embed the HTML inside that view. So the Django debug panel allows you to use the browser, you have this little extension, and you can check all the data you are doing to the server,
12:21
you can check the same things as if you had the Django debug tool. This is very useful too. Okay, tips and tricks. Now that we know the basic concepts and how to measure and how to find the bottlenecks, we will see a few best practices
12:41
and a few possibilities on how to fix performance bottlenecks, okay? So first, the most important, databases. Databases are usually slow because the indexes are wrong. I index in a database, well, it's an index, it makes your queries faster.
13:00
But you need to have the right indexes. Databases are not as intelligent as they seem. You need to be very specific on what you want to index. So, in example, all the time, the primary key will be indexed, okay? But then you can add indexes for single fields,
13:21
the DV index, or composed indexes for more than one field, indexed together. The first one is defined in the model in that field. You just add DV index equal true. And the index together is defined at the meta of the model, and then you there put arrays
13:41
of many fields, okay? So, in example, yeah. So you can see, this happened to me a few days ago. You can have your idea on how it's working, but then it's slow. You think it's using an index because it's a very simple query, okay? You are using a date-time field,
14:01
so you are ordering a list of rows by date, but it's very slow. What's happening? If you use the debug toolbar or any other of the toolbars, you will see at some point that the problem was in Postgres. In my case, it was not using the index.
14:20
Why? Thanks to the debug toolbar, I found that it was a multiple index. It was indexed by creation time and UUID. Okay? Why? No idea. This was inside the Django admin, so I understand that it sorts by time and by UUID,
14:41
but once I found it, fixing it was just adding an index, and it went from 15 seconds to three milliseconds. The difference is huge. And this table was very small, just three and a half million rows. So for bigger tables, it's very important to be sure that you are using indexes
15:00
for your most used queries, and also if you are using the Django admin, of course. What's the bad thing about the indexes? Why don't we add indexes to everything? Why it's not automatic to have indexes everywhere? Indexes occupy space. Space is cheap, but space on the database, well,
15:23
it's problematic. And also, having indexes make slower writes, because if you insert a row, it has to update all the indexes. Okay, if you have two indexes, it's okay. If you have 20 indexes, it will get more complicated. And you can do permutation for multiple indexes
15:41
of many, many fields, and it will get slow very fast. So use the indexes only when you need them to, and be sure to profile and to be sure that it's using the right index. The difference is huge. It's very easy to see that it's working as expected.
16:01
Okay, another tip for the databases. Doing bulk operations. In example, if you have to do an initial ingest of data, and you have thousands of rows, and you go one by one, it will be thousands of writes to the database, okay? You can use the bulk create method,
16:22
and do bulk insertions of, I don't know, a thousand at the same time, or 10,000 at the same time. This goes much faster. The database has no problem in adding 10,000 rows. At the same time, it's just a bit slower, but the difference in number of queries is huge.
16:40
Each query you do to the database has an overhead of going to the remote database and everything. Sometimes you test in your laptop and it goes very fast, but once it's in Amazon or it's in another provider, you will see the overhead very fast. So you can do bulk operations for creating,
17:02
you can do bulk updates, and you can do bulk deletes. Okay, instead of iterating over all the objects, all the models, all the rows, you can use these methods. Update is a bit more complex. Why, because usually when you want to update a field, okay, you know what you want to put into that field,
17:21
but if you want to update a query set of many fields, usually the field you want to update is dynamic, okay? Because setting the same value for all the fields, that's not a common use case. So you can use the F expressions that are for setting field values based on dynamic data.
17:44
Dynamic data, I mean things that are already in the database. For example, you want to increase a counter, so you could use an F expression to say, okay, give me that counter, plus this kind of things. By the way, these are links, and I will post the slides,
18:00
and you can check all the links. Most of them are going to the Django documentation, but others are going to other sources, okay? And delete is very easy, no parameters, you just delete a full query set in a single operation. Another thing to take in mind is that when you do bulk
18:21
grid, it's not using the safe method. It's not using the signals. Same for update. If your logic depends on Django signals on a given model to do something, it's time to add a row, this will not call the signals, okay? So you have to manage that apart.
18:43
Okay, another tip for the database. Getting related objects within the same query. Here we have two different use cases. They are very similar. If you go to foreign keys, or if you go foreign keys
19:00
fields, or many to many, okay? For foreign keys, it's easier, you just use the select related method of the query set, and you will have one model, and one object, and other related objects in the same query set. So, in example, I want to get the model country,
19:22
and all the cities in that country. So, normally, I would use one query for the country, and then one per city. That's inefficient. I can do it with a single query, and tell the database, get me this country and all the cities at the same time. It's a bit slower than a single query,
19:41
but much faster than doing N queries, okay? And the second one is a bit more complex. It's for many to many fields, when the relationship is not only a foreign key, but you have more fields. This does an extra query. Before the normal query, this will do an extra query. This will get all the IDs of all the related objects,
20:02
and it will do the join in Python. This is important, because sometimes the databases are very slow doing joins. If you don't have the adequate indexes, or if it doesn't fit in memory, it has to go to the file system, or whatever. This makes sure that you will get all the related
20:21
many to many objects, with just an extra query. So you will do two queries instead of N. Next. Slow admin. I use the Django admin a lot. I usually extend the admin and add custom.
20:44
And one thing I like is that the default value for the admin, well, it can have lots of fields. It doesn't grow very well. You can do many of the tips you have seen. An example, list select related will do the select related thing inside the model admin.
21:05
You can do, overwrite the get query set to the prefetch related. So the get query set method, you just extend it, and call with prefetch related to whatever fields you need. Ordering. The ordering field makes sure that it's using an index.
21:22
And the same for the search fields. If you are doing searches on an index of fields, it will be very slow. Now, for foreign key and many to many fields, you can do two things. Read only means that instead of, in example, we have a list of all the cities.
21:41
There are thousands of cities. And this means that it has to do an extra query to the database to get all the cities, and render it, and you will have a select box with a lot of things. It will be slow. Not on the database part, but on your machine, the browser will get very slow. So if you do read only fields,
22:00
it will not be a select field. It will not be editable. So you will have just the current value. And this can be useful because most of the times in the admin, you are not changing these kind of relations. But if you need to change them, then the next one, raw id fields. This is a different field that instead of listing
22:20
all the possible values in this foreign key, it will display, I should have put a picture here, it will display just the id, okay? A little button for search and a little button to delete. So in example, we have a list of cities. We would have a field and say city 45. And that would do the relation without spamming
22:43
lots of html entities into the browser. The raw id fields is cool, but it's not very beautiful. It's better to use the Django Salmonella external application. It's like the raw id fields, but it tells you the name of the field
23:01
that you are using. It's a little more beautiful and more usable. Okay, so with this Django Salmonella, instead of seeing city 45, you will see city Barcelona. It's more usable by the end user.
23:20
Another little trick, extending the admin templates. In this case, if you extend the filter template, the filters are what in the admin, in the sidebar at the right, you have all the possible, all the filters that you define in the model admin will be there.
23:40
If you have, in example, the city, you have thousands of cities, it will take a lot of space. And it's slow in the browser. It's a lot of craft. So you can extend this filter, and instead of doing a standard list, using a selector, an HTML selector, the standard form.
24:01
In this way, it will occupy less space, and it will be just a normal form that when you click, it filters you by this foreign key. Okay? Now, I talked a lot about the cache, and the cache is difficult because you have to invalidate things, and you have to know what to cache,
24:22
and you have to do many difficult operations. If you know the sentence that says that in computer science, there are two difficult things, cache invalidation and naming things, okay? Cachealot, it's not a joke, it's a very good software. Django Cachealot is a system for caching the ORM queries, so the database accesses,
24:41
and automatically invalidates them. This is a very cool project. This is done by the, there was another project called Johnny Cache, this is from the same people, I think, and this manages automatically the caching on the ORML level. It introduces itself at the middle of the ORM,
25:02
and it does caching at table level. This means that if the table doesn't change, then the cache is still there. Once the table changes, the cache is invalidated. What can happen here? You have a table, and you are writing all the time. This could be a problem, because you will be invalidating the cache all the time.
25:22
Anyways, I did some small tests, and even if you do that, having the database cache in the ORM improves your performance, because usually, inside of the same request, you could be accessing the same database, the same row, many times, okay? If you are not caching that, just by caching it
25:40
inside the request, you can avoid a few extra queries. So, even if you are having a lot of writes, well, I would say that you have to measure if it goes better for your system, okay? This will take some space in the cache, of course, but having an automatic system, this project has very good code coverage,
26:01
and well, it's very, it's the low-hanging fruit. You just install this, it's very easy to configure, and your application gets much faster for most of the usual cases. Of course, if you have some specific things, you can use the low-level API of Django cache a lot,
26:20
and do caching in specific places, or disable some tables, or accommodate to your own case. Cues and workers. Do the slow stuff later. Sometimes you have to do stuff that is slow. Could be CPU-bound, so the CPU is working a lot,
26:43
because I don't know, you have to do, generate a PDF, put it inside a zip file, okay? These kind of things takes a lot of CPU. You don't need to do that synchronously, okay? That can be. Oh, you, you, yes, thank you. Generate PDF, this.
27:01
Most of the data in the industry, and a single jobs system, where you queue the staff, and you do it, and you will have some, your application servers, of course, but then some workers, and these workers
27:21
will just run the tasks, okay? The task can be any kind of task, not only CPU-bound. Sometimes you have, I don't know, so you have to go to a URL and do a post, and that could be slow. If you put it into a queue, you don't have to wait for this blocking operation to, so it can be done later, okay?
27:42
And if you, if you want to improve the performance, you have to identify this slow stuff, and move it to another place. This is also a very basic tip. Card sessions, this is easy. You just set this setting, in the Django settings,
28:03
and you have two options for non-persistent sessions, or for persistent sessions. So by default, Django will set the sessions in the database, okay? That means that each time a user goes into your application, you will do a read to the database, doesn't make sense. I mean, why?
28:21
You can have those in the cache, and that's it. So if it's non-persistent, it just keeps it in the cache, and once the cache is deleted, the user will be logged out of your application. But if you want persistency, it's very similar. It's caching the reads, but then it will write, okay?
28:41
Once it's not so often as the default settings, but it will eventually write the session. Still, all the reads will be avoided. Persistent connections, yeah, to the database, another Django setting,
29:01
that by default is set to false, and you have to enable it, okay? And this says that a connection to the database can persist for, I don't know, 60 seconds. Otherwise, it will close the connection and open it again, and close and open. You can set it to true, and then it's forever.
29:22
But connection, I think it's better to close the connection after a few time, because if you are having connectivity issues, or issues with the database, or the app servers, or whatever goes wrong, and keeps the connections open, you can have trouble, because other workers won't be able to connect, or other app servers won't be able to connect. So this should be set for, I don't know,
29:42
a minute, five minutes, something like that. The important thing here is not doing lots of connections all the time, in the same second, doing thousands of connections. You want to avoid that. More things, okay? This is not performance, but scalability. Usually, these are the universal unique identifiers.
30:04
And by default, yeah, by default, Django use normal primary keys, sequential IDs, so the first row will be one, the second will be two, third, fourth. Usually, these are different. Unique identifiers are not sorted,
30:21
are not ordered, are not incremental, so each time a usually is generated, it's totally random. The chance of collision is, it's calculated, and it's negligible, so it will not collide. Even if it collides, you would get an error in the other, saying, oh, this key already exists.
30:42
Advantages of using UUIDs. You guarantee the uniqueness, so you won't have collisions. What could happen here? If you have two application servers, the database gets disconnected, or they are in different time zones, so the database gets split, or whatever, you could have a new user ID 25,
31:01
and in a disconnected machine, creating another user ID 25, same ID. What happens then? You have a conflict, you have a collision, and that's not nice. Also, UUIDs are very well indexed because they are using native fields, they are using hexadecimal values,
31:21
so it's not looking for a string, okay? It's something very, very well performant. So, using UUIDs from the beginning makes it very easy to do database sharding. If you don't do this, then later, you will have to do a database migration to use to add UUIDs, and remove the standard IDs
31:43
in other places, and in the foreign keys, and it's a crap going through all the foreign keys, changing these UUIDs, so do this at the beginning of your project, and then when you want to share the database, it will be much easier.
32:02
Okay, slow test, not a scalability issue, but this is important anyways. Slow test used to be a bigger problem, because right now, we have, since Django 1.8, we have the keepDV option, and since 1.9, we have the parallel option.
32:22
Before that, you had to do different hacks to avoid, first, the migrations. If it's time you have to run the test, you have to run all the migrations for all the apps. You can have tens or hundreds of migrations. In Django 1.7, consolidating the migrations into a single one was not working very well,
32:41
was not possible. In Django 1.8, it worked better, but running all the migrations make the test very, very slow. So, when you run the test, just use the keepDV, and it will not do the migrations, okay? Run in parallel. This means that each test case will be run in parallel. At the beginning, the unit test system will create,
33:03
instead of one database, many databases. If you combine this with the keep database, it will be very fast, and in each of the databases, it will start running the test cases. Also, for fastest tests, you can disable things that you are not using, for example, middlewars. Middlewars are usually a suspected bottleneck,
33:23
because if you have custom middlewars doing lots of stuff, it will get slow. If middlewars usually go to the database, or whatever, and do stuff, do validation, do authentication, these kind of things. Installed application, it's not a big difference, but anyways, if you're not testing an app,
33:41
remove it from the installed apps. Password hashes, this is standard in the Django documentation. Use easier hashes, and define, for example, that it's not valid for production, but for the unit test, it's enough, because you are not testing the password hashes. You are testing the user operation, for example. Also, logging, you can disable all the logging
34:01
with just one line. Also, use mocking whenever possible. Mocking means that instead of going to an external service, an external database, an example, or running a slow program, you write a mock that simulates, that it's this external call.
34:20
So, in example, if you are connected to Amazon S3 to upload databases, and you do that, I don't know, a thousand times inside your unit test, that will be slow. If you do a mock, and just keep those files on the local system, or in memory, or in depth null, whatever, it will be much faster, because you will not have the overhead of going to the internet all the time.
34:42
Also, for the philosophy of the unit test, it's better to test only your logic, nor the external services that could or could not be working. So, after all of these conclusions, the first thing you have to do is to monitor, to measure, to find the bottlenecks.
35:03
Once found, optimize only the bottlenecks. Go for the easier stuff. The 20% of the lines spend 80% of the time, so find those lines, go for those lines, and don't try to optimize everything, because if you want to optimize every line,
35:21
that defeats the purpose. And once you have fixed the bottleneck, you have fixed that 80%, okay, but now in the remaining 20%, the 80% of that will be in another bottleneck, so you have to keep doing this again and again. Rise and repeat. A few external resources.
35:43
The official Django documentation is awesome, so it has a section on performance, on database scalability, very good. A book, High Performance Django. This book is very good. It's very oriented to production systems, to have, well, performance more than scalability,
36:02
but scalability things also. This is a must-have if you have Django systems in production. It tells you everything. In my talk, I have focused only on the Django things. In this book, you will see about other things, about using a Jinx, a proxy, varnish, external systems that you can use to make it faster.
36:22
So you don't scale Django only with Django things, but also with external things. A blog, the blog of Instagram Engineering. Instagram, they say it's the biggest Django project deployed in production nowadays, and in all the history, I think. And they post a lot of use cases.
36:42
They posted how they increased all their systems when they started with the Android application a few years ago, when Facebook bought them, and they are posting things all the time as engineers. And also, the data science blog is interesting too. They talk about scalability issues.
37:03
And this is a document. Yeah, you can click here or Google for this line. Latency numbers, every programmer should know. This is a link to a university, and they say, how much time does it take to go to a local connection in a local data center?
37:21
Go to a connection from Europe to the United States. How much time does it take to write one megabyte on the hard drive, on an SSD hard drive? To read, to read from an SSD. To read from memory from another machine in your data center, L2 cache, L1 cache, everything. How to compute inside the CPU,
37:42
how much does it take to run an instruction, a hit, a miss, whatever. This resource is very important because it happened to me. I thought that, in example, going to the local hard drive would be faster than going to an external machine in the same data center, okay?
38:00
It's not true. Going to another machine with a network connection, if the machine has the data in memory, it's much faster than going to the hard drive. So you have to get these numbers and play a bit and accommodate to them, okay? And that's it.
38:20
Thanks for attending. I will, no, the slides are already posted at SlideShare. And at lead ratings, we are looking for engineers and data scientists, feel free to contact me, okay? And that's it. Now, if you have questions, anybody?
38:50
Okay, nobody understand. Anything? Yes.
39:28
I usually deploy often, so memory leaks are not a problem. I deploy often, so Celery is restarted, so memory leaks are not usually a problem for me. But yeah, that can happen, of course.
39:47
Sorry, again? Could be.
40:36
I have tested zero MQ, I liked it a lot. But usually, I go with the easiest option
40:43
and Celery was good enough. But of course, there are many different systems. Of course, if your jobs are not time critical, Celery is okay, but if you need more performance, there are better systems.
41:05
Okay, so if you have any more questions for David, just grab them during a coffee break or during lunch and he'll be happy to answer all of them.