We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Taking Django Distributed

00:00

Formal Metadata

Title
Taking Django Distributed
Title of Series
Part Number
34
Number of Parts
48
Author
Contributors
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
While some code happily lives on a single server forever, most big projects will have to cross the boundary into running both their application and storing their data across multiple systems. The basic strategies are well-known, but we’ll take a look at what to do as you cross the painful threshold where you can’t run your app as a monolith or store everything on a single database server. Among other things, we’ll look at how to split up business logic and application code to run on different servers, how to scale to handle different kinds of web traffic (read-heavy, write-heavy, and long-connections/WebSockets), when and how to make parts of your code not run inline with HTTP processing, strategies for storing data across multiple machines, and how to structure your engineering team to best cope with all these changes. We’ll also look at a few apparently innocuous decisions and the spiral of bad performance they lead to, and how to recognise some of these common problems so you can avoid them yourself in future.
6
Thumbnail
42:19
Distribution (mathematics)GenderBitDistribution (mathematics)Slide rule2 (number)Core dumpProcess (computing)Software developerHuman migrationSoftware engineeringForm (programming)CodeComputer animation
Computer2 (number)NeuroinformatikEndliche ModelltheorieMultiplication signLattice (order)Mixture modelCartesian coordinate systemFood energyCodeWebsiteElectronic mailing listComputer animation
DatabaseCodeReading (process)Interface (computing)Entire functionPhysical systemSoftware developerBlogTrailMereologyMobile appLine (geometry)Boundary value problemGoodness of fitSystem callCodeFunctional (mathematics)Endliche ModelltheorieRandomizationDifferent (Kate Ryan album)Heegaard splittingAbstractionRule of inferenceDisk read-and-write headPattern languagePoint (geometry)Term (mathematics)Finite-state machinePointer (computer programming)WebsiteStructural loadType theoryQuicksortMultiplication signImplementationDampingProjective planeSoftwareScaling (geometry)Socket-SchnittstelleStrategy gameSpecial unitary groupEvent horizonPresentation of a groupInterface (computing)Library (computing)Covering spaceServer (computing)Game theoryMessage passingAreaService (economics)Core dumpTemplate (C++)ExistenceFilm editingVirtual machineBitRight angleConvex setDatabase transactionBasis <Mathematik>GoogolDigital electronicsCurveSubsetPerturbation theoryGeometryVariety (linguistics)40 (number)Distribution (mathematics)Web 2.0Software frameworkGenderCartesian coordinate systemTelecommunicationMusical ensembleComputer animation
TelecommunicationDirected setService (economics)PentagramNumberEndliche ModelltheorieObservational studyLevel (video gaming)Goodness of fitSystem callFitness functionPoint (geometry)Message passingService (economics)Scaling (geometry)CASE <Informatik>BitBus (computing)Diagram
Service (economics)Message passingBus (computing)MultiplicationSocket-SchnittstelleInformationService (economics)Lattice (order)Data structureTelecommunicationGoodness of fitDatabaseTerm (mathematics)ImplementationCodeComputer animation
DatabasePartition (number theory)Computer-generated imageryReplication (computing)Single-precision floating-point formatConsistencyPhysical systemPattern languageRevision controlInformationHuman migrationAnalogyTable (information)BitWeb pageMultiplication signView (database)CodeStrategy gamePersonal identification numberReplication (computing)Medical imagingPartition (number theory)SequelWebsiteTheoryScaling (geometry)ConsistencyDifferent (Kate Ryan album)DatabaseGoodness of fitCASE <Informatik>NeuroinformatikWritingSingle-precision floating-point formatFeasibility studyVirtual machineSpacetimeHeegaard splittingMathematical analysisPhysicalismReading (process)TriangleMereologyBranch (computer science)Set (mathematics)Point (geometry)Right angleComputer fileGenderInformation privacyQuicksortDiagram
Revision controlProgrammer (hardware)Connectivity (graph theory)Operator (mathematics)SynchronizationMetreNeuroinformatikFiber (mathematics)Maxima and minimaScaling (geometry)DistanceTime zoneServer (computing)ComputerCoprocessorVirtual machineComputer programmingTriangleProcess (computing)AdditionComputer animation
Single-precision floating-point formatDatabaseReplication (computing)LastteilungServer (computing)ConsistencyStructural loadCoprocessorDifferent (Kate Ryan album)LogicSoftware developerMereologyDatabaseConsistencyPresentation of a groupEndliche Modelltheorie4 (number)WebsiteDifferent (Kate Ryan album)Structural loadNumberGeneric programmingQuicksortHTTP cookieSoftware developerProjective planeServer (computing)Socket-SchnittstelleOperator (mathematics)Process (computing)WhiteboardLevel (video gaming)CodeBookmark (World Wide Web)Table (information)Goodness of fitWeb pageView (database)Virtual machineMaxima and minimaSet (mathematics)Channel capacityProgrammer (hardware)Perfect groupLine (geometry)Time zoneBitProduct (business)Computer programmingPlanningPoint (geometry)CoprocessorScaling (geometry)CurveTheoryGame theoryMultiplication signContext awarenessSpacetimeLastteilungEvent horizonOrder (biology)DataflowWeb browserClient (computing)Fitness functionDependent and independent variablesConnected spaceRaw image formatSoftwareWorkstation <Musikinstrument>Proxy serverChemical equationInternetworkingLogicThumbnailSystem callElectronic mailing listService (economics)Single-precision floating-point formatRight angleNetwork socketMoment (mathematics)Parallel portCASE <Informatik>Figurate numberPhysical system1 (number)GenderBuildingBoom (sailing)Rule of inferenceSummierbarkeitDefault (computer science)CurvatureDiagramComputer animation
Single-precision floating-point formatSoftware repositoryService (economics)Revision controlFocus (optics)Repository (publishing)Expert systemPhysical systemLoginSoftware repositoryComplex (psychology)AreaCustomer relationship managementMilitary baseCodeDifferent (Kate Ryan album)Computer programmingPhase transitionOpen sourceSoftware bugQuicksortVirtual machineDebuggerStress (mechanics)Bus (computing)IP addressProcess (computing)Operator (mathematics)CASE <Informatik>Service (economics)Software developerCollaborationismMereologyHand fanSet (mathematics)WebsiteProjective planeSpreadsheetMultiplicationTelecommunicationDampingException handlingSoftware maintenanceBitTrailSingle-precision floating-point formatSystem callHigh availabilityPhysical lawDependent and independent variablesMultiplication signGoodness of fitNetwork topologyData structurePoint (geometry)Strategy gameRhombusCausalityVariable (mathematics)WindowProgrammer (hardware)SummierbarkeitRegulator geneCuboidGenderOrder (biology)Tracing (software)Internet forumReal numberConsistency1 (number)Basis <Mathematik>MathematicsNormal (geometry)Insertion lossDressing (medical)Reading (process)SynchronizationDisk read-and-write head
Socket-SchnittstelleIndependence (probability theory)Event horizonWeb pageTerm (mathematics)WebsiteMultiplication signValidity (statistics)Coordinate systemRelational databaseProduct (business)Endliche ModelltheorieCASE <Informatik>Software testingMathematicsNetwork socketDirection (geometry)Server (computing)Boss CorporationComputer architectureData modelFlow separationDifferent (Kate Ryan album)CodeTouch typingPoint (geometry)Query languageOperator (mathematics)Functional (mathematics)Partition (number theory)Group actionMatrix (mathematics)InformationSingle-precision floating-point formatMultiplicationGoodness of fitSoftwareGenderEnterprise architectureTowerAreaWordPattern languageTelecommunicationHash functionQuicksortOverhead (computing)Focus (optics)MereologySelf-organizationTable (information)Interface (computing)Rule of inferencePhysical systemInstance (computer science)Heegaard splittingRegulator geneContent (media)VelocityBus (computing)Interactive televisionDesign by contractMedical imagingNeuroinformatikConsistencyService (economics)Pole (complex analysis)Connected spaceWritingPerfect groupRaw image formatOpen setCache (computing)InternetworkingGame theoryPersonal identification numberFreewareMessage passingData centerView (database)Software architectureOpen sourceUML
GenderRevision controlWebsitePhysical systemVirtual machineSoftware repositoryService (economics)CASE <Informatik>Point (geometry)Web pageVideoconferencingTraffic reportingMobile app
Coma BerenicesComputer animationXML
Transcript: English(auto-generated)
As Adam just told you, I'm here to talk about taking Django distributed. But first of all, a little bit about myself.
And that means making the slides advance, there we go. So I, as previously mentioned, am a Django core developer. I am perhaps most famous for working on south and migrations in the past. I very gratefully handed off migrations to Marcus a while back, and these days I work on channels, which we'll talk about in a second. My day job is a senior software engineer
at Eventbrite, the ticketing company. And I generally have a very bad tendency to run towards code on fire, rather than away from it. It's a problem, I know, I'm talking through it, it's fine. So I have some bad news, computers hate you. They really don't like you. And the second piece of bad news is this makes distributed computing very difficult.
And for a long time, I'm sure you will agree with me, like in my past, my reaction to this was very simple. I'm gonna build a monolith. It's gonna be beautiful, have nice clean edges, all the code in one place, we have one application deployed to one server, and we're gonna be very happy. And this does work for a long time.
Monoliths run some of the most successful sites in the world, and before I even start this talk properly, one piece of advice is don't necessarily move away from the monolith. This talk is for you if you want to, or if you think you have to, and there are definitely reasons you should not go into those, but it's not necessarily a bad thing up front. So if you are thinking about a monolith,
you're thinking, well, it's time to split things up a bit. Like, what do I do? I have all this code, and usually what would happen is you'll come in having an existing code base. This talk is for those who are coming in generally with a big existing code base. You have all the code in place. You've probably got a few years, or if you're Eventbrite, almost a decade of code
lying around that you want to sort of take and wrestle and split it up. And this isn't particularly easy. If you're starting from scratch, it's a little bit easier, but I also would temper starting from scratch in this way too, because there's a very big tendency to what I call overarchitect, to sort of take the best ideas and run with them. We can build an amazing shard of systems
distributed with kubernetes, and it serves five people. You don't want that. So there's three aspects to taking a site or a project distributed. The one we talk about most probably is code, or even databases if you're in that realm. But I'm also gonna talk about teams. One of the things I've learned over the last four years is that the way your team works
and the way you engineer software at scale is very important. You can't get anywhere with a team of 100 people if they don't talk to each other, if they don't understand what they're working on. So I'll cover some of that as well. So there is no one solution. I can't give you a magical solution that you can walk out of this room and go back to your project and implement it and scale forever.
If I could, I probably wouldn't be here. I've been in a luxury yacht in the Caribbean sunning myself with $100 notes. There is no magic solution. You can't do this. I'm here to try and give you both strategies that you could try and apply and also advice, things where like you want to look at the patterns and recognize things happening before you see them.
One of the things I try and do in talks these days is give you the pointers to where to go and learn. I can't cover a lot in 45 minutes, but I can try and give you the hint of what's out there, the things to go and look for in research if they catch your eye or if you start yourself finding yourself going down that path. And the reason there's no one solution is sites are very different.
There's all different kinds of load types and different kinds of implementation types. These are just some of the ideas of what you can have. For example, if you're scaling Wikipedia, Wikipedia is an incredibly read-heavy site. Most of the traffic to that site is people just going there, looking at an article and leaving again. A strategy that works well for Wikipedia because it's very read-heavy
does not work well for something like Eventbrite where we're very write-heavy. People come to us to buy tickets, to send us money. A lot of what people do is very transaction-heavy and involves a lot of sort of writes and updates. And so we really can't scale in the same way. It's true for the kind of load, too. You can have very predictable load, which I'm sure when you have Wikipedia or Google,
it's sort of a gentle curve, as you'll see later. You can have spiky load as well. You can have people like, oh, well, we're all gonna arrive in the next 10 minutes for this blog post you put on Reddit, say, is a very common thing that happens. Or even, you know, you're a thing that hosts events every year, and so everyone's going to arrive in the 10 minutes before tickets go on sale to try and buy them.
And there's also things like the chatty. This is often more a problem in game development, but it's becoming more of a problem in the web. Previously, the idea of having a website that would sort of repeatedly send small messages backwards and forwards was unusual. And then Ajax came along, and Ajax helped a bit, but it's still quite bulky. And now with sockets and smaller Ajax and frameworks,
we're starting to see those problems of sites are very chatty, they talk backwards a lot. And if you're in an area with high latency, like, say, Australia, where I was last week, that becomes a real problem. But let's start with code. What can we do with our code, and how can we help deal with some of the problems with distribution? So first of all, you use Django.
You have apps. They're an amazing abstraction to use. For many years, I put all of my code in one app called Core. No matter what it was, all the models, all the code, all the templates were in one app. I basically ignored the entire app system of Django. This was fine for me as a single developer. My blog is still this way.
There's one app called Blog, with all the geotracking inside it and the place visiting and the talk track. It's all in one app. But as you become part of a bigger team, apps are a really useful line to draw boundaries across, not just for where to put models and code, but also to understand what your dependencies are, who's working on what, what the ownership is like.
And you want to really formalize those interfaces. And one of the problems with apps, and it's good in the short term, is that it's very easy to just call functions on models directly from another app. If I'm writing a polling app for my blog, I can just make the polling app call and find random posts in my blog's post model.
This is fine at small scale. As you scale up, one of the biggest things with code is having very clean interfaces drawn between the different parts of your code. This is useful for splitting up, as we'll see in a second, but it's also useful just generally in terms of thinking and reasoning about the code. When a code base gets to a certain size, no one person can understand all of that code base.
And so you have to let yourself forget about pieces of the code and just think about them in the abstract. And that's only possible if you have abstract pieces that have good rules about them. And so these interfaces are very important for saying, I can reason about this piece of code without remembering what's inside it because otherwise my head will explode.
If you do have those interfaces and you do get to the right scale, you can then choose to split along them. And this is kind of where I say, you have to go this far. In many ways, staying back here with formalized interfaces is good enough for most big companies, but if you want to and want to go with separate machines, having them there
gives you the perfect place to stick the cleaver and split apart your code base at that point. As a sort of small example, you can imagine a site that had things like inventory and payments like a ticketing site does. Inventory is a name for having tickets by the way. And there's a very clear split there if you've built it correctly of saying,
oh, we can take the whole payment system which deals with banks and settling and all the sort of horrible stuff that goes on there and move that to one part. We can take the inventory system with things like, oh, you must sell exactly 30 tickets. There's seat maps, move that over here. And we can take the presentation layer, the rest of the logic and keep it separate. And even this sort of concept is very helpful.
Again, you can read about them separately. You can say like, I don't want to know about payments because it's really complicated and takes a whole team, but I know that I can ask for a payment and get a confirmation back. And one of the big problems when you split is communication. And this is kind of the reason channels exists. Channels often comes across as,
oh, this is made for WebSockets. It's Andrew's idea to get WebSockets across. And that is true. The gap I first saw in Django was, oh, WebSockets need to happen. But the secret is that WebSockets are not that hard. There's plenty of good Python libraries to serve them. The problem is not sockets. The problem is making a system that lets you have sockets.
It's the idea of, well, we can have people talk to a server. As soon as we have two servers and they're trying to chat to each other, how do they talk server to server? That idea just doesn't exist. And that idea is the problem in bigger scale for services. Imagine you have three services and you're like, oh, okay. All we'll do is we'll have the three services
just be like Django apps on each one. They'll have a whiskey runner. You just call like HTTP endpoints and they'll give you JSON back. Very decent model. No problem with it at small scale. Definitely a good thing to go for. You have three to start with. You then get five. First of all, it's a bit of a pentagram, which is worrying.
But secondly, you can see it's gone from three to 10 into connections. And now what if you have 10 services? Oh dear, okay. And this problem gets bigger. I have heard, for example, that Uber have two and a half thousand services. You can imagine this model doesn't really work in that case.
And so the thing channels goes for, which is not necessarily always a good fit, but I think for most cases is, is a message bus or a service bus. This is where rather than having interconnection, all your services and all your individual pieces talk to a common bus, and that is how they collaborate and share information among each other. Channels is a very good medium for this.
Eventbrite service infrastructure is now starting to be moved on to running, not on channels in terms of sockets, but on the channel layers, the underlying implementation of communication stuff in there. This is a really good way of starting to have that code separated out. Now, I'll do some more tips at the end for how this works, but for now we're gonna go into databases. And this is kind of the hardest part.
And this is not just because migrations exist, but people often come to me and go, Andrew, why do you need migrations? We can just use Git for code. Can't we do the same for data? And the analogy is somewhat flawed. It sounds lovely. Like, oh, well, of course, there must be a similar solution as there is for code, as Git came along and did for database,
as Git did for code. And it's not quite true. The problem with data is that data turns out to be quite valuable, and you can't just delete it and recreate it willy-nilly. And also, it's very, very big. We could have a system like Git with pure versioning of all the database, but it would make it 100 times or 10 times bigger. When you have 60 or 70 gigabytes of data,
that is not a feasible prospect, and that's kind of what we don't go there. And the same applies to scaling. You could think about the same kind of things you do with code, and we'll show those in a second, but there are very different strategies for different kinds of writes. And so the thing you might think about when you have code
is what we call vertically partitioned, which is a fancy name for, you give each table its own database. The idea here is that, say you have a couple of big tables. I have like a big users table and maybe a big images table and a big comments table, like a cheap Instagram, for example. And if they're all kind of the same size, you can very cheaply use one third of the space
on each machine by just putting one on each machine. The problem with this is that you can't split further than per table. If you have one giant table especially, this just starts falling down almost as soon as you look at it. And so the next strategy is a little bit different, and I'm sure most of you have heard of it,
which is having replication. In particular, having a single main database that you write to, having lots of replicas that you read from. This is a very common pattern in Django. There's lots of third party apps that let you do this. It does come with some caveats. I'll cover those quickly. The main caveat is that there's a replication lag, which means that when you write information
to the main database, it takes a little bit of time, usually under a second, hopefully under 100 milliseconds, for that information to go into the main database, to replicate to one of the replicas, and then be available for you to read when you query the replica. Now 100 milliseconds is not a long time, but it is if you're rendering a page.
And particularly one of the main problems people have is they will write to the main database, and then just read from a replica straight away, and not think about the lag, and you'll read back stuff that is old. Like you may have just saved a new comment, and you'll read back the page again, and the comment won't be in the replica, and you'll show the user a page without the thing they just submitted to you.
This is a very common problem, and the solution to it is called database pinning. What you do is you say, okay, well if somebody writes to a table inside one of our views, for the rest of that view, we will pin that view to read from the same main database as you wrote to, because they don't get consistent information. That's great until you realize
if your site's write heavy, you can't actually do that for all the pages, because then you're never gonna use all the replicas. You're just gonna always be writing and reading from that main database. And so this is kind of one of those tricks, and this is where my favorite triangle, which is the C-A-P triangle, comes into play. I'm sure many of you have heard of the idiom
cheap, fast, and good, pick any two. This is the same for databases. You get partition tolerance, availability, and consistency. You get basically at most two. And if you're very lucky, you get at least one. Many databases give you maybe half of one of these if you look at it right. My sequel is somewhere on the one scale.
I have no bias, I have a bias. But in this problem, inconsistency is everywhere. You can imagine the idea that you might think, say Postgres, Postgres is in theory a partition tolerant and consistent database, in that it's not always available to read if one of the replicas fails,
but usually if you can read, you'll get a consistent answer from what you just gave. But that's not quite true with replication, because there is still a little bit of inconsistency there. It's only true in a single machine case. And non-consistency really creeps in to all aspects of distributed computing in general. And there's a very good reason for this, and this reason is physics.
This is a nanosecond of wire. The wonderful programmer Grace Hopper was famous for giving, in her lectures, holding up nanoseconds of wire. And the idea is this is the maximum distance electricity can travel in one nanosecond. Because the speed of light is a certain speed,
300,000, so 300 million meters per second, I believe, don't quote me on that. And in copper, it travels two thirds of that speed. But even on fiber optics, it's only slightly longer than this. And if you think about a computer, a computer does what's called clocking. If you're not familiar, processors basically, they sort of, almost like a mechanical machine
where they do an operation and then clock to the next one, they just do more operations. And the clock is kind of what governs how everything synchronizes in a computer. It's almost like a mini version of a big computer system. And it turns out, if you want to go more than one gigahertz, which is, it turns out, using one nanosecond,
you can't have components further apart than one nanosecond, or they cannot physically clock faster than one gigahertz. This is why we don't have big computers that run very fast, it's physically impossible. And this is kind of the microcosm of why distribute it is hard. Think of this at a global scale. If I have a server in Australia and a server in West Virginia, they are at minimum 100 milliseconds apart
at the speed of light. I cannot beat that. It's physically, as far as we know, you know, for caveats, physically impossible to do though. And so if my goal is to have more than 10 writes a second consistently, I can't do it because I cannot synchronize more than once every 100 milliseconds
between those two zones. And this is where all the problems of distributed computing come in. If you go back to the triangle, consistency is one of those things that is affected heavily by being physically distributed and it really comes into play with databases. And if you think about this model, this works very well, and it will dissuade you from it,
but there are more advanced things we'll come to in the fourth part of the presentation about sharding and so on that we've kind of moved towards. A microcosm of databases and code combined is load balancing. A lot of people don't think about this at first. I certainly didn't when I started out as a ops engineer was some of my first jobs. And it quickly hit me like a ton of bricks
or rather like a huge number of users clicking refresh at once. And the problem is you think websites are simple. You don't, you know better. People think websites are simple. People think, oh, I can just have a couple of servers scattered around, they're all equally balanced, and then everyone have consistent load times,
and all my users are roughly the same. They're all sort of very cookie cutter generic people. If you have this, congratulations, you have a gem of a website, please keep it. No one else has this. What we have is loads of logic that runs at different speeds on different machines that different pages have different amounts
of processor loads. For example, like, oh, well, you know, this page is very easy to render because I can just show a template. This page, I have to thumbnail these 25 slides and show them in a grid view. And not only that, you have wildly varying users. One of my favorite interview questions to give is roughly as good as build Instagram. I sit down with a candidate and go,
okay, I'm not gonna ask you any sort of whiteboarding or technical questions or like how do I reverse a list? I want you to discuss whatever level you feel comfortable how you would build Instagram from the ground up. And usually, most engineers who are junior or senior will go a decent way along, have some tables and some good scaling ideas, and then you drop the bomb, figuratively,
which is that you say, okay, now one of your users has 10 million followers and nobody else does. And you have what's almost called the Justin Bieber problem, right? That this one individual user is so incredibly expensive, you can't split them up. They're a single atomic entity. And that really mucks with the way your load works. And not just that, as I said before,
load balancing in theory is like this. If you're in one country especially, you have a lovely curve during the daytime, my mom's awake, and a lovely relaxing curve when they're asleep in the evening. Perfect. Easy to scale for. You can draw a nice line about 20% above this line, say this is our maximum capacity. You have a little bit of spare capacity at all times.
If you're feeling particularly thrifty, you can launch servers in the morning and take them down in the evening. If you're a ticketing website, which I have some familiarity with, it looks more like this, where you're like, oh, it's lovely and relaxing, and then an event you didn't realize existed goes on sale. And they're very popular. And everyone in the world arrives at once
to try and get their free beer. That literally happened, by the way. And so, suddenly, everyone just slams into your servers onto one single page, and often into an order flow that's very complicated. And scaling for this and load balancing this is very difficult. You can't necessarily load balance equally across different servers,
because it might hit a certain set of servers differently. Like, this might hit your payment endpoints much more than your event view endpoints. So it takes a little bit of extra work. And this gets more complicated because of me. Because I did WebSockets. WebSockets are lovely things. They're beautiful, they're great. They're very good for game programming in the browser in particular.
But they have some problems for load balancing. And those problems are they're not like HTTP requests. This is one of the reasons that they don't fit into WSGI. Like, WSGI is you have a request, you serve it, you send a response, you're done. End of responsibility. Sockets aren't like, they're clingy. You open a socket, it can open for hours or days. It can just sit around.
You can't necessarily reroute it, because TCP doesn't work like that. Not only that, the set of tools that handle balancing sockets is very limited. Like, you have to handle them almost as raw TCP connections. But also, they're sort of HTTP, but some tools won't deal with them properly, and it becomes a bit of a mess. And even worse, they have four different kinds of failure.
Which you don't realize until you write a server that serves them, then you realize that they have four different kinds of failure. They can not open in the first place, they can close randomly, and all this kind of stuff. And the thing with WebSockets is, they're great, but they're an extra feature. When you design a site, you consider them a bonus. Like, if you have them, if your client's browser supports them,
if you can open them through a proxy they have, great, fantastic. But you should treat them as optional. And you should close them liberally and freely. If you want to re-load balance some sockets, just close them and let them open somewhere else. Don't design them to open forever. Design them so at any minute they might die. If you're on the London Underground, for example,
there is only Wi-Fi in stations, and not in the tunnels between stations. And so people regularly will appear for a minute, go away for two minutes, reappear for a minute, you've got to design for that case as well. Most sites I used to use in London didn't like the whole, I have internet now, wait, no it's not here again, wait, no it's here again now with a three second latency
and just started collapsing. Like some of the takeaway sites, I was like I'm trying to order food, did you not plan for this case? No you didn't. So that's one of the big problems. And thinking about those problems is when we come to teams. So as I said, teams are a really important part of designing distributed software
and big scale software in general. Engineering is a different discipline to programming. It certainly encompasses programming, but engineering is a more holistic thing. Engineering is about taking a set of programmers and designers and product people and making the best products that you can serve the needs best. And a big part of this is how you use the people you have at your disposal. As I become more and more senior
and go through the industry more and more, I just see this. Like I used to be in the boat eight, nine years ago, like oh, I am the genius kid programmer, I can do anything, I understand everything. And as you get out of the zone of you're so incompetent you think you know everything, you get to the zone where oh, you realize you know nothing.
And that's kind of where I ended up at this point. Teams are very important. I don't know enough about managing people, that's not my expertise, but these are some of the things I've seen as a senior engineer who leads teams and leads projects that I think are useful. So the first thing is you're developers are people too. Programming is an incredibly draining profession.
I'm sure many of you know this. Mentally and also emotionally sometimes as well. Understanding requirements and also trying to adapt them and deal with people having different ideas from you can be very difficult. You need to make sure there's time and space to plan for this stuff. In particular I see lots of plans like oh, we're gonna have these three features
developed in parallel, they're merged instantaneously, then we'll keep going. This is often born out of being a smaller company that's less distributed, where you have oh, we had one or two services, everyone knew the whole code base. When everyone's in one room and knows the whole code base, you can happily do that. You can work on two different pages and then merge them together
because you all know what's going on. But when you have 100, 200 people, you don't know everything that's going on elsewhere. You have to allow time both to spin up and understand what the context is, see if someone else has solved this already, and then to spin down and merge. And so it's a very important thing to think about. And part of this is technical debt
really can be poisonous. Again, I'm preaching to the choir here I'm sure, but it's very important that you build up some technical debt when you're, especially a startup. If you're doing the space shuttle, don't do this. But I presume you're ordering websites because you're at DjangoCon. And as a website, you can have some technical debt. It's fine, it's almost healthy,
requirements change, people change, user bases change. What you should do is very, very cognizant, be very aware of where your debt is and when you need to pay it off. Because like normal debt, it accrues interest. As your code base grows, it drags down on you more and more and more. And so you need a little bit to compete,
but you need to keep ahead of it and keep managing it. And it's very easy to lose track of that technical debt as you go along. And then we get to the slightly more controversial question which most companies have not even solved internally, which is how many Git repositories do you have? This is one of the things with making a big distributed system
that doesn't really get to think about until you get there, until you're sitting down at the computer and going, oh, okay, we've managed to split out our payments code base and the rest of it, fantastic. Where do we put it? If your answer is a single repository, then congratulations, that's great. You can have a single giant repository. Everyone's gonna have merge conflicts all the time. It's gonna be terrible.
If you chose multiple repositories, congratulations. You're now gonna have 300 repositories and no one knows where they all are. And you're gonna have 300 versions and your release manager is gonna have a hell of a time. It's gonna be terrible. There again is no good answer. Multiple repos often seems more attractive at first. It is, but it means that what you're doing
is pushing complexity from programming and doing the merge conflicts. That problem isn't going away. It's just happening to be pushed down to the release phase and so you're giving your operations and release engineers all of your problems. It's kind of selfish if you ask me, so just be very aware of that. If they agree, that's fine, but don't think you're magically making work go away.
This then turns into, well, you have these repos or single repo. How do people code on that stuff? Do you have your teams structured around individual services or pieces of the code? Do you instead try and strike people across different services? Like, oh, this team works on different things and try and encourage people to have diversity
of knowledge and opinion inside the code base. This again is really difficult because often you don't have enough engineers because no one does. Not only that, but this also includes designers and UX researchers and operations engineers too. You never have enough people and working out how to arrange everyone
so that, again, no one knows everything. This is a very important thing. I can't stress enough, but the right people to talk to each other can get really, really difficult. And this gets really problematic with ownership gaps. This is a problem I never saw coming until I moved to a big company is that it's possible to have giant pieces of code
that no one knows about because they got written four years ago. They still work fine. They were written all right, but nobody, like even if the person who wrote it is still at the company, they've probably forgotten. Like, I've forgotten most of the code I've ever written. It's not uncommon to do so. And it's very easy to know what you're working on.
It's very hard to know what you're not working on. Like, at some companies, even working out what the feature set of the site is could take a team of senior engineers weeks to work out. Like, there are some sites that are so complicated that it just, no one, maybe the support team knows about it, actually. Like, one good tip is go and talk to your support people
because they get all the experience of all the weird little features. But basically, nobody knows what's going on. And until you run into one of these gaps, it's hard to know they're there. And so really think about like, if you're smaller or medium, just keep a rough spreadsheet even of like, oh, these are the rough features that we have and here's who knows about them best.
And then if somebody leaves and they're in columns by themselves, you go, ah, that person has left and nobody knows about this now. And so we should go and fix it. This happens in Django, too. Like, Django is a very big, complex project. We have specialists in a lot of Django areas. There are areas of Django there are no specialists in.
For a long time, we thought nobody knew about all the weird multi-form formers and stuff. And then one of the call developers admitted to knowing about it, went, ha ha! What a fool, yes, exactly. So that's the kind of thing that ends up. It's not just companies, it's big open source projects, too, and it's very easy to happen. So with all these sort of rough points,
I wanna go into some strategies. Like, how can you take these ideas I've been throwing at you and try and, what's my best advice, basically? Like, I can't give you single solutions, but what would I suggest? So, first of all, people love microservices. And to do this correctly, I have to go, microservices, like this.
Jazz hands are very important. Microservices are a big buzzword. They're very common, and they're very easy, and that's the problem. It's so easy to just ignore the other code and start a new service. It's a service version of, oh, we'll just delete it all and rewrite it from scratch. It's easy to start, and you know
that it's gonna be this wonderful thing when you go in. Most programmers, a few exceptions I know aside, do not like maintenance programming. If you do, congratulations, you should charge more. But if you don't, the temptation is to go, oh, we can do new services, and then we'll just join them all together. And again, you're not actually saving any work.
What you're doing is pushing all of the work later on in the process when you now have a thousand services and no one knows what they'll do. Can you imagine, like, if you have a thousand services, I doubt any one person would understand what they'll do or what their ideas are. And so you've just taken the same problem you had with a monolith and made it not only be in services,
but now it's on different machines and there's a consistency problem. At least with a monolith, it's all in one place, and you can sort of trace through the code in the debugger easily. You try tracing bugs through four different services and four different machines, it is not easy. Like, you can't just put a PDB trace in there. You have to sort of find logs and put logging in
and relaunch them all and then follow the logs through. It's just a pain. So that's one of the big things to think about. What I generally encourage is a moderate size of services. Usually five to 10 is a good start for most companies. Try and think of your big business reasons and focus around those. That often helps the team composition too.
If you have experts in certain areas in your team already, it's very sensible to cluster around them and give them some junior engineers to mentor one as well. Just don't have a thing where every engineer has their own service, because at that point, they're always writing their own different code bases and you have a giant collaboration problem. Again, I'm a big fan of service buses.
I sort of wrote one in channels. They're a really nice way of doing inter-service and inter-machine communication. One of the big things, especially in this world of Docker and of containers, is that with a service bus, when you have a piece of code, you can just say, hey, the bus is over here. There's no need to say, oh, well, service one is here, service two is here, service three is here.
Oh, we restarted service two, its IP address has changed. So you restart you with a new IP, and it just gets to be a massive, complex nightmare. So that's a really big part of service buses. They make deployment and scaling a lot easier. They're not listening to any ports because they're just talking out. So you can just launch 10, 20 processes very easily. It's not for every design.
There are totally cases where you shouldn't have a bus or you should try and keep some of traffic off of the bus. In particular, one good model I've seen is having a service bus for some stuff in the terms that it is at most once, which is a message on the bus might get there and might not. And a separate thing called a fire hose, which is all the events happening in the site,
and they may happen once or twice or more, which is at least once. And different code problems can happen on different things. Cache invalidation is best done from a fire hose because if you invalidate the cache twice, it's not a problem. And so your cache invalidation thing should listen to a fire hose and every time it sees an ID,
oh, yep, this user's changed, invalidate their cache. Great way of doing that shouldn't be on a bus. And then onto the consistency stuff. You are gonna have inconsistent data. It's always impossible to avoid this. Even if you think you're a perfect monolith, as soon as you have computers in more than one data center that are physically separate, you're gonna have to deal with something like this,
even if it's like 10, 20 milliseconds. And a really powerful way of doing this is to look at the product. What are you making? Where can you allow old data? Sit down with your UX researchers and your product people and your designers and go, okay, we have to give somewhere, but where can we give in a way that doesn't hurt the experience?
A very common way, for example, is to give on content you didn't author. If I had, say, a site where I looked at images and comments, all the things that I didn't own or comment on, we could happily serve you all data. It's a very good example of pinning. This applies everywhere. There's many cases in different pieces of software
where you can go along and go, yeah, we could just change, like if we remove the pagination, this is very common, by the way, if you've been to GitHub recently, GitHub does not show you how many pages there are because that's almost certainly what the slow bit is. And so by just showing you next page, most of the functionality is there, but that expensive pagination query has gone away. And so that's one of the compromise examples
you might go and do. Now sharding. Sharding is a very complex issue and one that would take an entire talk by itself to cover. But for those who aren't aware, sharding is the next step from basically the vertical partitioning. It's called horizontal partitioning. And the idea is that you have one table
on multiple servers. Say like my user's table is on 100 different servers. And generally, you'd say, well, there's 100 different shards and each user is in a certain shard. It's very, very powerful. It works incredibly well for most patterns that people understand, but it is an incredible technology and person cost.
Having sharding in your code base makes everything slower to write, everything slower to run. Some people come along and design for it upfront. They go, oh, okay, we're gonna be amazing. We're gonna start off with our brand new startup. It's gonna be sharded from the beginning. It's gonna be fantastic. This is like the technical debt thing. Sure, you could do that,
but in the same way that you could not take on technical debt and have a perfect code base that passes every single test and had 100% coverage all the time, you're gonna be much slower. And almost certainly, when you're making your project, be it open source or commercial, you don't understand the full scope of your problem. You should expect things to change.
So in the same way you should have some technical debt, like I often won't write tests for some, like the halves of my code that I think are gonna change, you probably shouldn't put sharding in there straightaway. You should probably go, hmm, well, we're gonna leave the touch points in here. And in particular, what I suggest is routing all of your queries through a single model
in Django or a single function outside of it so that when you have sharding, you can come along to that one single place and put it in. Like sharding often involves like we need to look at an ID, work out a hash of the ID, find a server and do that stuff. I see many sites that don't do this, that like directly query tables or do custom raw queries. That ruins this.
So just try and keep the split point there, but don't split on it. And that generally works out pretty well. Again, with WebSockets and also long poles, anything that's like any kind of connection that's open and you send more than one thing down it, expect it to die, designed for failure. If you're designed for failure, when things don't fail, it's a happy surprise.
You're like, oh, this is great, it's more efficient. If you don't design for failure, your pager will be going, no, how's the pager? Your mobile phone will be going off all the time. You're like, oh, it's not working. It turns out the internet's not perfect and doesn't stay open all the time. Who could have guessed? I've seen, like I used to write game servers in Twisted
like a few years ago now for Minecraft. And that's another case where like you expect that it would stay open forever and you can just send things down the socket and it's not, that's not true. You can't do that. So design for failure, not just in this, but many other things. I have other talks on designing for failure if you want to go and watch them. Teams. This is very difficult. I am still thinking about this problem.
The best thing I've seen so far is independent full stack teams. What I mean by this is essentially when your company gets big enough, treat it as lots of smaller startups. Each team has operations on it, has design on it, has product on it, has research on it. And they all talk together. They have that sort of small group feeling
and they communicate between each other. One of the common examples of this that's quite popular these days is the matrix organization. I haven't got a slide for this. It's the idea that like, oh, you have teams that are full stack, but the people who are specialists in each area sort of meet up across teams. Like all the ops people have like lunch together twice a week and discuss stuff.
I like this because it encourages you to think and move as a small company. I'm a big small company person. This is why I favor this probably. It does have some overhead where your teams are gonna have to interact more formally. They can have to request stuff of each other like big companies would. And there's gonna be almost like contracts and interfaces defined between them. But that gets you those system
and data model designs for free. Like if your teams are defining how they talk to each other, congratulations, you've got a free data model out of it. So this is generally what I prefer. Another common thing, and this is the last strategy, is that I see a lot of people not having software architects,
a lot of people having just software architects. Now software architecture is a very ill-defined term. I'm probably one, maybe, I don't know. But it's a person whose specialty is coordination and putting different pieces together. The person who would go and look at all those models in the abstract and look at how they interact.
It's very common to not have this, and especially if you're coming from a small startup where, again, everyone understands the whole code base. Because if everyone understands the whole code base, you don't need this person. It's not important. No one has to do that hard work of doing the information gathering. But once you get big enough, you need that person. And then all too often a big enterprise
will have a team of specialist software architects who sit in a big ivory tower at one end of the campus and think about things all day and go, hmm, this is very important, philosophical. It's importantly practical. I value having practical knowledge of stuff. I think having people who specialize in architecture but still are involved in writing software is very important, like not losing that particular focus
or view on the world. And this kind of comes back to one thing, which is that monoliths aren't that bad, really. I've seen many big sites, and I've run a few big sites that were just a giant monolith. And they have problems, certainly,
but a lot of the problems you think you have with a monolith are just problems with your team and the way you code, and going distributed won't necessarily make them better. And you could, in some cases you should, if you have pages that are more expensive than others, you have a big thumbing thing, you have video rendering, these are good reasons to split up. But Django's app system is very good.
It's not a bad idea to have 100 Django apps because when you have 100 Django apps in one machine, it's much easier to version them all as one. Doing releases, you have one repo, you can just release the top version of that repo. There's no, well, we need version .36 of this and version .4 of this and version two of this, but version three of this will delete version two of this.
That's the problem you get with services. So maybe, just maybe keep the monolith, but do think about it. And with that, thank you very much.