AV-Portal 3.23.3 (4dfb8a34932102951b25870966c61d06d6b97156)

Solving your Big-Data problem before it arises, using Django

Video in TIB AV-Portal: Solving your Big-Data problem before it arises, using Django

Formal Metadata

Solving your Big-Data problem before it arises, using Django
Title of Series
Part Number
Number of Parts
CC Attribution - NonCommercial 2.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Solving your Big-Data problem before it arises, using Django How data sharding can make you perform better and faster More and more websites are collecting huge amounts of data and developers often don't think about this data wave when developing their apps or sites. In this talk I want to describe how thinking about sharing your data will not only make your app scalable, but also faster and the code will be better. This talk is structured in two parts. The first is, about sharding and different strategies that can be used in solving a typical big data problem for various projects. The second part will focus on a Django implementation on how to implement a sharding technology and create a fail over website without relying on any "cloud" providers. We will make the argument, that thinking about how your data will perform and testing these assumptions will make your code better and faster, even if you don't have too much data at the beginning. We will also show these assumptions on our real live data and describe how we shard our data and what motivated us. ······························ Speaker: Didi Hoffmann Event: FrOSCon 2014 by the Free and Open Source Software Conference (FrOSCon) e.V.
Keywords Free and Open Source Software Conference FrOSCon14
Computer program Freeware Mountain pass Projective plane Open source Staff (military) Open set 1 (number) Number Programmer (hardware) Hypermedia Googol Operator (mathematics) Local ring
Scale (map) Service (economics) Scaling (geometry) Server (computing) Computer network Database Code Explosion Estimator Computer hardware Software Website Right angle God
Scaling (geometry) Software Multiplication sign Infinity Quicksort
Web page Point (geometry) Scaling (geometry) Process (computing) Projective plane Set (mathematics) Right angle
Web page Service (economics) Table (information) Code Patch (Unix) Multiplication sign Maxima and minima Water vapor Twitter Flow separation Social class Point cloud Scaling (geometry) Validity (statistics) Server (computing) Cellular automaton Surface Database Cloud computing Perturbation theory Digital rights management Googol Vector space Right angle Figurate number Quicksort Table (information) Writing
Computer program Group action Table (information) Server (computing) Web page Physical law Virtual machine Database Product (business) Flow separation Term (mathematics) Figurate number Quicksort Table (information) Sanitary sewer
Point (geometry) Server (computing) Service (economics) Table (information) Flow separation Different (Kate Ryan album) Server (computing) Projective plane Right angle Quicksort Term (mathematics) Table (information)
Backup Table (information) Server (computing) Virtual machine Database Mass Database normalization Uniform resource locator Different (Kate Ryan album) Ideal (ethics) Finite-state machine Right angle Table (information) Backup Sanitary sewer
Backup Service (economics) Self-balancing binary search tree Port scanner Rule of inference Cache (computing) Strategy game Computer hardware Query language Energy level Data Encryption Standard Game theory Table (information)
Standard deviation Game controller Ferry Corsten Gender View (database) Moment (mathematics) Basis <Mathematik> Database Mereology Theory Web 2.0 Video game Software framework
Default (computer science) Computer program Mapping Projective plane Database Database Instance (computer science) Ruby on Rails Front and back ends Exterior algebra Human migration Different (Kate Ryan album) Operator (mathematics) Object (grammar) Endliche Modelltheorie Multiplication
Complex (psychology) Programming paradigm Standard deviation Server (computing) Real number Mathematical analysis Database Database transaction Human migration Type theory Different (Kate Ryan album) Cuboid Multiplication
Data model Process (computing) Service (economics) Different (Kate Ryan album) Covering space Vapor Database Database Reading (process) Modem Social class
Data model Random number Chain Server (computing) Functional (mathematics) Root Ranking Database Right angle Database Avatar (2009 film) Router (computing)
Digital electronics Table (information) Key (cryptography) Query language Server (computing) Database Lastteilung Quicksort Table (information) Reading (process)
Point (geometry) Table (information) Twin prime Code Server (computing) Cellular automaton Multiplication sign Web page Database Benchmark Query language Gastropod shell Partition (number theory)
Functional (mathematics) Server (computing) Randomization Multiplication sign Structural load Equaliser (mathematics) Projective plane Database Personal digital assistant Blog Gleichverteilung Right angle Quicksort Partition (number theory)
Data model String (computer science) View (database) Blog Range (statistics) Content (media) Database Instance (computer science) Endliche Modelltheorie
Data model Statistics Multiplication sign Kerr-Lösung Database Instance (computer science) Endliche Modelltheorie Instance (computer science) Physical system
Data model Cache (computing) Randomization Cache (computing) Root Hash function Operator (mathematics) Multiplication sign Projective plane Database Right angle Mass
Data model Cache (computing) Cache (computing) Key (cryptography) Different (Kate Ryan album) Object (grammar) Programmable read-only memory Endliche Modelltheorie Right angle Database Endliche Modelltheorie Partition (number theory)
Point (geometry) Key (cryptography) Code Virtual machine Database Mass Counting Cartesian coordinate system Sign (mathematics) Software Different (Kate Ryan album) Natural number Object (grammar) Endliche Modelltheorie Utility software Right angle Quicksort Row (database)
Computer program Key (cryptography) Uniqueness quantification Virtual machine Set (mathematics) Database Bit Subject indexing Category of being Integrated development environment Object (grammar) Endliche Modelltheorie Object (grammar) Category of being Position operator
Data model Cache (computing) Personal digital assistant Object (grammar) Multiplication sign Structural load Projective plane Database Bit Endliche Modelltheorie
Key (cryptography) Decision theory Decision theory Multiplication sign Projective plane Electronic mailing list Database Revision control Data model Human migration Cache (computing) Personal digital assistant Object (grammar) Telecommunication Mixed reality Endliche Modelltheorie Key (cryptography) Form (programming)
Mapping Multiplication sign Projective plane Virtual machine Computer Information technology consulting Twitter Data model Cache (computing) Software Object (grammar) Information retrieval Set (mathematics) Hard disk drive Right angle Quicksort Game theory Spacetime
Computer program Distribution (mathematics) Decision theory Decision theory Database System call Field (computer science) Word Human migration Personal digital assistant Object (grammar) Endliche Modelltheorie Right angle Key (cryptography)
Link (knot theory) Multiplication sign Projective plane Virtual machine Database Database Parameter (computer programming) Human migration Order (biology) Authorization Data logger Error message Multiplication Row (database)
Atomic nucleus Server (computing) Statistics Service (economics) Thread (computing) Key (cryptography) Uniqueness quantification Gender Set (mathematics) Database Mereology Disk read-and-write head Web 2.0 Web service Process (computing) Software Query language Graph drawing Software framework Diagram Right angle Table (information)
Code Multiplication sign Virtual machine Motion capture Instance (computer science) Replication (computing) Field (computer science) Product (business) Data model Cache (computing) Bit rate Natural number Different (Kate Ryan album) Integer Data structure Theory of relativity Scaling (geometry) Relational database Structural load Moment (mathematics) Projective plane Physical law Bit Database Instance (computer science) Frame problem Subject indexing Process (computing) Case modding Endliche Modelltheorie Resultant
Point (geometry) Key (cryptography) Relational database Server (computing) Cellular automaton Multiplication sign Uniqueness quantification Virtual machine Database Database Replication (computing) Cartesian coordinate system Category of being Process (computing) Human migration Energy level Right angle Table (information) Multiplication Backup Address space Modem
Software developer Virtual machine Set (mathematics) Database Limit (category theory) Augmented reality Product (business) Tendon Revision control Integrated development environment Different (Kate Ryan album) Computer hardware Table (information) Multiplication
Slide rule Server (computing) Service (economics) Scaling (geometry) Key (cryptography) Multiplication sign Projective plane Virtual machine Sampling (statistics) Set (mathematics) Range (statistics) Database Line (geometry) Scalability Data model Hash function Order (biology) Website Convex hull Software testing Right angle Quicksort Reading (process)
Randomization Hash function Length Multiplication sign String (computer science) Execution unit Projective plane Gastropod shell Figurate number
Web 2.0 Type theory Computer file Different (Kate Ryan album) Decision theory Projective plane Endliche Modelltheorie Routing Social class
Multiplication sign Projective plane Virtual machine Data storage device Database Instance (computer science) Data model Optical disc drive Cache (computing) Hash function Zustandsgröße Right angle Liquid
Axiom of choice Virtual machine Database Front and back ends Revision control Prototype Human migration Root Different (Kate Ryan album) Ideal (ethics) Software framework Endliche Modelltheorie Data compression Multiplication Default (computer science) Multiplication Relational database Projective plane Electronic mailing list Database Human migration Process (computing) Angle Query language Video game Object (grammar) Table (information) Row (database)
Scripting language Point (geometry) Empennage Key (cryptography) View (database) Software developer Virtual machine Set (mathematics) Database Power (physics) Human migration Crash (computing) Process (computing) Mixed reality Writing
certain talk about solving the big data problem with this is quite a problem with that but a a lot of work that's actually operation databases from some stage so short story about media born
in London I work for a number of companies not in Berlin uh yeah so long the program from the base of the local staff way back and not this new project will do you good I we run into these problems like 1 shared with the world so
what happens if you want it actually does become a success this is something nobody estimates but it does happen once in a while while you're coding you don't anticipate what I use 1 database use and your data explodes and then huge
problem so on the news all thinking
I need to see them and he discovers that we're getting users all user so we got an article on some big news sites and some of them and you're like 0 my god what's going to happen so what you need to going the scale you this you need to make use of it on your right you buy more access you need is scale service how you
going have engines Apache what not only in the scale that analytics something a lot of people forget most of the time that doesn't scale infinitive I need to get your software will that sort of stuff you have around your support if you have more users you get will support you get multiple requests you need more people are looking the phone and so forth and the network which turned out to be 1 of the biggest problems actually but I'm
going to focus on data today that's the thing I'm most interested in to be totally honest and big data
are not enough of little today so so
this is this is how you data looks like right at the beginning of your project you Happy hacking away you got to attend to and users everything scales everything works fine and you have it was last page so it's everything is perfect at some stage you notice but it is
slowly picking up and then maybe here is the point where I is the eldest item of the set of everything will be fine and right so that accounting for the growth of your project is vital if it succeeds if you just happy rather small project that want to know about we actually want is to exceeded the actually working on the start up if you work in a company you might have a thousand users my old job we had a small project and suddenly it was company-wide and everything went down and then we show ourselves here this week and so I account early on so that you know the
solutions for this figure for this Betancourt problem by the single so I know what the company in Berlin that was goods online and they just bought the biggest so they could get and housing in the future of a room and if somebody tells that somebody would be fired and the whole database which is run on this 1 huge server and
at nite they had to turn adapted basically because all the patches have to redo everything and then the morning that put it back up and this was like it's a it's a shock to list the company now who who have heard about this well this scaling solution for you because nobody dared to touch then by end of by even bigger databases surface and then so is said OK we might separate tables on 2 different service so we don't need all our analytic stuff maybe running on the same side of user accounts might be on different and that's a fair approach I mean if all you care about the same size which only out so all come up to use that data on and this is the page that and then when you read you write solution but Twitter Google cells wholesale right there solution and that's valid because it about 200 engineers have been nothing else to maintain this code right it but pretty much in the body of will have the resources to write your own database it's really hard to have that I'm not going need of cloud services for now Our classes have 1 big problem you don't know what's happening under the hood normally use Amazon or Google or whatever services you run into a scaling issue of seen this many times we just managers instead anymore within the what recorded with support vector but why is this not scary the cloud services really be careful when the users because if you run into initially you start with and you have to find a around so it's sort of put it in brackets at because all the experiences of hot with these is it scales really nice for the few thousand users and at some stage you hit a wall and water and it takes 2 weeks to figure out how to come around the world have to solve the problem so
obviously the 1st thing is you've got your tables by a simple way that we tables and thing is you crying you standing the people so 1st of all I put the database and use silicon broken robot because of a group program and then at some stage and figure out but was but there's a missing solid so basically this this up in but the problem is 1 table is always bigger than the rest of your user databases progeny
quite small because just the the name of law but probably a product database for your account in history or whatever is going on here and then you got the problem again go 1 huge so that you need to maintain and you got all the small tables other machines but you don't really get a lot of this this sort of a very short term solution
it works 4 months maybe but in the long run you run the same issue because you just sort of taking a little of data of of is 1 big and most projects I've seen have 1 big table is really weird
so obviously you want solve specific the tables into different services you want to uh I want to show you a table right and I you'd 1 3 different social and different servers for holding all or your data so you have 3 . 1 here 3 point 2 here we point 3 here and so forth back up would be a nice
thought so ideally you would also
want redundancy in this again if you have 1 big database by 1 big table it's very hard to do backups because you need the same machine again if you want a hot-swap back in exactly the same mass of machine costing thousands and thousands of years you need the state machine again
exactly the mirror and if you can't get all these funky features with different locations and so forth so you shall like this and this is all the ideal solution right the horizontal trying shot your data your Big Data the father was unclear you cut up into little pieces and you get rid of the
level of the genes you have more and more databases through that you get fast agrees that a full table scan on 1 shot is gonna be ridiculously fast because the whole table and looking is far easier because you've got different Charlie the game catches and also the can use fourier a lot more cheaper hardware and if you
saw for the backup strategy if 1 dies look at this running that you can grow with your data ideally if you follow the rules you can and and come to some examples so we were just 16 shots really fast because we just add new services I would have rebalance itself instead the
other side is sorting is really really hard and it messes you up tell you go off and we are part of that looking at to them
once you do this you can't really messes up your life I was shouting at people because it was most it's really hard getting it right is really really hard so another theory so this is about
gender in some respects so the standard database of been dying so you know this giant he doesn't know jungles here is the better exits are going to more details so this is Django is a Python web framework which I has a moment you controller of the basis you can specify this is my database these are my views is 0 my controllers that
takes care of all the nitty-gritty stuff that's does the object mapping for you so you don't really have to care about the database at the beginning of this whole project you just say OK but dataset the example here on this the like instance so I say might databases I my default database which is as well and that the powerful there's here and takes care of all the rest creates a database for use in the here you have updates on the model it will create the data will update the database for you it does all that stuff for you so it's really really nice to have a small project like Ruby on Rails as opposed to 1 or the other alternatives but then the joint policy but
multiple databases right so is with consumers you start shutting out this is like a really simple logical shot by the but people database where all this stuff is kept and you've got you use the database somewhere else so let's say this would be appropriate if you have a huge you you'd use a database you want to keep some wells and maybe different database backend because for some operations are doing my spells program and the other 1 is described in so even have this support with the
databases but you can even choose the different types of databases you have for example we have a logging database that we have a total different databases than there because loading has to different requirements Requirements Analysis transaction data works so for this you can really say I have the thought of user database saved by program models into the different things for all of all the Django partnered people here very easy you just say database migrate and given the database users and it does it will force so this is fairly out of the box fairly standards then this is actually
copied out of conviction that this thing get ridiculous complex the real world data show here this young so we have these on different servers he you see that our learning database on a different port because is a total different set up a database and anybody who doesn't know of a database 0 package that have a look at it in is because it's so much easier to express your databases like this and not have this huge dictionary basically it's easy we have 16 shots appropriately named by x we have our
well-being we have fall under different database that immigration purposes
so and talking too fast but uh if you got questions stop me right because I tend to fall quite fast so policy how does gender and where to save what so so welcome to
automatic database reading which is a really really nice feature so picked out the 2 2 obvious things read and write the basic defines the class these way you say the database for reference should be and the database for all right and is a very simple example busy for reads I want to read for my replica service and I want to write to my primary so that it's have a simple process girl set up where you have 2 replicas replicating from
the primary obviously you want to write to the primary which you base ensure through say that it's rights is the primary server and the primary server you specified here right named it primary and the red because of the same you connect to the replicas and then randomly choose
which because you want to read that sort of clear before we begin to do that
riches can be chained which is really really nice that if 1 of these functions determines not that if if I wouldn't know what to I'll just return none will chain through the database roots and like this I can specifically say this because this is the
this but I never default setting which right city full database and read with people that I like this you can point the and load balancing to the whole thing and so forth so
basically the most important thing was shouting it the right shortly so if you want to charge you need some sort of key to decide where should that be direct would right so you got all these your table how it is that of your table didn't primary key
which appear in stupid idea I suppose but probably you just fill the shots and you just throw them up circuit which doesn't really help it was some sort of key that keeps everything together right you have some sort of what people going query and this is something that's vital and in this is the most important thing when you're doing shot is find the right hand the at most data looks like this in your
database so you really need to find a shocking that separates us really nicely if you don't find it you don't have 1 you could have 1 partition or shot being totally overwhelmed was crazy and all the others residing around and actually lose performance so you really have to figure out what is the best
shot which connects most of the data points so
I can play around and take 2 3 days of time benchmark this look at your queries look at what they data retrieving look at your code use of the cell and their and really the kind your data because if you
get the shot the wrong that's it but you sure of so in our case it was
not as hard lately so we have documents all pox as we call them and books have millions all of entries is entries can be historicals every time somebody else on credit anyone but we have millions of these but these not logically connected right
through the project so basically we said we're gonna partition by project but all projects of sort told the same size obviously we're going to have some projects are bigger some smaller but if we take it evil cash for which we didn't have hashing function to shop these then we can assume that they will be equal amounts of size on every server and that's ideological right we add a new randomness to the whole thing competition by that because if you want the equal distribution randomness is the best you can get then every project it's a project cash we call it from this project we shot and a lot of writing a lot of blogs for for example do the same we will have pockets and so this is and this is really nice because as soon as you access 1 project you always hitting the same database and because your load is equidistributed you're getting all databases equally In nodes that makes sense the so
basically I make a horizontal which we cited the model with everybody know what a during model is not enough so in jungle you have models which define the way your data looks like so you have your views which render the additional basically you have your models which the how your data should look like so a typical model is for blog posts with the title being a child string and you have
content being text and maybe also being a form and like this you can define a this is then translated into not run in the database and so hear it very easy return the full range we said it we
were using the statistics model there we go to the loading database with everything that comes from the system just goes along going and it's pretty much right only optimize database were just pushed into it and then we do all the often analytics from that but this is not really time critical we don't want to say I would always save all waste time basically
writing to the log of dispersion and evidence commit right away and it's fine well as we really want to get where's my data but we said that that the model decide well I'm going to so if it's an instance how we get we call get shot method on the instance of the gets
augmented well we basically saying gets need project and return the shards which saw in the database competition you so shards 1 2 3 and all that and this is the 1st Charles the hash tells us a charge I like this we have the randomness of the hatch and we can automatically get the shot it's really easy operation to get which are to pull from and so another problem with this is every time we hit the database roots we we
sort of we're hitting the database right to find out the project so every time we worth saving something retrieving some that we always hitting the database this like a massive performance bottleneck so we just added some caching which basically says try to get the project
idea if we know it return it if not we say it into the cache and return and this is like vital for this whole captured about charting thing but you gotta run into these horrible things
that really hard to debug way so
and 1 of the most painful things that is this so have been so basically if you've got different shots and you've got foreign keys you cannot have foreign keys on a different database right so all your foreign keys are pretty much invalid and see if they go over different models like especially as with the other leading horizontal partitioning that
becomes a foreign key because it's a this record points for this records in the future this record that could be here under different database utility for machine right we've got 16 different machines so far the a joint foreign keys suddenly become this whole network and they and it doesn't work that means even tried it just doesn't so we have to do if you base you
resolve you all know foreign key to to a positive integer and then you start doing business application and this is like a massive performance bottleneck that foreign keys just don't really working and after write all this rapid code you have to write all this stuff to make on his work again and so busy is this sort of solution primates and
have a little bit less but it's still not ideal so what we do is we save a country with a country idea other positive initiatives over their properties as sets of indexes and so basically you have if you want to get the country turn the object this will always see then because it goes through the redirect get the right object but this of course is a database that wonderful machine and if we said that we do the same we just said the primary key value this assumes you have primary keys unique in your whole
set of which is something that's again another tool itself that under the just assume for this the primary keys are unique for every machine something when people deployed this all this is really use activated program Tammy on this is a thing of 2 outs and then we have this whole it was it was weeks because it's really hard and this region environment to get a unique key this is another really really complicated and the so obviously when we get we set and if you can get out of this like hugely so because this is another database query this is the same basically as we have here also employ caching
this every time you touch anything in your model you trigger database which you really don't want which is based in Boston the 1 that holds project right this issue we have the case where project was started again and doing the same thing again such we have this local fishermen who so this makes
it such a little bit less but it's still not ideal yeah so this makes it right so that you have another database and the the things have to something about as soon as you have shot they get have all these problems are gonna run into and again spread the load out but it also can increase the load close so stuff to be careful is
putting the data database for every reads decision yet all that yes we could have done that but then we have foreign keys we know again work and that's thing we so but in our case we have this
right the projects and we have these items I we know that will items in the projects are going to be on the same chart so in we have more
foreign keys In the project will be the items pointing to each other because for example we do is every time we create a new revision we create a new item but what we also know whether W linked lists so we can iterate through all the revisions really fast most of the form he's worked with all items in the same projects we didn't have so many foreign keys between the projects so we decided to just say we're going to expand the inter project communication and not enough because this has to be fast right this is what the user wants this is what the user wants to see the user doesn't want to wait hours for history to build up the you want this to be there right away so that consequences citizen we made we talked about this a lot but I was lucky in the respect that we
have these projects that have lots of items are contained space of social networks for example from really really hard to distribute so if you write a book posts was users will want to shut shop or was used the
other posts company shot by because of all the different uses because a user or the user will upload suppose that it's a city so a very good paper on how to shots active but such networks and will face but ended up doing because a consultative that were just copying every post onto every machine because they so that they couldn't when they were growing so fast at the beginning so every user got the machine and if 1 of your friends posted something you got a copy of that post in your timeline busy with a one-to-one copies that you have a thousand friends and you would would going to a thousand computers because that was the simplest thing possible then retrieval time of posts societal right they want to retrieve that person is really really fast and I want to sort of they want to do all this funky stuff the basic data space is an important we can just on your hard drive and some of them but we can distribute this really easy and then we contain the posts for this 1 user becomes embarrassing easy again so that's a total valid solution if you're working on such a network Twitter had huge problems paralyzing best when there in the ring system now that was really really hard so we still got around doing
it like this which is in the nicest way of doing it but even a programmer isn't too where of stuff can use the country field now they can use and inducible that if he does select on it with 85 right had there because this will be every select will be a new database call of work so
be really careful hitting the database for every word decision and hitting all starts is probably your worst case if you're doing aggregate or something and so and you somehow managed to hit all your shots outside the worst case because you lose again the whole distribution don't all the giant people will will not all might not work
for you will have never seen it when people say was never seen at work and you shouldn't use double anyway it's so what you can do is even tho you use the adult using parameter and then say which database to use and like that you can do total on that shot but you're not really you not seeing the whole database you're only seeing about on shot authorities will not work so of you got this whole in database which is hard to monitor and debug for somebody hasn't been doing trying to along time this is incredibly hot that why that database that was that data record going them especially sorry when you've got an error somewhere this is like incredibly hard to debug because you end up having thousands of records I'm wondering what is wrong and you can't find you have to hit 16 different machines with the databases are defined as 1 record so he gets really really hard but I think it's worth I personally think that if you project is growing and you and you can see because going up and more people using think about sharding earlier than later because you migrating your data might be might be a lot of work but because you have you got this huge database so that it sums it up my greatest all around your shots in order to lose anything the so think about it rather than later
that some some links for the last 1 is the most important you get a fiddle around for patients who have had as those again very frustrating our and yet but you and I
thank when a for people to feel
questions yeah so so the
question is exactly so we
got a picture of so they see the question was why not use a primary key as you shot in the head the part database and I use here yet we know it would because the idea before in front of every database of so the question was enjoying the this is UUID which is a unique key diagram generated by gender and the so ideally in this set you would have won web server right we even went 1 way process every database of would be serving end my web service right so that imagined in a network diagram you have probably 3 hopeful service sitting in front of every database of of coarsening right but there there there's no way engendered except using rather q or using or are in different ways of doing that is a all the caching framework stuff and MinHash for example is committed so you can use the cash to do it of the define unique the because it works those will have different threads and all these threats will not be aware of all the other service friends being around so what you can do which also people do is they have 1 table that generates the nucleus and just increment that the user query or a lot of people using them partial the caching software to to generate and the during I want statistic but then you've got a problem what happens at random isn't really random and the 2nd 2nd 1 and that's what we do that they thought he had been the shock and awe such so what exactly that's what you you that's what we is that actually due to search don't this is massively simplified so what you
can do this work work and we see that things like for instance we can do is for example we can have is a positive integer field
you can say the 1st n bits of integer defined by machine and the other and it's all my like incrementing normal but then loads of different ways to solve this and probably it always depends on the load you have but if you get it expects millions of
all database starts at some stage you have a different strategy than if you just have 16 and for example our database configures not scale for a thousand machines not adding out you need fighting on not charged 2 thousand shots just because we solve figured out we can grow machines and seeing the data growth we have we machine that we sort of know the sweet spot when we're going to the problems and that so far in the future we decided not to buy the house if they had the whole all these and so on I agree I agree that the law was moment capture and so these are very very nice in this highly disciplined by when we started this this was a small project ideas on a weekend and I just want to have something really easy then I gave it to friends and the friends started using it and then it started growing promise we have so many when we ran into a result discover this was a product in the 1st place we really had so many users that rewriting all the code that would written for like couch for example which I was looking at the time I would have been too much work we just growing at such a pace that we couldn't go forwards too busy rewrite the database stuff and for the problems we have relational databases are quite nice because you have these advance Index Futures and we have a structured data and as I should maybe put in some some frames of size about the problems but we have quite structured data is basically title and the text and these ordered so we do get a lot for having relational structures and and indexes on that and for example the whole back up we do we just it was process well and the replication with which at this rate using today so based on so basically
what we decided if we're not going to back up in China or get what we decided is what we do reviewing the replication on the process level because it has a cell is really really really good at having the streaming replication and always you can use was
couched all exactly the same actually looking back at it now what I would have properties like not long relational database but we start with now and I don't think that every right seemed amount of work people into so it is within the replication the table 1 of replicated the of a really stupid thing to have on the same machine all this is 1 should be replicated here so this so that was over and then we just got unlucky that switching in front of it so will just that's which of the address of 1 that but process because really fast free replicas of all the data that just replicate streaming application that seems to work quite nicely but I mean there's nothing in this will rely on acid and so we could use catches something yes and I hated have have it's just the so we have a we have performance issues with which I found what we that I would just do we have things it were just not do the so they could the join cost would be with in the database which had been like and we didn't spend too much time on actually I tried and it didn't perform well and it seems like not being scalable to at the level of machines we wanted to scale so at some stage we cannot just 17 that there was a hundred machine at some stage will hit a problem was unique keys and stuff but is there we can very easily calculate well-being at this point but I couldn't see it what was the 1 what she that we have yes we just that actually what
was the purpose of this is that it
takes about development machine so what
we do the shots tables in 1 big practice and you can see from the ports with that all connected local it will be exactly and then we have a lot on this difference so just that but it all on 1 machine so we went to press as well so that the logging and all the rest of the tables or databases in 1 database of like this we can really easily the vertical we got the same conflict in the development environment and then we have settings we just have if production basically a just replace local with the different shots of and all we have authentication and that that the have had
that but so like this we can really use the developed locally and also we can really easy that a sample set they're like production and put on tests because we have exactly the same set of protest tests is 1 big machine again was 1 day what to database servers and running on different ports 1 all 1 all the data but some more questions for on and so on and all this is so what we can see that our sort key is a hash we generated by random and all we do is we take the 1st chart from the sharply to decide which shall what we could of course do is take the the next China techniques to Charles and then we basically multiply the service that makes sense for you to get a sense of enlightenment yes yeah except exactly also now the set examples that know we that's why that's why I said the shocking is the most important thing you looking at because the Schottky defines the whole performance the whole of scalability it will not stay in 1 to work on to project at the same time if base it would be the most worst thing we could do right because we had to getting to compute or to serve all the time exactly but also by using this we can rebalance quite easily because we just solved re reading so what we did at the beginning when we had 2 machines which is spaces and that here and these are going to be 1 so that only the going be the next and then we solve this again as so by growing we could solve horizontally scale quite easily by just copying the database which was really nice when we had like we got an article on but the big news sites so it would just to it which is exploding and I just have to change the confocal 2 lines of orders what might any machine but to balanced everything and right away it all went down again right so do you like this gives a lot of advantages balls go wrong sharply that's test and we hope that's never going to happen but they're obviously I mean no but as it was faced right it's really really hard for some projects to find the right shot and that's why that when my slide was
really the who had come it's really be
aware that really figure out what is the next 1 the next steps going to be what the units that things that implement the and tried tried it still works without shot and I've had a lot of time doing this and some of so we need around here somewhere we need a random or some randomness so we can have equal distribution of
charts and that introduced this this hatched for every project and of course you can extend this to the length of the hatch for so we can have a few thousand serves that is just to get on right what yeah we had you so the have the hash is a real random string actually it was every project has a random hash we partition by the hatch so
they web it other than that it's it's
so the project always stays on
1 shot everything related to a project pictures all different types of items we have all attached to the Hessian shot by the half and that's why we put the decision where to where to route to into the model was always in the model file we have pictures texts all different types of classes basically models and every model decides who attached to it right so well as it so here we
get shot the basic at the
project of the odds of the you assigned to right so this will be in an item that's a text item right and and it's defined by the state function at every item needs a project and so we get project by the here and then say save me or put me under
question what do you think you which which is a unique globally unique you the exactly that's why we've got to catch up would you like yeah we we we thought about that and that's what we have to do in the back and no but yeah but this is actually so actually done now the use the hash and so now we've got hands the projects we have the problem that we have 1 project database will projects were and so what happened is a project it was was just handed because every look up at this time of was that now that we got key-value store that's distributed and we don't even use the project idea anymore so we do is we use on all the machines we have a key-value store that is balanced and we use the hash to point to the different projects and so we've distributed that Nelson out again right we've still network traffic ideally will be on the same machine but we haven't done that yet is about 1 network traffic but you don't have 1 machine anymore to find all the project and that's what In this example we introduce this cash because in the past again is distributed and the liquid must be here yes so by itself migration yes so that's
incredibly easy for yes you
just add database shot 1 shot to shot 3 shots so
that's all I've done is I've written a full script for and just got a list of all my shots at all I do is I do didn't SSH onto the machine and because our machines a song named encoded the shot name into the machine but is a SSH into the machine Michael immigration and they call the database when developing this is very important only have database roots properly set up because if you do migrate and the databases are not properly set up but up was data everywhere it doesn't work so always set the database uh this is from a good habit anyway if you're working on on multiple databases never use this but this will always solve for for everything and ruin a lot of your life always always use a database in which you want to migrate all and even do all this stuff right the only thing we have to deal with the when you do a new migration in detecting immigration on the model again not run on all the shots of by all the short to run the migration all the shots was inconsistent and be sure to to do all the default values and everything the same because of all queries for the full by and you might have said something else and you start rebalancing of start popping records over different shots but you get inconsistent data the you can also depends on the project the turn of the last project I was in we did catch to because it just made sense this projects I don't know I don't know I think we get a lot of relational databases because I look up some really really nice and fast but we had always we have all this pain the choice using Django sold said we had to use a relational database because the object matters which longer really really bad so probably if did not what do I wouldn't use giant stadium all had used the front-end was angles some jobs that framework under the back was couch that would sort because that's interesting in here right now I want to learn and this was really something a started on the weekend on any Django I knew was technologies like just OK could out really fast and actually the prototype the first version that was used by the lives of people that is something I wrote about like over the weekend Saturday nite pretty much and then we have so many users that some citizens this might become the project and then the machine went down because we have too many people and then we started doing all this stuff so was so start with what we had but I don't know it is if if using couches something is really the solution for everything that is multi 47 my experience was couches at some stage you didn't really big will so we have a project we had full-service of rebalancing and then we have mobile phones sinking up and using the original idea stuff and at some stage we just this and it will crash and we never figured out why we just got around produce intelligence compression stuff but I don't think this is the ideal solutions for look at your problem and then look up what you really want to achieve always about where bottlenecks an I can guarantee you 1 tables can be huge and all those with a small or what yeah its world they started that's there is now that we do that mainly so what we have is if we know that is like that so what we at the beginning
we would didn't use this and we really have to write migration scripts so if you now we will solve limited to 16 shots it wouldn't use the other 2 would have to rebalance everything obviously and that's just a marriage fasting we just go offline we rebalance everything you think that there is no they're not as because actually quite easy to predict users it turns out we use solve development so we can quite easily with others sex so far we can quite easy because how many users we can sort of stomach because I'm assuming that reason uses the random which then only on what we from historical data we can sort of know we're going to have so many power users with an so manipulative slogan once tried all we cannot stand here and so the user so we quite what know quite well what the mix is going to be all these users and we know quite well what the requirements are going to be a because we use a random key we know how fast the short going fill so we can solve quite if I and maybe the TechCrunch has this huge article pseudo crash but so we can quite we know you well break-even point is going to be a when we need new shots and so when we need to shop and get get more machines so as soon as the toll committed on your shot sets in this example you fix wouldn't otherwise you have to write a script to about works it works it's a lot of these all but at the end of the day you just you select on the database giving everything that looks like this and copy it over the so what you can actually good emulated previews or stuff like that which we didn't do the work for after that we could have then use just copied the view which apparently process and you know I was on their own other questions ideas could but little from the room by using much to