Bestand wählen

How we scaled GitLab for a 30k-employee company

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Erkannte Entitäten
and that kind of good and try out in the last summit this talk to real score of it is in the track of we're living distributed world but I'm surprised to find that I'm the only told you that track seems like there's no other tracks there that that pokes above scaling of realist applications in distributed systems so I think the reason might be that's as real as developers we are following some best practices so that making our ap distributed or scale or abstinence in that hard that is in that problematic but this 1 this this Let's thing is bad boy I say you really got some problems in the form of public like mating talk about how I would fix the problem so thank you where much of
for coming to my talk of my name is mean depend I came from China but I work for the bubble group and that is my you have a continent and my the With the handle you're welcome to following and so
what is the well did lot ways well and since separately it is just a git hub Cologne open-source School of the whole the but nobody likes to say that have so uh 0 of by the way the thing that the the gift box that include deploying a machine is installed on on premises such as the quick survey how many of you use Git lab in your organization of you thanks so the the
lab if you see it as a black box it actually it exposes to horse why http the others SSH and HTTP are used on
2 purposes you can call a repository via HTTP and you can push content to a repository http and also more importantly as a Rails application that provides rich user interactions with the web page and on the other hand the SSH only allows you to Cologne duke operations an and in the back end from very simplistic point of view forces content on gait and that is what makes this thing the master to scale on very
problematic on a par save you look closer it also uses some other stores on the back and once massacre actually they also supports post-course secure because they use Active record which abstracts of at shorter implement maybe implementation of the DP so it's changeable and the other is red it's use it as a cue for delayed tasks and also as an cash and the other is file system debuts file system to support the heat repositories so that's the black box
if we open it up to see what's inside then you can see is basically structures like
this are is always source so you can also download the source code and see it we deployed on the front end and there to parse binging sample classes it's over so well that the reason why those companies are in inside the lab is because B that has all mini boss the package that the priest and they actually depends on those on to other packages and jinx is for http and all blastocyst overestimation as we mentioned is for the assessors port that opens and when some request came 4 http request they came to the 2nd layer the unicorn an is for the ordinary rails requests and but for the request 48 Michael and push it goes to workhorse is certain others and service retaining goal to make it fast the and a fleet to came as as issue request it goes to the 3rd part of the 2nd level at the namely at LEP shell and on the 3rd level the 3rd levels called by the 2nd level components in a mainly real was responsible for operations on page and the did not deviate is wrapper around a rugged and rocket is a wrapper around the kitchen on the 4th floor n sidekick yet this was some for some tasks handling and on the lowest level it is gate and ligatures they utilize both implementations appeared in the live get to and if you don't know about it is actually the rewrite of gates in way that is portable embeddable works as a library and all the name get to they see it as the 2nd generation again but we the nearby the prefix because it's a library and to so this
structure works really great for small teams but the company that why work for has 3
30 thousand employees this is from the physical or year reported last year the just published a new 1 this year days ago they before yesterday the stock price that looks good a public company of so the skillet so how would
do this the well we 1st consider about a
problem on the front end when the request came is either ATP or SSH as would the reals developers we are most familiar with http and so on the so worth is you is actually rock as you horror instances and that's something we're we're familiar with it as well the we just put the
endings in front of them said of upstream in the configuration let them point to the unique or source in the back and but for SSH how to move your was this is a problem on so I started project
called SSH true http is source online you have a can't it basically name is all those SSH requests because and the way Gates interacts with the so rare room between HTTP and as as it is very similar and the request to SSH could be easily delegated to but it requests on http and as we could see from the slides later SSH is actually such a pain in the ass on there are more complications to this so I guess that is the reason why do have not been http as a default you know when you go to a public repo on the of and the colonial you wire and as far as I'm remember is 240 as http you wire instead of essays each 1 there the
there actually complications to the architecture that makes the SSH access a little bit of the the slower intended to be 1 but actually of the but we do not use my approach on my of my approach was this
slide but actually use this slide but what we did
was we are not using paintings as the front and we use something called LVs and it is a feature from the Linux kernel and the specific a part of it that we're using is called by PBS
uh which expands to IT workforce over and air we S stands for Linux water so work it is actually a leader for switching Service on like in jinx which operates on 7 of the TCP IP stack but it does uh load balancing on the transport layer so the this abhors all communications as long as they are TCP IP so the difference between a T P and S is a show you lumen indeed the but there comes as that comes at a cost as well because when you goes down to the 4th layer you lose the ability to do health checking with the status cold returned by the request because on the 7th linear you could see actually what's the status code of your interview requests are and make and Marx some as health or not healthy but cannot see those and only see packets the only see the date uh and you i or a rewriting you you lose that ability as well because there is on layoff 7 as well the you
n like I said that comes with complications because the SSH protocol in wars some security mechanisms that checks with your keys and if you have more than 1 machines in the back end they're keys are not the same by default so we deploy the application for you 1st have to copy the host keys are across the or cluster to make the host key the same otherwise will connect to but more than 1 service the client will complain and saying that all the SSH caves different is this yeah this is the security 1 ability you got a check it out and they will not connect and secondly if you remember you could add as as a caged assessed based bitch keys from the client we have the web pages uh Weyuker on repository like on behalf of and the same thing happens engage that so when you add your as its key to the server it has to a patch or a copy all of the keys across the entire cluster to make every machine except your key as supposedly the added a line you know the dot SSH directory so large authorize keys and they have to do it on every machine uh will we did that we have a well you can add to that we're sidekick because I think only you know only 1 them 1 which in the class the fact is that job and the other will even on the job so you have to but do it in the way that broadcasts all the keys across the whole cluster and we we we did that you will read this pub-sub this structures and that goes of the back and well the
the RealTravel beginnings so and the part that's the beauty of stores this repository on the FS and I want you know pause a moment to in the remind you of the trail factor out the you know the reason why do that is such a bad boy unlike other realist applications is because in wireless the 4th rule of the 12th factor after event is some principle advocated by Peru where the 1st and 4th rule as back services I should be treated as attached the resources and like there
is Amazon service massacre service they should all be configured as a you our that could be easily attached and deep dispatch but did have some content all file
system that is the you know the source of all evils and that the common Laura are firstly get a repository and secondly user-generated attachments and other parts well we are going to move them to the call to make a scale there well actually standing at this point you have a lot of choices the choices that I'm going to elaborate might not be the best I wonder you know of analyze the options that we have and so a few when data right into a real application that has a similar problem you could in the value of those operations as options as well so 1st
option is from some feature provided by the clamp and the price edition is called keytab Geo In the that doesn't really solve the problem you you see how the how the that yield that things it is they make full replications of your didn't have instance across the worst it assumes that each machine of your cluster has enough file system storage to hold the all the content to your kids repositories and they make Hendra person copies across them yeah is officially
supported by we really didn't solve all our problem and undeveloped because in the the size of the overall size of all repositories are big we want we don't want to store them on 1 single machine then it there's there's not enough disk space to hold them the so from distributed system point of view that geo on is the 1 master and slave full replication design and in CAP theorem which says consistency availability and partition tolerance cannot be achieved that at the same time the only achieved 2 of them so the let geo achieves a and P of those 3 parts and there no disaster recovery supported and absolutely no shopping because it's full replicated and the other
option that we could use seemingly very perfect way to solve the problem well 1st of all we limited this age by that jam written by me called as as is true ATP and so that we could forget about problem as as it and focus solely on http and seemingly uh there is something we could take use of it is the it you know every repository stored on it let could be and with morality using namespace slash rebel name and that part uh appears in almost every you i r of every request that will you see the repository commit history on page the rolled you the the rule format contains that part and we you colonnade we put it they all contain apart so when I use that part as a routine key and make some rotor logically into an Jenks To make a shot indicate that and every every by doing that every request offer came to an end jinx will be shot uh for example if we are going to have a cluster of size 3 uh we could you when some hash algorithm that distributes that hash the namespace left rebel name into the class the into heat into any 1 of those 3 machines and so the this thing is is perfect but but continued in exposome problems this territory 1 problem is signed cake does not have
sharding so maybe it does but pay you have to to men and see how you could do that you know on each the
Chado those 3 could have showers could so the balsam sidekick tasks which needs to be consumed by the corresponding sidekick shot as well so we'll start the sidekick shots and have to start it was special Q names as well as 1
complication and there are others that changes have to be made on application level as well because isn't and not every In
page on the top level falls
into a single shot like in the other in the admin page you could see a list of all the rebels with their sizes well if that request falls down into only 1 single shot you will not get that information because some rebels reside in some other shots so major changes will be introduced to the application level as well and also you need super user authentication because the
SSH requests are not designed to uh access all repose their of user authentication leaders in front of them is also another
application layer logically a change that have to be introduced this is actually not ideal every in every way of solving this comes with the cost so less than thing
about how to deal with the file system storage well we got other options well 1st we could make it a trail factor out by you by making the file system attachable there's some there's will provide such solutions like hardware network attachable storage you very close mass and their software mass as well like google has GFS and also we could use remote procedure calls to only make shots on the SS level instead of on the application of all the entirety that and also we might consider clean we could maybe use Amazon S 3 to replace the FS as back-end for its lost well we evaluated
all those options to turn out that NASA's not thought about own hard as well what about do not by those things because it hasn't now I We policies the soft mask article does not have that yet like Google have GFS but I've ever does not have the efforts of but I have to remind you that those 2 options might the good options for your organization if you want a skill could have you know and they're really good means to solve the problem because it introduces where a little change for your application level because all the change are confined on the lower service that got attached to the to that but I did not try them invading surely come with a cost as well because next software nest has to be very complicated as far as I know there are some good solutions called CEF FS we just came to stable about a month ago or days ago and if something happens on that later you need to have somewhere in that pad talented you operation art of often engineers to solve those problems and also by attaching mass softness you also have you we will also use performance because each operation to each I O to DFS is non network and they're added latency to each network I and and your replacing the the thing on a very lower level so the added cost will be much so that those 2 operations if you have just could the begin to new and PCA well this a good solution I you know I looked up how did help solve their problem since like they're doing what he sees there dispatching access to get to to some obviously calls into charts is that Duke lap shots where a shot on a different level the does that shorter looks like a good solution the and what we did at and above are is used the false option we kill the BFS and use the cow what cause we
use was called the bubble always most something that not that well known but you could have thought of it as of the same thing as Amazon S 3 this object storage in the
cloud and how we did that well the rest of this talk will become the but the technical of it turned out that the lab has 3 ways to access gates repositories namely need to gage and great great is a very old jam it's is reading group uh well we found that it is so you you live you it could be eliminated making the whole problem year of because it's only using the leaky part of the that yeah and the is used in a jam called Gollum and Gollum was designed to have is it gets accessed part pluggable so we on create and we plot a rocket which uses limited true of so that makes there's only get gate and the magnitude and we compare those 2 projects the 10 limited to quell gate was pretty old is probably written by was started by Linares pour lost and it did not consider the problem of back and to plot and unplug so it's back and it's hard to replace all the cold our return to access content from the file system but it was very modern I don't know how their creators thing about a problem but they designed the Bakken as replaceable you could write your own backends it so the basic
idea is we we write our own and we'd rather beckons that's actually stores the content on and the cloud storage and also well the great has been eliminated also we have to implement data on top of that you begin to because keep cannot easily replaces and back and forth but it could too cold so
Cloud beckoned cloud-based back was was that
pattern looks like well known that the most some details about git gets has a a true positive stories and wines called 0 dB and the other is called graph to be 0 dB is for the chunks of data that support inside repositories uh and the rest b is the branches tax and that you put in the repository so and for the 0 dB they're also to parse 2 to part 2 kinds of all the these uh the 1st is loose ODBC those are you know data is fundamentally a a content-addressable and file system and the content addressable being the S H 1 as as H A 1 and whether you of the object the your trying to fetch so no storage actually stores each as h 1 as h a 1 well use on on the
it the I will you know opened up a example yeah what we have noticed and that's the good repository and if you are going to the docket directory and use a tree in you can see there are some like those those files to the cardinal stored files and they're also pack stored files those are the PAC store files and that's what I mean so we don't go cloud-based back to store both both types of those files the basic idea
is therefore the loss files it's pretty straightforward when you read you make HTTP request to read from the cloud 0 I forgot to explain their FTP is very similar to those false where you can see this on the the rafts directory the and all of your branches are inside the rest has master and master will tell you a shit S H A 1 well you so is basically key-value store and that translates to HTTP requests pre straightforwardly the you see each ref root reading week we made a HTTP read that each arrested be right we made the HTTP right each of 0 dB store
Makita http port In
each Lewis was a store were made who is to be read In so that's the simple part of that the the complicated part it is the pact the content of it because if you only store those lose content will be as low as as we the reason the very reason why did so fast is because it has a very good design of packs pack files are used both as a way to transfer content from between server and the client and as a way to store the content you your repository on it is both a transformer file format in the storage file format uh the way we write voices of the way we write those packs are using which is translate them to put requested he but the we read it is complicated is every path came with the index file and they index
file tells you about if you are looking for some object in the pack where to start so each request I will be translated into uh lots of ranged http requests 1st the will read the idx file to find the next arranged to reading the pack uh and then a read only the small portion of the file using the range header
from the ObjectStore so as a
example if that if they get each read this content than 1st the bite will be but upper level and it'll binary search in the index file and the will gets a
all set to be gaining the PAK file and the the pet valid C
if this content stored is a delta or not if it
were a delta they has to continue looking for the base of that delta and the whole chain continues continues until you find the
wrote and by combining all of the and other combining all the deltas with the base you get the object that you only reading and here's example is a real war
example in the chain is as low as 5 you have to jump inside the PAK file to actually got the thing that you want really because it each time you read this actually only a delta so that is a real problem to us because if the i old high if the pattern inside as pack file is not good enough they you will end up having a lot of range requests on the HGP that will make the thing all fully slow but the good news is in the human there's some get they made it so where do heuristic algorithms to a when the text files are generated so that those I'll patterns are not that bad so when we make a range request like could actually make make the range bigger than we needed therefore we could fetch bigger content from each range request and their content would be sufficient to fetch all the way to the rules of that object by this good characteristics we reduce many http request to mix this you know whole solution not arsenal that's what other it and the other part of it as
I said you have to uh make kids talk to leave the true because you get does not have the back and replaceable a tunnel that it this is pretty easy actually the mentors of gates to they're pretty smart folks they right Catena where Unix way although the commands are the cold each how the like you the factions give clone he on the server side uh the 1st got called was key to upload pack and get double pack will then call another command called did Pacol which ejects in there for the commands that deal with the transmission protocols will not touch it and this complicated with and we do not touch we only touched the thing that does I from the desk so we only need to replace the Datapac objects and in the git push scenario we only need to replace or even unpack objects and implemented on top of liberty to is very easy no big task and also there are some other scenarios there where they're 2 scenarios when the 3 and get pushed the small data guts unpack right away and got written to the news storage and the big data in I got unpack because unpack consumes time and they directly creating index for it and writers Pat so in this case we need to reimplement gating that's that which is please the task not been alright so after all those changes to let's see the whole how the performance
looks like it definitely gonna be slow because you still changing in the it fast file system bio to some slow http I 0 so let's see how it looks like will well the text to reuse is a repository called the collapse C it has more than to handle or of thousand objects and when packed it weighs more than 100 megabytes get push well the lasting from because an you know on file system we right it directly to the but on cloud we're we're either them directly each 2 a T P and there there are not too many new operations created is just only at a small or the time to each of and those 2 operations the and did push delta there also but like I said their teams scenarios will push lodged content it only stores the pack so this is like on the scenario and if you only put a little content it got Compaq and stored loosely and this is the delta case were also not too much time either get clone well it is actually a the had a person slower because we had to come along the range operations happen and that's that's what makes it slow and also the fact the got way more slower because this is the delta search this you very happens with 2 people could pull with your co-workers up the rebel and it also has to go through the whole process of the range operations that I mentioned so if really slower what the good news is is not possible user has to wait longer but is not as a something that they cannot wait and and also on the page got way slower all of the reals operations and all of the rails operations were affected because we are operating on a deeper level and Rios will call rocket rocket quality to the B 2 is also real it like on this page where listing of foulest In the shoal actions not take 5 at 5 seconds to run the a LIMSI all of those benchmarks are all without cash so yeah the real worse every will be better because we have cashed end up like this this is a In another reals operations and before the changes 50 ms and after its 5 about 5 seconds so that's the reason why we have to add much cast to it so we added a cash on multiple layers the like those real layers we added them I'm not gonna you know you elaborate on all the cash that we add but for some interesting aspect of this this is something interesting well you know true uh was designed in a way that it could have more than 1 0 dB batons and you could even set a priority to it so we basically made a hamburger structure of that back and we added to new beckons to it which is the cash back and you know the source that we deploy those things to still gotta palaces file system to use and we use that as a on disk catch if we read some content once were stored on the file system so that the next request hit it could just the re the condom from the file system instead of the remote means that they're making remote procedure calls and the good news is the dB of kids never change you can only put data into it but again never modified data so we our free from the problem of the form of cash expired or expiring also the rest to be a could catch that we're ready it's but that's we more complicated that the mind not worth the effort I'm every moment in the future because you have to expire cash rest the beast got did he all the time we commit new commits to last Thursday the rest slash head slash must've got up the you have expired catch the so not gonna go into details of when the cash got updated the it and lastly i wanna you don't say something about future work for for right now it seems like this in idea works on more or less acceptable In you that's Level IV attracted to any AWS as 3 were shown that because 1 is currently working on OSS which is not so widely used n you know there is no there is a need for this you know the the reason why there may be some new for this is because the Laplace cannot be deployed to her recall at this moment and if we could make this and back and for a as 3 then the users of did that could have a chance to deploy to oracle and also and get level still has many dry calls to get like like for the history page of the Committee history page of a repository it actually bonds another pit instance to fetch the result so we could eliminate some direct cost get and after if we had if we developed the back-and-forth is it unnecessary you could add settings for the user which was which back and he wants to use it could be either file them or Italy ancestry that would be perfect and gollum Gollum in we should you could do some work to make them use write it as the default in the beach itself we funded less performance it will we find it let this your slower many scenario's per compared to get so could improve its performance in the future and I will be actively to those jobs on my give public on so that yeah if you're interested you could even look look into my count and see how it goes thank you very much and and I are all my friends in the of the proof of from
Service provider
Virtueller Server
Generator <Informatik>
Charakteristisches Polynom
Lesen <Datenverarbeitung>
Objekt <Kategorie>
Große Vereinheitlichung
Selbst organisierendes System
Repository <Informatik>
Virtuelle Maschine
Weg <Topologie>
Reelle Zahl
Mobiles Internet
Verzweigendes Programm
Nabel <Mathematik>
Wiederherstellung <Informatik>
Netzwerk-gebundene Speicherung
Prozess <Physik>
OSS <Rechnernetz>
Kernel <Informatik>
Total <Mathematik>
Wurzel <Mathematik>
Physikalischer Effekt
Algorithmische Programmiersprache
Konfiguration <Informatik>
Spannweite <Stochastik>
Arithmetisches Mittel
Automatische Indexierung
Projektive Ebene
App <Programm>
Transformation <Mathematik>
Wiederherstellung <Informatik>
Physikalisches System
Front-End <Software>
Speicher <Informatik>
Physikalisches System
Objekt <Kategorie>
Klon <Mathematik>
Kernel <Informatik>
Nabel <Mathematik>
Twitter <Softwareplattform>
User Generated Content
Güte der Anpassung
Dienst <Informatik>
Wurzel <Mathematik>
Rechter Winkel
Lesen <Datenverarbeitung>
Repository <Informatik>
Klasse <Mathematik>
Dienst <Informatik>
Räumliche Anordnung
Spannweite <Stochastik>
Delisches Problem
Inhalt <Mathematik>
Gerichtete Menge
Protokoll <Datenverarbeitungssystem>
Open Source
Elektronische Publikation
Patch <Software>
Kartesische Koordinaten
Prozess <Informatik>
Nichtlinearer Operator
Zentrische Streckung
Netzwerk-gebundene Speicherung
Verkettung <Informatik>
Keller <Informatik>
Interaktives Fernsehen
Syntaktische Analyse
Wrapper <Programmierung>
Zusammenhängender Graph
Elektronische Publikation


Formale Metadaten

Titel How we scaled GitLab for a 30k-employee company
Serientitel RailsConf 2016
Teil 20
Anzahl der Teile 89
Autor Pan, Minqi
Lizenz CC-Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.
DOI 10.5446/31525
Herausgeber Confreaks, LLC
Erscheinungsjahr 2016
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract GitLab, the open source alternative to GitHub written in Rails, does not scale automatically out of the box, as it stores its git repositories on a single filesystem, making storage capabilities hard to expand. Rather than attaching a NAS server, we decided to use a cloud-based object storage (such as S3) to replace the FS. This introduced changes to both the Ruby layer and the deeper C layers. In this talk, we will show the audience how we did the change and overcame the performance loss introduced by network I/O. We will also show how we achieved high-availability after the changes.GitLab, the open source alternative to GitHub written in Rails, does not scale automatically out of the box, as it stores its git repositories on a single filesystem, making storage capabilities hard to expand. Rather than attaching a NAS server, we decided to use a cloud-based object storage (such as S3) to replace the FS. This introduced changes to both the Ruby layer and the deeper C layers. In this talk, we will show the audience how we did the change and overcame the performance loss introduced by network I/O. We will also show how we achieved high-availability after the changes.

Ähnliche Filme