We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Speed up the monolith

00:00

Formale Metadaten

Titel
Speed up the monolith
Untertitel
building a smart reverse proxy in Go
Serientitel
Anzahl der Teile
490
Autor
Lizenz
CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
GitLab is a ruby on rails application, but this didn’t prevent us from having fun with Go. Learn how we decomposed our monolith by writing a smart reverse proxy in Go that handles I/O intensive operations. A technique that every web app can use, regardless of the company stack. We set a deadline for releasing a cloud-native version of GitLab and put a team of engineers to work planning the helm charts, splitting several components into independently scalable PODs. The team faced a few challenges. GitLab’s main codebase is written in Ruby, which has a global interpreter lock. We relied on NFS to asynchronously upload files from our workers fleet. Removing shared file system by uploading directly from the controller was not an option. We wanted to move to an object storage based solution, but that was a paid feature and we had to port it to the open-source codebase. Oh, we also needed to make sure the rest of our engineers could keep shipping new features at our regular monthly cadence. At the same time, we were planning our infrastructure migration from Azure to Google Cloud. Removing this intermediate state, where a file is on GitLab server NFS but not yet uploaded to the object storage, would have made the migration a lot easier. We had to remove the NFS dependency to make GitLab easily deployable on Kubernetes and we needed a performant multi-cloud object storage uploader viable also for on-prem installations, a solution that would work for a single server setup up to Ggitlab.com scale. Luckily we already had written workhorse, a smart reverse proxy written in Go for handling git operations. It was time to extend workhorse capabilities leveraging the full power of goroutines. We had a plan, but the devil is in the detail. Allow me to guide you through this journey. During the talk I’ll tell you how a ruby-on-rails company began to write Go code, how we implemented an object storage uploader inside our proxy, the problems we faced, and tradeoffs we took to deliver this in time.
33
35
Vorschaubild
23:38
52
Vorschaubild
30:38
53
Vorschaubild
16:18
65
71
Vorschaubild
14:24
72
Vorschaubild
18:02
75
Vorschaubild
19:35
101
Vorschaubild
12:59
106
123
Vorschaubild
25:58
146
Vorschaubild
47:36
157
Vorschaubild
51:32
166
172
Vorschaubild
22:49
182
Vorschaubild
25:44
186
Vorschaubild
40:18
190
195
225
Vorschaubild
23:41
273
281
284
Vorschaubild
09:08
285
289
Vorschaubild
26:03
290
297
Vorschaubild
19:29
328
Vorschaubild
24:11
379
Vorschaubild
20:10
385
Vorschaubild
28:37
393
Vorschaubild
09:10
430
438
Reverse EngineeringProxy ServerPlastikkarteReverse EngineeringPlastikkarteProxy ServerUnrundheitComputeranimation
Wort <Informatik>PunktwolkeProdukt <Mathematik>DistributionenraumComputeranimation
MultiplikationsoperatorRuby on RailsCodeComputeranimationTafelbild
Kernel <Informatik>Nichtlinearer OperatorKeller <Informatik>Repository <Informatik>Computeranimation
ClientKernel <Informatik>ServerBandmatrixProzess <Informatik>BandmatrixKanalkapazitätLastClientDämon <Informatik>Einfach zusammenhängender RaumComputeranimation
Proxy ServerReverse EngineeringPlastikkarteSyntaktische AnalyseURLMenütechnikBenutzerbeteiligungCodeEinfache GenauigkeitRouterURLOrdnung <Mathematik>Reverse EngineeringFehlermeldungPlastikkarteProxy ServerAdressraumEinfach zusammenhängender RaumClientInformationStellenringPhysikalisches SystemServerRoutingMiddlewareMatchingFunktionalSchnelltasteBeanspruchungNetzbetriebssystemGerade
InformationsspeicherungAutorisierungCASE <Informatik>InformationProxy ServerAuthentifikationSpezialrechnerBinärcodeBitKlon <Mathematik>CodeNichtlinearer OperatorComputeranimation
Nichtlinearer OperatorZusammenhängender GraphCodeComputeranimation
SpywareLastVirtuelle MaschineKontextbezogenes SystemPhysikalisches SystemMultiplikationsoperatorProzess <Informatik>GrößenordnungHalbleiterspeicherProdukt <Mathematik>ZahlenbereichInverser LimesComputeranimation
Elektronische PublikationHash-AlgorithmusInformationServerProzess <Informatik>ClientThreadParametersystemMiddlewareMetadatenCASE <Informatik>Kartesische KoordinatenMini-DiscGamecontrollerFlussdiagramm
PunktwolkeNFSImplementierungMultiplikationsoperatorElektronische PublikationPunktwolkeSoftwareComputeranimation
Nichtlinearer OperatorInformationBandmatrixObjekt <Kategorie>Produkt <Mathematik>Einfache GenauigkeitMini-DiscHalbleiterspeicherSoftwareWarteschlangeGraphVirtuelle MaschineProxy ServerZahlenbereichPunktwolkeProzess <Informatik>p-BlockCoprozessorPunktInformationsspeicherungQuaderComputeranimationFlussdiagramm
ImplementierungUnternehmensarchitekturPunktOpen SourcePunktwolkeVersionsverwaltungObjekt <Kategorie>InformationsspeicherungCASE <Informatik>Computeranimation
Objekt <Kategorie>Elektronische PublikationDateiverwaltungRepository <Informatik>ClientZeiger <Informatik>URL
MIDI <Musikelektronik>MathematikLokales MinimumBootenCodeStreaming <Kommunikationstechnik>Objekt <Kategorie>Elektronische PublikationInhalt <Mathematik>DickeURLInformationVollständigkeitMetadatenKontextbezogenes SystemAutorisierungSpywareMathematische LogikMini-DiscPuffer <Netzplantechnik>GruppenoperationRuby on RailsClientRechenschieberFitnessfunktionPunktInformationsspeicherungBitProgramm/Quellcode
DifferenteImplementierungComputeranimation
ImplementierungStreaming <Kommunikationstechnik>GamecontrollerInformationsspeicherungPunktwolkeGoogolMini-DiscGruppenoperationWhiteboardComputeranimation
Objekt <Kategorie>MultiplikationGrenzschichtablösungMini-DiscLokales MinimumProgrammbibliothekCodeElektronische PublikationTeilbarkeitComputeranimation
Prozess <Informatik>ROM <Informatik>Mini-DiscHalbleiterspeicherElektronische PublikationDatenparallelitätMultiplikationGamecontrollerComputeranimation
IterationPunktwolkePhysikalisches SystemCloud Computing
QuellcodeVorzeichen <Mathematik>Proxy ServerFormale SpracheTeilmengeMAPProjektive EbeneSoftwaretestKomplex <Algebra>PunktCodeMultiplikationsoperatorWeb-ApplikationReelle ZahlKomponententestServerMathematikProxy ServerVersionsverwaltungComputerarchitekturURLHeegaard-ZerlegungGüte der AnpassungCASE <Informatik>Reverse EngineeringGrenzschichtablösungVorzeichen <Mathematik>Physikalisches SystemKartesische KoordinatenBitrateDatenverwaltungComputeranimation
PunktwolkeOpen SourceFacebook
Transkript: Englisch(automatisch erzeugt)
And the last talk is going to be about how to build a smart reverse proxy in Go. So, round of applause. Okay.
Thank you. So, we're going to build a smart reverse proxy in Go. First, a couple of words about me. My name is Lesio Hayatsa. I am a backend engineer in the infrastructure department at GitLab, and we are an all-remote company, so I am one of those faces back there. So, we gather once a year. This is the last gathering we made in New Orleans City.
So, I'm going to tell you about Asari. So, imagine the infrastructure department announcing that we are going to migrate our production from Azure Cloud to Google Cloud Platform. Say, wow, this is really cool. More or less, at the same time, with more or less the same deadline, the distribution
team announced that we are going to release cloud native charts installation for GitLab. Also, this is really cool. Then you start thinking, wow, we will ship features, we will keep delivering GitLab while migrating all those things, and then you start thinking about all the little
technical depths that you have seen, all the beauty tricks in the code base. I'm not really sure that this journey would be so fantastic. But before we dig into the story, I need to go back in time to mid-2015. So, we are a Ruby on Rails company.
Why am I talking here at the Google Cloud conference? So, we had a problem. We had a big problem with slow requests. Nobody likes slow requests, but our problem was for not really a performance of some requests,
but by design, we were supposed to move data. Think about Git operation. If you want to clone a kernel repo over HTTPS, it takes time. No matter how optimistic you can put there, there's a bandwidth, and there's data that you have to move. So, it takes time. Back in those days, the only solution we had for this was, yeah, you can clone over
HTTPS, but it's better if you do it over SSH. So, one of the reasons was because of this problem, was that we have a technology stack that was based basically on a forking daemon, which was designed only for serving
first client on low latency, high bandwidth connection. So, this is a forking daemon. So, you can imagine that you have a master process that loads your code, then it forks, it creates some workers, and the master process handles incoming connection, forwarding them to one of those processes, and if the process is waiting, doing IO, it cannot serve any
other request, because it's not a multi-threaded application, so you can imagine that if you are cloning something in this situation, you're losing capacity when you transmit data. So, the basic idea is this one.
We had an HAProxy in front of GitLab. I removed database, all the external dependency, I just want you to focus on this. So, you have a web server, which handles requests and APIs, and HAProxy in front of it. So, let's enter workers, a smart reverse proxy.
So, there are a lot of reverse proxies out there. Why we had to write a smart one, and what does it mean? The idea is that it is smart, because it's not a general-purpose reverse proxy, but it really knows your workload and can help you where it's needed. It was named workers to make fun of the magical unicorn, and the idea was that you
can have the magical animal, but if you need to do the everlifting, you need workers. So, let's start with a simple example. How hard will it be writing a reverse proxy in Go? This is a reverse proxy in Go. Three lines of code, error checking, and imports.
So, let's take a look at this. First, you need an URL for your upstream server. And, yeah, this is the thing that you need. Then, you need a proxy from the HTTP utility, new single host reverse proxy. You pass the URL in, and then listen and serve.
Done. You have a reverse proxy. Now, we have a reverse proxy. How can we speed up a slow request? Let's imagine that we have a slow endpoint, which is on slash slow.
So, this is the amount of code removing import that you need to rewrite the thing. So, let's go through the code. I cheated a bit, because, in order to fit everything into one slide, I had imported a package. So, this is Gorilla Max router.
You can do these things directly with the standard library, but the idea here is that I want to easily declare a handler that handles a specific route. So, that's the reason why we have a router here. So, first thing, you need a router. Then you need middleware. Yes, because something that we figured out on our logging system is that if you put
a reverse proxy in between, all your log system will be filled with local host incoming connection. So, you need to take care of the address and all the information about the external client. So, it's just three headers, and you're done.
And then, what you need is a handle function that can rewrite your slow endpoint in Go. So, the basic idea here is that you don't rewrite your whole code base, but you just pinpoint the pain that you have, and you rewrite them in a more performant way.
Then, basically, we go back to our old code. We parse the upstream URL. We create a new single host reverse proxy, and we bind it to the router so that everything that doesn't match a specific route will go through the reverse proxy in our upstream.
So, this is what we did. It was the 22nd of September, 2015. We released this idea where HAProxy was connected to workers, and in case of a Git
operation, so cloning and pooling, we were doing authorization and authentication in the old ways. We were forwarding the information to Unicorn and the old Rails code base, but instead of handling the clone operation in Rails, we were just forking the Git binary and
forwarding all the body of the request there. So, basically, kind of a CGI. You can imagine this is like a CGI, but done in the reverse proxy instead of in the Rails application. Over time, this evolved a bit.
So, today, we have a new component, which is Gitally, which is written in Go. It's a gRPC server, and it handles all the Git code. So, if you want to interact with the repository, you can do the gRPC call, and it's an external component. So, we were able to speed up Git operation by slow request.
Couple of months later, we released the CI system of GitLab, and we had another problem. We had a big offender in the context of slow request, which was the CI runner attempting to upload artifacts.
So, you can easily imagine that you can, I think we had a limit of one gigabyte. I'm not sure. So, we had this fleet of process that were uploading artifacts constantly. And I want to give you some number here. I took the memory footprint of our production installation of GitLab, and the unicorn
process takes around 800 megabytes of RAM. Workers, 70 megabytes. So, there's an order of magnitude in there. You can imagine where you want to spend your time in your machine if you're under heavy load. So, we came up with this idea of body hijacking, which is more or less described here.
So, the idea is that you have an external client. In our case, it's the CI runner. And this client needs to upload some file, okay? So, when the request goes to workers, instead of forwarding it directly to Rails, which
in that case will dump the file on disk and replace the file with a file handler in the hash of parameters of your request, we will act before. So, we will parse the incoming request in workers, and we save the incoming file to
disk. Because this is what will happen later in the process. But we can do this in a performant way, and multi-threaded with go-to things and everything. Then we strip out the body from the incoming request, and we replace it with some metadata that tells the upstream server where we put those files. So, we forward them to Rails, and we had a middleware in Rails that was reading the
new headers, the new information with metadata, and basically replacing the file again in the hash of parameters. So that as an engineer, when you're just writing your controller code, it's exactly the same as if the request was coming directly through the Rails application or
through workers. Because you still have a file handler there, so it's completely transparent. So, this is what we did. It was more or less two months later, the other implementation. And so, we speed up all the uploads, buy, buys, low requests, and it's time to go back to our story.
So, we had to release cloud native charts. And we had a big problem now. Network file system, NFS. Now, let me explain to you why this was a problem. I collapsed everything back into the GitLab box, because I'm going to add new stuff here, so I don't want to confuse you with a lot of information.
So, this is the same thing. You have workers, and the Rails application, and everything. So, we had to do a synchronous operation. So, Sidekiq is a queue processor for Ruby and Rails application, and we use Redis as a queue.
So, for instance, we have support for object storage. If something needs to be uploaded in object storage, it gets on a temporary location, then you write the job on Redis, and Sidekiq picks it, and move it to the object storage. Now, if you think about this, this works really well if you are on a single machine,
but as soon as you have an HA installation, or if it's a Kubernetes installation where you have pods, and so each one of these blocks is pod boundaries, you have a big problem, because you can't do this. Basically, we were mounting the same NFS share across all of our fleet,
so that regardless of the workers that were processing the incoming connection, every machine in the Sidekiq fleet was able to read it, and move it to the final destination. So, I want to give you some numbers also here,
because I was surprised when they told me. So, NFS is something that almost everyone knows about, but very few knows about the requirements for running this thing in production with a very big storage, and in an intensive operation. So, you can imagine that everything is constrained by the speed of the disk,
and the bandwidth that you have on the network. So, you want to have a lot of memory, because the last things you want is that swapping. You don't want to contain the memory swapping on disk with the IO on disk for writing and reading information.
So, in our production, we had an 8-core machine with 50 gigabytes of RAM, just for running that box there. It's expensive, and it's a single point of failure. And if you have to ship the Cloud Native installation on Kubernetes, you can't use this, because Kubernetes can handle NFS,
but it's not Cloud Native, because it expects you to have an NFS outside of the cluster. So, we had to figure out a way for removing NFS from this graph. So, we came up with this idea. Maybe we should implement object storage directly in workers.
There's a side story here. At this point in time, object storage was an enterprise feature, so you need a license for this. We decided that, okay, we want to ship the open source version on Kubernetes as well, so this has to be backported in open source first.
So, think about the timeline. We were moving from another cloud provider, we had to ship the Kubernetes Native installation, and we started realizing that we also had to backport features, make sure that it was working, and build all those things together. So, first thing, we started with our own use case.
So, we targeted only Google Cloud Storage, because we were moving there, and we started with Git LFS, which was a very easy API to fix, let's say. So, Git LFS, it's large file storage for Git. It's an API that you can add to your Git storage,
and when you want to track, say, binary or a big file, whatever it is, you can ask LFS to track it directly in object storage, so that when you commit it on Git, the file will be replaced with a pointer to a location in that storage,
and the Git client will just handle the thing for you. So, when you clone and check out, you download the file and you have it, but it's not in the repo, not technically in the repo. And this was easy, because you have a very simple API that tells you, please put this object there. It gives you the size, and the body of the request is justified.
So, very easy one. Now, I have a background as a Ruby on Rails developer, and the first thing that I realized looking at the IO package was, I don't like it. I expected more features. I expected it to be more powerful.
Then I started writing Go code daily and said, oh, I really love it. So, the idea that IO reader and IO writer are so simple, you can pipe them together, and it's incredibly powerful. You don't need all those obstructions, it's just everything. It's a stream of bytes. You can read it or you can write it. So, this still fits in one slide.
Maybe it's a bit hard to read, but this is an handler that gives you the idea how can you do body hijacking and directly storing the information on object storage while it is in transit. So, without buffering it, without writing it on disk.
Let's go through it. So, first thing, we didn't want to move authorization logic to workers because the idea is that you need to write what you need to speed up the operation, but we still have hundreds of engineers that work on Ruby on Rails daily,
so we just want to keep everything in the Ruby on Rails code base. So, we made an API that basically received the request and with some information from the request, check if you are authorized or not to upload that information and gives you back a pre-signed URL. So, then, in the context of the handler in the Go proxy,
you just create a new HTTP request, a put request on the signed URL, and you forward the body of the incoming request wrapped in a node closer, but in just a little.
So, what happens here is, and you set the content length from the request that is coming in, and then you just do much. Okay, it's not highlighted. Forgive me. So, the point is that then you run this request. You basically are moving the body of the incoming request
while you read it from the client directly into S3 or Google Cloud Storage or Mini IO, whatever you're using as an object storage. So, you don't buffer it, and as soon as you read it, it gets directly in the object storage.
Once you're done and you check that nothing failed, you copy the incoming request, remove the body, set the content length to zero because you removed it. You had some metadata that you should definitely sign telling where you stored it, and you forward the request like our real proxy.
So, when the request reached the upstream, so the Rails application, the file is ready, safely stored on S3, on the object storage, whatever it is. Mission complete. Well, not exactly. As I said, we had some dirty tricks
that we had to take care. So, we were lucky because Google Cloud Storage is not exactly an S3-like implementation. It has one difference that allowed us to ship this. So, Google Cloud Storage is the only S3-compatible implementation
out there that allows you to stream unknown ranked reads. This is not compatible with the S3 API, so Mineo will refuse it and all the other implementation. They want to know upfront how much storage you need for that request. So, we had around 35,000 CI runners in the wild
outside of our control that were sending artifacts without ranked requests because they were compressing it on transit directly on the upload request, so we cannot have the size without writing it directly on disk. So, this was a big problem.
So, next iteration, we went back to the drawing board and we started looking more deeply at the S3 APIs and we found out this thing. The multi-part upload, so divide and upload. It was designed for another use case. The idea here is that to increase performance,
to make a better use of your bandwidth, you can split your original object in several parts, upload them concurrently, and then you have a final code that just doesn't finalize everything. This is just one single object, and then you have your final object.
Now, we decided to just implement this thing in our reverse proxy, but we found out that all the libraries out there were designed for this kind of use case, so either they expected to be able to seek the file on disk so that they can run multiple performance,
upload in parallel, or they were optimizing for, they were not taking care of memory, so if they weren't able to gather the size of the request, like if you have an incoming body, they were just okay. The maximum amount that I can put is 600 megabytes,
so I would just start reading 600 megabytes and then upload it, and this was a problem, because we had to keep memory usage under control, because we had to take care of multiple concurrent uploads from the outside, and so we wanted to make this in a way that we could control memory usage,
so we came up with this very simple idea, which is whenever this comes in, we create a temporary file. We write up to 50 megabytes. The API controls that number, but just to give you an idea, so we write the first bytes to the disk.
Then we upload that temporary file as a part of the multiple uploads. We delete the file, so we keep also the disk usage under control, and we check. Are we done? No, go back to the beginning. Write temp file, write, upload, delete. Once we reach the end of the incoming stream, so the request body,
we say, okay, we are done, and we send the finalized code, and that's it. So we made it. We were able to migrate a live system from one cloud provider to the other one. We were able to release the first iteration of the cloud native installation.
Then the second one had also support for Mineo and the other one, and so, yeah, I want to thank you for listening to me, and I want to highlight some takeaways from this talk, what we learned. So the first thing, you can speed up a web application writing a reverse proxy in Go,
no matter if you are a company that writes in another language. You can start incrementally. It's an iterative approach, which is a good thing, because you can rewrite only these low endpoints, as it is not that kind of project when you say, yeah, we are going to rewrite the whole code base because Go is the way to do. Yeah, it is, but nobody,
no higher level management will ever accept you. Yeah, let's write everything. So you can start where, just showing where you can improve things. You can forward to another service, if you need it, which is a very good entry point for splitting a monolith into microservice
or just a service architecture. And always, always remember to sign and modify the request. If you expect to change something, sign it, so that the upstream should check that the signer server knows that the thing that you brought in there are really coming from your reverse proxy,
not from the outside. Workhorse search code is available at the URLs there, and it's released under MIT license, so all the examples that you have seen here are just small examples, not extracted from the real code base just to show the key points and, yeah.
But if you want to take a look at how we did it, there are more complexity involved, you are free to study the code, contribute if you like, and that's it. Thank you. Thank you. Any questions?
Yes. Thank you for the talk. How do you test proxy, IPI calls? Okay. How do we test the reverse API proxy? As you're leaving, please try to make less noise.
Thank you. We have several levels of testing, so we have the unit testing and acceptance testing in both projects, so they are tested in isolation, so for every commit in the CI, you run this kind of test. Then we have the apps, this is just our case,
the rates application has a reference of the version of the upstream proxy that it's supposed to work with, and when we bundle everything together, we have some QA pipeline that just builds the entire system and they run some use case end-to-end
through all the... Thank you very much. We don't have time for more questions, sorry, but you can come and talk to him. Thank you.