We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Playing with casync @ instagram

00:00

Formal Metadata

Title
Playing with casync @ instagram
Title of Series
Number of Parts
50
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In Instagram, we have been experimenting with casync as an alternative package format for deployment of the site. This talks describe our findings Last year when Lennart presented in ASG about casync, we were excited to check it out. So we checked what was necessary to deploy parts of our sites with it. and we spend some times experimenting... This talk show our results, so far...
24
Thumbnail
15:29
25
Thumbnail
21:21
32
44
SpacetimeSystem programmingPhysical systemMedizinische InformatikComputer virusProduct (business)FacebookProjective planeLecture/Conference
Data modelSystem programmingPhysical systemStrategy gameResultantUMLComputer animation
Data modelWindowPhysical systemMultiplication signComputer animation
System programmingError messageMathematicsSoftware developerCodeFile formatStrategy gameVulnerability (computing)Software developerMultiplication signProduct (business)Sinc function
System programmingProduct (business)MathematicsSoftware developerCodeSoftware testing
System programmingMathematicsError messageSoftware developerCodeFile formatStrategy gameSource codeComputer fileSoftware developerMathematicsMultiplication signFile formatRevision controlBitAbstractionBarrelled spaceProcess (computing)Order (biology)LengthFunction (mathematics)Directory serviceUniform resource locatorDirected graphSubject indexingDifferent (Kate Ryan album)Term (mathematics)Data storage deviceProduct (business)Connectivity (graph theory)SummierbarkeitResultantState of matterRepresentation (politics)BytecodeSynchronizationWeightSource codeView (database)Computer animation
AbstractionPrice indexDatabaseQuery languageRevision controlLevel (video gaming)Data storage deviceRevision controlSubject indexingComputer fileDatabaseBinary codeAbstractionFile systemInformationCASE <Informatik>SynchronizationHash functionRepository (publishing)Compilation albumServer (computing)Directory serviceKey (cryptography)Row (database)Virtual realityInstance (computer science)Multiplication signLevel (video gaming)Query languageProcess (computing)Computer animationXML
Revision controlFile formatTotal S.A.SpacetimeSystem programmingNP-hardElectric currentRegular graphPhysical systemComputer fileLevel (video gaming)Integrated development environmentLibrary (computing)Parameter (computer programming)Scale (map)Embedded systemBinary fileDistribution (mathematics)Source codeInformation retrievalDifferential (mechanical device)ResultantFile formatPhysical systemTerm (mathematics)Level (video gaming)Multiplication signRevision controlHard disk driveSpacetimeVirtual machineHTTP cookieDirected graphAbstractionSynchronizationVirtual realityFreewareGroup actionEndliche ModelltheorieCartesian coordinate systemAnnihilator (ring theory)Band matrixLibrary (computing)Product (business)Parallel portDistribution (mathematics)SoftwareComputer fileConnectivity (graph theory)Binary codeMereologyDifferent (Kate Ryan album)Software testingComputer animation
System programmingMultiplication signPoint (geometry)Directed graphSynchronizationComputer animation
System programmingElectronic mailing listBranch (computer science)File systemComputer fileNumberSubject indexingPoint (geometry)Physical systemMultiplication signAddress spaceLibrary (computing)Stability theoryCodeKeyboard shortcutResource allocationRevision controlOrder (biology)MathematicsBoundary value problemDistribution (mathematics)Entire functionSingle-precision floating-point formatMixed realityString (computer science)SynchronizationLecture/ConferenceMeeting/Interview
Point (geometry)Multiplication signCartesian coordinate systemConnectivity (graph theory)Decision theoryComputer animation
Electronic meeting systemSystem programmingLecture/ConferenceMeeting/Interview
Transcript: English(auto-generated)
Hi! Hello. So, my name is Alvaro Leiva, I'm a production engineer at Facebook and Instagram, and I'm the only thing standing between you and lunch, so let's get this over. So, basically, my talk is about how we saw Leonard's talk last year about CI-Sync, and we saw that
it was a really cool project, so we started finding out what can we do with it, and we found this problem. So we tried to solve it with CI-Sync. So, basically, I will start saying, like, why did we want to experiment with this? How was our strategies? And then
our results, yes. Okay, cool. So, as a raise of hand, can, like, the system that you work on, can anybody raise their hand if that system is deployed once a week? Okay, cool. Once a day? Cool. Okay. Twice a day? Ten times a day? 20? Yeah. So, the reason why
I say this is because, at Instagram, we deploy, since two years ago, we deploy
more than 50 times a day, because, basically, what we try to do is that we try to deploy each single commit that a developer sends to master, we try to deploy that directly into production, have it enough time in production, so we give a signal if that commit will break or not break, and then move to the next one. This works really good for us because
it allows us to find things that will break, or security vulnerabilities and stuff like that, like, really quick. So the way that this works, it's simple. A developer commits its code, we package it into our own internal tooling, we run tests into it, and then we send it into production. And this, like, really works really well because it allows us to chip small changes. A developer, after he lands his code, it's
like an hour until it's in production, so he's around. If he breaks production, he can help us fix it, and it's really easy to roll back to the previous state. So basically, how we do this, you have to imagine that this is version A, and this
is kind of a representation of what will be our source tree. For those who don't know, Instagram is mainly a Django shop, so that's Python. Most of these are just plain text files. So we basically have this package that is a representation. We strip things that we don't want. We compile a few things that are C, we convert Python to byte code, and then we
package into our format. Then a developer comes, make a commit, make a small change, and then we do the same process, and we end up with a package that is really similar to A but has all the components. And then we also have C that probably also changes different things,
and you can see that, if we do this a lot of times a day, the sum of A, B, and C gets really big. So, this was like a really interesting problem to solve with CI Sync, and I will explain a little bit how we view CI Sync that is like for reasons of brevity and for what the abstraction of our problem is, maybe oversimplify, but okay. So, the way that we work is that we take this version A,
and then what CI Sync does, it will take and divide it in little chunks of data. The magic about CI Sync is that these chunks are variable length, so that means that this piece over here can weight like 10K, but this one can weight 20Ks and stuff like that.
So we take those packages, and then CI Sync will output two things. The chunks, there are these files, and an index file. That index file is basically a recipe on how we're going to take these chunks and then reconstruct them into creating the directory.
So, this is the session, and then the opposite process is that you take your index file, then you grab whatever chunks it says that you have, you assemble them in the right order, and then you have your package back. The cool thing about this is that if we now have a package B that has a small change of it, the third session result will be really similar between
A and B, and maybe we will have one extra or two extra chunks there, and we will yield a different index file. So now we don't have to think in terms of versions, we just sync, or we just have all these stores, all these chunks stored in a single location,
and then what we distribute, it's the index file, and the index file is what it will become our version. So, okay. So that is basically how we use CI Sync. Okay. So,
it's really simple. The way that we work on this is that we put an intern who was really good at his job, and we ask him, like, to come with abstractions, stuff like that. First of all, we wanted to create, like, an abstract definition of package that will englobe this idea of having stores and having index, but also will not be subjected to just
file systems and maybe instead of syncing the index file as a file in the directory, we want to sync it as a database record, because this lends to be a key value stuff. The first thing that we did is that we changed the idea of index to an idea of manifest. And
the reason is really simple. What CI Sync gives you as an index file, it's basically a recipe to reconstruct your package, but it doesn't give you any information of how the package was constructed, who constructed, where it constructed, did it use certain compilers, what version,
what hash of the repository. And then we put it into stores. So, this is really important for us. It's like the information about the package is almost as important as the package itself. So, we created this tool called CA package
to make an abstraction over CI Sync, and the way that you will get the index file, it works something like that. You give the package stage, and then it gives you a URI. That URI in this case is an SQL query, so we can store all the things into SQL. And this will be our manifest. Basically, here, the data, it's an encoded version
of what you will get into index file. And then you see that we put all the other information that we do care about. Stuff like, for instance, what is the package name? Like, on Instagram, of course, we deploy our Instagram package, but how about if you want to deploy virtual environments? Or we want to deploy other binaries like this? Not just file system,
but simple binaries. We have version. The same package, we build it multiple times, so it will be really good to have versions. All the things. Finally, we basically have a store adapter that we can, if we want to save, retrieve the store from an HTTP server, we can,
if we want to store it from the local disk, we can also do torrent. So that is basically how we did it. So let's see a few experiments that we did to see how was the result, and then we will be done. So the first thing that we did is
we worked on the creation. That's why I put a little cookie there, because it's like, how do we create the package? The first experiment that we did, we took a hundred versions of Instagram, and we created in our regular format, and then we created with CA package that is basically an abstraction over CA Sync. The first thing that we did find out is that
we save about 90% of space, and this is kind of obvious because this is the idea of CA Sync at the end of the day. In our regular model, basically, each version is a full package contained of the things, while when you do it with CA Sync, you are just including the new stuff
or whatever extra it is. And in terms of resources and time, creating the things took about the same time, and it makes sense because we use the same technologies, we serialize the same way, we compress and compress using the same libraries, so basically the big win was in space. You can see if you deploy more than 50 times a day, by the end of the day, you are saving a lot of
network because you're not chipping all the components, you are saving space, and eventually reconstructing will be faster. So the second part of this experiment was to
actually get these packages and put it into production in the same way that we deploy our normal system. So in parallel, we basically had a few handful of machines that when a commit came and landed into master, we created the CA packet and then we chipped it into production. We measured the total bandwidth, we measured the time that it took to state, and the resource
usage. The total download was, again, like 90% safe. Makes sense. We are basically downloading less stuff because moving from version A to B, it's really pain-free, but the cool thing is that moving from A to C without going through B, it's also really pain-free.
The stage time was faster, and the resource usage was basically the same, again, because we use the same technologies. So that concludes basically what we did. What's next? We want to try on binary heavy distributions, so again, we say Python application is mostly text files,
but we want to try it with virtual environments that basically has a lot of binary things. We want to stop shipping chunks through HTTP and start using torrent, because if you have a big infrastructure, you can leverage the fact that most of your machines already have them, the
chunks. We want to, right now, we just shell out and execute it. We would really like to start using CA Sync as a library instead of just chilling out to it. Finally, we want to try other toolings different than CA Sync, because the AVA is really cool and we would like to stress test it against other things in the market. So, basically, that's it. I finish.
Yeah, don't worry. There will be questions. Also, if we run out of time, I'm going to be here, and I have stickers if people want. The question is, just the last point that you raised, that you
want to stop shelling out, do you want to turn CA Sync into a library or do you want to reimplement the code? So, CA Sync, again, I don't want to overstep my boundaries here, but CA Sync works, or it's written in a way that it resembles a lot like a library. It's just, it's in version 2,
so I don't know if the AVA is going to be stable or not. Basically, what we want is to take that same thing and put it into a Python binding. Does it work? Yeah. So, my intention was always that it was supposed to be a library, and that's why it's written in a library style, but I haven't come around to make it a library yet. My other question
was actually just that, what's the size of the images? The things that we produce? Yeah, the stuff that you actually store there, like what's the average size? Oh, you mean the full size? Okay, I don't know if I can say that the size of my... Just a rough
address? Okay, so it's like, I don't know, I would say like 50 megabytes to 200 megabytes, depends on the size. These are like text files, so you will basically, like even if it's like really big, when you compress it, it goes through the string. Thanks. So, you create a lot of packages per day. What do you do with all chunks? After a
while, do you garbage collect them, or do you keep everything all the time? So, the cool thing about that is that since we deploy every commit, by the time that we deploy, I'm gonna say a number, commit number 150, there's no point of going back to anything before that, because we know that we are in a good place, right? So, we purge all chunks all the time,
and the way that you do that is that you have a list of all the chunks that compose your whatever you want to keep in the back, and then you just serialize it and find all the chunks that doesn't belong to that list. Oh, yeah, yeah, yeah, but you cannot deploy
having different branches. You always deploy from master, yes. I would like to ask if something like that can be achieved using GitAnnex or GitLFS,
and if not, what are the advantages of using CI Sync? So, that's what we are gonna discover like in the next step. We're gonna start using other tooling. The good thing that I really like about CI Sync is that its general purpose is not based on this particular problem, but it's really general purpose, so we can
apply the same techniques and put it into binary distribution instead of just text, or maybe we want to keep an entire file system with this, and it works really good, while Git and all the things tend to be more single into problem, but I cannot say that for sure, because we haven't tried yet. So, one more question.
Go ahead. He was trying to make a question. So, you have an incremental way to ship packages. Do you also have an incremental way to build, if there's only small changes every time? So, let me see if I can kind of explain. Not really, because when you do A and B, you have to think that this, don't think of this
as incremental. That's kind of the first thing that I try to get out of my mind. Think of this like I build A, I build B, and it happens to have like components of A and components of B are similar, but this could be like two different applications, like they don't have to be, so don't think of this as incremental of A to B.
With that in mind, you still need to serialise your whole directory and compare the chunks, and then realise which one you actually have to build, but when you're on that point, then you already wasted like 90 per cent of your time on just doing the serialisation and doing the third decision. I'm going to stay here,
so if people want to ask me questions, you can do it after this. Thanks a lot.