We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Forget Mono vs. Multi-Repo - Building Centralized Git Workflows with Python

00:00

Formal Metadata

Title
Forget Mono vs. Multi-Repo - Building Centralized Git Workflows with Python
Title of Series
Number of Parts
112
Author
License
CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The mono vs. multi-repo is an age-old debate in the DevOpsphere, and one that can still cause flame wars. What if I were to tell you that you don't have to choose? In this talk we will dive into how we built a centralized Git workflow that can work with any kind of repo architecture, delivered with Python. One of the greatest recurring pains in CI/CD is the need to reinvent the wheel and define your CI workflow for each and every repository or (micro)service, when eventually 99% of the config is the same. What if we could hard reset this paradigm and create a single, unified workflow that is shared by all of our repos and microservices? In this talk, we will showcase how a simple solution implemented in Python, demoed on Github as the SCM, and Github Actions for our CI, enabled us to unify this process for all of our services, and improve our CI/CD velocity by orders of magnitude.
29
BuildingSoftware repositoryGoogolWordSoftware repositoryGoodness of fitContext awarenessContent (media)Lecture/Conference
BuildingWordEmulationContext awarenessCycle (graph theory)Self-organizationComputer animationLecture/Conference
Vector potentialBookmark (World Wide Web)Computer-assisted translationSoftwareSoftware engineeringSoftware developerInformation securityCybersexJSONUMLLecture/ConferenceComputer animation
Web pageSoftware repositoryMultiplicationInformation securityCycle (graph theory)WebsiteSoftware developerLogicExtension (kinesiology)Focus (optics)CybersexLecture/Conference
CodeSoftware repositoryBefehlsprozessorGoodness of fitDifferent (Kate Ryan album)JSONUMLXMLLecture/Conference
MultiplicationBinary fileSoftware repositorySingle-precision floating-point formatIndependence (probability theory)Differential (mechanical device)ResultantContext awarenessTraffic reportingSingle-precision floating-point formatComplex (psychology)Revision controlDivisorComputer fileClient (computing)Factory (trading post)Different (Kate Ryan album)Shared memoryCodeWordLibrary (computing)Independence (probability theory)BitAutonomic computingMobile appNP-hardOverhead (computing)XMLUMLJSONComputer animation
Gamma functionFront and back endsMultiplication signSoftware repositoryCodeContext awarenessSingle-precision floating-point formatPhysical systemComputer fileDifferential (mechanical device)Software frameworkEvent horizonProcess (computing)Overhead (computing)Formal verificationCodecInstance (computer science)Revision controlExpandierender GraphSpacetimeInformation securityFront and back endsService (economics)Graph (mathematics)PlanningComputer clusterMultiplicationSoftware developerTraffic reportingLecture/ConferenceComputer animation
StatisticsSoftware repositoryProcess (computing)Computing platformLecture/Conference
Event horizonFile formatComputer fileToken ringValidity (statistics)Instance (computer science)Self-organizationMathematicsLatent heatSoftware repositoryMobile appFront and back endsProgram flowchart
Computer fileCentralizer and normalizerUniform resource locatorSoftware repositoryDifferent (Kate Ryan album)Lecture/Conference
Cellular automatonReading (process)Letterpress printingWeb 2.0Moment (mathematics)QuicksortCycle (graph theory)Demo (music)BitCartesian coordinate systemInformation securitySoftware developerService (economics)Server (computing)Context awarenessGoodness of fitFront and back endsSource code
Software repositoryFamilyLogicUniform resource locatorGroup actionResultantServer (computing)Web 2.0Multiplication signWordDifferent (Kate Ryan album)Lecture/Conference
Chi-squared distributionLimit (category theory)VacuumGamma functionClique-widthComputer fileInformationDataflowSoftware repositoryCodeMereologyEvent horizonFront and back endsService (economics)Demo (music)Software testingWordMoment (mathematics)Group actionPoint (geometry)Centralizer and normalizerPower (physics)BitSelf-organizationReading (process)Source code
Philips CD-iUser interfaceMaxima and minimaSoftware repositoryEvent horizonClient (computing)Computer wormLine (geometry)Open setVector potentialService (economics)Point (geometry)CodecParameter (computer programming)MultiplicationBitAreaPower (physics)Source code
BitComputer filePower (physics)AreaInformation securityMeasurementGroup actionLecture/Conference
MassOpen setMultiplication signSoftware testingSoftware repositoryMixed realityBitEvent horizonMereologyLine (geometry)Crash (computing)Computer animation
World Wide Web ConsortiumVery long instruction wordHelixEvent horizonServer (computing)Multiplication signMereologyBitDependent and independent variablesResultantSource code
Cartesian closed categoryAreaCodeFront and back endsCartesian coordinate systemResultantPhysical constantLecture/Conference
MereologyGroup actionCycle (graph theory)2 (number)LogicLink (knot theory)Source codeProgram flowchart
Electronic data interchangeCycle (graph theory)QuicksortMereologySoftware repositoryJust-in-Time-CompilerResultantPoint (geometry)Multiplication signRevision controlSoftware developerClient (computing)DataflowInformation securityGoodness of fitMeasurementProcess (computing)Software frameworkFeedbackCybersexGroup actionVideo gameContent (media)Order of magnitudeLecture/Conference
CASE <Informatik>Self-organizationDemo (music)AreaLogicMereologyTraffic reportingCuboidElectronic signatureComputer fontSoftware repositoryLecture/Conference
Thresholding (image processing)Demo (music)Context awarenessMereologySelf-organizationAreaLogicElectronic signatureToken ringLecture/Conference
DataflowMereologyGoodness of fitConnection MachineBitCartesian coordinate systemFood energyLecture/Conference
Configuration spaceMereologyCodeSet (mathematics)Lecture/Conference
Latent heatTaylor seriesSoftware testingCodeService (economics)Right angleGoodness of fitOpen sourceEntire functionInformation securityMereologyComputer fileCentralizer and normalizerLecture/Conference
Front and back endsSoftware testingEntire functionPoint (geometry)Single-precision floating-point formatLecture/Conference
Point (geometry)Unit testingProcess (computing)LoginDifferent (Kate Ryan album)MereologyMobile appJava appletBoom (sailing)Server (computing)Information securitySoftware testingINTEGRALSoftware repositoryProduct (business)Lecture/Conference
Different (Kate Ryan album)Absolute valueSoftware repositoryRevision controlQuicksortPrandtl numberPhysical law
ResultantFocus (optics)Latent heatCodeInformation securityRoutingMathematicsObject (grammar)Group actionDifferent (Kate Ryan album)Web 2.0Software repositoryLecture/Conference
XML
Transcript: English(auto-generated)
Good morning, everyone. I would like to first acknowledge my appreciation of you all showing up and sharing this morning with me. I don't take it for granted. And let's start the content
and let's try to have some fun. So tonight, tonight, today, we'll talk about mono versus multi-repo in the context of CI pipelines. So once we understood what those concepts are, the pros and cons, the next thing is maybe there's another way. Maybe there's an optimal way to enjoy the best of both worlds.
And like every other solution, I always try to think how is it applicable, how is it maintainable, and how is it easy? And before every talk, I always like to say my main goal. My main goal is that at least one person here will say that's kind of easy. I can do that. I can introduce this into my dev cycle
and my dev organization. But first, let's talk about my favorite topic, me. So my name is Michael Sego. I'm a 30-year-old software engineer from Tel Aviv. I love Django and I love cats.
Not related, but both correct. I've been a software developer for the past six years back in Tel Aviv, and I kind of move between industries trying to find the place where I feel the most impact and, well, just happiness. And I found the most happiness, actually,
in the cyber security industry. The cyber security industry allowed me to do something that felt really right, but I didn't really understood it. Currently, I introduce cyber security, basically security measures into dev teams, into the dev cycle, and allow them to focus on what they're good at.
You wanna focus on your business logic. What you do, I take care of the security. And that kind of mindset also gave birth to today's talk. I think we can all agree to a certain extent you don't start your day thinking, how am I gonna improve my CI infrastructure? Ooh, I wanna add this feature, I wanna do that.
If you are, please talk to me. I'd love to see that passion. Most people wanna set it once, forget it, and even when there's problems, try to put them to the side. That's not the right way to handle a CI. So let's take that mindset of trying to make things work better for all of us. But first, let's talk about the never-ending debate,
mono versus multi-repo. Just a show of hands, who's a mono-repo kind of developer? Okay, you're outnumbered, that's good. So this is a never-ending debate, and it won't end because of one simple truth.
They're both great solutions. It just depends on your tech stack, depends on your team, depends on your culture. You could use both and get amazing results. But in the context of CI, there's really key differentiations. When we use a mono-repo, just a quick overview, centralized code management, easy refactoring,
team share the same culture. Problem is, when everyone works on the same files, conflicts will be a daily endeavor. And single tagging for whole code base kind of makes versioning and refactoring for different clients, different versions of your app a little bit harder than it should. In multi-repo, independent deployment, independent versioning, autonomous work,
you wanna enable your dev teams to express themselves and do things their way. Problem, for example, libraries update overhead, and more siloed teams. Yeah, that's a new word for me as well. But in CI context, mono-repo allows us to have one single file, no code duplication,
very easy to go in, understand what's going on, and run with it, while in multi-repo, we'll have to do a single file or system per repo and kind of reuse the same stuff time and time again. The overhead becomes taxing very quickly. Another way to look at it is via this graph.
In the mono-repo, all the different services adhere to the same CI pipeline, while in multi-repo, everyone has its own infrastructure. But what if there was another way? Maybe there's some way to enjoy the flexibility of multi-repo CI with the one-stop shop of mono-repo.
Well, there is. It's called centralized CI, and I'm sure you've seen versions of the solutions. You've heard the versions. Well, today we're gonna see a very lightweight, easy-to-implement version using GitHub and Python, obviously, and the main thing I want all of you
to always keep in mind, I'm gonna show you an instance of all this stack, but you can use your imagination and your skills to expand it and do amazing things with this infrastructure. So why should you even care? Why should you want centralized CI?
Because the current solutions probably work wherever you work, and they do the job. The centralized CI will kind of allow us to do best of both worlds, which is done by adhering to this simple principle of decoupling the checks from the code. There's no real reason that the check should know or should share the same space with your code,
and by this decoupling, we get abilities like to enforce styling via Linter, implement security checks, I'll expand on that later, and allow both tailor-made checks and cross-repo checks, and that's like the essence of it.
The centralized CI overview basically talks about our SCM, GitHub, Bitbucket, GitLab. Your developers, yourselves, are working, doing your day-to-day jobs. Events are flown everywhere. For example, we're listening to PR events.
Those PR events will get transmitted to some backend that will analyze them, verify them, decides what you want to do with them, and they will decide which CI jobs we want to trigger. Now everything we're gonna see later always go back to this differentiation.
We have a place where code happens. PRs are written, comments are added, things are edited. They should not be aware that there's a whole CI framework behind them. That's the whole beauty of it. Now we have this backend, the brain of this whole process, that analyzes those events, verifies your tokens, see what needs to happen,
and you can very clearly just understand. So if I have this place, this centralized brain, I can probably do very cool things with it. I can start to analyze how much time does everything take, stats, maybe per-developer kind of stats.
Monitoring your own CI and PR activity is a very recommended practice, especially as your team and your company grows, and in the end, we trigger a CI job in a completely separate repo that contains nothing but GitHub workflows, or CI checks where it depends
on the platform we're talking about. So now we're gonna go about the same thing, only in a specific GitHub instance. So we have some GitHub organization with the great naming repository A. Repository A user opens some PR on GitHub. That, our GitHub app will send said event
via webhook to our backend. That backend will first and foremost verify the token, it's valid, it's in the right format, then analyze what happened in this said PR, which kind of files were changed, what was the change, and then decide, okay, I'm gonna trigger this, that, and the third check.
That is sent back to GitHub in a dispatch workflow, that's the name of the thing, and it will trigger workflow in the central CI repository. So again, the whole thing is we have three different layers, first layer is your actual code, second layer is the whole GitHub talks
to your CI brain that decides what to run, and then the actual running of the checks is in the completely third location. So what do we need for our demo so we can see the whole thing and get a little bit more excited?
So we'll create a GitHub application that will monitor those webhook events and send them to our decided destination. We'll create some simple backend listening to PRs and use ngrok as a tunneling service to our, well, my server here.
I wouldn't be a very good security aware developer if I didn't just briefly talk about when you're introducing some sort of application into your dev cycle, try to give it as few permissions as possible and be very much aware to what you're giving it. The amount of attacks and issues
that derive from this very problem are too many to count, and I always suggest go over the fine print, even though it's in dark mode, it's a little bit harder. And for just more tech-wise, what you need, you'll define the webhook URL, which is basically where your private server is hosted.
You need to define webhook secret, define the repository permissions. Actions and checks are two different concepts in GitHub, but they're important. Action is the actual brain of what's running. So for example, our linter run flake8 is the action.
But the check is basically the notification of the result of said check in the appropriate repo. Because if we're running the logic in the centralized CI location, it doesn't really make sense that we will go there every time we open the PR to see what happened with my PR, especially when so many people
are opening PRs at the same time. So for us to notify the original repo, hey, this is what happened with your action, there's a check, that's just the words. So let's build a centralized linter. This is our EuroPython demo organization,
and the three layers we talked about are presented here by three different repos. The test repo is just our code base, some basic Python stuff, hello Dublin. Our CI is the brain part. We'll dive into it just to show you what it means, but nothing too major. And the third and most interesting part is the central CI,
which is this. This is the repo that contains all the different checks. As you can see, there's just a readme file and GitHub workflows. With two different checks.
I don't know how well you know the syntax and everything, but we'll go over it real quickly because I know it can be quite taxing. This is the flow that we will trigger from the brain part of our service. As you saw in the repo, all it has is YAML files and a simple readme.
It has no idea the infrastructure it runs on, what it's about to run, the code, it doesn't care. It's a very simple dumbed down check. And it gets all its relevant data from the dispatch workflow event from our backend. So what we're gonna do here and you're gonna about to see in a moment is we're gonna explore GitHub actions.
We're gonna create the check. That's that notifier we talked about. Check out the original repository. Run lint with flake8. And according to success or failure, get notified.
In the brain part of it, we're going to src name. Basically we have one public endpoint. Yeah, I tried to play with it a little bit.
And that public endpoint will trigger a handle PR event. That handle PR event will package the necessary info. Prefer to look at this from, this is the handle PR event. As you can see, we create this client payload
that also talks about the owner and the relevant repo and has all the data we got back from the open PR event, packages it. Here I just hard coded, decided run this check, CI YAML. This dumb line is basically the essence of your potential
and this is what I want you to take. I just decided run lint or I don't care. But according to your use and your needs and we'll dive into a more complex example, you can make this line into a whole new service that decides which check should run and you could really control your CI.
And that's the point of this monorepo versus multi-repo whole debate because here I basically allow you, just like in monorepo, one place to define everything. That's the power. But just like multi-repo, this area is your place to get a little bit smarter, add and introduce new things.
For example, you go over which files were changed during the PR. Oh, I see some Python files, some JS files, even a Terraform file. So I will run security checks according to those files. I'll run Bandit, I'll run Kix, whatever security measures you want to use. And this is where basically the potential becomes infinite.
So let's see this whole thing in action. In the end, it's not gonna be so dramatic but I think you'll see the value. This is my test repo, hello world. I wanna crash the CI slinter.
Of course, too many blank lines. Create a PR in the background through ngrok to our server. Events are flown, handled.
And this is the only annoying part. It takes a little bit of time till everything clicks. And I go back to our free layers. All the time I wanna go back to the concept so it simulates better. We see now the check, the linter check. Run and we got a response.
But where all free layers come into place? Well, this is the code there, the SRC. Someone opened the PR and he got notified of the result regarding the PR. He has no idea about the check, has no idea about the tech, no idea about the infrastructure. He just opened the PR and got immediate value. There's a GitHub application that talks to the backend.
That whole area has no idea where those checks are stored, what's gonna happen, what's the next step. It just analyzes and sends said checks. And the checks eventually that run, I'll show you exactly where the checks themselves ran.
Infuse the tune. See, this happened just seconds ago. This is the action, this is the logic, this is the linter part of it. What we saw in the PR is the check, the notification.
And we'll enter the action and we'll see what's exactly the problem. So we just went through a very simple cycle that I think we can all see the value and how we can use it and how it will impact us all.
An important part that was shown and I wanna dive into because it's really important in any sort of solution you're trying to introduce into a dev team. If I didn't convey the results via check,
which was very easy, I would just run the action, I would even have less permissions and make everything a lot easier, what would be the biggest problem? I would make the solution just uncomfortable. And there's no point in a good solution if it's not easy to use
and easy to maintain for the end user. I tried to talk with my user in the most friendly way possible via check in your respective PR. Always try to think eventually how said person will use what you're developing.
And this is the more complete flow that adheres to that principle. We have the SCM that talks to the back and talks to the CI, but the CI reports the outcome all the way back to the original repository. So once we modify the CI job to create and update checks, this is a more complete flow.
We have a repository checkout. We create the check, we run the limiter, and then we update said check. So I've alluded a couple of times regarding this whole, I'm gonna show you a very basic example of it. How could it be used for better, more creative ways?
Well, we use this infrastructure, or maybe a more robust version of it, in our day-to-day work back at JIT. Not gonna talk too much about what the company does. That's not the point of the convention. But we analyze each PR, its contents, and understand, okay, which security measures
do I need to use? And we give our clients and our developers ourselves during our cycle never-ending security measures using this exact framework. And of course, as I said, we convey that via a simple PR comment, which I think is a very friendly way
that we all are accustomed to. So thank you so much for your time. I hope this was short, sweet, and gave you some value. Before I finish, I just wanna say one last point regarding this whole talk.
However you apply the CI, CD, and other infrastructures, your code, don't always try to choose the lane that you saw in some medium post or what happened before you. Try to see if there's a way you can use the best of both worlds, and you will be quite surprised to see it's a lot easier than you think.
It might not be one click, it might not be one day of work, but this whole infrastructure took us one and a half days to implement, and it's still used to this day, of course at scale, with much more abilities, and DB that continuously analyzes the results and gives us real great feedback. So just to package the whole thing,
if you are intrigued about how we do things at JIT, if you're intrigued about cyber security, if you're intrigued about other CI, CD pipeline solutions, I'll be here the whole day and I'd love to talk. I prefer people over code, and I'd just love to meet each and every one of you. If you're inspired, you have questions,
I'll be here all day. That's it, sort of sweet. Thank you so much for your time. Do we have any remote questions, just quickly? No?
So if, can we just use the mic here, please? Yeah, a great talk, and especially great advice regarding the Medium article. My question is, in this interest, thank you, it's a great outline, awesome,
how do you cater to different requirements of those repos, like one repo may require, let's say more resources on CI, so much beefier CI box than other repos? Yeah, so first and foremost, thank you so much for the question, what's your name? Pablo, okay, thank you so much.
Pablo brought a very interesting question regarding, I put everything in the same organization while clearly they have different needs. I've done that just for the demo's sake. In our real operational, the area that runs the actual checks themselves, the logic part of it, is in a completely different organization. They have a signature GitHub token
that allows them to communicate, and it has its own threshold and amount of checks it can run, so they don't really affect one another, but as you saw, maybe I'll show you later, in the YAML itself, we define all the infrastructure it needs, it runs it once and kills it at the end.
Hopefully the answer is it gives you more context. Hi, Christian. You talked about GitHub in particular, what about other providers,
like GitLab or? Yeah, again, thank you so much for participation. This is the fun part of the talk. This entire flow is applicable in all the big free SCMs. It's slightly different, it requires a little bit more work on it. We've done the exact same thing in GitLab for a specific POC we wanted maybe to switch over,
but yeah, it's applicable in all three. I just wanted to show it specifically in GitLab because it's the easiest and most popular. Hi, thanks for the talk, and your energy throughout was really good.
I can definitely see why you might need this, especially in a monorepo microservices type setting. I think there's maybe some benefits that you have when you have the CI designs coupled to the code. You know exactly what the configuration was for the CI
for a specific snapshot of what it was run against. Do you think that's always worthwhile to, that cost is always worthwhile with this kind of setup? Yeah. First, again, thank you so much for the participation.
That's the fun part of the talk. Wonderful question, and the simple answer is yes, and the even more simple answer is, as you saw, we configured in our CI checks all the different checks. If you wanna do a very specific check that's tailor-made to a specific service, et cetera,
et cetera, you can configure a very specific YAML file. You'll need to do some ugly code back in the brain part of it to choose where to run, but fine, sometimes not everything is best practices, but we do it actually quite constantly. This is a side note. Back at JIT, so we orchestrate open-source security tools.
All those security tools, in the end, have their own infrastructure and tests, and they need to adhere to some CI. So at the CI, central CI, we have specific checks that are relevant to each one of those tools. You can't always have the same test for everyone, so yeah, we have very specific checks for a good reason.
Hey, yeah, first of all, thanks for the great talk. Very interesting technique. I'm just wondering how you, for example, scope the deploys of the backend for managing the CI. So I could imagine where this becomes a single point of failure for your entire
test and deploy pipeline, where you say you block your entire company's deploys because you updated this CI backend and made some mistake, whatever. So I guess two questions. How do you test the CI backend itself? Does it use itself to do that as well? And yeah, do you have some ideas
on how it maybe have deploys per team so that at least you don't take other CI pipelines down by doing that? Okay, first good question, then interesting suggestion. Thank you so much for participating. Yeah, the CI could become a bottleneck.
It's one of the few joys of working in small teams. The CI itself checks itself. It's not a great way we need to find a better solution for it. But yeah, we have sometimes this very weird thing where the CI fails. We don't take it for granted. We go over the logs and like, oh, something wrong happened.
A great way to keep it static is first and foremost all the different tools we use in the CI, use specific versions, not like latest stuff like that. I assume all of us are, at some point the workers got burned over, something like that. But it doesn't of course replace the basic testing, like unit testing integration, end to end, all those things happen.
But yeah, we need to do a better job of doing CI more resilient. And regarding your question of making different pipelines for different teams, we currently don't do it, but it would be very easily applicable. We just need to create a new GitHub app.
It will adhere to a different CI server, but it will eventually report to the same CI checks code base. I'm actually kind of excited to maybe try it out with my team. It's a good idea. But like where I would use it, if you're part of a really big team that has different products,
maybe don't want the same checks, or maybe you have one team that's a Python-led team, and the other Java team, boom. So you want to have different checks for them. So that's maybe where I would use such a thing. But a great idea. Yeah, awesome, thank you. Thank you. Hi, thank you very much for your talk.
We are fully multi-repo, and that works great for us, but very occasionally we have two PRs that sort of affect the same thing in two different repos, and NCI suddenly is an absolute nightmare for us, because one tool uses the master version of one, and the other uses the master version of the other,
while they really should test against both PRs. Do you have a solution for that in your current setup, to test two PRs on two different repos simultaneously? I'm sorry, I should probably answer quickly.
I actually don't see how that problem will rise in this architecture, because basically if I misunderstood you, please correct me. You have two different PRs, two different repos, whatever, that run pretty much the same kind of check, but in the end, both web hooks will be sent to the CI backend,
will decide, okay, we want to check, we want to run check X, they both will run the same check X, and you'll get the same result. No, the problem is that these two repos use each other as dependency, basically. Oh, no, no, no, no. And then, and.
Stop that immediately. It's a, sometimes you don't get around it. Usually you get around it. Yeah, as we said previously, the world isn't always best practices. I would love to sit with you and try to solve it by the end of the day,
but the kind of checks we've shown, sorry, just give the focus a little bit around. The checks we've shown are more cross checks, like, for example, Linter, the whole purpose of Linter is to make sure everyone adhered to the same styling guidelines, or very specific checks like security checks. We don't really use our CI to run the abilities of repo A on repo B vice versa.
If you have a specific ability there, I would try to extract it to own, for example, here, a get of action that you would run. She would stop this decoupling. The only problem that will rise is you will have to continuously update that get of check as you change the code base. But that's a very fun thing to solve.
I'll talk to you later. Sure. Okay, I have multiple people. So I'll talk to you as well. Thank you very much. Thank you.