Abusing GitLab CI to Test Kernel Patches
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/47494 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2020309 / 490
4
7
9
10
14
15
16
25
26
29
31
33
34
35
37
40
41
42
43
45
46
47
50
51
52
53
54
58
60
64
65
66
67
70
71
72
74
75
76
77
78
82
83
84
86
89
90
93
94
95
96
98
100
101
105
106
109
110
116
118
123
124
130
135
137
141
142
144
146
151
154
157
159
164
166
167
169
172
174
178
182
184
185
186
187
189
190
191
192
193
194
195
200
202
203
204
205
206
207
208
211
212
214
218
222
225
228
230
232
233
235
236
240
242
244
249
250
251
253
254
258
261
262
266
267
268
271
273
274
275
278
280
281
282
283
284
285
286
288
289
290
291
293
295
296
297
298
301
302
303
305
306
307
310
311
315
317
318
319
328
333
350
353
354
356
359
360
361
370
372
373
374
375
379
380
381
383
385
386
387
388
391
393
394
395
397
398
399
401
409
410
411
414
420
421
422
423
424
425
427
429
430
434
438
439
444
449
450
454
457
458
459
460
461
464
465
466
468
469
470
471
472
480
484
486
487
489
490
00:00
Patch (Unix)ConvolutionStatistical hypothesis testingSoftware maintenanceSoftwareDisintegrationOperations researchSystem programmingDemosceneProjective planeConvolutionComputer animationXML
00:30
Enterprise architectureConvex hullHill differential equationUniqueness quantificationOpen sourceStudent's t-testDistribution (mathematics)Address spaceDisk read-and-write headPairwise comparisonOcean currentRevision controlUniqueness quantificationDuality (mathematics)EmailConvolutionEqualiser (mathematics)Canonical ensembleComputer animation
01:09
Enterprise architectureLink (knot theory)ModemSoftware developerQueue (abstract data type)Statistical hypothesis testingQuicksortSoftware developerCommutatorCodeMoment (mathematics)Multiplication signSoftware maintenancePatch (Unix)Software bugWeb pageEmailLine (geometry)CuboidComputer animation
02:10
Patch (Unix)Directory serviceLimit (category theory)MiniDiscTraffic reportingMessage passingComplex (psychology)ConvolutionNumberFeedbackPatch (Unix)CirclePresentation of a groupSoftware developerComputer animation
02:50
Process (computing)Statistical hypothesis testingConvolutionStatistical hypothesis testingBuildingConvolutionPoint (geometry)Presentation of a groupSoftware developerProcess (computing)Goodness of fitSoftware repositoryStatistical hypothesis testingMathematicsPlane (geometry)MereologyMultilaterationDifferent (Kate Ryan album)Slide ruleNP-hardTraffic reportingStreaming mediaStatistical hypothesis testingProgram slicingMoment (mathematics)TrailCommutatorComputer animationEngineering drawingProgram flowchart
04:19
AreaConvex hullSoftware bugQuarkAnnulus (mathematics)PhysicsData storage devicePlot (narrative)LogicInclusion mapExecution unitCorrelation and dependenceGraphical user interfacePatch (Unix)System on a chipSimulated annealingHill differential equationMenu (computing)Message passingData managementOpen sourceWeb pageSystem programmingVideo trackingMessage passingSeries (mathematics)EmailMeasurementProjective planePatch (Unix)Software maintenanceComputer animation
04:54
Client (computing)Gastropod shellSCSIWide area networkSystem on a chipModul <Datentyp>Sparse matrixArchitectureWireless LANKeyboard shortcutMessage passingPatch (Unix)EmailProjective planeElectronic mailing listWeb pagePatch (Unix)Multiplication signPressureRepresentational state transferSeries (mathematics)Message passingCovering spaceLatent heatRight angleComputer animation
06:19
Instance (computer science)InformationProjective planeDependent and independent variablesSoftware developerCore dumpDifferent (Kate Ryan album)System programmingPressureCorrespondence (mathematics)Software repositoryComputer animation
06:51
Latent heatSystem programmingConvolutionRevision controlInformationBuildingComputer architectureSoftware developerSoftware repositoryTrailComputer animationXML
07:39
ConvolutionConvex hullSineMessage passingBuildingBoss CorporationBus (computing)Web 2.0ConvolutionComputer animationSource code
08:13
RobotSoftware repositoryStatistical hypothesis testingStatistical hypothesis testingSoftware developerRobotMessage passingSummierbarkeitWhiteboardTraffic reportingInformation securityComputer animation
09:05
Statistical hypothesis testingEmailFirewall (computing)RobotStatistical hypothesis testingSoftware repositoryConvolutionPlastikkarteProcess capability indexBefehlsprozessorType theoryBuildingSoftwareData storage deviceSystem programmingStatistical hypothesis testingGastropod shellMereologyComponent-based software engineering1 (number)Scripting languageRange (statistics)Descriptive statisticsSuite (music)Virtual machineSoftware developerInstance (computer science)ResultantFile systemReal numberNetwork topologyComputer architectureSelf-organizationDatabaseParameter (computer programming)DataflowData structureEmailMultiplication signInformationRight angleOpen sourceSet (mathematics)Traffic reportingTelecommunicationSound effectNumberSemiconductor memoryParticle systemPoint (geometry)Data miningQuicksortDigital electronicsComputer animation
13:15
Suite (music)Statistical hypothesis testingCodeCodecStatistical hypothesis testingStatistical hypothesis testingType theoryRepository (publishing)Multiplication signAdditionVirtual machineParameter (computer programming)19 (number)Data storage deviceCASE <Informatik>Computer fileTraffic reportingBitSelectivity (electronic)Latent heatComponent-based software engineeringComputer hardwareSystem programmingInformationSet (mathematics)File systemMereologyCondition numberSoftware developerComputer architectureAxiom of choiceRun time (program lifecycle phase)Open sourcePatch (Unix)ArmTouch typingDescriptive statisticsSoftware maintenanceNetwork topologyCodeSoftware repositoryComputer animation
15:57
Order (biology)Game controllerComponent-based software engineeringPatch (Unix)Subject indexingGroup actionMultiplication signSystem programmingComputer hardwareoutputFunction (mathematics)Convolution
16:31
Data warehouseSystem programmingResultantInstallation artGroup actionOperating systemStatistical hypothesis testingVirtual machineComputer hardwareMedical imagingDampingBootingDistribution (mathematics)DemonComputer animation
17:32
System programmingAvatar (2009 film)Data warehouseRight anglePeripheralInformationBefehlsprozessorProcess (computing)ConvolutionComputer architectureStatistical hypothesis testingStatistical hypothesis testingSoftware developerRepository (publishing)EmailTraffic reportingWhiteboardComputer animation
18:52
Network topologyStatistical hypothesis testingConvolutionComputer virusMountain passStatistical hypothesis testingSoftware repositoryArchitectureEmailBinary fileComputer hardwareComputer configurationContinuous functionDisintegrationBootingComputer networkActive contour modelDevice driverFunction (mathematics)Read-only memoryIPSecRootSystem programmingUtility softwareVirtual machineNetwork socketUser-defined functionBridging (networking)Local ringFuzzy logicAuthorizationSuite (music)UDP <Protokoll>Element (mathematics)Control flowInstallable File SystemRevision controlData storage device2 (number)Computer architectureEmailTraffic reportingStatistical hypothesis testingCommutatorCone penetration testForcing (mathematics)WaveDifferent (Kate Ryan album)Computer animation
19:26
Open sourceStatistical hypothesis testingStatistical hypothesis testingStress (mechanics)Suite (music)Data storage deviceRAIDSoftwareDevice driverLimit (category theory)Thread (computing)Software maintenanceStatistical hypothesis testingLoginLink (knot theory)System programmingTraffic reportingStatistical hypothesis testingResultantObservational studyLabour Party (Malta)WeightComputer animation
20:27
ConvolutionArchitectureStatistical hypothesis testingEmailBinary fileComputer hardwareComputer configurationContinuous functionNetwork topologyProgrammable read-only memoryConvex hullMountain passStatistical hypothesis testingSoftware repositorySound effectComputer fileLoginConfiguration spaceHyperlinkSystem programmingInformationProcess (computing)Traffic reportingStatistical hypothesis testingData warehouseCartesian coordinate systemWeb 2.0Statistical hypothesis testingComputer animation
21:14
StatisticsStatistical hypothesis testingStatisticsCircleResultantState of matterStatistical hypothesis testingRow (database)WaveControl flowStatistical hypothesis testingMultiplication signSoftware maintenanceComputer animation
22:29
ConvolutionStatistical hypothesis testingStatistical hypothesis testingConvolutionRight angleProjective planeResultantDatabaseSystem programmingSoftware developerEmailState of matterInformationInformation securityLevel (video gaming)GoogolSingle-precision floating-point formatComputer animation
23:40
Formal languagePoint cloudData warehouseUsabilityIntelProduct (business)Convex hullLevel (video gaming)Traffic reportingStatistical hypothesis testingConvolutionResultantSystem programmingInternet service providerSoftware repositoryCodeComputer animation
24:12
Data warehouseSoftware repositoryQueue (abstract data type)Patch (Unix)Moment (mathematics)Software developerMereologyTraffic reportingTable (information)MathematicsLine (geometry)Flow separationStatistical hypothesis testingBranch (computer science)Descriptive statisticsPoint (geometry)CommutatorRange (statistics)InformationView (database)Multiplication signPurchasingWhiteboardPatch (Unix)Software repositoryComputer animation
26:16
Patch (Unix)Variable (mathematics)Goodness of fitMenu (computing)Level (video gaming)Latent heatInformationQuicksortProcess (computing)Category of beingComputer animation
27:15
Level (video gaming)Traffic reportingProcess (computing)Template (C++)Variable (mathematics)Latent heatCondition numberComputer animationSource code
27:48
Hill differential equationKey (cryptography)Multiplication signFlow separationComputer fileRepository (publishing)BuildingResultantComputer animationSource code
28:19
Web pageObject (grammar)Level (video gaming)System programmingContrast (vision)Scripting languageSoftware repositoryGame controllerProcess (computing)Computer animation
29:30
System programmingStatistical hypothesis testingState of matterProcess (computing)Data storage deviceTotal S.A.Maxima and minimaConvolutionStatistical hypothesis testingConvolutionSoftware developerHuman migrationFeedbackFreewareStatistical hypothesis testingSoftwareBootingSystem programmingStatistical hypothesis testingData storage deviceComputer hardwarePatch (Unix)CodeSoftware maintenanceDistribution (mathematics)SurfaceSlide ruleElectronic mailing listPower (physics)Heat transferFlow separationComputer architectureBitMultiplication signDifferent (Kate Ryan album)Matching (graph theory)Projective planeCASE <Informatik>MereologySoftware bugMoment (mathematics)MathematicsPoint cloudCuboidArmEstimatorProcess (computing)Water vapor1 (number)Dimensional analysisQuicksortLevel (video gaming)HoaxInstance (computer science)AngleCycle (graph theory)Right angleSystem callDrop (liquid)Goodness of fitPersonal digital assistantMultilaterationInteractive televisionComputer animation
38:19
Local GroupProcess (computing)Latent heatFault-tolerant systemStatistical hypothesis testingSoftware bugMultiplication signGame controllerGoogolRow (database)BlogComputer animation
39:02
FacebookPoint cloudOpen source
Transcript: English(auto-generated)
00:05
So we are going to talk about this thing and us. It's me, Nikolai Kondrashov, and Michael is sitting here. He'll join me later. Neither of us does actually look like that anymore.
00:21
So we are coming from Red Hat CTI team or CTI project, and we're a distributed team which are doing kernel CI at Red Hat. So why do we need kernel CI? Well, as most of you probably know, we do releases of distributions, and each release has different kernel version.
00:42
There's a lot of kernel versions. And moreover, we are one of the major contributors to the kernel and certainly the biggest one among distributions. So this shows the comparison of unique email addresses from SUSE Red Hat and Canonical contributed
01:00
to the kernel. And this is commits. Each of those is per year, and the Red Hat is the blue one. So we somehow got to make it consistent and reliable. And if you look at this, this is a big queue of how code goes through the pipeline
01:21
towards Red Hat. And until not so long ago, our tests were only there at the end. So the developers would throw the commits or throw the builds over to the QA, and they would test it and then come back, and they would retest it. That takes a long time. And if you consider how long it takes for the whole code pipeline to execute one patch, to digest the patch
01:44
through that and get the release, it's a long way. So what we want to do is we want to do this. And we want to do it fast and provide as much and as fast as possible feedback so that ideally the bugs are caught before
02:04
even maintainers see them at the moment that the developers submit those. And that is hard because there is just so-called down much email. I mean, look at this. So we somehow are supposed to like,
02:21
I'm not having anything against this message just shows the complexity of these things. So somehow we are supposed to put our webhooks in here. Like this is a patch number 62 out of 114. And there is this amount of discussion going on. So somehow we are supposed to test those
02:41
and provide feedback to developers. So this has been an ongoing discussion recently in the kernel circles. And for example, the last Linux plumbers, there was one of those presentations that are happening recently by Dmitry Vukov who did a very good take on those issues.
03:01
And I recommend you watch it, if you're interested in the kernel development process, he makes good points. So what we've built is something like this. This is simplified and actually lost a slightly more complex slide on the plane here
03:20
because of how slice.com works. But nevermind, this is very simple. And we'll go around. So normally the, if you wanted to check just the changes in the git repo being committed there, that's kind of easy. So we have a bunch of repos. We track and we check like if there are new commits
03:42
and we test those. That's fairly trivial. We actually have, inside of that, we have a bunch of git repos for different releases like RL7, RL6, RL5, RL8. And we also track upstream repos, mainly stable at this moment,
04:00
but we track also a few others. And we test those commits, of course, and that's relatively easy. We just pull the repo and we run our tests about which I'll tell you a little later. Then there's the interesting part we just started with. Turns out it's not that hard. So, well, it's hard.
04:22
It's hard, don't get me wrong, but you can do it easier. So there's a typical mail list, like Linux USB mail list. There's a message from a series. It looks like this. Turns out there is a project called Patchwork, probably many of you know, which is used by maintainers.
04:42
Well, most of all, they use it to track the patches as they're being processed, reviewed, and tested, and they check which patches were merged and which not. So it looks like this. If you go to patchwork.kernel.org, there is a bunch of projects, and those projects can be mapped to a particular mail list
05:02
or even a particular tag being used, like tag in the subject being used in the mail list. Well, at least I tried how it's done this way. I'm not sure if that's upstream, actually. So if you go to the same Linux USB mail list here, you can see there is, again, those patches, but this time they're organized into series. And if you click on one of those links,
05:21
you get to the particular patch series. There are two patches in there, and we can go to a specific patch and see what's been going on, what's the patch, and we can download the inbox there on the right or the whole series. And the main thing that concerns us about this
05:40
is that patchwork has REST API. So we can go through those projects. We can extract the patch series, the patches, and everything, and we can track when they are appearing. This is, of course, sounds very simple, but the devil is in the details where you have to kind of expect that not all messages come through at the same time.
06:02
When you go check and the series might not be complete, there are bugs, people sending all kinds of messages in there, sometimes they're not picked up, and things like that. Patch series can have cover letter, don't have cover letter, and things like that. But you can make it work.
06:20
So the typical patchwork trigger would be tied to a particular patchwork instance and particular project there and associated with the corresponding Git repo. Further on, we also have triggers from our package builds in Koji and the developers' builds in Copper for Fedora
06:42
and internally for L, although it's called a little differently. Pardon, let's go back. So Fedora build system looks like this. There are a bunch of packages there being built, prepared for releases and reviewed. So we can look for kernel.
07:00
There's our kernel. Let's take this one. And here's information on the build of the kernel, like specific revision and everything, and all the packages that were built for all the architectures. And Copper looks like this. Oh, wonderful. Finally connected. So Copper is more for developers.
07:21
You can have your package build and put into our RPM repo and picked up by your users or by other developers and we track those as well. So you can look for the slash kernel there and find one of those and go in there and into builds and see there's been a build and here's the packages.
07:41
But we don't talk through the web UI, of course. There's the packages. We talk through the, we listen to the Fedora message bus which is used both by Koji and Copper and internally at Red Hat. There's a message bus as well. We just listen to the message. There's a log from our trigger and it checks like,
08:02
okay, there's build completed. There's a message coming through the bus. We're not interested in this one, neither in this one, but here is the kernel. We pick it up and we trigger our pipeline. So we also have to test our own CI. So we have a special kind of,
08:21
two special kinds of triggers for GitHub and GitLab for testing contributions to our CI repos of which we have many. And as you can see, we have some on GitHub and some on GitLab because of, well, historically. So it looks simply like this. You submit a PR or MR and the bot comes and says like,
08:41
hey, I'm a bot. Send me a message and I'll test this for you. So the developer says like a test please bot and bot says testing and you go, then it triggers the pipelines. So for the various repos that we have, a developer can go have lunch or two and then finishes
09:01
and the bot says like passed or failed. Well, as it happens, here's an example of my full request. So the bot comes in and tells us like, here I am, this is what you can do. And it's a little different for GitLab and for various repos and can say like, oh, add this keyword or that keyword and we'll test this and that and things like that. And yeah, I am asking the bot to test.
09:23
So the bot says, yeah, I'm going. Then it's post the result or failed, of course. So that's how we do testing for our OCI because, yeah, many repos and because our OCI is in a separate pipeline and we internally have two GitLab repos
09:43
actually handling this, but about this, not repos, GitLab instances. Well, that's a complication, a fun one, yeah. So further on, and these are two major parts of our pipeline, of course, the test database and the tool which lets us pick
10:00
which tests to run and organizes everything into trees and other dependencies like where we can, what we can run with this build, with that build on this architecture and that architecture and et cetera, whether developers want to test this or that. So basically the data flow is very simple.
10:21
We have KPI DB, which is a repo with the test information, which is currently private because there is like all kinds of stuff for real internal ones and we are still intending to open it up, but it's difficult to separate the upstream tests from downstream tests in a complicated data structure and somehow merge it together.
10:41
So we have the database, which is basically YAML and we have a tool which takes it and a bunch of parameters and then speeds out the XML for bigger system, which actually runs our tests. And this system is installed internally, but it's open source, but you will have a very hard time actually trying to install it,
11:02
which we are working on right now very hard and hopefully we'll be able to let you guys enjoy it. But it's all open source. It's out there, goes in documentation. It's just nobody succeeded installing it by themselves yet. So the database can contain information
11:21
about architectures again, of which there are five right now. Kind of host types and these describe what do we want the host to have? Like, do we want to have this much RAM or this many CPUs or this much storage or even a particular PCI card or a network card?
11:45
And we organize them into host types because that's easier. We have trees, obviously particular repos or types of repos we want to test and those affect which tests will run where, like for example, one test can run on the real seven, but not on the real five.
12:03
And some on upstream and some tests are still internal. Not many though, most of them are actually out there. Components which describe what things the build contains, like upstream only contains a kernel image, but internal builds, they're built using RPMs
12:23
and there could be debugging for headers, the internal kernel headers and things like that or tools that some tests need. For example, some tests need debugging for things like that or some tests don't run on debug build.
12:42
And then we organize tests into sets, of course, like for network tests, for file system memory, et cetera, et cetera, virtual machines. And of course, the description of the test suites themselves of which there are quite a big number, assume to be a hundred. And these range from simple tests,
13:00
like just a shell script which just restarts the kernel test something and it's done, to very big ones like LCP, USEX and the top ones are listed there, I guess, but those are not all of which can contain like thousands of tests. So an example of tests with data like a description where it is in the report, it's actually quite outdated.
13:23
Anyway, the essence is there. Where it is located and for example, this one is in our test repository on GitHub where most of our tests are, which host type it runs on, additional information of like,
13:42
I want like this, this very specific host for this, like, and it could be down to a specific host name on our, in our bigger system. Like I wanna run it like exactly here on this machine because there's only this hardware there. Which this description could look like this. Actually this says, don't run on these ARM systems
14:02
because they don't work. So information on maintainers and what's, well, this is a discussion for upstream. Let's look at that for now. So these are test maintainers who look after, after the test and check that it's working
14:20
and it's failing and then actually receive copies of failure reports and they are supposed to take a look at as soon as something happens and tell the developer, okay, sorry, that's my bad, it's a failure. Or say like, this is your problem. And they are responsible for those tests, which is going to upstream.
14:41
Important because upstream developers, they don't see that much into our machines and things like that. So they have a hard time figuring out what actually happened, which we are working on. So there's the conditions for the test to run on like the sets, which it belongs to.
15:00
And this is also updated, my gosh. This is also an interesting part that we specify which source files the particular test covers more or less so that we can avoid running it when there is a patch that doesn't touch those files.
15:21
And that's why we can kind of contain the runtime, at least a little bit, make it shorter when we don't need it. And this allows us to kind of describe when to run them for which code architectures the test will run on
15:41
and which trees it belongs to. But there are no components here because this is old. And there could be like multiple cases of this repo I wanted to run it with this file system or with that file system, for example, a file system test or additional parameters. And invoking the KPAT tool,
16:01
normally people don't invoke it by hand, but it runs in the pipeline. So you can say like, okay, generate me the XML for this run, for this kernel turbo, for the upstream tree, AR64, with this patch. And highlight the output and it would look something like this and it goes on and on and on.
16:21
I'm not going to bore you with those details. This is the input to Beaker and saying how to run it and where to run it and in which order. So going to Beaker, it's a big system which maintains inventory for the hardware, including down to the components, lets you match that hardware, has access control,
16:44
like particular groups have access to this hardware, those to this hardware, and for example, some NDA hardware could be there and protected. It also does the provisioning and boots up the machines, installs the operating system from scratch
17:04
using an account normally because we don't support running from images because that's hard to do and we are distribution, so we have to test the whole distribution from install. So it installs everything, it talks to the test harness,
17:21
extracts test results and things like that and looks after the machine so that if it does lock up, it releases the machine and erases everything. So it could be like the system inventory, we can find those machines, like these are not very useful right now, this is Itanium, we still have those.
17:43
You can go into machine and take a look at the details, like this is just one tab about the host information, there's the CPU info, storage, peripherals, things like that. And this is an example of some of our jobs running for stable repository of the Linux kernel.
18:04
This one job is just for one architecture and it has four hosts and here's an example of one host executing those tests. This is a bigger UI, this is a bunch of tests there.
18:21
Further on, now we're approaching the user visible stuff. So we have a reporter which watches over the pipelines and checks which stage they're on, which job they're approaching and sends the email reports to developers or whoever's interested. And sometimes it can send an early email saying like,
18:42
okay, have we started this test and like, watch out or we did the test and or something failed in the pipeline. So there's an example of a successful report that was sent to a stable mail list. Here's the, it starts with the saying like,
19:01
we took this repo, we took this commit and there's the summary, everything went fine. And we were actually compile and use those comments. And then we run them on these hosts, so like this architecture is AR64, first host, second host, PPC64, two hosts, x86.
19:23
x86 got more hosts, four hosts. And they also have a notion of waived tests, a test which you mark in that kpadDB and saying this test is waived, which means run this test as normal, maybe at the end of the run, but ignore the result
19:43
and don't take it into account when given a verdict, whether it failed or not, we ignore it. And we use this to test the test, which were just introduced into the system or were being fixed so that we can track like how are they performing, like are they doing okay
20:01
and their test maintainer can look after it until it stabilizes, then we remove the waived status. And this is done manually because tests are different, you have to look after them. And that's an example of report that we send upstream. Our internal reports are a little more elaborate,
20:21
you get to see actually links to the bigger results and explore the logs and everything. But those tests actually have, yeah, artifacts, and there's a blue link there, this contain the binaries, config files and logs, things like that.
20:40
So, and then finally, then finally we have the data warehouse, it's a system which uses positive SQL and collects all the information about our runs, also similar to reporter and this kind of application
21:02
of the effort at this moment, but we are working on that. So it puts us over all the jobs and collects information like how it went and what's the status and much tests run. And there is a web UI which looks something like this and provides statistics, how much we failed, how much we succeeded, the pipelines and various statuses.
21:24
This has been pulled from GitLab using the GitLab API. And there's a particular pipeline and listing all the tests and all the hosts, and you can go and see the results in Beaker,
21:42
how it went. And we maintain the test statistics, how tests were failing or passing, for example, for exactly for the purposes of deciding like, when to waive the test, if it has been doing bad and then send it back to the maintainer and say like, okay, deal with it. Or we can actually take it out of waived state,
22:01
if it's doing okay. And same for hosts, like if some host is misbehaving in Beaker and that's a problem because there are just so damn many hosts that they break and you have to watch out and the host maintainer, like whoever maintains that host, they don't have time to look after it. So we look at those and we say like, okay,
22:20
this host should be excluded from the runs and we add like, don't run on that host. So finally, the title of this talk. No, not here. This actually concerns Guillaume's talk. So there is this thing, you've probably seen it at the last slide.
22:42
It's the kernelci.org and they run lots of tests and they were recently approved as the, accepted as a Linux Foundation project to advance the state of kernel CI. And we joined that effort. And right now we're working on a database
23:00
and the system used to aggregate testing information from various CI systems. So that's the ultimate goal, so that there is a single place to go and check kernel CI results from whoever runs those tests. And so that the developers get on a single email
23:22
with those results and not just five emails from everyone. Right now, this is mostly the kernel CI folks and us, but others are joining. Hopefully soon we start aggregating more data, but we already have a tool. You can take a look like how this looks.
23:40
I don't know if Guillaume showed this, but this is an example of how the test reports like top level could look there. So we took a Google BigQuery system for storing those results so they are more readily publicly available so that people can go and explore the data
24:02
and see how kernel is doing and do research if they need to. So this is our repo with the code for that. And it looks something like this when it's pushing and it's the data, and we are working on a dashboard to show this off and to provide the develops.
24:22
This is very rudimentary at the moment. Finally, the interesting part, it took a little while. So we store our CI pipeline inside YAML, but we store it in separate repos because of the way we trigger those.
24:41
So to trigger GitLab, we're actually doing commits for the repo, and I'll show that in a moment. So basically, these are two repos, and the repo on the left is only including pieces from the other repo.
25:00
And this lets us to let the triggers do commits with the information about what we want to test inside that repo. And we need two repos so that these commits don't interfere with the development commits we have. So because there's like every time you want to test something, there is a new commit,
25:20
and that's an empty commit, and it doesn't have any data in it. So we use it just to identify the particular pipeline and GitLab view. So for example, the baseline trigger, like the Git repo trigger comes in and does, and sees that there are two changes in the, like in one repo and another, and does commits to separate branches in that repo.
25:42
This trigger actually is retired now, but it was quite kind of interesting. So it also does and checks and finds something and does the commit. There's the trigger that finds patches and does commit in that same branch, and finally the GitHub board comes and finds like, okay, there is a new merge request, and I put it in all the branches
26:01
that we're interested in testing. And it might look like this, for example, the stable branch has those commits all with the pipelines running, and this branch has its own commits here. And the commits can look like this. So there is data there, but it's not for GitLab's consumption,
26:21
only for us as debugging, like this says. All the variables that we put in there, all the descriptions, like what we are triggering on, things like that, this one is huge, and this one is big as well, it's abbreviated. So we use a lot of GitLab extends property,
26:42
which lets us separate the general pipeline into like the pipeline into the menu of jobs and stages, and into three specific information, like our pipeline specific information. This is our shortest pipeline,
27:01
and it says, okay, pick those four stages from the pipeline and we have 10, maybe, or more. And this says, okay, sorry, pick this prepare step where we download all the stuff for the execution, all the dependencies, and this says like,
27:21
this is the prepare, and this is stage prepare. And this one, we say, okay, again, pick this create repo x86-64 and extend it, and this is one of those saying, okay, take this repo, take this template of the job and a bunch of variables
27:43
and conditions and create a particular job for this specific pipeline here. And this time we are using the merge keys to merge those. So we're using extend here, extends here because this is a separate YAML file, so we cannot use merge keys,
28:01
and we use merge keys here. It's the same, one big-ass YAML file there. And this would be a create repo, creating a RPM repository with build results for testing, which are then installed in Beaker. The next stage is composed a little differently,
28:22
so we have a huge script, which is split into a few YAML objects, and finally the last stage looks similar to that, and so on, so we have pipelines which are much longer than that, and more involved.
28:40
So why we took GitLab? Well, we started out with Jenkins, so we had a Python script controlling Jenkins, which had a Joe, Britain, and Groovy, which controlled another Python script, which checked out the kernel and build it,
29:01
and then get it off to Beaker. So that was not very reliable, hard to debug, hard to understand, hard to maintain. As a contrast with GitLab, we had a relatively straightforward system. We could keep everything in the Git repo
29:21
and keep changing it faster, and well, it's more reliable than Jenkins for us. And I hope Michael is able to say something. We have 10 minutes left. Okay.
29:43
So now you will hear me complaining about GitLab, but. So yeah, there we go, there we go. So I don't know how many people use GitLab here. And how many people have used GitHub? Yeah.
30:01
So it's a very familiar system, so it's nicely documented, it has a huge API surface. People are familiar with it, so if I say GitLab, people actually know what I'm talking about. If I mention some other CI technology, then people look at me like this. If I mention Jenkins, people just go away and say like, no, you don't want to be on your team.
30:21
Hopefully there's nobody from Jenkins here. Sorry. But then we are actually testing kernels, so that's quite similar to other software to test in some aspects, but follow aspects, especially like testing. There are some interesting issues that you will see there.
30:40
So one is that actually GitHub, GitLab, most of these general CI systems don't have any concept of a failed pipeline because of infrastructure issues. So if you look into what distributions do for gating, most of those actually have a test fail thing, maintain or fix it, and then they have something like, oh, our infrastructure failed, or our test is system,
31:03
and then it's actually for somebody else to fix. Now, most of you might know that kernel maintainers don't react too well if you email them without any good reasons, or if you email the Linux kernel list with infrastructure issues, they get pissed quite easily. So we really want to avoid that, and that is not very easy, so you actually need
31:22
to put stuff around GitLab to make that happen. On the slide on the left, you see what the test system actually gives you, which is Beaker in our case, and then there's this one missing. So a penny code in Beaker is actually, the hardware messed up, the kernel didn't boot, but for whatever reason, or the distribution didn't boot,
31:41
or there was some power surge or whatever, or actually we messed up our general infrastructure or had the networking issues, stuff like that. And that's not in the system, might never get in there because it's not something that you would normally have for your average. Can you just add a little bit about that? Yeah, sorry. So one consequence of that is that GitLab,
32:02
yeah, okay, sorry, ah, this is the line, sorry. So the thing is, GitLab has infrastructure issues, GitLab. GitLab CI has infrastructure issues, and you can select, okay, you can restart on this failure, on this failure, on this failure, but it actually doesn't matter to me
32:22
which failure the GitLab starts on because it's GitLab's thing. It can fail on various things, but it doesn't allow me to tell like, okay, restart on this issue or on this issue. I can only say test passed or failed. And that probably comes from where GitLab is intended to be used in, it's like a test ran,
32:42
nothing can happen. It's just trying to test on simple software. But for us, if you remember that job, like just one architecture, four hosts for one architecture, and there are like two, three, four architectures more than that. And what GitLab does is they just kill us.
33:03
Yeah, and that stays there. And that's a separate slide, I'm confusing the issues, but basically we cannot tell GitLab, okay, we had an infrastructure issue, can you restart? That's a big deal for us. So, I mean, there are more interesting things.
33:20
We actually, that's from the beginning of January, we are producing 30 gigabytes of artifacts a day, like kernel builds, RPM builds, all kinds of stuff. And if you use a shared GitLab instance, like gitlab.com or whatever have you, in Red Hat, we have a couple of those. People might not have the storage available.
33:42
So we would want to store it outside, like in S3, if you build in the cloud, you want to keep it there. You don't want to incur the transfer costs moving in and out, which is not possible at the moment in GitLab. So you can configure per instance, but not per project. And it all goes on like these things. So there are certain things where it doesn't really match
34:02
very well with kernel testing. And you try to work around it. It's possible, but it gets more ugly. So you can take a look at the code, it's on GitLab. Don't blame us for however we did it. It's really hard to upgrade, because if you have pipelines running for a day,
34:21
if you do an upgrade, the GitLab runner will stop to accept jobs. So then you don't get any builds, any tests for a day. You can work around it, upgrade different runners at different times, stuff like that. What else do we have? We just skipped that one. Oh yeah, and that's an interesting one. Most test systems don't expect the test to reboot.
34:42
So I don't know why, but kernel tests actually reboot a couple of times. It's something that kernel developers think is useful. And so if you need to boot into your kernel, but then you might also restart a couple of times in there. So you can't really have your test harness,
35:01
like the GitLab part, have itself restarted. So you need to have another indirection. You just start another VM, or have the GitLab part outside of your testing system, in this case Beaker. But otherwise you could just put it inside of your hardware lab, which you can't do at the moment.
35:21
Yeah, maybe we stop here and take some questions. Otherwise I'd just complain about other bugs.
35:54
Now the question is how much time do you actually gain from having CI?
36:01
And I think depending on who you ask, there might be a different answer. So developers might most likely say like it doesn't help us at all in the beginning, especially now, where you might actually get infrastructure issues, giving you false positives. But I think we find a couple of issues a week. Actually four, maybe.
36:21
Four a week, where kernel developers were really sure that they got it right, but they didn't. And that could be patches posted to the mail list. Or it could be something merged into stable, for example, in the stable Linux. The ultimate goal is actually to free resources
36:42
inside of RL, because upstream patches poke. So we want to provide feedback outside of it that never actually goes through the whole pipeline. Now we only find out about it when it's already merged and built into an RPM. The ultimate goal is to have work done upstream,
37:03
which you're most likely to. Okay. So you mentioned you were running Jenkins before you.
37:22
We just rewrote everything and then switched. Yeah, the question was how painful the migration was. Like we took some of the tools that Jenkins was using, and we used them in the new pipeline, but we rewrote everything that was in Jenkins there, because you can't really run that in GitLab.
37:44
So we had to rewrite the big part, and we had to rewrite the triggers and things like that. We replaced the separate tool, which was controlling Jenkins with those little triggers that I showed you. Any more questions?
38:01
Yes? The question was, is GitLab working on those issues? So there is an issue that particularly pisses me off, is that GitLab simply kills the runners. Yeah, yeah, this one.
38:21
They simply kill the runners with sick kill. So for us, it's a runner that's controlling that bigger resources like this, I don't know, 10 hosts that are running those tests for hours, and we just forget about them because of that.
38:40
Like GitLab just forgets like, oh, whatever. And that host is occupied for these hours, so we cannot clean up. And this bug was open for years, I think. And they're promising they will fix it soon, so I hope they will. Yes, somebody will be there. Okay, our time is up, so catch us in the corridor.
39:00
Okay.