We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Bootstrapping Debian-based distributions for new architectures

00:00

Formal Metadata

Title
Bootstrapping Debian-based distributions for new architectures
Title of Series
Number of Parts
90
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Source packages in Debian based distributions make two assumptions: that they are always compiled natively and that the full distribution is available during compile time. These assumptions conflict with the requirements during a port of a Debian based distribution. Some source packages must be cross compiled for a minimal build system from which the rest can be compiled natively. Since the full distribution for the new architecture is not yet available, dependency cycles have to be broken during cross as well as native compilation. As support for either is not yet part of the source package description, porting was so far a long, manual and error-prone process. This project aims to develop tools and techniques that make the process of bootstrapping Debian based distributions for a new architecture from scratch automated, repeatable and deterministic. Starting off from zero, a minimal build system is cross compiled. Through native compilation and breaking of build dependency cycles, the rest of the distribution is built. The talk will cover the current status, the techniques and algorithms used as well as future developments of these tools and what is missing in Debian to make bootstrapping easier in the future.
25
Thumbnail
15:46
51
54
Thumbnail
15:34
55
57
Thumbnail
1:02:09
58
Thumbnail
16:08
62
Thumbnail
13:26
65
67
Distribution (mathematics)CodeDistribution (mathematics)Projective planeObservational studyComputer architectureCodeUniverse (mathematics)GoogolHypothesisComputer animation
Binary fileBootstrap aggregatingSet (mathematics)Binary codeCASE <Informatik>File archiverDistribution (mathematics)QuicksortLecture/ConferenceComputer animation
Graph (mathematics)CASE <Informatik>Physical systemArchitectureDistribution (mathematics)Bootstrap aggregatingDistribution (mathematics)Cycle (graph theory)Binary codeMaxima and minimaOcean currentLecture/ConferenceComputer animation
Graph (mathematics)Cycle (graph theory)Connectivity (graph theory)ArchitectureBootstrap aggregatingLecture/Conference
Data centerMIDIGraph (mathematics)Multiplication signGoodness of fitState of matterDiagramLecture/Conference
Bootstrap aggregatingElectric currentCompilation albumMathematical analysisMenu (computing)CompilerArchitectureBootstrap aggregatingPhysical systemBootingCompilation albumHacker (term)Mathematical analysisInformationGraph (mathematics)Cycle (graph theory)Control flowStructural loadProcess (computing)Task (computing)Computer animationLecture/Conference
Bootstrap aggregatingArchitectureBefehlsprozessorUser interfaceCodecAlgorithmGastropod shellCodeLibrary (computing)Graph (mathematics)Modulo (jargon)TheoryImplementationBootstrap aggregatingMultiplicationRevision controlPhase transitionGraph (mathematics)Computer architectureProjective planeDistribution (mathematics)Gastropod shellLibrary (computing)Binary codeBootingArc (geometry)Software developerMetadataAlgorithmWordPosition operatorScripting languageState of matterOrder (biology)CodecBefehlsprozessorTheoryCompilation albumPhysical systemEmbedded systemBuildingComputer animationLecture/Conference
Decision theoryProcess (computing)Bootstrap aggregatingMetadataVideo gameDistribution (mathematics)Lecture/Conference
Compilation albumPhysical systemBinary codePhysical systemArchitectureDistribution (mathematics)Compilation albumComputer animation
Selectivity (electronic)Inclusion mapSet (mathematics)Physical systemBinary codeAlgorithmCycle (graph theory)Compilation albumFile archiverLecture/Conference
AlgorithmResultantSlide ruleAlgorithmBuildingProcess (computing)AdditionCASE <Informatik>WikiRevision controlMultiplicationSelectivity (electronic)Maxima and minimaElectronic mailing listPhysical systemTheoryBinary codeX-ray computed tomographyPhase transitionMathematical analysisComputer animationLecture/Conference
Physical systemGraph (mathematics)Graph (mathematics)Physical systemSuite (music)Distribution (mathematics)Binary codeBitXMLComputer animationLecture/Conference
Distribution (mathematics)Condition numberBinary codeSelectivity (electronic)Bootstrap aggregatingDistribution (mathematics)QuicksortComputer animationLecture/Conference
Distribution (mathematics)Hill differential equationDistribution (mathematics)Slide ruleCondition numberStatement (computer science)Graph (mathematics)Diagram
Distribution (mathematics)Distribution (mathematics)Phase transitionPhysical systemLecture/ConferenceComputer animation
AlgorithmVideo game consoleSet (mathematics)Physical systemMathematical analysisLecture/Conference
Distribution (mathematics)AlgorithmGraph (mathematics)Selectivity (electronic)ResultantVertex (graph theory)Different (Kate Ryan album)Phase transitionAlgorithmClient (computing)Graph (mathematics)Process (computing)Computer animation
Point (geometry)Set (mathematics)Group actionBuildingVertex (graph theory)Binary codeArithmetic meanLecture/Conference
Set (mathematics)Meta elementGraph (mathematics)QuicksortMereologyArithmetic meanComputer animationLecture/Conference
Graph (mathematics)Set (mathematics)ArchitectureCompilation albumGraph (mathematics)Similarity (geometry)Set (mathematics)Line (geometry)Binary codePoint (geometry)BuildingProgram flowchart
Point (geometry)Graph (mathematics)LengthCycle (graph theory)Pattern languageSet (mathematics)Lecture/Conference
Binary fileConfiguration spaceBuildingProfil (magazine)Control flowXML
Axiom of choiceSet (mathematics)Independence (probability theory)Binary codeConfiguration spaceSimilarity (geometry)ArchitectureBuildingCycle (graph theory)Lecture/Conference
Graph (mathematics)Axiom of choiceSet (mathematics)CASE <Informatik>Cycle (graph theory)Program flowchartLecture/Conference
Binary fileGraph (mathematics)FeedbackGraph (mathematics)Cycle (graph theory)Integrated development environmentCompilation albumProgram flowchart
Cycle (graph theory)Set (mathematics)Graph (mathematics)EvoluteBuildingAlgorithmBinary codeLipschitz-StetigkeitDrop (liquid)QuicksortLecture/Conference
FeedbackCycle (graph theory)LengthMereologyCompilation albumAlpha (investment)Control flowElectronic mailing listAxiom of choiceDifferent (Kate Ryan album)Solution setAlgorithmSet (mathematics)Goodness of fitComputer animation
Cycle (graph theory)Electronic mailing listGraph theoryFeedbackArc (geometry)Set (mathematics)Term (mathematics)Graph (mathematics)Vertex (graph theory)Order (biology)Installation artLecture/Conference
FeedbackOrder (biology)Connected spaceGraph (mathematics)Design by contractVertex (graph theory)Computer animation
User interfaceGraph (mathematics)Demo (music)Computer fontTotal S.A.2 (number)Demo (music)BuildingProfil (magazine)Form (programming)BEEPImage resolutionElectronic mailing listFunction (mathematics)Slide ruleFlagHeuristic1 (number)WikiRepository (publishing)Thread (computing)BlogDistribution (mathematics)EmailCASE <Informatik>Distribution (mathematics)Order (biology)Multiplication signSocial classCycle (graph theory)Data managementSimilarity (geometry)MultiplicationVideo gameWindowLogic gateWebsiteComputer animationLecture/ConferenceXML
Transcript: English(auto-generated)
All right. Welcome everybody, and welcome to my talk on bootstrapping Debian-based distributions for new architectures. My name is Johannes Schauer. You will find me on IRC under the nickname Josh. And I am currently in my master studies at the Jacobs University in Bremen.
So as an overview, the whole project started as the Debian Google Summer of Code project in 2012. And was later on continued as my master thesis at Jacobs University. My mentors are Wookie and Pietro.
Wookie gave me all the practical side of things, the actual bootstrapping work and Pietro Albaty, the theoretical and academic side of things, as he is also the guy who mainly contributed to DoSA 3, which you might know.
So the bootstrapping problem is hard because in Debian and usually in binary distributions, there are two assumptions that are made when source packages are compiled. First of all, source packages are expected to be natively compiled. And second of all, source packages
are compiled with the full archive of binary packages available. Now, both is, of course, not the case when this binary distribution, in this case Debian or Debian base distributions, is bootstrapped for new architecture because then at least a minimal build
system must be cross-compiled and only few binary packages, namely those of the just cross-compiled minimal build system, is initially available. And because source packages depend on the maximum possible amount of binary packages and those binary packages might build from source packages that
cannot be built because other source packages must be built before dependency cycles are created. To get an overview of the problem size, that is the current status of Debian sit as of January 1, 2013. This is a strongly connected component,
which means that every node in this dependency graph is involved in a cycle with every other node. If someone today wants to bootstrap Debian sit for new architecture, he has to solve that. As we see from the past, a human
can solve that with lots of effort and months up to a year-long struggle. We also see that the problem and the size of this graph you just saw increases over time. It starts here in 2005.
That is data that I got from snapshots Debian.org. It's data of Debian sit and goes until 1st of January 2013. We see that the problem size generally stayed constant before a new release was done, which is expected because then no new features were
introduced anymore. But we see that generally the problem size when Debian or Ubuntu is being bootstrapped seems to be increasing over time. We don't know how that will develop in the future, but it doesn't look good.
So far, whenever bootstrapping was done, there were different practices depending on for what architecture was bootstrapped, what architecture was available. Methods include to use Gen2 or OpenEmbedded, which could be compiled for that architecture and to avoid cross-compilation
and then compile Debian source packages on that foreign system. It commences with manual dependency cycle analysis. At least from Torsten Glaser, I know that he has loads of paper where he manually draws dependency cycles
and dependency graphs to find out where to best break this huge, strongly connected component so that he can build something. It involves manual hacking of source packages because usually source packages in Debian or Ubuntu are just supposed to be compiled with all binary
packages, all dependencies. And there is no information of what build dependencies can potentially be dropped if so needed. So you have to manually go through those source packages, get a lucky guess of what you think might be droppable, maybe some GTK stuff, some documentation stuff,
and see where you get when you build that source package with your build dependencies. And the process so far can possibly take up to a year to complete, which is, as one can see, a daunting task. The bootstrapping problem is also not something
that is only seldomly done. It is done about every year for Debian, at least, which means it is not unimportant that it becomes simpler. Now, what would happen if it would become simpler? Of course, foremost, porting Debian or Debian-based
distributions for new architectures would be easier. It would be faster and deterministic. It would not be so painful anymore. But if it's easier, then we can also expect that we get other benefits, like we get more subarc builds, which are optimized for a specific CPU, like the Raspberry Pi.
What else would be nice? While we are at the topic, well, we could remove the need for Gen 2 open embedded. By that, make Debian more universal because it can then, well, bootstrap itself without need of other distributions. We can update lagging architectures.
And we can have a quality assurance tool which allows to continuously check the archive, whether or not it is bootstrappable in its current state. Now, the essence of this talk is that we have now the algorithms to automatically unwind
this whole mess of graph that you saw before and make that into a build order. The tools that were written since last summer are written in OCaml and Python for some scripts using apt-pkg and some shell. It's all letter GPL.
Dose 3 is heavily used as a helper library. I would not be close to where the whole project is right now without Dose 3 and without the contributions by Pietra Abate. It's all online on Git in this tutorials repository.
More specifically, with those tools, we can now create and analyze this dependency graph. So we give developers a way to generate the graph and search for places in the graph that could possibly make sense to further look into to see if a build
dependency can be dropped or not. It enables the user to find the source packages that could potentially be modified. And once that is done for enough source packages, to create a build order out of the acyclic graph.
Now, this is all only theory. The stuff that I do works only on the package metadata. It only works on the build dependency and binary dependency metadata that one can retrieve. And whenever I say compile or install,
then I actually don't mean compile or install because nothing. And what I do is actually compile or install, but just the build dependencies or binary dependencies of source and binary packages respectively are fulfilled. So it's all a very theoretical work so far.
Now, to test all this in practice, some more things are needed. And of course, we want to test this in practice. First of all is that some more multi-arc is needed so that we can analyze the cross phase, which is not possible yet because there are some multi-arc conflicts.
Some packages need to be made better at cross compilation for the minimal build system. And of course, there needs to be a way to write down in source package metadata what build dependencies can potentially
be dropped if needed, for example, during the bootstrapping phase. And the syntax for that is still being debated. But final decision is needed to test all that in real life. Now, the whole process of bootstrapping binary distributions for new architecture
starts with cross compilation, cross compilation of a system which is minimal with respect to a build system, a system which contains enough binary packages so that we can start compiling source packages natively. For Debian, this is implicitly all packages,
all binary packages, which are essential, yes, the binary package build essential, and not so much but due to over 80% of the archive, depending on it, depth helper. Surely, there is some way to make some packages build without depth helpers so that we don't need depth helper
in a minimal build system, but in practice, supposedly, it's easier to include depth helper in a minimal native build system than solving dependency cycles to build all its dependencies natively. For cross compilation, we do the selection. Actually, any selection is possible. The tools allow the user to make any selection here
easily. And the next step is to get the co-installation set of those packages, which means we get a set of binary packages which will satisfy all the dependencies of this package selection. Next step is to get the source packages that build those binary packages.
Now comes an algorithm that is used twice, so it's important to understand it. This algorithm starts with these source packages that were selected in the first step on the last slide. It works the following way. First, all the source packages which were initially selected are added to the result. We get the foreign cross build
dependencies of those source packages, and we get the source packages to build those binary packages. And if those source packages are not yet in the result, we go back to one and do the whole process all over again. So it iterates, gets the source packages, gets the binary packages, gets the source packages to build
those binary packages, sees if they're already in result, and if not, do the whole thing all over again to get a final list of source packages, which is enough to cross compile this initial selection of binary packages. Now, that is the theory.
This process so far fails because of the multi-act conflicts. Due to multi-act conflicts, this analysis is not yet possible. So what we are doing for now is to ignore the cross case, and we just assume that the minimal build system we select can magically be cross compiled from nothing.
As I heard from Muki, that is not too far away from the truth. There are only a few more binary packages that need to be cross compiled in addition, in practice, to create a minimal build system. So the assumption is that the minimal build system can be created from nothing. But later, when this is solved, we can use the exact same
tools I will now explain for the native phase in the cross phase as well. But it also seems that the cross phase is sufficiently easy to not need that, but that is to be seen. So now that we assume we have a minimal native build system,
we start from that and create and analyze a dependency graph. Debian is huge. Ubuntu has even more packages. Debian so far in CID has 18k source packages and 38k binary packages. So to reduce the problem size and to make execution of all the tools a bit faster,
we create a so-called reduced distribution. The reduced distribution is simply a selection of source packages and binary packages for which the following two conditions hold. All binary packages must be createable by the source packages we selected. And all source packages are compilable with just
those binary packages available. So it's a mini distribution, which is self-contained because we can compile all source packages in it. And all binary packages in it can be created from those source packages. And this dramatically reduces the problem size and makes it possible to have
a name for a small distribution which is sufficiently significant for an initial bootstrap of Debian or Ubuntu or anything else Debian-based. First, how significant such a reduced distribution
is can be seen by this graph. One might think that the minimal possible distribution for which those conditions hold is very small, but it's actually not. So as of 2013, this reduced distribution for which the statements on the last slide hold
is 2,000 binary packages and more than 500 and about 600 source packages big. So we already get a useful chunk out of the full distribution, which people can already work with. So we go through the same steps as we did for the cross phase.
We start off with the minimal native build system. And we start off with selecting packages we, in the end, want to have available in our native system. Just to keep things simple, we again start off with those packages. Again, you can put anything there. You can put your whole GNOME desktop there,
and then it will be available later, or KDE, or really whatever you like. And it will be available in the end of the analysis. Again, we get there a consolation set, and we get the source packages. And then we again execute this algorithm, which works just the same way as it did for the cross phase.
We add the source packages to the result. We get their build dependencies. The source packages to build those build dependencies. And if those new source packages are not yet in the results, we go through the whole process again. Really simple. Now, once we are that far and we again
have a selection of source packages which are in the problem set by the aforementioned algorithm, we create the dependency graph. The dependency graph, as we modeled it, contains two different kinds of vertices. One vertex kind represents source packages,
and another vertex kind represents installation sets. Installation sets are a set of binary packages which fulfills the install dependencies of a single other binary package. When I talk about installation set nodes, it is important to note that it is a group of binary
packages. The edges are either build depends edges. A build depends edge points from a source package to an installation set. So it indicates what build dependencies a source package has by pointing to the respective installation
sets. And they are built from edges, meaning that the binary packages within an installation set node build from some source packages. Now, the build graph is being created
by connecting all source packages to installation set nodes of the build dependencies, as I just explained, except that all installable build dependencies are left out. All build dependencies, all binary packages, which we can already install, are not
part of the problem anymore. Because they can already be installed, they apparently already are available by some means or the other. So we discard them from the build graph. And the next step is to then connect all installation set nodes to source package nodes they build from, except the available binary packages, which also makes sense
because all binary packages, which are already available, for example, through cross compilation or because they are architecture independent, do not need to be compiled anymore. So they don't need a build dependence edge to a source package node. An example build graph looks like this.
This is an example from the huge mess you saw before. As you can imagine, there are a couple of edges missing. For example, I think it was libxcb. There are actually 50 other edges that point from each to other stuff.
And similar thing for Python. So the thing is actually pretty huge. I just cut it out to make my point here. So build depends edges are marked as dashed lines and builds from edges marked as solid lines.
So we can see that Python 2.7, the source package, somehow build depends on tk, and that binary packages in the installation set of tk8.5 minus def build from all those source packages around it. And we also can see two dependency cycles, one here
where GDB actually has Python in its installation set. So this installation set node builds from the source package Python and Python itself build depending on GDB. And the same holds for the cycle of length 2 over there.
So how do we break these graphs? We break it by removing build dependencies. The name that we have come up with so far is called build profiles. Build profiles indicate profiles that source packages
can be built with so that they can be built with fewer build dependencies by disabling some features. For example, by disabling features through configure are similar, and we will probably go into more details about that. Another way to break cycles is
by moving build dependencies from build depends to build depends indep. For everybody who doesn't know, build depends indep is all those build dependencies that are only used to build architecture independent packages. And as we are bootstrapping, we don't need to build architecture independent packages. So we can just rely on the hard build depends
and can ignore all the build depends indep. So if you have a source package where actually some build dependencies can be moved to build depends indep, you should do so because it makes bootstrapping easier. Another way is to choose a different installation set for non-strong dependencies.
As binary dependencies of binary packages can contain disjunctions, so binary packages can say I either need this other binary package A or B to be installable, there are many different choices that can be made to create the set of packages stored
in every installation set. Here, just one choice is made. Of course, another choice might be possible to be made, in which case this edge might not exist anymore. So one way to break the cycle, for example, would be if GDB could be installed without Python.
I don't know often if that's possible or not. Maybe it's a dependency which is necessary, in which case it is a strong dependency, and that is the name in the academic environment for those dependencies that are necessary for an installation of a binary package.
Another way to break the cycle, any cycle that you can see in here, is to cross-compile something. For example, if we could magically make GDB available, or TK, or Python xdbgen through cross-compilation, then either cycle would be broken. So that would be the fourth method.
Next, once we have the dependency graph, the user has to analyze it. And there are several ways to analyze the dependency graph and find source packages for which it might be beneficial to drop build dependencies.
One of the implemented algorithms is simply one that finds source packages for which least build dependencies are missing. So it finds source packages where only one build dependency is missing, and maybe this can just be dropped. So our problem size significantly decreases. Other methods include ratios.
For example, if we could build a source package evolution without libmx minus def, then those 55 other source packages might not be pulled in. There might, of course, be other binary packages which point to that, but this is not taken into account here.
So here we get the suggestion that it might be beneficial to cut this edge to possibly not have this set of packages in the graph anymore. Another one goes one step further. For example, it says that if we can build source tracker without dia, then dia would not anymore
need to be built by the source package of dia, and these 22 installation sets would be potentially avoided. Another possible way to identify those cycles is by listing small cycles. For example, cycles of length 2 just
consist of one edge, which is a build depends edge, and one edge, which is a build from edge. Since we only break build depends edges, there is only one single way to break cycles of length 2. So by listing cycles that are of length 2,
we already get to know a list of cycles where there is only one single way to break them. Of course, besides choosing a different installation set and cross-compilation. So we already get a good idea of source packages where little other choice is possible.
Another way to analyze it is to create a list of edges which are part of most cycles. So for example, you would tell the algorithm, calculate all cycles up to length 12, and then you would be able to get a list of edges with most cycles through them.
And it is expected that if you remove an edge which has about 600 cycles through them, then it would be nice if that edge could actually be broken because we could instantly break 600 dependency cycles. And the last one is to automatically calculate
a feedback arc set. Feedback arc set is a term from graph theory. And it is just a set of edges which, if removed from the graph, make it acyclic. The next step is to create a build order, which is the feedback vertex set problem of finding a small amount of source packages to profile build
and make the graph acyclic. The build graph needs to be changed into a source graph. Here we see the example from before. We see that, oh, I just forced those edges for sake of an example. I forced this edge to be broken.
And this edge here, we can now see that the graph becomes acyclic. And as a next step, all installation set notes get removed. The initial connections are still made. And in graph theory, it's called a path contraction. And we get a graph here for which we can easily
deduce a build order because it is acyclic. The build order for this specific graph is listed further here. By topologically sorting those vertices, we can deduce a build first, build, solve Python 2.7 with a profile, which is what the star signals.
I will show them in a demo soon, and then all the other form. Now I will show a demo which just takes 60, 80 seconds.
So this takes 80 seconds. And this is based upon the Debbie and Sid of 1st of January, 2013. It does exactly the same steps I outlined before. And it works on a reduced distribution.
It takes 75 seconds. For a full distribution, as you can see here in the slides, it would take nine minutes, which is also manageable. But you can probably get that further down. But all is OK. Since we don't have build profiles yet, and to get an example which is more or less realistic,
in this specific run, we use droppable build dependencies which were harvested from Gantu, which were extracted from its use flags of e-builds and from contributions by Justin Glaser, Patrick, Daniel, and Wookie.
While this still builds, I can go to the next slide. A to-do list is to try that out in real life, to add more multi-acto packages that's necessary, more cross-compilation to decide for profiles and tucks. There's currently a thread on the mailing list about that.
To implement them, we might want some better heuristics. The current ones work well, but maybe there are more. It would be nice to generalize the whole thing for a lot of problem class. I am interested in also doing similar things for, for example, RPM-based distributions or for package managers, as I heard.
And well, there's still the issue of finding a name. Here in the other window, we see now the output of the whole thing. And this is the build order which would build those initially mentioned 2,000 binary packages and 600 source packages in just 63 steps. As before, all source packages which
are compiled with a build profile are marked with a star. And as one can see further up here, the whole thing can be done by only profile building 73 source packages. OK, I come to my probably last slide.
Here are some resources. You can find those slides in my latest blog post. So it's really all you need to take away from this talk. And you get the mailing list or IRC channel, the main Git repository, the Git repository for the extraction from build profiles from Gen2,
Cycles, DOTA3, our Wiki page, a really full, a Wiki page full of stuff, and the current thread on the demo mailing list. And now it's time is over, so no questions. But if there are any, then maybe we can cut some of the minutes. But in any case.