Libabigail, State Of The Onion
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/61436 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2023342 / 542
2
5
10
14
15
16
22
24
27
29
31
36
43
48
56
63
74
78
83
87
89
95
96
99
104
106
107
117
119
121
122
125
126
128
130
132
134
135
136
141
143
146
148
152
155
157
159
161
165
166
168
170
173
176
180
181
185
191
194
196
197
198
199
206
207
209
210
211
212
216
219
220
227
228
229
231
232
233
236
250
252
256
258
260
263
264
267
271
273
275
276
278
282
286
292
293
298
299
300
302
312
316
321
322
324
339
341
342
343
344
351
352
354
355
356
357
359
369
370
372
373
376
378
379
380
382
383
387
390
394
395
401
405
406
410
411
413
415
416
421
426
430
437
438
440
441
443
444
445
446
448
449
450
451
458
464
468
472
475
476
479
481
493
494
498
499
502
509
513
516
517
520
522
524
525
531
534
535
537
538
541
00:00
StatisticsPresentation of a groupMathematical analysisGeneric programmingLibrary (computing)Binary filePairwise comparisonRepresentation (politics)Distribution (mathematics)Library (computing)Pairwise comparisonMathematical analysisSet (mathematics)MiniDiscFile formatData storage deviceOperator (mathematics)Interface (computing)Group actionProjective planeComputer fileEccentricity (mathematics)Cartesian coordinate systemBinary fileRandom matrixComputer animation
01:52
StrutComputer virusGamma functionTheory of everythingGastropod shellDemo (music)Parameter (computer programming)Insertion lossMathematicsVariable (mathematics)Function (mathematics)OvalRepresentation (politics)BuildingSymbol tableType theoryDifferenz <Mathematik>Pairwise comparisonTransformation (genetics)MathematicsInsertion lossParameter (computer programming)Functional (mathematics)Instance (computer science)Right angleDefault (computer science)Type theorySource codeProjective planePhysical lawMultiplication signDrill commandsSocial classInterface (computing)MereologyTransformation (genetics)Different (Kate Ryan album)Thomas Bayes2 (number)Revision controlLattice (order)Data structureDemosceneRepresentation (politics)RoutingComputer fileMorley's categoricity theoremDebuggerResultantPairwise comparisonField (computer science)Symbol tableInheritance (object-oriented programming)InternetworkingComputer programmingKey (cryptography)Differenz <Mathematik>Binary fileHierarchyVariable (mathematics)Library (computing)Front and back endsLatent heatGraph (mathematics)Latin squareCategory of beingFile formatDivision (mathematics)Database normalizationRandom matrixComputer animation
09:08
Differenz <Mathematik>MathematicsPairwise comparisonTransformation (genetics)ArchitectureMultiplicationDifferent (Kate Ryan album)InformationHorizonKernel (computing)Mathematical analysisSource codeParameter (computer programming)StrutFunction (mathematics)OvalPatch (Unix)Gastropod shellDemo (music)Data bufferGame theoryRepresentation (politics)BitTraffic reportingType theoryCodeFile formatRevision controlComputer clusterDefault (computer science)Multiplication signIntegerFunctional (mathematics)Library (computing)Formal languageLogicPairwise comparison2 (number)Programmer (hardware)Mathematical analysisCrash (computing)Set (mathematics)Numbering schemeProjective planeCompilerKernel (computing)Module (mathematics)Different (Kate Ryan album)ExistenceInformationInterface (computing)Source codeDescriptive statisticsFront and back endsSystem callInsertion lossLatent heatLine (geometry)Block (periodic table)MultiplicationMathematicsComputer architectureInstance (computer science)Lattice (order)Row (database)DebuggerGroup actionComputer animation
16:24
Parameter (computer programming)StrutOvalFunction (mathematics)Demo (music)Insertion lossMathematicsVariable (mathematics)Computer virusInterior (topology)Text editorClient (computing)String (computer science)Symbol tableQuadratic equationType theoryAlgorithmDeclarative programmingLibrary (computing)Multiplication signSingle-precision floating-point formatTranslation (relic)Link (knot theory)InformationExecution unitDistribution (mathematics)Gastropod shellLine (geometry)Linker (computing)Product (business)CASE <Informatik>HeuristicProcess (computing)Functional (mathematics)Programmer (hardware)Matching (graph theory)Binary fileParameter (computer programming)MereologyRevision controlBoom (sailing)Source code2 (number)MathematicsEccentricity (mathematics)Instance (computer science)Heat transferFundamental theorem of algebraExtension (kinesiology)Forcing (mathematics)Right angleInsertion lossMetropolitan area networkGoodness of fitBuildingData transmissionUniform resource locatorSpring (hydrology)Computer animation
23:40
Source codeSource codeSoftwareElectronic mailing listOffice suiteLevel (video gaming)EmailFile formatComputer animation
24:29
Program flowchart
Transcript: English(auto-generated)
00:06
OK. Hello, everybody. So my name is Doji. I work in the tools group at Red Hat. And so we are here today to, OK, first of all, thank you for staying.
00:23
So I wanted to talk about application binary interface analysis today. And OK, first of all, who doesn't know about libAbigail and ABI stuff? So OK, so I think we'll have something for you guys.
00:46
So what are we going to talk about? So first of all, I'll introduce what Abigail is, and we'll look at how it works, what the project brought recently, and what we're looking for
01:04
as far as the future goes. So Abigail is about doing analysis of application binary interfaces. So it's a set of tools that can do things like compare the ABI of two binaries
01:24
or store the ABI of a binary onto a disk format, can do a comparison of binaries that are in packages, like Debian packages, RPM,
01:40
star files, et cetera, et cetera. And it is also a shared library that you can use to write more tools if you want. So that's all well and nice as far as marketing goes. But then let's look at what we mean by ABI.
02:01
So suppose you have a simple program, well, a simple binary which has a very complicated, let's say, which has three functions that are here. The types of the functions are defined here in a simple hierarchy.
02:21
Here you have the first type, S0, which inherits a base type. And let's say another type here that inherits S0. So that's the first version of it. Let's see if it compiles.
02:44
Yes, it does. Then I have a second version of it, which looks quite the same. But what does it do? What's the difference between the two? Very simple. I just inserted a data member in the base class.
03:02
And we want to know what the impact of this is on the ABI as far as the binary goes. So where am I here? I'm in the source code of the project. And so I've built a version of it. And so here we have one of the tools, which name is ABI diff,
03:21
which does what you think it does. And so if I run it, what does it say? Basically, there are two changes as far as ABI goes in that battery due to the change you've seen. So the first change is about the first function,
03:42
which is here. And so it is telling us, basically, that that function has a parameter type that changed. And the change is about this structure, remember? Something is interesting. The size hasn't changed, even though I've
04:02
added a data member in there. So yeah, you know the drill, right? If you don't, I can explain it more. But size hasn't changed. The base class has changed. And the change here is a data member insertion at a certain offset, blah, blah, blah.
04:22
So this is the impact of the change of the first type on the first interface. And so there is another interface that got impacted. And the parameter of that function, which was struct S1,
04:44
changed as well. Its base class changed. The base class was struct S0. And the details of S0 change were reported earlier. So we don't have to repeat it again.
05:01
So here you see that we compute the changes, and we also analyze those changes so that we can detect if things have been reported earlier or not. And also, we mess up with more stuff,
05:20
because here we say, for instance, that there were two changes, for instance, but one got filtered out. What does that mean? So let's see, for instance, if I recall the,
05:41
OK, I'll add a special. So I've asked ABI div to show me redundant changes, because by default, it removes redundant changes. And we see that we have the third function that was impacted as well by the change we created.
06:04
And so, well, all the changes that impact function three were already reported. So this is why it was suppressed. That change was suppressed by default, because it was redundant.
06:21
So we're not just diffing things. We're analyzing the diffs, and we're trying to massage those diffs so that they can be consumed by human beings. So this is what we mean by analyzing ABIs, basically.
06:43
So how it works. Libaby-Gale has, so the library used to implement the tools has a front end, which is kind of backward. The front end reads the binary. Usually, it is back ends that writes binaries, but here, backward.
07:00
So we read the binary, which has to be in the ELF format right now. And we build an internal representation of it. We look at the publicly defined and exported symbols of declarations, basically functions and variables.
07:21
We build a representation of them and their types. And then we construct the graph of the types like that and their subtypes. And we pull all that together, and we call that an ABI corpus. A corpus is an artifact for us that represent the ABI of the binary we were looking at.
07:42
And so there is a middle end that acts on that internal representation. Said otherwise, it acts on ABI corpora, corpora being the plural of corpus in Latin. Let's be pedantic. So we can, as you've seen, compare two instances of ABI
08:03
corpus. Then we build an internal representation of the results of the comparison. We call that a diff IR, so it's a different IR. And then we perform transformations on that diff IR, like categorization.
08:21
So we would walk the graph and say, OK, this diff node, we've seen it before. So we'll mark it as being redundant to this other one. And then they can be transformations that are suppression as well. Well, suppression. We will mark the nodes as being suppressed, for instance,
08:44
because the user wrote something that we call a suppression specification file, requiring that some types of changes might not be reported. So once we have that well-massaged diff IR,
09:03
we have backends that walk that diff IR, obviously, or the initial IR, and do useful stuff, like writing emitting reports, for instance, or emitting the representation of the ABI corpus
09:22
in a disk-saved format that we call ABI XML. So what we've done recently, so I'm going a bit fast, because to let time for questions and stuff, and we can go on and, let's say,
09:42
not very structured discussion afterwards, if you like. So yeah, in the recent times, what we've done is, well, you know DWARF. You know that it changes all the time with new versions of DWARF producers. So with GCC 11 and LLVM 14, the default DWARF version
10:06
was bumped to the version 5, which is quite ancient, actually. I think it was released in 2017 or something. So yeah, we support most of that right now. And another major thing that's happened recently
10:24
was that, thanks to folks in this room, that I won't, don't worry, I won't give your name, but new debug info formats were added, because we started with DWARF only.
10:40
And so the CTF debug info format support was added to libabigail. So basically now, if you have a binary having CTF and or DWARF, you can choose whatever you want to use as a source of type information.
11:01
So things being how they are, the code got changed a bit to be turned into a multi-frontend architecture. We also have a multi-backend architecture, basically, because we have different types of reports.
11:23
The one I've shown you is the default one, which is quite verbose. So some people like it more terse. And who knows whatever weird request users might come with in the future. So yeah, different report backends.
11:42
And well, it doesn't stop there. We are still working on new stuff while coming from user requests. So yeah, apparently the new kids on the block, well, new kids in town now, cool stuff is BPF, right?
12:02
And with BPF comes BTF, which is the type description format of BPF. And so there were some requests to support that. So it is now in mainline, even though it's not
12:23
in libabigail mainline. But it's not released yet. It should be released in the next version. So what do we do with that? What's that thing? Basically, because BTF describes the C types, basically,
12:40
we are using that to compare the interface exposed by the kernel to its modules. So we're doing that with CTF already, with BTF now, and also with DWARF. With DWARF, it is much less fast, shall we say,
13:02
than with the CTF support and BTF. So people are using that feature to analyze the KABI, basically, kernel ABI,
13:20
that thing that doesn't exist. And then we've had weird project-specific requests over the year. And the last one that came in last month, I say, or last month, in January, was to have a,
13:43
I call that the library set ABI analysis. So basically, it's a project that has a huge library, shared library. And they're planning to split it in different libraries. But then they keep ABI compatibility.
14:03
They're supposed to. And so they would like to ensure that the set of broken down libraries has an ABI that is equivalent or compatible with the first initial one. This is what I call the library set ABI analysis. So we're going to add support for that in,
14:25
I don't know if it is going to be in the next version or not. So yeah, these are the kinds of things we are working on. So yeah. And now I'll let you ask questions, if you have any.
14:44
Does the library have any support for language-specific APIs? So languages are built on top of C, for example, but they have main main schemes. Yeah, exactly. So yes. So of course, DWARF is multi-language.
15:04
So if the compiler of that language emits DWARF, then we're good to go. There is a small layer of language-specific stuff we add for reporting so that we can report stuff in the native language of the programmer who wrote the thing.
15:23
So to give you a concrete example, right now we support C++. C, Fortran. Someone asked me for Rust support. So we had that, basically. We have some crashes on OCaml. So I thought we were supporting it too, but I need to do some stuff.
15:40
So yeah. Basically, it needs work, but for the mangling logic. So OK, I can show you. Let me show you an example. So yeah, I was writing.
16:01
So yeah, let's see. So you see, for instance, in C++, we'll compare. So here, you see this function, the function 3? I'll change it in the second version here, function 3, and I'll add an integer here, right?
16:23
Yes, let's, whoops. We compile that. And whoops, weird stuff happened.
16:42
So look at what it is saying here. So you see here, because we are in C++, I changed function 3 in the source code. Yeah, let me just, yeah, see? I changed function 3 here, and I added a parameter.
17:03
That's what the programmer would say. But then, from the binary standpoint, what we're seeing is that the first function was removed, and then another one got added. This is because in C++, the name of the symbols of the two functions, the two
17:25
are different, they have a different mangling, OK? So we go from the name of the symbol to the name of the declaration, right? But if I do the same in C, then,
17:41
like, yeah, I knew you would ask that question. I don't know you, but, and I have a second version here. Boom, boom, boom. And so here, some, oh, sorry, I changed the name of, sorry,
18:07
I changed the parameter of the function there, but this is in C, OK? And so if I go in the, ah, sorry, if I go in the shell, and I look at, boom, the two, so this is, the first one
18:26
was hello, and this one is bye, of course, because I think this is going to be the last C here. Because in C, the name of the two symbols are the same,
18:42
now we say that the function has changed. So these are the kind of things that we'll have to adapt, basically, but there is not much to do. In some cases, you have mangling, and in the other cases, you don't. So you don't have anything to do for the mangling. Does that answer your question?
19:02
Roughly. Roughly, yeah. Do you have this code, part of the code, which decodes the mangled name to a unreadable name? No, because the matching is done by DWARF. So we know that this symbol is for this declaration. So we don't have to do the mangling or demangling.
19:21
We look at the addresses, and we know that this symbol is for that one. So yeah, we don't really care about that. Yeah, please go ahead. Oh, there is none.
19:41
No, no, no, no, no, no, no. So yeah, just to repeat the question, what are the performance issues when we analyze big libraries like, he said, LLVM, but there is WebKit, Gecko, et cetera, et cetera.
20:02
So when we're looking at DWARF, we have a fundamental problem, which is the duplication of types. Here we are in the business of comparing things. And so when we compare types, basically, we
20:23
are in the land of quadratic algorithms. So things are inherently slow if we do them naively. And so the thing is, in DWARF, every single type unit is represented.
20:40
But then when you have the final binary, the final shell library, for instance, and you have 1,000 translation units, and in every single translation unit you had the string type, for instance, that was used, then you will have the string type represented 1,000 times, at least in the DWARF.
21:03
And so we must be sure that those 100 occurrences of string are the one and the same. We can't just look at the name and say they're the same, because they could be otherwise. And so we have to compare them and make sure they're the same, and then we'll say, OK,
21:22
I'll just keep one and throw away the others. This is deduplication of type, it is called. And so this process takes a huge amount of time, which is, well, for huge libraries, it can take forever. So we have heuristics to make this thing be faster.
21:46
But then it takes time. So we have some of the heuristics that we're using now is in the land of partitioning, like we will do things piecewise
22:04
so that we can do things in parallel. It is not mainline yet, but this is the future we're thinking about. Another approach is to have the types be deduplicated
22:20
before we intervene. This is what, for instance, the CTF guys do with C. So they will do the deduplication at debug info production time. And then in that case, we're golden. There is another case where we're doing that is when we are building distribution packages,
22:42
like for instance, RPM or Debian package or whatever, there is a tool which is called DWZ, which does the deduplication to an extent. Well, when it works, it works. It does the deduplication. But the problem is DWZ has the same issue as us.
23:01
And sometimes when the binary is too big, DWZ will just give up. And in that case, well, we have to use our little hands and do the deduplication in line. And then, well, we'll spend time. That's because someone should get DWZ, turn it into a library, and put it in the linker.
23:21
Yes. And yes. Do it in link time. Yeah, we can. Yeah, that's something that one of the things that we need to do to improve the entire ecosystem of these things. And yeah, that's definitely, yeah. So someone has some free time and yeah. So yeah. So as I'm, do we have other questions, or?
23:49
So are there any other formats that are on your roadmap? Right now, no. But like three months ago, BTF was not on my roadmap. So the future is not what it used to be.
24:02
So I don't know. Anyway, so yeah, we are hosted on Sourceware. We still use mailing lists. You send us patches. And yeah, you can find us on IRC on the OFDC network. And well, thank you very much.