We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Automated SBoM generation with OpenEmbedded and the Yocto Project

00:00

Formal Metadata

Title
Automated SBoM generation with OpenEmbedded and the Yocto Project
Subtitle
A case study of automated SBoM generation in meta build systems
Title of Series
Number of Parts
542
Author
Contributors
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
SBoM are becoming a critical component in ensuring the integrity of our Software Supply Chains. Many current tools for SBoMs generation focus on two ways of generating SBoMs: generating them from the initial source code, or post-mortem analysis of completed systems and artifacts. While these are both valid and useful methods of analysis, less focus has been put on the tooling that pulls upstream source code together and generates the completed system artifacts, such as a distro build system or more generically any "meta-build" system. Using OpenEmbedded as a case study, Joshua will cover the unique strengths that generating SBoMs in meta-build systems can provide, as well as the challenges when trying to do so.
SimulationMultilaterationSlide rulePhysical systemSoftware developerElectric generatorObservational studyCASE <Informatik>BitProjective planeComputer animation
Data compressionWeb browserData bufferExecution unitComponent-based software engineeringSoftwareSource codeVulnerability (computing)Revision controlComputer animation
BootingPatch (Unix)QuicksortMereologyAnalogyChainSoftwareMathematical analysisInformation
Source codeCompilerGame theoryRun time (program lifecycle phase)Dynamical systemBuildingSoftwareConnectivity (graph theory)QuicksortHeuristicChainOrder (biology)Latent heatRevision controlMultiplication signTimestampInformationSource codeBuildingVulnerability (computing)Projective planeElectric generatorRun time (program lifecycle phase)Point (geometry)Core dumpComputer fileCompilation albumEntire functionNeuroinformatikProduct (business)PhysicalismTheory of relativityAnalogyGroup actionOpen sourceElement (mathematics)Service (economics)Combinational logicUniform resource locatorDataflowLibrary (computing)Different (Kate Ryan album)1 (number)Physical systemMeta elementMathematical analysisEmail
BuildingEmbedded systemOpen sourceOpen setCore dumpProjective planeSoftware testingServer (computing)Physical systemDistribution (mathematics)
EmulationLemma (mathematics)Link (knot theory)Presentation of a groupDifferent (Kate Ryan album)Medical imagingProgram flowchart
Meta elementCompilerSource codeComputer-generated imageryTask (computing)ChainHash functionoutputCodeEncapsulation (object-oriented programming)Variable (mathematics)Kernel (computing)GUI widgetPowerPCRootMetadataArmMultiplication signSet (mathematics)CompilerSource codeMedical imagingFile systemBuildingSoftwarePhysical systemPresentation of a groupMaxima and minimaMathematicsRule of inferenceProgram flowchart
Computer-generated imageryPrice indexMaxima and minimaRootFile formatFile systemKernel (computing)Codierung <Programmierung>Computer animation
Data managementLimit (category theory)MereologyRootComputer configurationDefault (computer science)UsabilityFunction (mathematics)Computer filePoint (geometry)InformationCombinational logicNP-hardExistential quantificationBuildingChainSoftware crackingAnalogyFile systemConnectivity (graph theory)Medical imagingINTEGRALSoftwareProfil (magazine)Core dumpGroup actionFlow separationElectronic mailing listLevel (video gaming)Computer animation
Standard deviationInformationSource codeConnectivity (graph theory)Computer animation
Program flowchart
Transcript: English(auto-generated)
All right. Hi, my name is Joshua Watt and I'm here to talk to you today about automated SBOM generation using a case study of the way that we generate SBOMs in OpenEmbedded. A little bit about me. I've been working at Garmin since 2009 and we've been using OpenEmbedded in the active project to do embedded system development since 2016.
I'm a member of the OpenEmbedded Technical Steering Committee and there's all the ways you can contact me if you're interested later. I'll post my slides after my talk. So we're all hopefully familiar with what an SBOM is. We use it to describe what software components we have in our system, what we know about them, what we don't know about them, and importantly what the relationship between them is.
So why are SBOMs important? If we're using software ourselves or allowing other people to use it or shipping it to customers or whatever we're doing with it, we want to know what's in our software. We want to know where it came from, what versions those things are at a very minimum. If there's software licenses, we want to know if we need to do anything to comply with them or things like that or make sure they're
not being used improperly. We don't want to expose ourselves or people using our software or customers or whatever it is to risk by having software that's been tampered with either maliciously or unintentionally. And we also want to know if any vulnerabilities come up after it's shipped so that we can fix them if necessary
or if it's vulnerable to exploit. And really the question that we want to know is, can we trace the binary things that we have given to people back to the source code that produced it? Often when we talk about SBOMs, we talk about them as being nutrition information for software. And I really do like this analogy. I think it
easily encapsulates something that everyone is familiar with, which is a standardized way of encoding something. For SBOMs, we're trying to standardize the way that we encode information about software, just like nutrition labels try to standardize the way that we communicate what's in our food. So most people can look at a nutrition label
and have an understanding of how it works. So we want SBOMs to be the same way. You can look at the SBOM, and it's a way of encoding what we have. I think this is a great analogy, but it is missing a few key pieces. And the pieces that it's missing are really the supply chain part of the analysis. So it can tell
us what's in our software, just like a nutrition label tells us what's in our food. But it doesn't tell us how it got there, right? A nutrition label isn't saying, this grain came from here or whatever. And that's the part that we're sort of missing with SBOMs that we would like to know. And that's what this talk is about.
So I don't have a nice analogy for how to communicate a supply chain that's like the nutrition label. But I do come from a consumer manufacturing background, so I do understand supply chains. So we can relate software supply chains to physical supply chains. And when you have physical supply chains, so you're making some consumer
electronics, you've got all these steps along the path of getting the completed product. And you need to know where every component comes from to make sure that all the right components are in the right place at the right time to be manufactured. You need to know what's being combined in every step for the same reason. And you need to know where this combination takes place, because in
modern supply chains, these steps can be spread out geographically all over the world. And they can also be spread out over time. So if you produce 10,000 of one thing and then put it in storage, and then you pull those out, you might need to know are these 10 years old? Are these five years old? How old are these parts? When were they
manufactured? And when we talk about software supply chains, we have basically the same questions. We need to know where all the components that are in our supply chain came from. However, in this case, we're usually talking about things like source code that we've compiled, and then the tools that we use to compile it instead of physical components.
We need to know what has been combined at each stage. Did we take this library from this other project and put it into what we're currently working on? You know, does it pull in some dependencies from somewhere? Things like that. We need to know where this combination takes place, although we're probably less concerned with the physical location as
much as the build host that's doing the combination. And who did it? Potentially we would like to know who did this step of our supply chain. And then, you know, when did it occur? Was the software compiled 10 years ago? It's probably got vulnerabilities that we should take a closer look at.
To help answer some of these questions, SPDX has a build working group that's been working on the build profile, and it will hopefully be releasing with SPDX3 in a couple months, or whenever that is soon. And it's designed to answer the questions of when a build was done, so it can record timestamps for when builds happen.
Who wanted a build done? So this is going to be the person who initiated the build, or wanted the build done, or did the build themselves, depending on the circumstances. And that's distinct from who actually performed the build, which might be, it could be a person if they're manually, you know, typing in the command to do the build, or
it could be a service like GitHub Actions or something like that, and that's why we have the two different who steps in there, or two different who elements in there that distinguish between the person who clicked the button and GitHub Actions and GitHub Actions that actually did the build, or whatever your service is. So how the build was done, so this is going to be tool specific information about how the build was performed,
like the command line arguments or things like that. It's important to note that the build and runtime dependencies are already actually captured by the SPDX core specification, so we don't include those explicitly here, but they're already included. Where the build was done, so this is going to be the build host,
the computer on which the build was performed. So this might be as complicated as an entire other software bill of materials, if you have one that describes the system you're building on, you could link into that and know all the information about the build host also, but also would capture the tool use, like if you have compilers or, you know, host tools or things like that.
And then the what you're building is already covered by the SPDX core profile, because it can describe packages and files and things like that. So one of the key points is that it's really important to try to generate
build SBOMs at actual build time. And to try to explain that a little further, I'm going to kind of compare generating SBOMs at build time versus two other ways that SBOMs are commonly generated, although these aren't the only other ways they're generated. So there's source SBOMs, which are generally, this is like reuse or something that's just included with the source code.
And then you've got post-mortem SBOM analysis. So this would be the tools that run after you have the final artifact to try to scan it and say you're vulnerable to these vulnerabilities and things like that. Try to determine information after from the final artifact.
Obviously I'm trying to say that we should generate build SBOM information at build time, so I'm not trying to say that the other two things are just terrible and never use them. They all have their strengths and weaknesses. I'm just trying to explain why I think we should build them at build time. So we talk about when something can be built. Source SBOMs obviously can't know this,
because they're not worried about when something is actually built. Build SBOMs should be able to figure this out from when the thing is built. You can record timestamps pretty easily. And post-mortem analysis may or may not be able to figure this out. It just depends on if that information happens to be encoded in whatever you've produced. We talk about how. Build time dependencies. Source SBOMs might be able to capture this.
If you're talking about something like cargo or NPM that explicitly encodes specific versions of dependencies in the source code, you could very easily know what the build time dependencies are. Otherwise, if you're talking about shared libraries or something, it might be able to know those, but it wouldn't know them concretely. So you'd know, like, I need OpenSSL, but you wouldn't necessarily know the specific version of OpenSSL
that it built against. Build time, you should know all of this. You should be able to know all of this at build time. You basically have to in order to correctly build the software. So you kind of need to know that. For post-mortem analysis, you might be able to figure it out, probably with some sort of heuristically.
And static libraries are always very problematic with this. It can be very difficult to tell if a given executable has a static library in it or not, because it's not recorded anywhere in the executable. So those can always be very tricky to trace back to their origin. Run time dependencies are a somewhat similar story. So source SBOMs, you could know what they are,
but probably not concretely again. Build SBOMs, you could know this if you're doing complete packaging. So if you're generating final packages like Debian packages or Fedora packages or OPK packages or whatever,
you could know this, know what these runtime dependencies are, even concretely. And for post-mortem analysis, for shared libraries, you can actually figure this out pretty easily, because it's in the ELF header. But for anything that's runtime dynamically loaded, like if you do DLOpen or something like that,
you probably can't figure that out very easily with post-mortem analysis. In your build environment, obviously source SBOMs don't care about this. Build SBOMs, you should be able to know this information, and for post-mortem analysis, maybe you could figure that out. I don't know, if it was encoded in the executables, maybe some of that information could be known.
There's a couple of advantages for generating supply chains from your build tools at build time. I like to say that they're authoritative because they have first-hand knowledge, because they're the ones actually doing the build, so they should know what's actually happening at each step.
Likewise, they're very accurate. There shouldn't need to be a lot of guessing from your build tools about what's going on at each step, unlike the post-mortem analysis, which tends to use a lot of heuristics or things like that. And they're comprehensive. They can analyze a lot of different steps in your build,
especially if your software supply chain is very deep, which I think it is for a lot of things. And so they can generate a lot of information about your builds, as we'll see later. And they're also able to analyze things that are difficult, if not impossible, to analyze at other steps, like particularly static libraries can be very difficult to trace back,
at least as far as I know. So what kind of things could generate this information? So starting from the top down, at the highest level, you'd have things like container build systems. So this would be like Docker Build or Builda or something like that. As you move down, you kind of get into what I call the meta or distro build systems.
This would be open embedded, which is what I'm going to give an example of in just a few minutes. Debian, Fedora could generate this every time they generate packages. And then if you go down even a further step, you've got the package build systems. It's not a great name for them. But this would be things like Meson or CMake or AutoTools. They could all generate this information also,
with what they know about builds. And you could go down an even further step and say, well, maybe GCC should spit out this information. And maybe it should. That's also something that could happen. And then it could sort of flow up the build stack, as you go.
So I'm going to give an example of what we do in OpenEmbedded to generate S-bombs. And if you are unfamiliar with OpenEmbedded and the OctoProject, OpenEmbedded is a community-driven project that provides the OpenEmbedded core layer and the build system, which is called BitBake.
The OctoProject is a Linux Foundation project that provides the Pocky reference distribution and runs a whole bunch of QA tests to make sure everything stays high quality, manages some release schedules, they provide funding for personnel to work on the project full-time, and servers and things like that, and they provide very good, excellent documentation.
You should go check out our documentation. And the purpose of these projects is to build primarily, but not exclusively, embedded systems. So we do have our traditional image you could flash on a Raspberry Pi up there. Colloquially we call these target images. But we actually can produce images for a whole bunch of different things that I'm not going to go into in great detail here.
I've got a bunch of other presentations on this that I have links to later. So when people want to build stuff with OpenEmbedded, what they start with is they have some source code, and they have some metadata, and they have some policy information, and they chuck all of this into this magical tool called BitBake,
and it spits out this target image that we talked about, and then you flash that on your widget and profit, right? It's great. A little deeper under the hood, the way that this works is that we start off with some host tools.
So this is the minimal set of things that you need to build with BitBake. So this is going to be your host GCC, Python, and a couple other fairly standard dependencies that run on your host. And we're going to take those host tools, and we're going to parse some recipe metadata that says how to build some source code.
And that source code is going to be used to build what we call the native tools and the cross compiler. So the native tools are still tools that are designed to run on your host system, and then we also build the cross compiler at the same time. So something like, an example of this might be like the protobuf compiler, right? We actually built that ourselves and don't require you to provide your own.
We also build our own cross compiler, so you don't even need a cross compiler on your host system. We then use those native tools and cross compiler to process more recipe metadata that's going to take some more source code in, and this is actually going to cross compile and build what we call your target packages that are designed to run your final system,
be it x86 or ARM or MIPS or PowerPC or RISC-V or whatever it is. And then we process yet some more metadata, and this one says how to combine all these target packages to make your root file system and your kernel and all of these other things that you need to actually have your target image.
The way that BitBait keeps all of this sane and tracks the dependencies is it uses a sophisticated method of hashing, where each step along the way called a task has a hash that is the encapsulation of all of the dependencies of that task,
all of the variables that affect that task's execution, and all of the code that it's actually executing. And then that gets combined into a single hash, and then that hash then is the input as a dependency to every task that depends on that one. So you get this chain of hashes all the way down following from your recipes that you start with to your target image.
So what happens is if, for example, something about the protobuf recipe changes, that's going to change the task hash for that recipe, that's going to cause that protobuf tool to be rebuilt, and that's also going to change all of the downstream hashes that depend on that, all the way through any native tool that depends on that, any target packages that depend on that,
and all the way through the root file system that indirectly depends on that. And so all of those things will be rebuilt by BitBake when you change that particular thing. And so just because of this hashing mechanism, OpenEmbedded and BitBake start out with a very strong software supply chain
because we have these very strict rules about how these hashes change and this causes everything to be rebuilt, and so you can actually trace it back from your target image to the target source code that produced all your target packages, and you can even trace that back to your cross compiler and your native tools that we built, and there are ways in other presentations I've done that you can see
that you can even trace this back to your host tools if you really wanted to do that and have that really deep supply chain. So basically what we do in OpenEmbedded is at each step along the way here when we're building something, while we're building it, we also spit out this SPDX document that says,
this is what we did here at this step. And then at the very end, we take all of the SPDX documents that went into our target image, or native tools that were used to build a target image, and we put them all into one big archive. And we have a rich set of dependencies that we actually report when we do this.
I'm not going to get into too much detail here. Again, there's other talks I've given that you can see that describe all of this in more detail, if you're interested. And these are those talks, if you want to see those. And these talk a lot more specifically about OpenEmbedded and SBOMs.
So when you do this, we can currently generate SPDX 2.2 JSON format. And I did this for a minimal QEMU AR64 system. So the root file system was 14 megabytes uncompressed, the Linux kernel was 20 megabytes,
and we had 158 megabytes of SPDX document. It's a lot. I was actually going to post up the archive. So it's a lot of data. And we're not even reporting on everything yet.
Some of that is the JSON encoding and things like that, but it's just a ton of data. So the question is, do we really need all of this? There's a lot of stuff to lug around. And I think you can hearken it back to that nutrition information. As a consumer of a given food product, wheat is wheat, right? I don't necessarily care how the wheat got into my crackers
or whatever it is that I'm eating. But if I'm a manufacturer of that food and I need to track something down that went wrong somewhere to do a recall or something, then I really care where that came from. And so I think the same analogy could probably be made.
It's possible your end consumers don't really care about your software supply chain, but if you're manufacturing something or building something, you probably definitely do so you can trace down problems and things like that. And there's always the possibility that there could be regulatory requirements for this in the future.
That seems to be a thing that's happening now. So if you're trying to track down a supply chain attack or something's gone wrong somewhere, then you probably will definitely want this information. So if you work on a tool that does something that looks like building,
please consider adding build profile support to your tool. It's actually really not that hard. For OpenEmbedded, we already had all of this information. It was just a matter of encoding it. Hopefully it was somewhat clear from what we were doing here. We already had all this information. It was just a matter of writing out the document that had it and then combining it at the end.
And with SPDX3, the combination at the end is going to be a lot better than it was with 2. And that's all I had. Are there any questions? Yes, you have many megabytes of SPDX. So I suppose you don't have a big file, but you have multiple files related with SPDX relationships?
Yeah, so we have a whole bunch of documents. So we're not generating one big SPDX document. We've got a whole bunch of small documents. Yes, we do that.
We use a whole bunch of external document references. And then they're in SPDX2, and this will be better in SPDX3. But in SPDX2, there isn't a standardized way to combine documents together. So we just throw them all into one big tarball. It's not the greatest, but it does put them all in one file for consumption.
At the point of build, they do exist on the file system as individual documents that you could package up however you want. It's just for ease of our end users. It's easiest if we just put them all in one big tarball and they can extract it and do whatever they want with it. But yeah, a whole bunch of external document references in our output.
Yes, I will post the slides. Yes, I will do that.
So there's an option you can turn on in your build that will, I don't remember if we turn it on by default. It's very easy to turn on. So you turn it on and then you just get this tarball as part of your build. So what happens in OpenEmbedded is you generate a file system like my file system and that's your root file system. And then alongside that, there will be my file system
.spdx.tar.gz or whatever it is. I forgot off the top of my head. But yeah, it's that simple. I think I'm not the answer given the size of it. So you don't deliver the SPDX as part of the image? No. We do not deliver the SPDX as part of the image.
So is SPDX not providing some integrity to an end consumer to say, I've got this SPDX, I've got the image, the two things are aligned. Right, so how do you trace the SPDX back to the image? So there's extensive checksumming in the SPDX itself.
Every file in that root file system is going to be expressed in the SPDX and the SPDX will have its checksum. At the file level. So you can say userlib.foo.so and then I go look at the SPDX, are the checksums the same? Then they're valid.
Who's using the SPDX? Have you created this image? Is anybody using the SPDX? Are you using it? No, I'm not. So he's asking who's using it and the short answer is I don't know.
A lot of people ask questions about how to generate it so I assume they're doing something with it. I don't personally generate this yet but that's just because of where I work.
There is a list of consumption tools available. Go ahead, I think you're next. The supply chain part, B, looks like the Salsa provenance. Do you also look at that solution because what I like about that solution is that
it's separate from the Salsa build material because it can be consumed in a different way. Did you ever look into that version? Yes, so the question is did we look at Salsa? Yeah, we had people that were from Salsa on the build profile working group so a lot of what Salsa did fed into what we're doing here.
I think that we wanted it to be more closely integrated with the SPDX core profile so that you could say this is all the licensing information and the build information and the supply chain information
so that's why we're including a build profile. I think tooling can come along too if you wanted to later on, like if you wanted to strip out all the build profile stuff because you don't want to ship that gigabytes of data to your customers, sure, I think tooling can come along to do that and that should be fairly trivial. But yeah, that's why we chose to do that that way.
What's the main relationship between the generated SPDX? Is it a recycle for the big bake itself or is it a component that's generated by the recycle? Sorry, I didn't quite understand the question. So basically you have one standard initial source of information that goes to the SPDX.
The initial source will be the big bake recipe or the component itself? The initial, right, so the question is where does the initial information come from and the recipe describes how to build, we report on both actually, so we report on the source code and the recipe
and the thing it built currently so we can do all of those things and I'm done. I'm sorry, I can answer more questions if you want but I gotta...
Hopefully the rest of you will know what to do. And as a reminder we have like chocolate, snacks and things here if anybody wants some.