We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Making Sense of so many License Compliance Tools

00:00

Formal Metadata

Title
Making Sense of so many License Compliance Tools
Title of Series
Number of Parts
561
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
License Compliance has become big business. Many proprietary tools exist, but fortunately in recent years FLOSS tools to aid understanding the licenses of codebases have been created. This panel includes developers of many of these freely available tools. We'll discuss what these tools do, if they can actually address compliance problems in the wild yet, and the general challenges of developing freely available tools in a heavily proprietarized industry.
10
58
80
111
137
Thumbnail
15:21
159
Thumbnail
18:51
168
Thumbnail
26:18
213
221
Thumbnail
15:22
234
Thumbnail
49:51
248
Thumbnail
23:06
256
268
283
Thumbnail
28:38
313
Thumbnail
1:00:10
318
Thumbnail
21:35
343
345
Thumbnail
36:13
353
Thumbnail
18:44
369
370
373
Thumbnail
44:37
396
Thumbnail
28:21
413
Thumbnail
16:24
439
455
Thumbnail
25:10
529
Thumbnail
15:36
535
Thumbnail
28:04
552
Software developerElectronic mailing listFlow separationMereologyInstance (computer science)CASE <Informatik>CodeRight angleServer (computing)Software industrySoftware development kitComputer fileOpen sourceView (database)InformationAnalytic setOnline helpSuite (music)Software maintenanceConnectivity (graph theory)Maxima and minimaAttribute grammarProjective planeDisk read-and-write headConservation lawSoftwareWhiteboardWave packetStatement (computer science)Process (computing)2 (number)Multiplication signProduct (business)Message passingBitComputer animationLecture/Conference
FreewareComputer filePhysical systemComplex (psychology)Self-organizationProcess (computing)EvoluteCoefficient of determinationException handlingGoodness of fitProjective planeComputer-assisted translationSubstitute goodDifferent (Kate Ryan album)CodePort scannerMultiplicationOpen sourceComputing platform1 (number)Software developerGroup actionRepository (publishing)Maxima and minimaoutputChainRule of inferenceDevolution (biology)MathematicsData managementFocus (optics)Form (programming)Latent heatSource codeCategory of beingBitHill differential equationCASE <Informatik>Internet forumMultiplication signFormal languageConnectivity (graph theory)Pattern languageDefault (computer science)InformationSpacetimeSet (mathematics)MereologyNichtlineares GleichungssystemFigurate numberCausalitySoftwareQuicksortSocial classSeries (mathematics)Term (mathematics)Instance (computer science)Universe (mathematics)Physical lawPoint (geometry)FreewareView (database)Library (computing)Right angleOpen setTrailHand fanStandard deviationLecture/Conference
Perfect groupAreaKernel (computing)IdentifiabilityLogic gatePlanningMultiplication signBoilerplate (text)Java appletCodePoint (geometry)InformationPhysical lawEmailNumberSoftware developerArithmetic progressionParameter (computer programming)Electronic mailing listComputer fileRight angleStatisticsOpen sourceMereologyMetadataDerivation (linguistics)CompilerException handlingExpressionDisk read-and-write headRule of inferenceProjective planeFile formatInternet service providerSpacetimeNetwork topologyMathematical analysisData managementFlow separationMessage passingDivisorCartesian coordinate systemGoodness of fitDifferent (Kate Ryan album)WordFreewareCASE <Informatik>Software maintenanceLevel (video gaming)Patch (Unix)Phase transitionPermutationArithmetic meanLatent heatAndroid (robot)Source codeKey (cryptography)Statement (computer science)outputProcess (computing)State of matterLimit (category theory)Set (mathematics)Proper mapWindows RegistryOnline helpComputer programmingSingle-precision floating-point formatPort scannerLecture/Conference
FreewareExecution unitLibrary (computing)StatuteComputer fileMiniDiscProcess (computing)AnalogySource codePatch (Unix)Multiplication signCASE <Informatik>Mathematical analysisOpen sourceUniverse (mathematics)SoftwarePerfect groupToken ringProjective planeSoftware maintenanceTerm (mathematics)Gateway (telecommunications)Software developerTranslation (relic)Point (geometry)Level (video gaming)Extension (kinesiology)BitCollaborationismSet (mathematics)Link (knot theory)State of matterAreaMachine learningExtreme programmingFrictionCircleInheritance (object-oriented programming)Right angleRule of inferenceIterationJava appletDirectory serviceConnectivity (graph theory)Positional notationGame controllerDirection (geometry)Program slicingView (database)Forcing (mathematics)Basis <Mathematik>RoutingMatching (graph theory)Instance (computer science)Computer programmingRootCodeWebsiteOffice suiteLecture/Conference
Hill differential equationDew pointMaß <Mathematik>IcosahedronFreewareProjective planeOpen sourceRight angleChainMathematical analysisTesselationSource codeBinary codeAreaInstance (computer science)SoftwareComplete metric spaceAssociative propertyData managementElectric generatorDatabaseHidden Markov modelSoftware developerFingerprintComputer fileFiber bundleCodeOnline helpLevel (video gaming)InformationPhysical systemCorrespondence (mathematics)Multiplication signSign (mathematics)Java appletWord1 (number)Different (Kate Ryan album)Point (geometry)Distribution (mathematics)Kernel (computing)Revision controlContext awarenessServer (computing)Enterprise architectureMereologyFamilyBitFreewarePreprocessorLecture/Conference
Hill differential equationFreewareInternet forumConservation lawProjective planeStatement (computer science)Source codeCopyright infringementOpen sourceAuthorizationComplete metric spaceSoftware developerCorrespondence (mathematics)Revision controlView (database)Physical systemRoundness (object)Right angleWordMultiplication signInformationDependent and independent variablesLevel (video gaming)Data managementProcess (computing)CodeWave packetMixed realityHash functionBinary codeIdentifiabilitySheaf (mathematics)Context awarenessChainDifferent (Kate Ryan album)Variable (mathematics)FacebookSoftwareElectronic signatureDistribution (mathematics)Centralizer and normalizerJava appletEndliche ModelltheorieCone penetration testForcing (mathematics)Musical ensembleMereologyStudent's t-testGoogolDistributed computingMaxima and minima3 (number)Instance (computer science)EstimatorLecture/Conference
Point cloudCanonical ensembleLecture/ConferenceComputer animation
Transcript: English(auto-generated)
So, this is our panel on making sense of licensed compliance tools. And I'm going to ask, I'm moderating, my name is Bradley Kuhn from the Software Freedom Conservancy. So, I'm going to start off and ask each panelist to introduce yourself, no more
than 30 seconds each, who you are, where you're affiliated with, and what your relationship is with licensed compliance tools. Okay. So, my name is Thomas Thiemeger. I work for Heer Technologies. We're part of the team behind Open Source Review Toolkit. I'm also involved in SPDX, and I'm also involved in ClearDivine.
Hello, I'm Valerio Cosentino, software developer at Biterchia. So Biterchia basically creates development analytics for open source, and so I started playing with extracting licenses from code. So this is why I'm here. Hey, I'm Max Sills. I'm the head of Open Source Attorney at Google.
I manage our in and outbound compliance processes, and just recently joined the board of the ACT, the Linux Foundation. My name is Philippe Ondredin, and I'm both the maintainer of a tool called ScanCode, which is a license detection tool, and several other open source licensed compliance-related
tools, and the CTO of a small software company called Nexby. Hey, my name is Michael Mihailjiga. I'm, along with other people, I'm maintainer for Fossology, which is a Linux Foundation project scanning for licenses, and SW360, which is an Eclipse Foundation project, providing
component inventory. I am employed at a German engineering company called Siemens, and in my free time, I give also trainings about tools. Okay, so let's just leave the mic at each end, and that way it'll minimize the mic passing. So our first question for the panel is, what GPL or other license compliance problems do
you believe that compliance tools can solve for users? Did you say what GPL and other license compliance problems? I didn't mean, you know, I'm obsessed with GPL, so I'm going to focus on that. But any license compliance problem, what's your laundry list of license compliance problems
you believe tools can solve, and how do they solve them? Actually, I see it today that GPL, since you have named it, is not a problem for most organizations, and if you see it, we are really asking, like, what's your problem actually with the GPL? We have more problems, like, on some Fossology server, on some instance maybe.
We have 50 packages which contain these clauses, like, this is proprietary information, this is confidential information, this is a file where you need to have the written permission of the copyright owner. Like, files which slipped into open source projects, accidentally maybe, maybe some of
these famous cases have been solved, but still we have, like, in the past three years, we have experienced 50 cases of files which are just not for commercial use, according to the licenses. And how does the tool help people actually comply with all those licenses? You scan it, and then you find it out. That's actually part of the open source distribution, and you're probably compiling it in your product if you didn't really look for it.
So, I think that's probably one of the benefits of Fossology, that it tries these licensing-relevant statements which prohibit you from commercial use. Right. So, it's really about discovery. It's helping people discover what licenses are there so they can read them and comply. Right.
Okay. Philippe? To me, the biggest problem is that most developers don't know what sub-party code they use. And as weird as it sounds, it's a bit like if you were building a car, but you forgot where you bought the engine from and where the brakes are coming from. And that's still the case. So, the value of the tools is helping you figure out where the code comes from,
if it's sub-party code. That's the first thing. Then what's the license? And they're pretty essential things to know before being able to use any bits of code, whether it's proprietary or free library and open source software.
Whichever point of view you're coming from, if you don't know that, it's like a bit crazy. But we're still very much at the basics these days. Right. So, it sounds like your answer is very similar to Michael's, that it's about discovery. The tools help you discover what's there. Yeah. So, not only, but that's the essential. Okay. Max?
Yeah, I'm going to channel you. I'm supposed to be the impartial moderator. Okay, well, all right. I'm impartial, but. Which is, I don't think tools really do a good job of compliance. And I think there's this fantasy, or I guess this industry push towards tooling. Like, let's just get the best tooling. Let's use the best system and not worry about open source compliance as kind of a substitute for knowledge.
So, tooling is really great at a start, for example, like when you want to see what the license of some source code is. But it doesn't work without some knowledge of how to actually interpret the terms of the license. Or a thoroughly documented process for building software.
How do you build the software? Where do you store the code? Do you, basic stuff like, what's the linking style of something? Yeah, that's all I was going to say. I think that there's that fantasy out there. Well, in my effort to be an impartial moderator, I'll push back on that a little bit. I think your other two panelists who have already answered would say,
but discovery is the first step. You've got to discover what's there first to be able to do any of the stuff you talked about. Would you agree with that or you disagree with that? Yes, discovery is definitely the first part. But I guess there's two camps of tooling. There's like, scan code is a wonderful tool. We use it extensively. There's the camp of tooling that wants to use scanning and tooling as an input
to a larger process. Like, you've talked to people. Human beings have agreed on social behaviors around code. And then there's the camp, like the black duck people and the compliance industrial complex. Where it's, let's use tooling as a substitute for thought. Let's not even investigate the software that we're doing.
So I think the first camp, it can be very helpful, but I would agree that it is. So what do you think tool, what license compliance issues do you think tooling can help with? And how does it help? Okay, so I'm pretty new with licensing things. So the tool we're working on, basically, he tries to use existing tools like scan code or Nomos.
And the idea is to tell a story about licenses. So how these, the evolution of licenses are like related to other software development things. Like, maybe you move from a private repository to GitHub. So then you have a license change there. The idea is to study the evolution of licenses and see what we can tell about this.
So it's like, we are not really focusing on instructing the current status of the project. But more like analyzing and understanding why our license can change and the impact on other things that are related to software development.
So dig a little deeper on what you mean by evolution. So are you talking about when other people have contributed under different licenses on top of other ones? That sort of thing, what else is in there? When you have an update on a license, for instance, you pass from GPL 2, GPL 3, or things like that. Or maybe you are integrating a component
from another open source project that has a different license. So maybe you don't discover this at the very beginning. But then over the time, you see that maybe that component changed the license and this can cause problems in your, so it's more like a tracking of, I mean, tell a story about the license or the licenses that are in your project.
Okay, so same question. What can tooling do to help understand and comply with licenses and how does it do it? So where I see the tooling, so as software developers, we're now developing all NCICD, we're going faster and faster and faster. And what I basically see is that compliance tooling is
usually way behind with development tools by default. So what our focus has been, and one of the challenges basically, we use package managers, all of us use package managers. In my company, we use close to 40 of them. So we have more and more released today because, hey, we do CICD. So everything goes faster. We have more and more package managers that are not supported by any tools.
So discovery is limited. But then also when you have all of that information that comes in all of these tools and all of these things, you need to be able to process it. And that's where we, what we focused on is basically, okay, we discover, but also then how can you process it? But to add to Max's comment, we don't say we automate everything.
I can fully believe you cannot automate compliance fully. We call our system highly automated. So what I do, my best do is to basically, as Max is a lawyer and our lawyers, is figure out how I can take Max's fault for the simple cases and write that into computable rules. But then from the other side, what you do as a developer in your source code
also give you a way to indicate this is documentation, this is examples. And then also get this also in a computable form and then make a handshake. So I can reduce the work of the, so basically both sides get a handshake. And I generally call this having cats and dogs talk to each other, being the dogs usually being developers and the cats being lawyers.
And they don't speak the same language. So my job, what I read into is making the language that they can speak. So that it's easier and we can move faster. And so that the lawyers can focus on the really complicated cases which really require dedicated attention. So we take the nitty gritty work, which is the hand work out of the equation.
And just basically say to lawyers, hey, here I have something speaker. I don't know. And if we can then automate it, which is usually a mental long discussion, because things are complicated and all. We try to automate it as best as possible to give hints. That's basically where tooling can help you of taking the general discussion that you have and take it out and focus on really specific edge cases.
That's what we're focusing on. That leads me well into my next question, which we'll start and go the other way with. So there's this old phrase in computing, garbage in, garbage out. Usually say today something like faulty data set or something like that. I would argue, be a little bit unimpartial here, that almost every free software project out there is a poor data set for
understanding its licensing. Developers, as you pointed out, do not do a great job at expressing their licensing intent in a way that is consumable and automatable. There are efforts out there to try to get developers to do this. Personally, I believe that's, as Alexis' talk years ago was,
Sisyphus is happy because there's always gonna be this pushing this rock up a hill and developers are not gonna want to follow any specific documentation system of licensing. So given that, how can tools actually work in a real way to even do discovery,
let alone anything else, if all the data is so bad that files are not properly annotated, projects are not properly annotated with licensing? How can we actually solve that problem in an automated tooling way when the data is that poor? So, the solution we came up with is, well, it's actually,
we have multiple parties from multiple backgrounds working together. So just to explain this whole panel for people, there's actually a group of people that are working together. So how it works is that OpenShift actually uses the scanner fan code.
So we use actually that to parse it. We can then basically, what the generator, we generate data then. We try to display this into SPDX, which is the internal standard. And then basically, Michael can basically ingest it in Visology. So the solution that we're working on is having all these various open source tools work together.
Now, there's lots of companies, and this is maybe the difference for developers, companies have liabilities. So companies care about licensing in open source. You as a developer might not, but your company does. So the way how we work is basically companies have these huge problems. We're inside the CICD, everything goes faster, and we have this data problem.
So the solution how we came together is by all the organizations that have this problem working together, and this is why we found it last year. We found it clearly defined, which is basically a central repository where we can take all of this data in and have a curation platform on top and then fix it. So just so you know, all of the companies pretty much do anything serious
about compliance, have compliance team that do nothing else than figuring out these licenses. They used to keep all of this data inside, and then we were like, hang on, we spoke, and Jeff McGovern and I spoke a year ago, and it's like, hang on, why, why, why, hang on, we're all doing the same work? So what you now see is that basically all these companies realize, like we have no proprietary information in this.
We are all doing the same work. Why not collaborate? And then basically take this, fix this data again together. And that's how we now do this, because basically in case you haven't noticed, open source is exploding, it's going more and more and more. We have more and more package managers, everything has to go faster and faster. If we don't solve this, basically companies might say, like,
we will not use your open source project simply because your licensing is so bad that it takes me too much effort to push it through our development tool chain. I would not, even if it's a great piece of open source, we will not use it because your licensing is so muddy.
So that's why now we're trying to work together, all the people on this panelist, yeah, Max as well, you're also now in clearly defined, to basically what can we do to provide tooling and work together that we can lift the whole community to fix this. Again, we do not want to impose you how to do licensing and enforce rules through your throat. What you will see is basically us working together and we'll file a podcast from saying,
hey, could you not please fix this? Here you have the podcast, please. So let's pass it along. So it sounds like your pitch is that you've solved it all, there's no problem with compliance anymore, all the upstream, well, you have a plan, right? And it's going to be all upstream and what's the date it's going to be done by?
We already lied, it was over. Okay, so his argument is it's done today. You go to Clearly Defined, every open source project that matters, you can get all the information. Does the whole panel agree with this? No, no. This is impossible. There are so many open source projects out there. We started working on a solution. So we started working on a solution. So this is basically OSI, Eclipse Foundation, dozens of companies that are all named.
So let's pass it to Michael. Do you think this solution is going to work? Give me the date. When do you think all upstream projects are going to have perfect licensing information such that all this tooling perfectly tells you everything? Give me the date when you think it's going to happen. I think it was the solution that gave the date. Well, do you agree, feel free to disagree with this solution.
Let's pass it along. Do you agree with this solution? And if so, what's the date it's going to be done by? It's a good starting point, but data is bad. But our integrating different tools all together and try to get the information also from other sources can be a solution. But I would not bet on the date.
So Max, what do you think? Do we have, this data is going to get there perfect? This is impossible. So let's talk about why that is. The input data will always be garbage for the rest of time because the law, copyright law, is garbage. That's the problem. And I'll just say this. Who knows what a derivative work is?
Who knows unambiguously in every case whether a piece of code A is a derivative of a piece of code B? No one knows. No one knows. And no one can know. Everyone has to do their own risk analysis. And this is the human factor, which is the only thing you, I mean tooling will help, but I think we're in this kind of obsessive compulsive phase,
like you were saying a second ago, where we have these ideas to fix copyright law once and for all. If only we could have annotations in a specific metadata format on the head of every file, then derivative works wouldn't be a problem anymore. Then we'd know the copyright provenance of everything. Forget derivative works.
Who can analyze whether something is even protected by copyright? And so we're always going to have these, or whether it's purely functional. Like that is actually still a novel area of law, still actively developing. So never, it will never happen. But we can, to your point earlier, if we're integrating our license scanning tools
in as part of the development process, that's really the important thing, is that we keep it open and we keep it tightly integrated into how people are storing source and actually developing programs. Because again, once you're, a lot of people don't want to get their hands dirty. A lot of attorneys don't want to get their hands dirty with software development. But if you're looking at it after the thing's already compiled,
already been distributed to someone, it's really impossible to figure out what's going on. What about you, Philippe? Do you think we can fix the upstream data problem, and if so, when? Actually, it's going to be fixed on December 31st, 3000. Okay, great. Exactly, right before midnight. No, no, but it's impossible to fix. Yet there are things which are practically possible.
And contrary to what you said, you can have developers participate in the process and help. So I have two practical examples. First is Linux. I've been involved with some of the top-level maintainers of the kernel for the last two plus years
with others to help clarify the licensing of the kernel. And weirdly enough, or not weirdly enough, it's an old code base. It has a lot of history, probably the largest number of contributors we've ever seen in any free librear.source project. When we started, there were about 80 different licenses.
And just for the GPL, about 700 different ways to state these files under the GPL. And, you know, there's a limited number of words to express that, but nevertheless, you could think about every single permutation, and it'd land at some point of time in the kernel code base. So what we're doing is scanning and reviewing in details.
I've just finished another review of the latest tip of Linux 3 yesterday. Every file in the kernel to decide what's the correct license, is it clear-cut or not, and how can we replace any boilerplate by an SPDX license identifier one by one.
And that's a huge amount of work. We are hoping maybe by 2019 and the current push to have maybe 60% of the files covered there. And there's still a lot of ambiguities. And really weird stuff where as time goes by, you know, companies have disappeared,
people have died. And if you have ambiguities, especially in the case of Linux where some part of history ends up in old non-git trees and was under big keeper, it's a mess. Nevertheless, you see nowadays, if you watch the LKML, the Linux kernel mailing list, developers diligently providing simpler and clearer statements of the license of the code they contribute,
and other maintainers nagging them to do that. So I think we're all for that. There is progress there. I agree with you. I think we're all for expressing licenses more clearly. But I have a follow-up question on the Linux point. So what happens when you can't represent the license of a particular set of copyrights
with a simple SPDX expression? Yeah, so the problem is, it's not really about SPDX, it's what if the license is ambiguous? And there's still a good number of files which have ambiguous licenses. And eventually there's two ways. Either you can get back to the original contributors and trace it back unambiguously
and clarify the thing. Or you have to get rid of the code. Like, I agree with that too, but when it is unambiguous, you know what the license is, but you just can't write an SPDX expression for it. What do you do? I don't think so. Why would not you be able to do that? Because there's an exception involved that doesn't have an XPS identifier, things like that.
Yeah, well, so you can still write an SPDX expression for that. And eventually, if there's no official identifier at SPDX for this between-quote-new-exception, then you can ask for it to be added. And if SPDX says no, what do you do?
Well, you can continue to use it as a private identifier. I mean, just to give you an idea, there's about 300-ish licenses which are referenced at SPDX. Scan code detects about 1,300, so about 1,000 more. And it doesn't mean you can just trash them and ignore these licenses. So SPDX recommends that people, if there's a missing identifier, they just make up a private space.
Yeah, and eventually there's discussion to have decentralized namespacing to address that. Now, another example, which is I've taken the top 1,000 packages of several popular application package managers,
namely JavaScript with NPM, RubyGems for Ruby, PyPy for Python, Maven for Java, and NuGet for C Sharp. And I'm computing a bunch of statistics on the clarity of licensing. It's still in progress, not fully finished.
There's one interesting tidbit of data that came up, which is, weirdly enough, the licensing of node package, that means JavaScript, which are more often smaller and more recent than others, is usually clearer. And one of the reasons I think it's clearer, it's not so much has to do because it's a more recent code or smaller package in general,
but because there's been a significant effort of the JavaScript and node community to ensure that there's feedback provided to developers. If you submit a package to be uploaded to the NPM registry and you don't have the proper SPDX license expression attached to your package,
you'll get a warning. It's not rejected, but you'll get a warning. And my only explanation for this difference between node and other package managers is possibly based on that. So I think if you provide feedback and provide some information to software developers that license is missing
or license is not clear, they will react. So if you go even further, eventually for the kernel, we'll get check patch, which is the tool used to verify each pack is correct before you submit it, act as a quasi-license compiler, and if you treat licensing as something which is as important as the code being able to run and compile.
I do have a follow-up question. I want to give Michael one chance to answer my previous question, which is when and how do you think the upstream data problem can be fixed? Okay, okay. So you don't forget about the date. Actually, I'm asked about dates all weekdays, so I was hoping on weekend I won't be asked on dates as a project manager.
But the answer is something like Philippe has answered. I think the question is similar to when do we all have electric cars? And the point is at no point of time because there are some who adopt electric cars very quickly and there are some who just don't care. And I think there is some area in open source where people are just not so very interested in publishing license, clean, clearly defined packages,
and that will stay around. It will also stay around because today open source projects themselves have a lot of dependencies. And if they don't update the dependencies, they hang around for five or ten years. You will find very super famous Java components with ten-year-old dependencies.
And then you can ask them or maybe you contribute something to update their dependencies, but unless all dependencies are out there, being used by open source software, not really being clearly defined in terms of licensing, you will have the situation and it will be like electric cars. In 2040s, the majority of cars will be electric, but you will have combustion engines hanging around.
And I think the electric cars analogy is very interesting because the same thing happens now in license compliance. There are different players trying to come out with their own solution, right? We have the Linux initiative here. We have reuse.software from the Free Software Foundation in Europe. Or we have Fossology where, for example, my employer, one of the reasons why we invest into Fossology is
because we think if a tool is freely available, and at the time when we have started to contribute to Fossology, there were not so many license compliance tools out there. But we thought if a tool is actually available as free software, it will help to clean up licensing and open source software.
So that actually links to the follow-up question I want to have for Philippe. So let's start with you and move – no, let me start there and we'll move it along. But it picks up on what you were saying last, Philippe. So I once called license – upstream license – license annotation in projects an unfunded mandate to upstream because from my point of view, the companies are asking for this.
They want perfect upstream annotation of all this licensing to make it easy so all your tools work well and give all this data. But upstream developers, they have other work to be doing. They're trying to make this software work. Making perfect license annotation in their project is a big job. And often it sounds like the tool folks are saying, well, let's collaborate with you to get it right.
How much do you think the obligation is really on the folks who want this annotation to get into these projects, do the annotation for them and offer it as patches to them and say, does this look right to you, use it, versus this collaboration idea you're talking about, which sounds – it sounds interesting, but on the other hand, it's really unfair to ask these developers to do yet another job
when it's not what they want to do. It's really the job of the people who are all obsessed with this license compliance stuff to actually get it done. Yeah, I agree. I think contributing unambiguous annotations is a good job for those who are actually trying to have that or want to have that.
The point is that in some cases you cannot actually contribute it because you're not the copyright owner, if it's ambiguous. Well, you can propose, right? You can propose, I think this is what you meant. And if the copyright holder accepts it and incorporates it, then they've assented.
Yeah, I also think it would probably accelerate the entire thing if those people who are asking for it are actually contributing this clean-up work. And I think they're maybe clearly defined also goes into this direction, actually. Because, for example, if Clearly Defined is able to take over analysis work from
Fossology or other tools, then actually someone else can contribute that to the claim. What do you think about that, Philippe? Do you think it should be an unfunded mandate to upstream or do you think somebody has the job to come along and do this, and if so, who? So, I don't think it's either or. Unless you live in a parallel universe using software for which you don't know what license terms you need to abide by, it's just crazy.
I mean, the same way, I wouldn't want to use any software for which I don't know the license. That's a gateway. I agree completely, but most developers are going to throw the GPL in the top
-level directory, start making files, and all of us would consider that a fully GPL project. It's annotated enough for any developer to care about, probably for any lawyer to care about, but the compliance folks tend to want better annotation, right? I think it's perfectly okay for anyone. I don't care about annotation per se. I care about clarity.
And if the convention, and it's widely accepted the convention is, if you slap a GPL at the top level of your project, your project is GPL, then that's perfectly good enough. It may not be perfect. It would be better if you were a bit more expressive, maybe state what the license of each of the files, but nevertheless, that's better than anything and better in many cases than nothing at all that we see in several projects.
So it's not so much about slapping annotation, it's as much as being able to discover whatever convention may be used by a project or a community. The thing that's terrible is when you get nothing. So Max, what do you think about this issue, this unfunded mandate question?
Should Upstream have to maintain this, and if not, who should maintain it? If you ask it this way, I wouldn't agree to it, too.
And it's been really low friction to create and use GPL software, for example, under that convention. And I think what we don't realize we're doing is every time we do another iteration of the new obsessive-compulsive behavior of documenting and annotating, we're creating social precedent, and we're
creating commercial conventions, which, if there are ever ambiguities in licenses, eventually could be consulted on. So can you imagine, like a project, there's two projects. One is a GPL license at its root-level directory. It has 10,000 files. I think now I can use that unambiguously. What about when we move to the world where every one of those 10,000 files needs to have a perfect annotation?
So is the convention going to be that if one of those files is missing the annotation, then all of a sudden, the software is... That's actually good, because since you're a lawyer, I'm going to ask you a legal question. You can't give us legal advice because you're not our lawyer, but tell me... I'll give you legal advice. Go ahead. So the compliance world has been feeding me back for years that the file on the
disk has special significance under copyright, that annotating the file with its license is incredibly meaningful. So can you tell me exactly where in the copyright statute it says that the file on the disk is the special unit? Like each source file. What's that? You're saying each source file.
So I've been looking for years in the copyright statutes where it says file on the disk is special and that's the thing you should annotate with permissions. So can you help me find it? No, it's not there, obviously. And actually, I mean, if you really want to freak people out, copyright licenses at least don't even need to be written, right? Like, we can really get freaky with the extent to which convention can start talking about it.
So let's dig that a little bit. So I'm being a little glib there about the file because the file is not where the copyright controls attach. How do we annotate? How do we annotate copyright in a software project? How do we figure out whose copyrights are whose and what their license? Where does the copyright attach? Where does the licensing attach?
People have argued that C tokens, like if you tokenize the C program, the copyrights attach with each token. I don't think there's any legal backing for that. I think you would probably agree. But I understand the problem. How do we find where to annotate? I think the appropriate thing to do is to be respectful of project maintainers. So if we look at it from the viewpoint of respect, where people have taken an extreme amount of effort
and put something out there for our benefit, then we should take projects as they come. Instead of dictating, I think, how they should annotate, we should say, okay, if it's clear enough that using some kind of tool we can scan it, we bear the burden, we bear the cost of assessing the provenance, then that's probably good enough.
We should probably circle around conventions that don't impose so many burdens on. So Michael's already accused me of, I want to give everybody a chance. Michael's already accused me of changing the question in the middle. So just give us your general thoughts on how you feel about the issue of upstream annotation, who should do it, why, when, and how. I agree more with what they said.
So it's a convention, so agreeing on some rules. But, for instance, the work that MPM, for instance, is doing on GitHub, so forcing or anyway putting a warning to have a license in your project can help. So I think it's like a mix from upstream and then also from knowing what you are doing when you write code.
So I would say that is. What about you? How do you feel about the upstream annotation question? So luckily in my company I could write a policy on this. Literally, I had to write the process. And in our set we said, yeah, don't fix the problem on our side. Just file a pull request for this.
Because for us it's basically, if we don't, if we patch it, to basically say, so we patch it internally where we, so just so you know, we do have in our tool an ability where we can say that the convention is, if it's a license on the route, it applies to all the files. It is possible to basically translate convention into machine learning. So we try to say like, hey, please upstream it.
Because for us it's basically, if we fix it once, it's basically fixing going forward. And sometimes it's really, really trivial things. And it's like, guys, come on. It takes you five minutes to basically fix this. We're sometimes talking about, most of the time the license is already there.
But just because they didn't perfectly follow how, for instance, Maven specifies how the license does. Because yes, it's in the Maven ref, but it's deeply buried in there. It's a five-minute fix. Like, just fix it and it will be fixed for the whole community. So I have one last question that I want to ask you, and then we're going to turn it to the audience.
So my last question is, the biggest compliance problem I see in the world is, under copyleft licenses, the requirements for complete corresponding source code. The source code that corresponds to the binaries or otherwise minified JavaScript, binary-like things. Tell me, what tooling helps with that, if any, and how?
So you want to know the corresponding source code for… Right, so you have a binary, right? I mean, this is the ultimate compliance problem. I have a binary that I know was built from some sources that were under a copyleft license. How do I produce the source release that goes along with it? What and where is the tooling that helps with that?
So are you the creator of the binary or are you the consumer? Either way. So if you're the creator, basically what we're trying to do is basically give you the tool chain for free. And what we're also working on is giving you instructions on, hey, if you do ksaxel, we basically will be publishing for all the various package managers
how you can comply with that. And literally exact details of like, if you do this in this, if you're doing Maven, do this. If you do Maven, it's that. So those details were not available beforehand. Most companies have written those, and we were like, when I asked companies, oh, you have those?
Can we just open source those? Like, no, no, no. So now I basically decided with a couple of other people to, we're just going to write them, we anyways have them, publish them as basically this is how you can do it, and all the tools will support that. Yes, it will require some time, some tools are a little bit more complicated to do this. But yeah, and then basically for me to think once we have open tooling
and we give you the documentation on like this is how we do it, it's basically us as who needs it, or the companies that need it, is going to all the tools that are part of that stack and basically filing pull requests and saying, hey, Webpack, we would like to do this and this, are you okay with this?
And basically we provide the tooling for that, and then yeah, it's going to take a while before we get to all the tools, but yeah, my solution, we have to, as we are the ones that would like to have it, we have to invest to fix it. I agree with that. So what do you think, what are the tools out there now that help with complete corresponding source code provisioning?
On either side, consumer or producer. I have no idea. I think I agree with you actually, because I haven't seen the tools yet. I'm trying to find out where they are and how to get them. So I agree with you. That would be my answer too if I were asked, so we're going to agree. Max?
Mirko is up there. Hey, Mirko. So I just want to give a shout-out to the Quartermaster Project. It's a great project. It's in development. I think that, yeah, it's going to be really hard, but the way to get closer to making sure that when you convey a binary, you convey the complete and corresponding source is to make sure that whatever tooling you have is really deeply integrated into your build system,
because that way you can create a manifest. You know exactly every source file that went to the binary. It's going to be very easy to convey both the tool chain. Now, as a consumer of binaries, like let's say you're in a relationship with a company, and they give you a binary, and you're required to redistribute it. There it's going to be impossible to comply, because you're going to have to do contractual negotiations.
They're going to have to give you the source or an offer. You're going to have to pass that along. It's really difficult. But as the producer of a binary, it's not that difficult as long as the scanning is deeply integrated with the build system. And to the point, yeah, if you're not the producer of the binary, it's really hard. Even if you take a package in a popular Linux distribution,
being able to ensure that you get the exact corresponding source code is not a given thing. Now, some tools like Quartermaster can help. I also have a tool called TraceCode, which is using STrace to trace the build and figure out which files may be used.
But it's really low-level help, and it's hundreds of the work that's eventually needed. To me, the simple thing to ensure you always have the corresponding source code available is to always work from source. And it's something that's surprising.
Everybody's using open source, but very often we consume package and projects as precompiled binaries coming from left and right, and the software teams, be they open source developers themselves or in commercial context, don't have the corresponding source code.
It's a real problem, especially after the fact getting back to the source is going to be harder and harder. The website disappears. There's one person and one team that helps to preserve that. That's the software heritage project, which is trying to index and preserve all the source code. It's really important.
And we don't realize how important it is. There's a whole ecosystem like in Java that's been used to consume only binaries. There's a huge amount of Java code which is not available and no longer available in source code. And when it's available in source code, there's no license information.
I'm going to agree that everything in the world should be available in source code if it's software. So I'm with you on that. Even if you don't publish it as a consumer, not taking advantage of the fact the code is available is crazy. I agree. It's just you're giving up on the benefits. Now, getting back to the other question just before, I wanted to add something, which is,
if you're publishing source code supposedly under an open source license, you want it to be consumed by somebody else. Otherwise, there's some problem. Why do you publish source code in the first place? So having clear licensing should be part of the standard practice. I would argue that we should not optimize
for the most pedantic corporate user when we write, for example, projects. And to this, for instance, take two examples. So practice in the Linux kernel has always been to always annotate each and every file. So that's the common way for Linux. If you take another ecosystem, for lack of a better word, Ruby,
Ruby developers hate writing any comment in their files. So there's very few comments in general and even fewer license-related comments or annotations. So it would be crazy to force the practice of C and Linux kernel developers on Ruby developers.
I do want to give Michael a chance to answer the source code provisioning question, and then we're going to get some audience questions. Yeah, because also Tom wants to hold up a sign here. So answering this question as the last person is probably redundant because I agree that as a producer, you have Quartermaster there,
and there's Zach from Software Heritage also sitting there doing an interesting project in this area. When you have a binary, this binary analysis toolkit might be interesting. I think there is a new generation version out of it to be published soon, binary analysis toolkit next generation, bang, so to say.
So I think that's interesting, and I think that should be more open source because the old binary analysis toolkit was open source, but the database of fingerprints, what you find in the binary and associations with source code being published, was not public, so I think that's going to be changed now,
and I think that's probably also an interesting tool. So we're going to take a few audience questions with our last ten minutes, which means we have to share this mic because we only have two mics in the room. So Tom is going to run the mic, and I will pass this one around to others. So if you say who on the panel you want to answer first, that would help.
There's basically one burning question with everything I've heard now, which is in a high-potential scenario, in a big enterprise, doing Java software for ten, somewhat, years, starting to care about license compliance now is like,
okay, just break down and pray to whatever deity you have because basically you're in a very bad place and you're not going to leave it soon, or what do you do? Everything you just told me sounds pretty bad, actually. So you asked what to do, so you're basically a new company,
and you wrote down. So what I would recommend for Java, Open Source Review Toolkit, my own tool. We were in exactly the same place, and we basically looked. So the difference is basically to form all the tools, basically you need a tool that understands package managers,
and the trick with all the previous open source tools was like they understand basically just on file level, copyright holders and licenses, but it didn't take the package information into account. So we basically, we really looked at all the, actually we spent two years looking at all the proprietary vendors, and we know everything's out there,
and they don't really work if you really look at it, if you understand. So my first question for tools was really like, how do you get to your source code? So how do you get what the packages are in there? The second question you also had to have, when you show me concluded licenses, how did you get to that conclusion? Those are the two questions that,
so no matter what tool you pick, those are the two questions you have to ask yourself. Can I add something to this, something important? Because I was expecting that he is answering with a tool, but I would answer, even though maintaining tools, situation, you need to become aware of your situation.
Like what's open source software you're using? What is actually your compliance risk? How do you distribute software? What's your distribution model? And from then on, you probably end up with ORT, likely, and I think there is also Java support in Quartermaster, if I'm not mistaken. But I think the first thing is situation. Also, when I talk to other companies who want to use Phosology,
sometimes it just turns out it's not the right tool for them, because they're in a different situation and have different compliance needs. And say which panelists you want to have answered first. I don't know who is the best panelist to answer. I'm asking about MIT and BSD compliance.
So SPDX identifies the licenses with leaving the copyright here and copyright holder as variables. So how do you, with tooling, when the licenses require you to reproduce a specific copyright here and holder, how do your tools deal with that?
And as toolmakers, how would you like upstream projects to make it easier for you to deal with? For example, Facebook has like a fixed year. Google uses like tautological copyright holders. So it says like Chromium authors. And then maybe Apache plus LVM
addresses the GPL2 compatibility in a different way. So just to rephrase your question is how important is it to have all copyright statements? How do you deal with MIT and BSD compliance? And how important are the years in copyrights?
How would you like upstream projects to, what would you like them to do with years in copyright holders? So that it's easy to comply with. For me, writing tools that detect copyrights is, I like them to be passable. But I think the bigger question is how important is it to have the exact statement?
I remember a discussion with a developer from, actually Google, working on the next version of operating system called Fuchsia. And he was telling me that it was absolutely essential to have the all rights reserved trailing word from copyright statement. That's a public discussion so we can show that. So I told him no, it's been, it's been, it's over since 1950.
So the question is also a legal question. Go ahead. Max, hold on, Mary has a question. No, I actually have a follow up question to the question there. Go ahead, go ahead. I would say as little as possible. The Git log is a great documenter of both the contributor and the year.
If there was ever any infringement or litigation, people would go to that immediately. So if you can just get rid of all copyright statements and all code, I'd be happy with that. I'd also add the easiest way to comply with copyright notice requirements
of non-copylefted licenses, treat them like copyleft licenses, always give the source code to everyone in the world always, and all the copyright notices will be right. Just always give everybody source, don't write any proprietary software. It will solve all of these problems.
Go ahead, Mary. So my question is where do I find or where can I get information on what tool might work for me? So let's say I'm a company just starting out on compliance, and I'm wondering what tool can I use? Now, apart from me telling them these and these and these and tools,
these tools are available and this is what they can do, they might not believe me because, well, you're a lawyer, you don't know anything about tooling, so where can I send them? So I have decided because Max is the biggest consumer of tools on the panel and there's too many people who make tools on the panel, I'm going to let Max answer as a consumer of tools.
I think the response is you're asking the wrong question. And we deal with this a lot, sometimes more than we'd wish to. But if you're looking at a situation where things seem really messy, there's been a history of bad practice, the first thing you need to do is talk to people.
You need to talk to the lowest level engineers. You need to talk to their management. And before asking for tooling, you need to make sure that there's some kind of coherent process for checking code in. Is there some kind of basic IP training to the employee so they know how they can check code in? Is there code segregation? So I think your problem is purely human.
And then after you've solved the human problem, if you can solve it, then tooling is really, it doesn't matter what you choose. Every one of these tools is going to help you do what you need to do. But you need to be solving the human problem first. Let's take this last question because we're all out of time. A hundred percent right. I mean, process first. And you can choose my tools afterwards.
Okay. So let's assume that we work at a company that is still on the path to putting everything in open source. So you have a mix of open source licenses and binary or proprietary things. Does SPDX or a process using something like that actually grok the fact that some things in there
are binary or non-open licenses? So as a company that has still a lot of proprietary stuff in there, what we do is actually we make our own license identifiers for all our proprietary stuff. So we treat basically open source licenses and proprietary licenses as licenses.
So the tool is designed to handle both. And SPDX supports basically also writing your own license identifiers. So what we do, for instance, so just so you know in this section they start with license ref. So we do license ref, proprietary, and then here. This is how we have our own identifier
and all of our packages, when they go to our customers, they will have an SPDX license identifier exactly for that so that our customers, when they insert our packages, they see exactly the license. Another question? Okay, I have one last question, and that is on this question of completing corresponding source,
I'd like to ask the tools people on the panel, first of all, are you aware of the reproducible builds project, one, and two, has that helped you at all in getting to complete and corresponding source? Yes, we're aware.
It has helped somewhat. The problem is the whole chain is not yet supported to do all of that. So what we want is to have basically all of these tools running at software creation but also be able to parse when the artifact is created afterwards.
So it has to do the whole tool chain. And for that, that still requires a lot of work. It's like where we now currently are at with the open tooling is working at source code creation a lot and figuring out the discovering and processing of that. We're not yet there with really the end-to-end. We'll get there eventually.
Anyone else want to comment? Yeah, I think there are a couple of steps before that which would already solve the problem of providing the complete corresponding source code
because I think reproducible builds always producing the same binary with the same signature or hash value is a step beyond that. And I wanted also to add that I understood Miriam's question differently supposed that you understood the process and you understood your roles. I think there is a problem that we don't have a central marketing department
for open source tools so far. So people know that there are open source tools and license compliance but it's really difficult to understand the capabilities of the existing solutions. And there is actually an effort on GitHub. It's called Sharing Creates Value.
And there we would like to list all the open source tools and explain their capabilities and how they fit together and how they could be arranged in the tool chain and in the company or so. So I'll take my prerogative as moderator to say on the reproducible builds question in my very biased view because reproducible builds is a conservancy member project.
But I thought this before they were a conservancy member project. It is the best thing to come along in the last 20 years with regard to the complete corresponding source code problem in my view. So with that I want to thank all our panelists. And many of them, before you clap, I want to note many of them submitted talks of their own and we cajoled them into being on a panel together instead of having their own talks. They were very gracious about it and I'd like you to give them a big round of applause.