Cobbles and Potholes: On the Bumpy Road to Secure Software Supply Chains
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 12 | |
Author | ||
License | CC Attribution - ShareAlike 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/51314 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Process modelingInformation securityChainSoftwareObservational studyComponent-based software engineeringVulnerability (computing)Projective planeOpen sourceSoftwareInformation securityGoodness of fitFood energyChainComputer animation
00:41
Information securityChainSoftwareSoftwareSelf-organizationChainInformation securityPresentation of a group
01:19
ChainSoftwareInformation securitySoftware developerOpen sourceTwin primePoint cloudData storage deviceWave packetComputer architecturePublic domainVideo gameCycle (graph theory)Projective planeCoordinate systemShared memoryVulnerability (computing)Information securityChainSoftwareOpen sourceCASE <Informatik>Slide rulePointer (computer programming)Software industryMultiplication signSpacetimeData storage deviceCodeCommunications protocolObservational studyWordFrame problemSet (mathematics)Digital rights managementSoftware developerContent (media)Presentation of a groupLevel (video gaming)PredictabilityEntire functionIntrusion detection systemMusical ensembleMobile appFormal languageThermal expansionComponent-based software engineeringInformationLattice (order)TelecommunicationOpen setFitness functionComputer programComputer animation
06:25
Product (business)Information securityPower (physics)Computer networkTelecommunicationExecution unitGamma functionSoftwareSoftware developerProduct (business)Category of beingLogicCartesian coordinate systemFront and back endsInformation securityMultiplication signEndliche ModelltheorieVirtual machinePerformance appraisaloutputOrder (biology)Public domainLaurent seriesAreaTwitterArithmetic meanEncryptionInternet der DingeWater vaporArtificial neural networkDistributed computingElectronic visual displayWeightNumbering schemeEnterprise architectureTelecommunicationState observerHomomorphismusKey (cryptography)Latent heatDatabaseMathematicsCore dumpShift operatorCentralizer and normalizerPhysical systemMathematical analysisCASE <Informatik>Phase transitionTerm (mathematics)Ocean currentReal numberOpen sourceSheaf (mathematics)BuildingSlide ruleInternetworkingAnalytic setPresentation of a groupPressureVideoconferencingRight angleProcess (computing)Distribution (mathematics)AuthenticationComputer animation
11:30
Information securityComputer networkDistribution (mathematics)Power (physics)Latent heatAuthenticationKey (cryptography)EncryptionNumbering schemeHomomorphismusWeightConvolutionFeasibility studyComputer-generated imageryTelecommunicationSoftwareDynamische GeometrieOpen sourceVulnerability (computing)Context awarenessComponent-based software engineeringNumberSheaf (mathematics)Projective planeCartesian coordinate systemComputer animation
12:21
Vulnerability (computing)Negative numberPoint cloudPatch (Unix)Information securityProcess (computing)Data modelFunction (mathematics)Parameter (computer programming)Java appletConfiguration spaceCodierung <Programmierung>Product (business)Directory serviceLimit (category theory)SoftwareRevision controlMatching (graph theory)Bit rateOpen sourceMenu (computing)TLA <Logik>Web pageVulnerability (computing)Process (computing)Product (business)NumberComponent-based software engineeringJava appletRevision controlPatch (Unix)Internet der DingeScaling (geometry)DatabaseExploit (computer security)Dependent and independent variablesPhysical systemInformationSoftwareNegative numberInformation securityCodePoint cloud2 (number)Term (mathematics)Slide ruleSelf-organizationMetrePublic domainProjective planeCASE <Informatik>Cycle (graph theory)Descriptive statisticsContext awarenessMultiplication signWindowSoftware developerContent (media)Range (statistics)Different (Kate Ryan album)Traffic reportingConsistencyMatching (graph theory)MereologyCartesian coordinate systemOpen sourceRight angleEnterprise architectureEnumerated typeBit rateFlow separationCloud computingObservational studyOperating systemBasis <Mathematik>Computer animation
19:49
Web pageInformation securityPatch (Unix)Computer networkVulnerability (computing)Exploit (computer security)InformationCumulative distribution functionMetadataRevision controlBefehlsprozessorWeightOpen sourceExecution unitCodeDirected graphPhase transitionAerodynamicsFluid staticsMetric systemScatteringData miningWage labourParadoxScale (map)SoftwareDigital rights managementHuman migrationMathematical analysisMagnetic stripe cardSchmelze <Betrieb>WindowObservational studyTwin primeLimit (category theory)Hill differential equationDatabaseDependent and independent variablesAutomationMereologyNegative BinomialverteilungMathematicsCausalityRootVariable (mathematics)Cartesian coordinate systemWindowSoftware developerRepository (publishing)AreaOrder (biology)Virtual machineComponent-based software engineeringVulnerability (computing)CollaborationismSet (mathematics)Drop (liquid)Information securityGraph (mathematics)System callSoftwareDatabaseInternet service providerLevel (video gaming)CodeMappingTerm (mathematics)CodeProjective planeMatching (graph theory)Goodness of fitMultiplication signCASE <Informatik>Patch (Unix)Dependent and independent variablesTraffic reporting2 (number)Observational studyExploit (computer security)Revision controlOpen sourceDifferent (Kate Ryan album)InformationProduct (business)MetadataDigital rights managementNumberParadoxMetric systemState of matterData miningBuildingData qualityForm (programming)Commitment schemeComputer animation
27:17
AutomationDependent and independent variablesMetadataCodeDatabaseInformationInformation securityChainStreaming mediaEvent horizonSoftware developerEmailOpen setConditional probabilityInjektivitätTemporal logicNumberUser interfaceNetwork topologyHill differential equationBuildingSystem programmingSoftware maintenanceMetric systemSource codeRepository (publishing)CybersexOpen sourceComputer wormSoftware frameworkTable (information)MaizePointer (computer programming)Side channel attackObservational studyCapability Maturity ModelSurfaceKolmogorov complexityJava appletComponent-based software engineeringMereologyMultiplication signBitChainOrder (biology)Repository (publishing)Sheaf (mathematics)CodeLikelihood functionVulnerability (computing)Slide ruleProjective planeEntire functionPasswordSoftware maintenancePublic domainNumberVector spaceInformation securityPhysical systemBuildingAverageComplex (psychology)Library (computing)Source codeComponent-based software engineeringImage resolutionJava appletSoftware developerSurfaceNetwork topologyStreaming mediaRight angleEvent horizonTwitterScripting languageOpen sourceSocial engineering (security)Universe (mathematics)Machine learningIntegrated development environmentMalwareSoftware testingCondition numberDimensional analysisDifferent (Kate Ryan album)ComputerInstallation artResultantAsynchronous Transfer ModeField (computer science)EmailReal numberProof theoryComputer wormWeb page1 (number)Prisoner's dilemmaRun time (program lifecycle phase)CompilerProcess (computing)WebsiteLevel (video gaming)Digital rights managementPlug-in (computing)Radical (chemistry)Formal languageSensitivity analysisCentralizer and normalizerComputer animationProgram flowchart
35:52
Mathematical analysisWeb pageObservational studyInformation securityMaxima and minimaStandard deviationSoftwareOpen sourcePresentation of a groupBuildingElectronic mailing listSurfaceComputer multitaskingFormal verificationRevision controlScripting languageRepository (publishing)Vulnerability (computing)DisintegrationAuthenticationMultiplicationComputer programProcess (computing)Component-based software engineeringDependent and independent variablesChainObservational studyProjective planeOpen source1 (number)MereologyStandard deviationOrder (biology)NumberCartesian coordinate systemPresentation of a groupAngleElectronic mailing listVirtual machineMultiplication signSoftwareInformation securityContext awarenessSoftware developerRegulator geneInequality (mathematics)DistanceSystem administratorTerm (mathematics)Internet service providerSelf-organizationNear-ringSurfaceComponent-based software engineeringMessage passingBitClosed setWave packetDifferent (Kate Ryan album)Sound effectStreaming mediaEvent horizonLevel (video gaming)Workstation <Musikinstrument>Point (geometry)Physical systemSpeech synthesisAreaInternet forumOnline chatCentralizer and normalizerOpen setComputer animation
41:24
RobotPhysical systemVirtual realityBootingInternet der DingeComputer animation
Transcript: English(auto-generated)
00:02
So now I'm really very pleased to introduce our first keynote speaker, Henrik Plait from SAP. Henrik is a senior researcher at the SAP and his current research is focused on security and software supply chain, especially the use of open source components.
00:22
Henrik is leading the equipped speed study project which support detection assessment and mitigation of vulnerable open source dependencies. Dear Henrik, the floor is yours. So yeah, good afternoon again.
00:40
My name is Henrik Plait. I work for the security research team of SAP and this talk will be about cobbles and potholes which are rather lame, which is a lame metaphor for the kind of problems that we challenge these days when it comes to securing our software supply chain.
01:02
Before starting the presentation, I'd like to use the opportunity to sincerely thank the organizers for having me. I really feel honored for the opportunity to open this conference. All right, so this is how the agenda,
01:23
how we will run this meeting. I will start with a very brief introduction of myself and a few words about my employer and the security research team. And then we will spend most of the time on what I believe are the two main problems that we see in the space of open source supply chain.
01:43
The one known for quite some time, especially since Heartbleed in 2014, which is the use of components with known vulnerabilities. And the second, a problem that is theoretically known since a long time as well,
02:01
but which gained quite some attention in the last couple of months, few years, and which is gaining much more attention in the future, which is my set prediction. Quick, two quick disclaimers. First, regarding the level of detail.
02:21
So as always, when preparing presentations, I've been carried away by my passion and enthusiasm for the topic. So I put a lot of content, examples and references. My excuse is that I wanted to produce a self-contained slide deck that is useful even after the presentation.
02:41
So that people can refer to it, find the pointers, references, information. But of course it requires quite some discipline for me to not explain every little detail and some discipline from the audience to not read all what is on those slides, but rather listen to me. In any case, you can always reach out to me after the presentation in case you have any questions
03:03
regarding whatever I presented. The second disclaimer is regarding this fear-mongering. And this is kind of a problem inherent to the whole security domain, right? It's always like as if we find pleasure in making people afraid of security risks
03:21
and vulnerabilities and so forth. I try to stay away by quoting and presenting information from research papers and studies that have been hopefully produced by independent researchers. Let's see whether this worked out. I leave that assessment to you.
03:43
So very quick, I'm German. As you hear from my accent, 46 years old, and I spent quite a good share of my life in France. I developed software for many years already. And rather accidentally, I became a member
04:00
of the security research team at SAP Labs in Southern France. And that was actually a great, great accident. And I'm working, happily working in this domain for more than 10 years, where I did architecture, code reviews, developed some secure developer trainings, acted as the coordinator of European projects and so forth.
04:26
Sadly, I contributed far too little to the open source community. So shame on me, I try my best to catch up, particularly with Eclipse Study. And I'm a cycling enthusiast,
04:40
which is explaining this metaphor and the occasional appearance of some cycling related pictures in my slides. And so those pictures come from the infamous one day race, Paris-Roubaix, which is actually known for its cobbles or cobbled roads in Belgium.
05:02
SAP and open source, well, how does this fit together? I would say not very much, if you think of the software that we developed 20 years ago. And here's this screenshot of the infamous travel expense management solution. Maybe many of you have seen this bluish, unintuitive UI.
05:25
That was really all 100% closed source communication protocols, programming languages, IDE and so forth. That situation changed dramatically, drastically as for the entire software industry, I would say.
05:42
By now we are a heavy consumer of open source, but we also try to give back. So we contribute to existing projects and we release and start new open source projects by our side. One particular noteworthy example in the year 2020 is this Corona One app.
06:04
So this is basically tracking relationships between people using decentralized data storage. It has been developed with a couple of other companies in a rather short timeframe of 50 days
06:20
and is now available on GitHub as well. For what concerns SAP security, we consider ourselves being an implied research team, bridging academia on the one hand side and SAP product development on the other side. So we basically try to transfer
06:43
and communicate new approaches, concepts, tools from academia to a product development and see what is applicable and useful. And we try to communicate real world security problems to the academia and see where there is a fit.
07:03
We have been quite successful in terms of peer reviewed publications. And we have eight strategic research areas. I'm not going into those or the majority of those. One area is open source security analysis led by myself.
07:22
And you will hear a lot about our work and current trends in this domain later on in my presentation. And the other topic of relevance is secure internet of things, which is led by a colleague named Laurent Gomes. And for that topic, I spent one dedicated slide
07:42
to exemplify the work we are doing in the area of secure IoT. The first interesting observation when talking with Laurent a couple of days ago was that he would actually not call the secure internet of things anymore, but he would call it distributed enterprise systems. And this change in the name is reflecting a trend
08:03
where those things are not any dump devices any longer that collect data and send this to some central backend, but they become smarter and smarter such that programming logic and data is moving from the central enterprise systems
08:22
to those devices, which brings its own new challenges. This shift or this trend is also exemplified with two of the areas of work of Laurent and his colleagues.
08:41
In former times, they were working more on secure end-to-end communication channels where the problem was really getting data in a secure way to backend. And a good paper in this domain is the one I cited here where he was basically developing a cryptographic scheme
09:03
to protect the confidentiality of the data being transmitted end-to-end. End-to-end meaning the data is encrypted on the device like sends out data and stays in encrypted fashion in the database and decryption only happens shortly before any processing or display.
09:22
There were a couple of requirements like device specific keys to support authentication, frequent key changes, and a couple of more. This solution is now used in the water distribution system
09:40
of the city of Antipi in Southern France, where basically sensor data like temperature, water pressure, water levels, and so forth are sent to some analytics backend and then presented and analyzed with some dashboard. The second example is of a paper I wanted to cite relates
10:03
to where problems appearing when you move application logic and data to those devices. Here he was recently working on basically especially protecting the intellectual property of machine learning models that are deployed on the device
10:24
and at the same time protect the input and output data. And what he did basically is he was using homomorphic encryption in order to protect the weights and the biases of the layers of the neural network
10:43
and in order to protect the model and which was working over the encrypted input data, which was in the evaluation phase encrypted video data. And so the use case of this technology is to basically protect or use the data
11:07
of video cameras deployed in Antipi again, in order to find out whether there are any suspicious activities that could relate to terrorist attacks. So for example, if there are lorries or trucks
11:21
being parked for a long time in front of some critical buildings or so forth. Right, now we come to the two main sections. The first one being dependencies with known vulnerability. So I will start assuming that we all agree on the fact
11:43
that open source consumption is steadily increasing and so is the number of disclosed vulnerabilities in such components. One number in this context is that typically an application contains around about 100 dependencies
12:05
or upstream projects and a good percentage of those have known vulnerabilities or have are let's say a subject to vulnerability.
12:20
And what happened is that in, I would say with latest in 2014 with Heartbleed and even more so in 2017 with Equifax developers entered kind of a hamster wheel, which is an endless cycle of checking whether there are new vulnerabilities for the components that you depend on. For every finding, try to figure out
12:42
whether these are false positive findings, if not assess whether those vulnerabilities really matter in the specific context of your application, because maybe they're a part of some code that is not invocable, not reachable in the context of your application. And of course you need to keep fingers crossed
13:02
that your checks didn't have any false negative. Next part of the hamster wheel is the mitigation, which can be very easy if your upstream users respect SemVer and which can be very difficult in case you have vulnerabilities of projects which are long dead and that you need to fix
13:21
maybe in your own fork. And then you release a patch, you're happy or congratulations to all people running software in the cloud. That is easy. That is really an advantage of cloud computing. And sorry for all those that need to patch software that is running in devices. So there was this ripple 20 vulnerability
13:42
a couple of months back. And this really exemplified the kind of the scale of the problem in this IOT domain where you had this vulnerable TCP stack that existed in hundreds of thousands of devices. Many of them are actually unknown.
14:01
And for some, you cannot even fix and release a patch to this device, even if you find and identify the device. So in this context of the hamster wheel of this endless cycle, we will talk about the following, the next 10 minutes about the following topics. Quality, timeless and content of public vulnerability databases.
14:22
Problems for developers to assess such vulnerabilities. The shortened response windows, which are shortened in particular because exploits become available on a very quick basis. And last, a few notes and comments
14:40
on the possibilities to auto-upgrade, to auto, to self-healing, if you wish, of vulnerability dependencies. So as a foundation of what is important for the next few slides is to understand the CVE and NPD concept. Whenever you talk about known vulnerabilities,
15:01
you hear those terms in the first 30 seconds, I assume. This is by now the largest publicly available database with information about software vulnerabilities, both in open source software and commercial software and operating systems. And basically everybody could submit a CVE.
15:24
So basically whenever a security learns about the vulnerability, he would request a CVE from the MITRA, an organization in the US. They would reserve a number, then start the discussions with the vendor and the researcher and so forth until it is eventually published in the NPD,
15:43
which adds a severity rating and the number of, an enumeration of the affected products. This whole process can take a few days up to several years, which is already indicating one of the problems we will be talking about later on.
16:01
By now, there are 140,000 something CV entries as of yesterday. This is an example of a vulnerability reported for Eclipse Mojarra. One of the Java Enterprise components, if I'm not mistaken, handed over to the Eclipse Foundation some years back.
16:22
And this is really almost, this is the full entry, right? You have a very short description, a severity rating saying, this is a high, bigger problem, got a base score of 7.5. You have one reference to the fixed commit and one to the issue. And you see this affected product is Eclipse Mojarra.
16:41
This looks neat, but it is by far not sufficient to let developers, downstream users of Mojarra decide whether this vulnerabilities really matters. And there are many problems and I don't have the time to go through all of them. I just want to point to, first of all, since this is a manual process,
17:02
there are errors, unavoidable, so no way to avoid those. And so one thing is they actually identified the wrong versions in the first place. So in fact, the versions 235 and 236 were also affected. And they corrected this after we reported this problem,
17:20
which we actually detected using Eclipse Study. The second big problem is that there are entire ecosystems not covered like NPM. You will find very, very few vulnerabilities about Node.js packages or NPM packages. And there are many more, mostly relying on the fact that you have humans
17:41
involved in the process that give arbitrary names or labels to things that need to be mapped to other things. But I won't go into this here. This was an example. There are very interesting empirical studies of the inconsistencies in the NPD. So those researchers referenced here,
18:02
they basically compared almost 80,000 CVEs with 70,000 vulnerability reports produced over the last 20 years and try to find inconsistencies in the names of the products and the versions of the products being referenced. And so they say the strict matching is if the CVE
18:22
and the report is mentioning the same product names and the same product versions. Loose matching is if they match the same, they talk about the same product but different ranges of versions. And when looking at it, and if you talk about different version ranges, basically one is over claiming
18:40
and the other one is under claiming the number of affected products or product version. And when they did this, they figured out just by comparing CVE and NVD that only 70% of those really have a strict match talking about the same affected product versions.
19:01
90% of those CVE NVD entries talk about the same product but about different versions. And the consequence of that is in 10% of the cases, they cannot even agree on the number of the products being affected. And this is getting worse if you compare NVD to other information vulnerability reports
19:23
like the exploit database and so forth. And an important problem is related to response windows and the availability of exploit. So here the main problem is that the time between a researcher reports the problem
19:41
and until the description gets available to the general public is can spend weeks and weeks. They analyzed that difference, finding that quite a number of CVE entries lag behind, one week behind the first official public report.
20:03
And the second problem or why this matters so much is because of the study regarding the availability of automated exploits. So here Palo Alto Networks looked at 11,000 exploits. So readily downloadable exploits from the EDB and checked when those exploits were available
20:22
compared to when the patches from the product vendors were available. And they figured out that 80% of the exploits that people can find on EDB were available before the CVEs were published, which is quite alarming, I must say. Equifax is particular case.
20:41
Here there were three days between patch availability and the actual data breach happening on March 10th, which is an interesting time. So I think the consequences here are twofold. One is the severe problems of the data quality and timeliness of public vulnerability databases.
21:01
And the second is due to those small response windows, automation is really a must. There are two approaches to detect vulnerabilities. One is on metadata where you basically compare those labels and a good example. So labels giving to software components and labels being given to vulnerabilities
21:21
and you try to see whether there is a match, but that is hardly because it's human provided names. One example is the OWASP dependency check. They do surprisingly well. They are considering this fuzzy mapping. They are very lightweight and map against CVE and MBT. The second approach is code base
21:41
where you ignore all metadata and only assume that real truth is in the coding. So here we have a method and that has been identified by the fixed commit. So this is the method fixed by the developers of I think this was some Apache project.
22:01
And you find vulnerable code only by looking at or searching for this method and checking whether this is in the vulnerable or in the fixed state. And this code-based approach is out to form metadata-based approaches in terms of precision and recall and allow for nice features such as impact assessments and update metrics.
22:21
So here you see a call graph from application methods to the vulnerable method I was showing before. Then one other important topic I find is, and this is a shameless plug for a session we will be giving at EclipseCon in one or two weeks.
22:41
The thing is that there is no high quality code level information publicly available. NVD for short as we have seen before. And what happened is that the providers stepped in and they started to build proprietary databases about vulnerabilities in open source software.
23:02
That mining of information is labor intense despite some advances in AI-based commit classification. But this leads to the kind of paradoxical consequence that the information about open source components itself is not open.
23:22
And because the data is not open, the general, the open source community cannot really develop proper tooling to solve the security problem by themselves. They rely on basically the proprietary tool vendors to share this data, which they do admittedly to be fair,
23:42
but they do this drop by drop. You don't have access to the whole data set. And as all the machine learning AI guys can tell, that is what you would need in order to really work and progress in the area. Our approach to this is what we call it
24:01
rather clumsily project KB, which is meant to overcome this and which is basically a tool and a database to support distributed collaborative management of vulnerability information for open source components. The next topic relates to,
24:22
I mean, you can say getting out of this hamster wheel, automation is key. I think this is rather obvious. And what you should really do is to scan early often and automate it with the tools I was mentioning before, OWASP, dependency, steady, there are also NPM audits. So every ecosystem has kind of its tooling.
24:43
And on top of that, you have the commercial vendors of course. But they only go as far as to the detection of the vulnerable component. And the fixing, this is really left mostly to the developers. There are some tools that create automated pull requests
25:02
for issues that they find in Git repositories. And this study I was mentioning here proved or showed that project using this automated pull request indeed patch more often, are more secure than the baseline, but still a lot of dependencies and pull requests
25:22
are not merged because the developers are afraid of breaking changes. And the root cause of this problem is the wrong use of SemVer. Theoretically SemVer is great. So it gives you the possibility or you can rely on if it is properly used, you can rely on minor and patch versions
25:41
not introducing any backward and comfortable changes or breaking changes. But there are some several studies that show that SemVer is not properly used and even minor and patch releases contain a whole bunch of backwards and comfortable changes.
26:00
And so this really has to improve before applications can become automatically fixed. The takeaways of this first part of known variabilities is, CDE, NVD has problems with quality, timeliness and coverage.
26:20
So you should really not use this as your only source of information, you or your tools. You will miss something and you will be late. This is not a blame important on NVD or CDE because they do their best, but they are heavily underfunded. So the blame goes to the lack of appropriate support
26:42
and funding to build such a public high quality database. And it was important that commercial vendors stepped in, but I strongly believe that the open source community should solve this problem by itself and that requires a public database. And these are the two other takeaways of that part, automated detection and fixing is really needed
27:03
to address this shortened response window and that code-based approaches improve significantly over approaches that rely on metadata. That is concluding my first part I'm running. I'm running a little bit of time. Let's see if I make it.
27:21
The second part is on supply chain attacks. And I would like to start this with a nice quote from a security researcher saying that installing code from a package manager has the same level of security as curl site com bash. And what I like about the quote is that it nicely illustrates the dilemma of many developers,
27:41
including myself. If I come across or if somebody tells me I should fetch whatever webpage and execute it in my terminal, I would become suspicious or I would ask me some questions, maybe have a look at the script. But in developer mode, when I want to get things done and develop stuff, there's much less hesitation
28:04
to just install, to run NPM install or pip install or what else command. And why is that so dangerous? Is that many packages or ecosystems come with the pre and post installation scripts. So if you install a package,
28:21
there is some script being executed with your user on your computer. And not only of the package that you install, but for all the packages that this package depends on. So all the upstream packages. So you happen to execute quite some stuff on your computer, or potentially if you install a package.
28:45
The former quote was from somebody who developed the proof of concept warm for the NPM ecosystem that would replicate itself. This here is a real example from November, 2018, which gained quite some attention
29:02
because of the high number of downloads, the high number of packages that depended on this package event stream. What happened is the alleged attacker wrote an email to the original developer and asked whether he would like to hand over the ownership. And the original developer who lost interest in the open source project kind of agreed,
29:22
which opened all possibilities to the attacker. And this example is also noteworthy because the attack was relatively sophisticated compared to previous ones. So the malicious payload was encrypted and the payload only triggered for certain downstream packages.
29:42
And it would be debated detection by only running in productive environments and evading its execution and test environments or build environments. There's an increasing number of such attacks. And this is work we have done together
30:01
with the University of Bonn. So here we looked at 174 malicious packages for which we could obtain the actual code. And so we were looking at the malicious code. And we looked at the different dimensions and problems. So under which conditions would it trigger,
30:24
how did the attacker injected the malicious package into somebody's dependency tree and so forth. And as well as temporal aspects. And here there's a clear trend for an increase in number. In 2019, in particular, there was a bigger campaign on Ruby gems.
30:43
And on the right hand side, you see the average or the number of days that malicious packages were available in the different ecosystems. And I think an average, a malicious package was like 209 days or so available
31:01
before it was discovered and yanked from the repository. This is an attack tree, which is far too detailed to go through all the nodes and the attack vectors. I just wanted to highlight two things. The most important attack vectors are typo squatting. So here you, this is a technique from domain squatting
31:22
applied to open source ecosystem. So you would basically choose, the attacker would choose a name similar to a well-known name. And my favorite example is a malicious package called Mumpy instead of Mumpy, this Python library for machine learning. And the second most important vector
31:40
was the use of weak compromised credentials. So basically package maintainers had weak passwords that were stolen and the attackers uploaded malicious packages to PyPI, NPM and so forth. Event stream was a matter of social engineering. The two ingredients that make supply chain attacks
32:03
so let's say possible are kind of the trust that users, developers have in packages. And at the same time, the automation introduced by build systems such as Maven
32:21
that care about dependency resolution and installation and download in an automated fashion. Again, you install one package and the likelihood of installing another 50, 100 packages is not that small.
32:43
I think I just have five minutes so maybe I hurry up a little bit. Actually, I should probably go to the conclusions already of this section in order to not go too much over time. So there is a number of consideration related to trust,
33:04
the implicit trust in the ecosystem. But I think I really go to the conclusions here. And as again, as before you're invited to go through the slides and contact me for all the details. So here basically is
33:23
the takeaways is many people thank you for putting trust in their security capabilities. And one of the examples I was skipping was showing how weak, weekly package maintainers use passwords and so and put entire ecosystems at risk.
33:45
The reason of the increase of supply chain attacks is that there are many of the number of dependencies of projects increased so much over the time and so did the number of actors and the complexity of the big processes
34:00
and the related infrastructure which all resulted in a considerable attack surface. There is this noticeable increase in supply chain attacks and in particular Python, Node.js and Ruby are the primary targets. I suspect the former two ones in particular because of the presence of these installation scripts
34:22
and installation hooks. But maybe we just don't know a few ecosystems like Java, Maven Central to my knowledge have not been analyzed in a systematic fashion. And if you want to protect against malicious open source components the two takeaways for me are that all dependency map does
34:41
not only the compile or runtime dependencies but also the test dependencies or the all the build plugins that you have because all of that is executed when you compile and build and test your solution and could possibly modify the compiled code
35:00
that ends up on a package repository. And if ever you're going to review open source projects because you're kind of six security sensitive and want to know what you're using it doesn't bring much to look at the source code repository. You should really only look at what you download
35:21
which is sometimes ugly if it is for compiled languages but looking at the source code will not help you against detecting supply chain attacks, known vulnerabilities, accidental vulnerabilities but no supply chain attacks. It is an active field of research which hopefully will yield some results in the near future
35:43
that then will become integrated in the different tooling and different stakeholders or by the different stakeholders. A few closing remarks and I'm really sorry that I had to rush a little bit. I hope still my main messages got through.
36:01
What is missing really in terms of supply chain attacks is that there's no comprehensive and comparative study of the effectiveness of different safeguards in the different ecosystems and then the subsequent gap analysts. So kind of have best practices and tips and tricks here and there but we don't have really a good study
36:22
of how much that is solving the problem. I have a couple of selective and opinionated or an opinionated list of technical safeguards. I didn't want to go into organizational safeguards such as training awareness and so forth
36:40
but I'm more interested into the technical stuff. But before showing those I wanted to mention a few things. One particular, this goes to all especially the commercial users. I think they deserve more support both the upstream projects used by commercial vendors
37:00
as well as the infrastructure providers. And a good example is PyPy which is run by as few as 10 administrators for more than 450,000 package owners and more than 260,000 projects. And there's no surprise that these few people struggle to fix and run after security issues.
37:26
Out of scope of this presentation are very interesting topics maybe not to work on but to follow from the distance. One is a number of government regulations and standards that will be imposed in the near future on software development organizations.
37:43
An interesting topic always is liability of commercial software vendors. I mean, for all the open source providers we have our open source licenses solving the problem in terms of legal liability but for commercial software vendors this is a dedicated topic. And then there's also this topic of moral responsibility.
38:03
And this became apparent in this event stream example I was showing before. He had an open source license. He was fined from the legal side denying or liability for however the project is used. But there was a huge debate in the issue where they discussed the problem.
38:22
And some people basically said, you should have taken care. You cannot just hand over the project to anybody. How could you do this? And there were other people including the developer who handed over the project saying, well, this is a spare time activity. I'm investing hours and days
38:42
and this could be not demanded from him. So I find this a very interesting discussion and open problem if you want. Right, and this is a list of safeguards. The upper part is just standard stuff,
39:00
well known, relatively cheap, mostly that should be applied where possible. I find the lower ones more interesting and you don't see the upper parts because I didn't want you to read through all those. The lower ones are more interesting because they address the problem, one of the big problems which is that the bills of open source projects
39:21
happen on arbitrary system. So very often it's some big systems or even developer machines where the binary package is produced that will be uploaded to PyPI or to Maven Central or to WellAS. And this is a huge problem and this is addressed by these three first mitigations
39:45
all from a different angle. And the last topic I wanted to put this because there are ongoing research works that suggest that the whole attack surface or that, well, let me start differently, that a good portion of the open source components
40:02
that you pull into your project is actually not needed by your specific application. And so you can just slice it away in order to reduce the attack surface. And that is one technical countermeasure that personally I find very interesting and promising. And I hope there will be some research in this area soon.
40:24
That is, and I was really rushing as I feared in the beginning, basically my presentation. Yeah, sorry again for the few examples that I skipped in the supply chain attacks part but I hope you got my main messages and main points.
40:41
Thank you so much. Are there any questions? Thank you, Henrik. Not question I think, maybe during the, certainly the breakout session certainly. Thank you very much because we are in a little bit in a hurry now and let's, now I will give the stage two moderator
41:02
to Rosaria and Marco to follow up on the next speech. We repeat just to say that we took notes on all of the questions that appeared in the chat. So our speakers will be available
41:20
at the end of this session for a breakout. So please join us and you can ask all the questions that you want. Thank you. Okay, very good.
Recommendations
Series of 12 media