We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

FASTEN: Scaling static analyses to ecosystems

00:00

Formal Metadata

Title
FASTEN: Scaling static analyses to ecosystems
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
As recent events, such as the leftpad incident and the Equifax data breach, have demonstrated, dependencies on networks of external libraries can introduce projects to significant operational and compliance risks as well as difficult to assess security implications. FASTEN introduces fine-grained, method-level, tracking of dependencies on top of existing dependency management networks. In our talk, we will present how FASTEN works on top of the Rust/Cargo and Java/Maven ecosystems.
33
35
Thumbnail
23:38
52
Thumbnail
30:38
53
Thumbnail
16:18
65
71
Thumbnail
14:24
72
Thumbnail
18:02
75
Thumbnail
19:35
101
Thumbnail
12:59
106
123
Thumbnail
25:58
146
Thumbnail
47:36
157
Thumbnail
51:32
166
172
Thumbnail
22:49
182
Thumbnail
25:44
186
Thumbnail
40:18
190
195
225
Thumbnail
23:41
273
281
284
Thumbnail
09:08
285
289
Thumbnail
26:03
290
297
Thumbnail
19:29
328
Thumbnail
24:11
379
Thumbnail
20:10
385
Thumbnail
28:37
393
Thumbnail
09:10
430
438
Mathematical analysisComputer networkInclusion mapRule of inferencePerspective (visual)CodeClient (computing)Library (computing)Software developerData managementMultiplication signCategory of beingForm (programming)NeuroinformatikSoftware developerIncidence algebraRepository (publishing)Formal languageTerm (mathematics)CodeRevision controlNumberConnected spaceProjective planeEvent horizonCASE <Informatik>BitRepresentational state transferInternetworkingLine (geometry)Programming languageStreaming mediaGraph (mathematics)InformationLoginClient (computing)Information securityUniform resource locatorLibrary (computing)Vulnerability (computing)Electronic mailing listOpen sourceWebsiteTraffic reportingAverageGraph (mathematics)Cartesian coordinate systemState observerMathematicsResultantElectronic signatureSoftware maintenanceGroup actionImage resolutionFunctional (mathematics)Meta elementMilitary basePersonal digital assistantCore dumpAxiom of choiceDependent and independent variablesMathematical analysisHypermediaProcess (computing)Computer programmingLevel (video gaming)Physical systemBit rateComputer fontMetropolitan area networkLie groupClosed setOrder (biology)ECosUniverse (mathematics)Point (geometry)MereologySystem callFamilyMoment (mathematics)AreaThermal conductivityGraph coloringService (economics)Logic gateObservational studyQuantum stateWordComputer animationLecture/Conference
State of matterRevision controlService (economics)RootState of matterInformation securityIndependence (probability theory)Data managementPatch (Unix)Decision theoryDifferent (Kate Ryan album)CASE <Informatik>Level (video gaming)MereologyLibrary (computing)CausalityMessage passingRevision controlGame theoryArithmetic meanMathematical analysisRight angleRootOpen setLinear independenceSoftware bugRepository (publishing)Position operatorSoftware maintenanceCodeSet (mathematics)Network topologyLecture/Conference
CodeVulnerability (computing)Computer networkMathematical analysisClient (computing)Port scannerContent delivery networkSystem callBuildingPrototypeGraph (mathematics)Functional (mathematics)Order (biology)Fluid staticsSystem callVulnerability (computing)Client (computing)BitProjective planeGraph (mathematics)PrototypeGraph (mathematics)Decision theoryCodeSoftware maintenanceParameter (computer programming)Mathematical analysisData managementProgramming languageExecution unitInformation overloadBuildingShape (magazine)Local ringControl flowProcess (computing)CASE <Informatik>Link (knot theory)Single-precision floating-point formatCausalityLevel (video gaming)Graph coloringDuality (mathematics)Event horizonSoftwareQueue (abstract data type)Formal languageRule of inferenceTunisProduct (business)MereologyCore dumpVector potentialLecture/Conference
Graph (mathematics)Inclusion mapJava appletVulnerability (computing)Graph (mathematics)ACIDFunction (mathematics)Numbering schemeParameter (computer programming)Revision controlHierarchyNetwork topologyScale (map)System callAngular resolutionImage resolutionMenu (computing)InformationPerturbation theoryInformationGraph (mathematics)System callStreaming mediaComputer fileSocial classRepository (publishing)Query languageBuildingRow (database)Open sourceMetadataLattice (order)Product (business)Mathematical analysisGraph (mathematics)Formal languageNetwork topologyUniqueness quantificationInformation securityData managementOrder (biology)Subject indexingFormal grammarCommunications protocolCore dumpRevision controlUniform resource locatorDatabaseCodeLevel (video gaming)Source codeProjective planePrototypeFile formatVulnerability (computing)Process (computing)Extension (kinesiology)Representational state transfer1 (number)Functional (mathematics)MereologyTheory of relativityPropagatorForcing (mathematics)WordScaling (geometry)Active contour modelArithmetic progressionPoint (geometry)Computer virusHierarchySemiconductor memoryDefault (computer science)Tube (container)Lecture/Conference
Software testingFunction (mathematics)Java appletModul <Datentyp>Data storage deviceDisintegrationGraph (mathematics)Similarity (geometry)Default (computer science)Order (biology)Software testingDirection (geometry)Functional (mathematics)Group actionBitRevision controlSoftware developerProjective planeAssembly languageMathematical analysisElectronic mailing listLevel (video gaming)Source codeMathematicsSet (mathematics)CodeCuboidSystem callDescriptive statisticsTerm (mathematics)Condition numberInformationGoodness of fitMultiplication signForm (programming)MereologyCASE <Informatik>Bit ratePatch (Unix)1 (number)Line (geometry)Value-added network40 (number)Statement (computer science)Shooting methodWater vaporQuantumSoftwareDifferent (Kate Ryan album)Reduction of orderStrategy gameReflection (mathematics)Chemical equationPresentation of a groupFamilyDatabaseFrequencyData storage deviceIndependence (probability theory)Data miningRight angleOnline helpRow (database)TwitterProduct (business)WordLecture/Conference
Data managementProjective planeCodeGraph (mathematics)System callSoftware testingArchaeological field surveyMereologyMathematicsReduction of orderLibrary (computing)Level (video gaming)Traffic reportingSoftware developerInformationResultantMathematical analysisSlide ruleTerm (mathematics)CASE <Informatik>Position operatorFormal languageCompilerReflection (mathematics)Electric generatorNegative numberHeuristicData managementPlanningGraph coloringProduct (business)Process (computing)Revision controlScaling (geometry)Closed setUtility softwareInformation securityLecture/Conference
Point cloudFacebookOpen source
Transcript: English(auto-generated)
Well, I thank you all for coming to this session about dependence management. My name is Giorgios Gousios. I'm assistant professor at the University of Technology in the Netherlands. You can find me on the internet using this tag here
on Twitter, GitHub, or any other social media of your choice. What I'm going to be talking about is basically a project that we are currently running. And what it tries to do effectively is to scale static analysis at the ecosystem level for programming languages.
We're in a room about dependency management, so I don't need to explain a lot what dependency management is. So, well, we can, basically, on any programming language, we can have a dependency to package. And these packages can have its own dependencies. And, you know, dependency to dependency to dependency,
if you're a computer scientist, you immediately understand that this forms a graph. And the graph has certain properties. For example, it is version. So every time that there is a new version of the package, we have a new version of the graph. The graphs are pretty big for common languages. For example, for Maven, it currently has around,
the Maven repository has around 3.5 million artifacts. And an artifact in Maven is the package plus the version. So just in terms of nodes, it's pretty big. And also, the number of connections can be pretty big as well. What we have been seeing lately is that such graphs are also very fragile.
One thing that's basically triggered much of the research community, at least, to do research into ecosystem happened in 2016 when somebody removed something from NPM, a very small package, just 11 lines of code. And in response, the whole internet died, basically.
So it was very, very central into the NPM dependency graph. So there was no resolution that could work for many hours. And, you know, there was quite a bit of trouble there.
The Equifax incident is also related to dependency management. More recent cases of failures, the event stream incident where somebody injected a code to steal Bitcoin from a user wallet. REST client that happened like two, three months ago, if I remember correctly. Again, it is a REST API client for Ruby.
And somebody injected code to get login information and post it to another URL and so on. The list goes on every week. There is a new failure that we need to account for. People have been doing research on that front. There have been quite a bit of interest lately. Tom will be surely presenting
a lot of his own work on that front. But I have some numbers here. So we have done some work in 2017, just before the LESPA incident happened. And we started looking into the dependency graphs. And we found that the average JavaScript project had 54 dependencies.
Some people replicated our work this year. This grew up to 80. And I heard, I think it was Jeff today, that raised this number to more than 100. So the average project on GitHub, or the average package actually on NPM, has around 100 dependencies. And this is growing very, very fast.
Another thing that happens is that those ecosystems change in a very high rate. So we have done, in our own research, we found that from the transitive dependency closure, all the dependencies that we get by doing a package resolution, let's say, 50% of those change over the course of six months.
Why? Because new package versions are released. So our transitive dependency closure, even though we don't change our client, the code that we include in our client will change. OK. So it is very important to have an understanding of what's going on in the periphery of our dependencies,
not just in our immediate dependencies. We also found that there exist packages in ecosystems like RubyGems, for example, that if we just remove them, and I can tell you which those are and which versions, around 40% of the ecosystem will collapse. So we'll have another left-pad incident.
Another dark side of dependency maintenance is that very few packages, sorry, a large number of core packages is maintained by very few people. So what those researchers, Zimmermann and Tolke, found is that just 400 people maintain basically
the 10,000 most important, more central packages in the node ecosystem, in the NPM ecosystem. Now from the consumer side, from the developer side, it is very difficult. Developers complain that it's very difficult to assess basically
the impact of an update. So there has been research there, because there is no tooling. So you can basically update, but you don't know what code you're going to bring in first. And you don't know what the impact of this code, of this update, will be in your own code. OK.
So what Raul Kula found in his research is that 85% of the dependencies in Maven are outdated, and 50% of the most important packages, so the top packages in terms of incoming connections.
Even more alarmingly, 70% of the dependencies have some kind of linking to a dependency that has a security issue. Why is that? Developers said that they were unaware basically of those security problems. And as a result, there's basically
vulnerabilities that proliferate in ecosystem. So there was a technical report by Comcast in 2017 that showed that one fourth of all library downloads actually include the vulnerability in the tragedy closure, not necessarily in the library itself. And one third of the top 100k sites
have a vulnerable dependency. They include a vulnerable dependency in their JavaScript code base. I am pretty sure that those numbers are a bit alarming. So if we aggregate all the what researchers basically have found, the last two, three years
that this has been very active, we can find, let's say, four basic problems. There is the observability problem. We don't know whether a new dependency was released and what is the impact of this dependency. And this is also encoded into the update problem. So if I update, what will break?
The compliance problem, there was a whole discussion in the morning about that. How do I know that I'm not violating anyone's copyrights? Or in any case, how I know what I distribute code that is compatible with the license applications that I have. And there's also the trust problem.
I have all the data. I have my precious data. How can I trust my precious data to code that somebody else has written on the internet? Somebody else might be a company, but this doesn't actually give any assurance there. There is no assurance. Now on the maintainer side, if you
maintain a library, we have, again, let's say the maintenance problem. So if I update, let's say, a function signature or a function, how many clients will I break? And that's a result because we cannot answer this question. What happens is that people keep stuffing stuff into the dependency.
So the dependencies grow bigger and the code bases become really big. There is also the lack of incentive problem. This is a meta problem, I would say. How can we ensure that maintainers that do such an important job are basically adequately
funded by the clients, by their clients, even though those clients are not explicit, there is no contract, in big companies? So why should an open source developer respond to issues that is coming from big companies that are not contributing to the development?
The state of the art in dependency management, well, what people do is that they either log their dependencies so they don't pull in updates. This might be good, but it also helps in, well, not being able to update against security patches.
Another thing that emerged lately the last couple of years is that we have dependency checkers, both on GitHub especially, that read, let's say, the package descriptors and know that there is a new update in a specific library that is included in the package descriptor.
And then they propose either a full request or, in the case of GitHub, they have a message in the repository. But this is not necessarily a panacea, because, well, the analysis level is very high level.
Just because we include something in our transitive dependency set doesn't mean that we actually use the code. So there are lots of false positives. And beyond that, there is not a lot beyond package version management. We don't help the maintainers to maintain those dependencies.
There is no support in making decisions on which libraries to include in our dependency set. And there is also no support in assessing updates. This is a state of dependency management as at least is being encoded by researchers the last two, three years.
Well, we believe that we can do better than that. And to get to the root cause of the problem, what we see is that while most dependency management is being done at the package level and the analysis is being done at the package level, the actual use of the dependency
happens in the code. So this might look like a dependency tree or a linear dependency here in that case. But the actual dependency uses of the dependencies might indicate, for example, that this part of the code is not used at all.
So if there is a security bug here, we might want to update, but it's not necessarily important. We believe that this is the root cause, that the analysis that we are doing currently is very high level. So what we propose is instead of doing analysis
at the package level, to do that at the call graph level. What is a call graph? A call graph is a graph, as the name says, that links function calls across the project. This is as simple as that. Of course, it's a bit complicated to do static analysis in all programming languages
in order to get call graphs, but we are working towards building tools that will allow us to do that. So if we have such a call-based dependency network, as we propose, we can solve quite a few problems
in one go. Perhaps not all the problems. We cannot solve the trust problem, for example, because this is a larger and meta problem, I would say. But many of the problems that we currently face in dependency management, we can do something about those, at least. One of them is, for example, does this vulnerability
affect my code? Because we can have a direct path from a vulnerability in a function down to our code. So if we find a path in the call graph, this means that we are affected and we should update. We can do a more precise impact analysis to help the maintainers. So we can say, for example, well,
if I remove a function argument from a function call, how many clients will I break before I release my update? And the clients actually see that I have broken them. So if I break lots of clients, perhaps I might make the decision to not release this update,
perhaps do a function overload. So what we are effectively trying to do is to augment the soundness. By soundness in static analysis, we mean basically the ground truth with more precision. So make the analysis of the reuse more precise.
By changing the unit of analysis. We have done this initial prototype and which we presented last year at Fosden. We have made, basically built call graphs for 70%
of all the cargo packages. Cargo is the package manager for us. It was a very precise attempt to construct these huge call graphs. But the problem was, in our case, it was that the call graph generator we asked was very in a bad shape back then. We're working on it currently.
But it didn't allow us to show the full potential of the approach. But still, it was a very promising prototype. And based on the ideas that we have, we designed and acquired some funding for the Fosden project, which is like a big European project with seven partners across Europe that aims basically
to implement the Prezi technology or idea, in that case, for Java, C, Python, and Rust. And on top of the actual call graphs, we can do various analysis that will be offered
as part of a package manager. And those analysis can be, can I safely update, for example, security build propagation, and so on. So let's see how this whole thing works. And I can tell you where we are and where we're heading to.
So we get updates from all those repositories, the four repositories that we support, Java, Debian, PyAP, and Cargo in a streaming fashion. So we have streams of new package releases. We have a call graph generator. Those are basically per language.
For Java, there are quite a few that are already there that are high quality. For Python, for example, we're developing our own because there is no tooling. For us, we're also developing our own. For Debian, we're using existing tools. This gets into two databases.
The metadata database, which is basically based on Postgres, will be based on Postgres, actually, because it's not fully implemented yet. And we are also building an in-memory high performance graph index that will allow fast reachability queries on top of huge graphs. If you take Maven, for example, we're
expecting, because it has already millions of packages, package releases, we're expecting billions of nodes and perhaps hundreds of billions of edges. So it's very important to have a custom query layer to do propagation analysis. Then we have this metadata database.
We augment with data that's coming from sources that are part of the community. One of them is clearly defined. We're going to have a talk about that afterwards. Another one is from ghstore, which is a tool that collects all data from GitHub. And we do some analysis on top of that. And we also try to collect vulnerability information from open sources.
SNCC is not currently an open source. But for some reason, I have the logo here. But the other ones are open. So we're trying to collect this and basically annotate the appropriate location, either a function or a file or a package with information from all those sources.
Another thing that we're trying to do is to try to come up with a bill of materials, which was a very hot topic today, at least in the meeting rooms that I was. And basically get information about that and annotate our metadata. And this we do by building the actual packages.
And on top of our two databases, we have a custom query layer that allows us to combine data between the two. For example, if you want to get a call graph for five dependencies, you can give it a dependency tree. And it will get individual call graphs for the packages and fits them together in order
to come up with a global call graph. The analysis sit on top, of course, the query language. And there is a REST API that will allow the package managers basically to get information from all those sources. So the streaming sources, which is basically updated as new packages come in,
will be available also to the public. You can take a look at the Code Feeder project relatively soon, hopefully in a couple of months. Now the question that arises is how we identify a function uniquely. So what is a node in our project? In our project, it's basically a function.
So how we identify a function uniquely across everything? We design the custom protocol to support that. We have a grammar to validate, of course, those URLs and so on. The core technology that we are using, we call it call graph staging. What happens is that what we want to do is to only build a call graph per package version.
So usually call graph generators, what they do is that they build the whole project first. And then they try to explore it, to come up with nodes and draw edges between the nodes. If you scale that up to the Maven scale,
for example, what you will be doing is analyzing the same thing over and over and over again. Because there are lots of packages that are reused, log4j, for example, slf4j, projects like that. So what we do is that instead of doing that, we analyze all packages per package, at the package version level, we
create a call graph for that, we annotate the extension points, and then we load them and link them together. That's on request. And we call this process call graph staging. This is how our call graphs look like. So we have designed a format to basically exchange call graphs, get the information from the call graph generator,
and put it into our metadata database. But this is part of the open data that we'll be releasing relatively soon. So it records information about the class hierarchy, the graph itself, which is basically an adjacency list, and information about the products that we are analyzing.
Now, how can we use this? One particular pain problem that I have described is basically how we do dependency updates. So the prototype that I will be showing here is not based on Fasten currently, but it will be based on Fasten in its immediate release,
next version. So what people usually do in order to do dependency updates is that they use dependaboot. This is the default tool on GitHub. It was actually acquired by GitHub a couple of months ago. Pretty successful, two million pull requests and so on. What dependaboot and similar tools
do in order to ensure updates is that they run the tests. And they have some in the description of all those boxes that you run your test switch, and this assures you that, well, the dependency update will be successful. Who thinks that this is a good strategy?
On the one hand, OK, what do you think? Why is it not a good strategy? Sorry? It's not testing transitive dependencies. Is it testing direct dependencies? Are we testing direct dependencies? What do we think?
Oh, we don't have to think a lot, also in the interest of time. We have done this for around 500 projects. What we did is that we measured function coverage, and by function coverage, we mean whether if there is a function that links into a call to a dependency,
does this line get executed while we're running tests? So we instrumented the JVM while the tests were running, and we tried to find from all the dependency calls, from all the calls, basically, how many of those are executed during a test run. In the case of direct dependencies, actually it was not that bad, to our surprise.
It was around 60% of direct dependency calls are being executed by tests. But if you factor in transitive dependencies, it's like only 20%. So of all the paths, basically, that lead into transitive dependencies, only 20% are basically being executed. Not a great scenario.
Okay, because most of the updates will happen, statistically will happen in the transitive set, not in your direct dependency set. All right. So what we do is that we take two dependency versions. We do a source level AST diff, using a tool from Spoon, yes.
What Spoon allows us to do is to come up with a list of precise changes at the function level. So what we get is information like this, if statement, change the condition from X to Z, something like that. So we get a very detailed list of changes.
Then we build the call graphs, and then we try to see whether from the functions that we have into the updated dependency set, then there is a direct path back to the develop, back to the project that we're analyzing. Okay, this is basically some easy form of reachability analysis.
In order to test this idea, we took some pull requests from Dependaboot, very fresh ones. We have a project called gh-store that collects data live from GitHub, so we mined very recent pull requests from Dependaboot. And, you know, what Dependaboot tells you is that there is this version here
that updated from this, you know, you only did a minor, it's not even a minor, it's a, you know, a patch version that changed, right? So theoretically, you wouldn't expect a ton of changes. This is the comment that we actually inserted in the Dependaboot pull request. What we can see here is that, yes,
it is indeed a patch version that changed, but we found that 773 functions actually changed, or 84 of which are actually affecting our own code. Okay. And what we do is that we have information
about the paths that were affected. And also, basically the paths and the functions that, assembler functions that allows us to, allows the developer basically to see what the problem, where the problem in their code is.
And, you know, we have also done some, a bit more research. We tried to evaluate, let's say, whether that error, which is the name of the tool that we have developed, actually detects the changes. In terms of tests, it was around 40%,
so we introduced artificial changes in the transitive dependency set. In the case of tests, it was around 40%, the detection rate. In the case of data, it was around 90 something percent. So it's way more precise. This part here is basically represents reflection. So if you do reflection, we're lost.
That's simple as that. Anyway, my time is up. So I wanted to show you some how, you know, the fasten would look like in terms of developer workflow. Think about this update scenario that I presented, being integrated into pip.
Yeah, let's end it there, just to give it, yeah. So if you want to know more information about the project, it's here. We are running also a survey about dependency management. If you want to help us, this is the URL over there.
And we will be around for questions and so on. Thank you. Yes. Are you actually doing binary? We are doing call graph for Debian binary, yes, using CSCOUT and SVM.
One of the examples you had mentioned was that you could use the call graph level as a false positive reducer or impact analysis for when you make a breaking change so you can tell if you're actually using the part of the library that was updated. Did you also do that when you were analyzing
whether or not the downstream packages were testing their dependencies in terms of playing spend button? No, but that's a very good suggestion. Basically, we submitted this specific paper like a couple of days ago. Yesterday, when I was putting the slides together, it just crossed my mind, so we didn't need it.
And it will change the result. Yeah, yeah, yeah, yeah, yeah. I mean, we could add this information in the report that we, in the comment that we do in the dependable request, and it would be very helpful, actually. In changes that are not actually based on call graph,
there's all other aspects that you're changing. We don't do that because we expect the compiler to be able to do that in compile languages. So if you compile with a new version, there's an API that breaks, the compiler will tell you, but that's valuable for languages that don't have a compiler. But we don't do that currently.
Yes, at the moment, yes.
Usually, what you do with static analysis is you tend to over-approximate, which will generate some false positives. But for use cases like security, for example, it's better to have false positives than false negatives
because it gives a wrong impression to the developers. So the direct answer to your question is that we don't have new features. We basically let it explode.
The biggest challenge is that call graph generation is not a solved problem. It's actually an unsolvable problem, to be precise, because you can never be 100% sound by definition. So the better the tooling, the more assurance we can give to the developers.
So languages like Java, for example, the call graph status there is pretty good. Languages like Python or JavaScript, that even worse, basically, that are completely dynamic. There is so much that we can do with static analysis.
Yes, one plan we have is to, because we're doing that at the massive scale, to actually run the tests in all the projects. And in many cases, the tests allow you to,
allow projects to execute reflection paths, paths that are affected by reflection. So we'll take those edges, basically, and put them into our call graph. So our call graph will be richer by running the tests across all projects. And we also plan to actually crowdsource the test running. So allow people that have the, use our tooling, basically,
to upload part of the call graph that is not in their own code base, back to our database. There's so much we can do with reflection as well. There have been people that are doing that in Java, with toolkits like Doop, but it is not extremely scalable.
And also, it relies on heuristics.