We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Deduplication on large amounts of code

00:00

Formal Metadata

Title
Deduplication on large amounts of code
Subtitle
Fuzzy deduplication of PGA using source{d} stack
Title of Series
Number of Parts
561
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In this talk I will discuss how to deduplicate large amounts of source code using the source{d} stack, and more specifically the Apollo project. The 3 steps of the process used in Apollo will be detailed, ie: - the feature extraction step; - the hashing step; - the connected component and community detection step; I'll then go on describing some of the results found from applying Apollo to Public Git Archive, as well as the issues I faced and how these issues could have been somewhat avoided. The talk will be concluded by discussing Gemini, the production-ready sibling project to Apollo, and imagining applications that could extract value from Apollo. After a quick introduction on the motivation behind Apollo, as said in the abstract I'll describe each step of Apollo's process. As a rule of thumb I'll first describe it formally, then go into how we did it in practice. Feature extraction: I'll describe code representation, specifically as UASTs, then from there detail the features used. This will allow me to differentiate Apollo from it's inspiration, DejaVu, and talk about code clones taxonomy a bit. TF-IDF will also be touched upon. Hashing: I'll describe the basic Minhashing algorithm, then the improvements Sergey Ioffe's variant brought. I'll justify it's use in our case simultaneously. Connected components/Community detection: I'll describe the connected components and community notion's first (as in in graphs), then talk about the different ways we can extract them from the similarity graph. After this I'll talk about the issues I had applying Apollo to PGA due to the amount of data, and how I went around the major issued faced. Then I'll go on talking about the results, show some of the communities, and explain in light of these results how issues could have been avoided, and the whole process improved. Finally I'll talk about Gemini, and outline some of the applications that could be imagined to Source code Deduplication.
Fuzzy logicStack (abstract data type)CodeMultiplication sign2 (number)Perfect groupGodRow (database)Roundness (object)Goodness of fitCodeProjective planeOpen sourceElement (mathematics)ResultantFormal grammarSource codeInternetworkingComputer animation
CloningString (computer science)Letterpress printingFile formatFormal languageNatural numberData typeData structureNumerical taxonomyInterior (topology)Instance (computer science)Vector spaceCombinational logicToken ringMathematicsWordCalculationFunction (mathematics)Field (computer science)ResultantDifferent (Kate Ryan album)Natural languageCodeFormal languageCloningSimilarity (geometry)Type theoryBitLevel (video gaming)Functional (mathematics)SpacetimeIdentifiabilityComputer-assisted translationNumerical taxonomyArithmetic meanInsertion lossAdditionSemantics (computer science)Data structureFreewareForm (programming)Computer animation
Hash functionInterior (topology)Range (statistics)Source codeComputer fileLevel (video gaming)CloningToken ringType theoryString (computer science)AlgorithmResultantSpacetimeProjective planeHash functionCodeFreewareNumbering scheme2 (number)Source codeComputer animation
Matrix (mathematics)Similarity (geometry)Pairwise comparisonGraph (mathematics)Vertex (graph theory)Connected spaceGraph (mathematics)CloningMereologyMerkmalsextraktionAbstractionComputer fileParsingGraph (mathematics)Universe (mathematics)CloningData structureSicState transition systemConnected spaceType theoryElectronic mailing listMereologyFormal languageSource codeCompilation albumRepresentation (politics)Different (Kate Ryan album)BitIdentifiabilityCodeInstance (computer science)Abstract syntax treeProjective planeSimilarity (geometry)Category of beingRevision controlGraph theoryPairwise comparisonTransformation (genetics)Order (biology)Abstract syntaxGroup actionOperator (mathematics)Declarative programmingStatement (computer science)
Interior (topology)Range (statistics)Statement (computer science)Graph (mathematics)FrequencyWeightWeight functionData conversionMerkmalsextraktionVertex (graph theory)IdentifiabilityFunctional (mathematics)Green's functionGraph (mathematics)Instance (computer science)Data structurePoint (geometry)DistanceComputer fileWeightDepth-first searchPairwise comparisonMereologyFrequencyOrder (biology)NeuroinformatikRevision controlSimilarity (geometry)State transition systemRandom walkMultiplication signDifferent (Kate Ryan album)BitCodeInformationType theoryNumbering schemeAlgorithmInverter (logic gate)ResultantInverse elementTerm (mathematics)2 (number)Context awarenessMetric systemAbstract syntax treeComputer animation
Similarity (geometry)Weight functionSubsetHash functionElement (mathematics)PermianEquals signElectronic signatureMatrix (mathematics)Musical ensemblePrinciple of localityThresholding (image processing)Graph (mathematics)CloningConnected spaceComputer fileDevice driverCodeScaling (geometry)Computer fileShape (magazine)Set (mathematics)Category of beingElectronic signatureHash functionConnected spaceData structureSimilarity (geometry)Functional (mathematics)Electronic mailing listDisk read-and-write headPoint (geometry)Matrix (mathematics)Power (physics)Graph (mathematics)Thresholding (image processing)Event horizonLattice (group)MathematicsInsertion lossBitGroup actionAlgorithmCASE <Informatik>PermutationElement (mathematics)Arithmetic meanDifferent (Kate Ryan album)Musical ensembleSource codeResultantFormal languageRow (database)Well-formed formulaLoginCurveAbstractionOrder (biology)Abstract syntax treeWeightSpecial unitary groupMultiplication signCloningLogical constantFile archiverIntegerWrapper (data mining)SummierbarkeitData storage deviceSource code
Point (geometry)SpeicherbereinigungInfinityComputer fileTask (computing)Error messageRepository (publishing)Data structureGraphics processing unitComputer animation
Computer fileMerkmalsextraktionDistribution (mathematics)Mathematical analysisGraph (mathematics)Thresholding (image processing)Connected spaceJava appletSoftware development kitConnected spaceComputer fileBitDifferent (Kate Ryan album)Inverter (logic gate)Insertion lossFreewareAlgorithmRepresentation (politics)Exception handlingLink (knot theory)Projective planeObject-oriented programmingJava appletGraph (mathematics)Data structureAverageFormal languagePairwise comparisonThresholding (image processing)ResultantOrder (biology)Instance (computer science)ParsingRepository (publishing)Multiplication signCASE <Informatik>1 (number)Numbering schemeMetric systemOptical disc driveArithmetic meanTrailPerturbation theoryArrow of timeGoodness of fitVirtual machineIdentifiabilityNP-hardLogicLatent heat
MaizeSource codeEntropiecodierungData storage deviceOrder (biology)Multiplication signFocus (optics)WeightBlogSource codeCASE <Informatik>ResultantThresholding (image processing)CodeProjective planeSimilarity (geometry)Data structureDifferent (Kate Ryan album)DivisorComputer fileComputer animation
Computer animation
Transcript: English(auto-generated)
So this time your talk starts at 30, right? I think so, yeah. Cool, alright, perfect. Cool, so round of applause for Romain.
He's going to be talking about the duplication of large amount of code. Thanks, so hey, my name is Romain. I'm a former intern at Solst. And I'm going to talk to you... Wait, wait, wait, wait, wait. Mike. What? Oh, I have to talk in this? Yeah, how are you? Just, wait a second.
What have they done? Oh my God. Okay, in your pocket. And also talk as loud as possible. Because the mic is only for the recording. Okay, let me go real quick.
Alright, you're good. Okay, so I was saying I'm a former Solst intern. And I'm going to talk about a project I worked on during my time low. It's an open source project called Gemini. And it basically tries to deduplicate large amounts of code. So before talking about how Gemini works and the results I got while applying it,
I'm first going to go over some elements of deduplication. So first off, what are code clones? So code clones at a high level are snippets that share little to no differences. So in natural language usually we can basically say whether two sentences or more are similar
just by looking at syntactical features. So here I highlighted for example exactly similar words. And even if we try to make this problem a bit harder, so for instance by using synonyms like this, today in NLP, so natural language processing, we're able to vectorize tokens in spaces which will yield similar vectors
for words that have similar meaning. So, however, for trying to find similarities between code, this is a much harder problem because you usually have syntactical as well as structural language that affects the semantics,
so what happens when you compile the code of the different snippets. So there's been extensive research done in this field which have led to a taxonomy of different kinds of clones. So we usually use four types. So the first is basically copy-paste clones. If you remove all things that are not code in the snippets,
so for instance comments, white spaces, stuff like that. The second type is structurally similar code, so imagine two snippets with different identifiers. The type three is a combination of both with minor changes like additions, insertions, deletions, stuff like that. And finally the type four, which is the most hard to detect,
is semantical clones. So just snippets that compute these same calculations and output the same result but can do it in different ways. So as an example, here are two functions that do exactly the same thing but do it in completely different ways with different identifiers.
Well, just looking at the syntax, one would not be able to judge whether these snippets are the same or not, at least not easily. So I'm going to talk about the baseline approach before talking about GEMINI. So this is the deja vu paper, which was released in 2017,
which tried to deduplicate about 480 million files using a three-level grand rarity scheme. So the first thing they did was, like I said before, remove all comments, all white spaces, and just hash each file,
computing a file hash, and then trying to see which hashes are the same and which are not. The second level was, after extracting tokens and creating bags of features, as you can see here, they hashed the strings of tokens, which produced a second hash. And the third level was, using all these extracted tokens,
they used a tool called SourceAware CC, which is able to basically tell you that two snippets are similar if they share about 80% similar clones. So the results they found on applying this on about 400 million files of code was that,
basically, if you look at a file on GitHub, you have about an 80% chance that a similar file exists somewhere on GitHub, which were pretty impressive results. However, from what I just explained to you before, I think you can deduce from these methods
that they can't actually do much more than detect Type 1 and Type 2 clones. Possibly Type 2, it might be a problem if too many tokens are altered, and Type 3 and 4 are probably out of reach. This is why at Source we developed a project called Gemini. So I'm first going to outline the main steps we do
when applying this algorithm, and then I'll get more into exactly what we do and how we do it and why it works. So Gemini has four steps. The first step is that out of the dataset, we are going to extract a certain amount of features which are going to be syntactical as well as structural.
I'll get more into depth into what kind of features we use and why we use them. The next step is creating a pairwise similarity graph between all the snippets. So basically in this graph, each node is going to represent a file, and two nodes will be connected if they are similar enough.
Again, I'll get more into depth into what this means. So the third step is out of this huge graph, the most interesting parts of this graph are actually parts of the graph or groups of the graph which consist of files which are all connected one to each other either directly or indirectly through hops.
This is in graph theory. This is called connected components. And we extract this from the graph of similarity that we've created. The final step in Gemini is on each of these connected components, we apply community detection in order to find the final cloned communities.
So I'll go a bit more in depth into why we do that later on. So let's first go on the first step. Extracting syntactical and structural features. As you can imagine, we can't actually extract structural features just from plain text code. To do that, we have to go to a lower representation of code,
namely abstract syntax trees. For those of you who don't know this data structure, it's used in compilers for syntactical analysis, and it looks roughly like something like this. This is clearly a very simplified version of an AST. It's probably not even correct. But one of the properties that each of these nodes have
is an internal type. For instance, here you can see identifiers, statements, declarations, operations. And so we're mostly going to be able to extract structural features from this data structure. Before describing each of the features,
you can imagine that differing languages are going to have different kinds of ASTs. So for those of you who were here before, Vadim talked a bit about this. We developed at Source a project which is named Babelfish, which aims at creating a unified abstract syntax tree.
So giving any file of any language, you would obtain the same data structure which can then be used in a unified way. This is pretty useful because that way we don't have to create code for each kind of language, and you can just transform everything into universal ASTs
and work on that. First I'm going to go over the two syntactical features that we used, identifiers and literals. Identifiers are basically variable or function names, and literals are values. For instance, here I highlighted them in green, and the identifiers are in blue. I think it's pretty straightforward.
For the structural, we used four kinds of structural features. The two I'm going to talk about are graphlets and children. A graphlet in a graph is basically a given node as well as all its children. As you can see here, we don't actually care for the syntactical data. We don't care that the node is called def, foo, x, or return.
We actually rely purely on the node's internal type. Here I showed you how we extract all the graphlets. The children feature is actually just each node and the number of children nodes that it has. Once you convert this to a weighted bag of features,
it looks something like this. For instance, we have the identifier and no children nodes, which appears free time in this example. As I said, we used four kinds of structural features. The two others are basically just concatenations of different internal types
that we obtained by traversing the ASTs in two different ways, one using a depth first search and the second one using a random walk. I won't get too much into why we used these features, but we thought that we would be able to capture
longer-term structural information of these snippets. We thought that these features were good. However, we don't actually have any metric or any peer reason other than intuition that these were good features.
However, as you'll see a bit later, these actually yielded some pretty good results. As you can imagine, extracting these features from snippets of code actually yields a lot of features. In order to reduce that, we used an NLP technique
which is called term frequency inverse document frequency, which basically amounts to calculating a weight for each feature and then thresholding in order to reduce the amount of features. That's it. It basically is able to reject all aware features,
so features that only appear maybe once or twice for only one file or one snippet, and it also is able to reject very common features, so features that would appear 10 million times or something like that, so things that are not very discriminative. At the end of the day, feature extraction revolves about this.
First, converting all files to UASTs, then extracting the weighted bag of features of each UAST, applying TF-IDF in order to reduce the amount of features. As you can see, there's a fourth step. This is actually a step that allows us to look more into biased versions of our algorithm.
We could use purely the weighted bags of features as is, but if we wanted, for instance, to look at more of the structural differences between code, then we would be able to give more weight to the structural features, and then, firstly, we could do the same thing for syntactical features. That was the first step of GEMINI.
The second step is the hashing step. We're going to hash the features using an algorithm I'm going to explain in order to create a pairwise similarity graph. At this point, you might be wondering, well, we could simply choose a given distance metric and then apply it to each file in order to compute similarities between files.
However, that doesn't really work. The reason is that it scales quadratically with the amount of files you use. If we're trying to do this at scale, it would basically explode. In order to introduce the algorithm we're going to use,
I'm going to have to explain a couple of concepts. The first thing is weighted Jacquard similarities, so giving two sets of features, A and B. The weighted Jacquard similarity is basically just the intersection of both features divided by the union. If both A and B were equal, then you can see it would be equal to 1,
invertly to 0. Here I displayed it for integer weighted features, but it would also work for real-valued features, so imagine half the sum would be the same. We use this because it actually has a pretty interesting property, and that is that if we pick a random permutation over our set of features
and apply it to A and B, then hash all the elements that were permutated and select the smallest hash, so the min hash, then the probability that these values will be the same is equal to the Jacquard similarity of both snippets.
What's interesting with this is that since it's a random permutation, if we actually pick the second or the third or the fourth or the nth element, then this still is true. If we take k values, creating what we call a min hash signature, and concatenate them into what we call a min hash signature matrix,
then we have the interesting property that for each column and each row, the value will be the same with probability equal to the Jacquard similarity of the files associated to each column. For this, what we're going to do is we're going to cut our matrix into bands,
so B bands of all rows, and we're going to call candidate per any two files which hold the same value in at least one band. We're now going to calculate quickly, don't worry, the probability of two snippets to be a candidate per.
If we call S the similarity between two snippets, then the probability that the signature will be the same in one band is just S to the power R. If we take the opposite event, that is, they are different in one band, then we can calculate this. If you do this for all bands, since we have B bands, we just put that to the power of B, and if we take the opposite event,
that is, the event that these two snippets are a candidate per, you get this. For those of you that don't often do math or haven't done some for a while, this might seem a bit abstract, but basically, this function, regardless of R and B, has this shape,
so what we call an S-curve. Now, if we choose R and B appropriately, we get actually this kind of curve, which is exactly what we want, because this curve means that if we have a candidate per, then with high probability, they will have a Jacquard similarity over a certain threshold that you can choose beforehand.
Here, this is an example with, I think, a threshold which might be about 0.95 for the similarity. At the end of the day, this basically allows us to compute the similarity graph by simply creating signatures for each of these snippets, then selecting a threshold and deducing the constants R and B,
and then on each sub-signature of each file, hashing these sub-signatures and putting them into buckets. And then, well, simply any snippets that land in at least one bucket in common are going to be our candidate pairs, which, theoretically,
are going to be over a certain threshold of similarity. The next step is pretty straightforward. It's the extraction of the connected components. It basically looks something like this. Hopefully, you can see the edges, but that's not great. You can't really see the edges, but I imagine that these are all connected components.
I haven't talked too much about this step, the community detection step. Why do we do this? You could imagine that all snippets in one connected component would be clones. However, in practice, this isn't true, because if you take three snippets, for example, A, B, and C, and imagine that A and B hashed to the same bucket
and B and C hashed to the same bucket, let's take a threshold of 0.8, then in the worst case, even though there's a similarity of 0.8 between A and B and B and C, you'd only have maybe a 0.64 similarity between A and C.
If you increase the amount of hops you have to do, the similarity is going to drop. This is why we do the community detection. It looks like something like this. You can see it's kind of put into the same groups, files that are relatively close depending on the structure of the edges.
That was it for the Germany algorithm. Now I'm going to talk a bit about the results I got while applying it. I use a public Git Archive. A public Git Archive is the largest dataset of consoles in the world. It's maintained by a source and is basically created from all the wrappers in GitHub,
which have over 50 stores. It has the neat property that we didn't use in my case. It includes all of the commit history, but as I said, I didn't use that, and I only took the head, which store amounts for 54.5 million files.
I was also restricted with files from which I could extract unified ASTs, which meant that I could actually only use at the time when I did this, files stemming from five different languages, so Python, Java, JavaScript, Ruby, and Go. I decided to apply the pipeline on all five languages at once
because I wanted to see whether the algorithm would be able to detect the structural differences between different languages. When I applied this pipeline, this is what I got. For those of you that know Spark, you know what I'm talking about.
Unfortunately, with these kinds of logs, usually they hide a certain amount of problems. Without going too much into detail, I had to face a certain amount of challenges, for example, corrupt files, parsing errors from Babelfish, GPUs not responding for no reason at all.
During the TF-IDF step, at some point there was basically infinite garbage collection, so my tasks never finished. I also encountered some GitHub repositories made by very clever engineers which were designed to not be cloneable.
We were able to clone them. However, giving a lot of structure, which basically imagine a folder with ten folders inside of them and then each subfolder will have ten more, etc. So, yeah, an exponentially increasing size.
These were actually pretty hard to process, and at the end of the day, I was only able to process 7.8 million files. As you can see, most of them were JavaScript and Java files. I still had a good amount of other Python, Ruby, and Go files. Out of all of these, I was able to extract about 6.2 million distinct features.
Looking more into the features we extracted, most of them were the syntactical features, as you can see, so identifiers and literals, which amounted for about 75% of all the features I extracted. However, when looking at each file individually,
I saw that they each had about a thousand features, and most of them were actually, this time, structural features. This was an average made on all of the features from files of any given languages, but this was about the same when looking at files of specific languages.
Next, what I did was that I applied the hashing with two different thresholds, the 80%, which was also used in DejaVu, as well as the 95% threshold. As could be expected, there are a much bigger amount of connected components for the 95% threshold,
but that's actually just connected components of only one file. If you look at the ones which have more than one file, well, there are actually more for the 80% threshold, so about 10% more. The connected components are also a bit larger, so on average they had about one more file per connected components.
Looking more into detail into the size of the connected components, as you can see from this log graph, the size, well, most of the connected components actually had a relatively small amount of files,
with at most for the 80% threshold, a thousand, maybe a couple of thousand files in one connected component. One thing I said earlier was that I applied my pipeline on files of all languages, in order to see if they were able to be separated using our algorithm.
That was mostly the case, apart from one community which was actually pretty huge, which had, if I recall, about 800,000 files, and which had the files from all five languages. When I looked a bit more into this, I found that it was because all of these files were very short
and had very similar syntax. So, what I did was I reapplied my pipeline, adding some bias towards the structural features, and this large community actually exploded into monolingual communities at the end, I mean connected components. So, looking at each kind of connected components per language,
you can see that four connected components have over one file. We have a big difference between JavaScript and Go, which have much more files in these kinds of connected components than for Java, Ruby, and Python. So, that can be explained with kind of intuitive explanation.
For example, Java, which is a much more object-oriented language, would probably have files which are much more distinct. Invertly, in Go, for example, we use a practice which is called vendoring. It's kind of logical that there would be a lot of the same files that would appear, less these results.
So, then I applied the community detection with the WorkTrack algorithm. So, as you can see, if looking purely at the connected components with over one file, the amount of connected components was about the same. However, when doing the community detection, we can see that there was a huge increase,
about 50% of the number of communities detected when going from the 95% threshold to the 80%. Okay, so now I'm going to show you a couple of examples. Before that, you might notice that up until now I haven't used any metrics, which is kind of odd in a machine learning problem.
That's because this is actually an unsupervised problem. Well, as you can imagine, I wasn't going to look at each of the 7.8 million files to kind of see, okay, this one is 80% similar. This is not, like, paralyzed. It would have been a bit hard. So, in order to judge whether this work was relevant or not, I had to look at communities and connected components individually,
which was how I judged my work. So, as I was saying before, there were a lot of connected components which had a lot of Go files, and one thing that was interesting was that a certain amount of least connected components, actually files which only had one file name.
So, for instance, this is a connected component with only files which are called textpulses.go. So, the white nodes are actually not files here. They are representations of the buckets. But, however, as you can see, or maybe you can't see from afar,
but there are actually four communities, except that two of them are minuscule here and here. But, as you can see, especially for the blue community that was detected, there are very strong links between each file. However, this doesn't mean that each of our connected components were only formed of files with one file name.
So, looking at this larger connected component, which only has a Ruby file, you can see that we detected three communities, and looking more at the file names inside these connected components, you can see that although they do seem to cluster together, it doesn't actually mean that the community is formed of only files sharing the same file name.
So, this is the final example I'm going to talk about. So, this was actually a pretty interesting thing because it was one of the larger connected components I found, and although the files in this actually stem from three projects,
only 25 files came from two projects, and all of the others came from 4,775 other files. So, as you can see, that's a lot of duplication for only one project, and when looking at the GitHub repository of the Microsoft Azure SDK,
you can actually see that there's this very repetitive kind of structure with a lot of files incoming. As you can see, we were able to detect that, and I got this representation when I created my graph on Giphy. So, that was it for me.
I hope you liked this talk. I liked working on this, and I found that the results were very interesting. If you want to look more into the results, you can go on the source blog. I created a blog post which goes much more in-depth into different aspects of the results I found on applying this.
If you want to check out the code and possibly use it, you can look at the source of the gemini project. And, if you are interested in the PGA dataset, well, you can also download it for free. Although, yeah, it's a bit big, so you might want to save about three terabytes of storage. Thanks.
So, we have some time for questions. Focus on duplication. Is the threshold too roughly adjusted because it's really hard to measure?
Yeah. So, the question was whether LISP technique could be used for refactoring cloud or if it was much more aimed towards duplication. So, I think, I'm not sure what you mean by that question.
I think it would be able to find codes which are similar if you only focused on structural features. So, in that sense, if you applied it to, let's say, one project in order to find all of these structurally similar files, it might be applied to refactoring. I don't really think that it would necessarily be impacted by the roughness of the estimate
because if you give enough weight to the features and if you apply TF-IDF with a relatively low threshold, then you can keep enough features for, I think, the similarity to be relevant in your case.
So, probably, yeah, it would be interesting to see.