We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Deduplication on large amounts of code

Formal Metadata

Title
Deduplication on large amounts of code
Subtitle
Fuzzy deduplication of PGA using source{d} stack
Title of Series
Number of Parts
561
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In this talk I will discuss how to deduplicate large amounts of source code using the source{d} stack, and more specifically the Apollo project. The 3 steps of the process used in Apollo will be detailed, ie: - the feature extraction step; - the hashing step; - the connected component and community detection step; I'll then go on describing some of the results found from applying Apollo to Public Git Archive, as well as the issues I faced and how these issues could have been somewhat avoided. The talk will be concluded by discussing Gemini, the production-ready sibling project to Apollo, and imagining applications that could extract value from Apollo. After a quick introduction on the motivation behind Apollo, as said in the abstract I'll describe each step of Apollo's process. As a rule of thumb I'll first describe it formally, then go into how we did it in practice. Feature extraction: I'll describe code representation, specifically as UASTs, then from there detail the features used. This will allow me to differentiate Apollo from it's inspiration, DejaVu, and talk about code clones taxonomy a bit. TF-IDF will also be touched upon. Hashing: I'll describe the basic Minhashing algorithm, then the improvements Sergey Ioffe's variant brought. I'll justify it's use in our case simultaneously. Connected components/Community detection: I'll describe the connected components and community notion's first (as in in graphs), then talk about the different ways we can extract them from the similarity graph. After this I'll talk about the issues I had applying Apollo to PGA due to the amount of data, and how I went around the major issued faced. Then I'll go on talking about the results, show some of the communities, and explain in light of these results how issues could have been avoided, and the whole process improved. Finally I'll talk about Gemini, and outline some of the applications that could be imagined to Source code Deduplication.