We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Deduplication on large amounts of code

00:00

Formal Metadata

Title
Deduplication on large amounts of code
Subtitle
Fuzzy deduplication of PGA using source{d} stack
Title of Series
Number of Parts
561
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In this talk I will discuss how to deduplicate large amounts of source code using the source{d} stack, and more specifically the Apollo project. The 3 steps of the process used in Apollo will be detailed, ie: - the feature extraction step; - the hashing step; - the connected component and community detection step; I'll then go on describing some of the results found from applying Apollo to Public Git Archive, as well as the issues I faced and how these issues could have been somewhat avoided. The talk will be concluded by discussing Gemini, the production-ready sibling project to Apollo, and imagining applications that could extract value from Apollo. After a quick introduction on the motivation behind Apollo, as said in the abstract I'll describe each step of Apollo's process. As a rule of thumb I'll first describe it formally, then go into how we did it in practice. Feature extraction: I'll describe code representation, specifically as UASTs, then from there detail the features used. This will allow me to differentiate Apollo from it's inspiration, DejaVu, and talk about code clones taxonomy a bit. TF-IDF will also be touched upon. Hashing: I'll describe the basic Minhashing algorithm, then the improvements Sergey Ioffe's variant brought. I'll justify it's use in our case simultaneously. Connected components/Community detection: I'll describe the connected components and community notion's first (as in in graphs), then talk about the different ways we can extract them from the similarity graph. After this I'll talk about the issues I had applying Apollo to PGA due to the amount of data, and how I went around the major issued faced. Then I'll go on talking about the results, show some of the communities, and explain in light of these results how issues could have been avoided, and the whole process improved. Finally I'll talk about Gemini, and outline some of the applications that could be imagined to Source code Deduplication.
Fuzzy logicStack (abstract data type)CodeMultiplication sign2 (number)Perfect groupGodRow (database)Roundness (object)Goodness of fitCodeProjective planeOpen sourceElement (mathematics)ResultantFormal grammarSource codeInternetworkingComputer animation
CloningString (computer science)Letterpress printingFile formatFormal languageNatural numberData typeData structureNumerical taxonomyInterior (topology)Instance (computer science)Vector spaceCombinational logicToken ringMathematicsWordCalculationFunction (mathematics)Field (computer science)ResultantDifferent (Kate Ryan album)Natural languageCodeFormal languageCloningSimilarity (geometry)Type theoryBitLevel (video gaming)Functional (mathematics)SpacetimeIdentifiabilityComputer-assisted translationNumerical taxonomyArithmetic meanInsertion lossAdditionSemantics (computer science)Data structureFreewareForm (programming)Computer animation
Hash functionInterior (topology)Range (statistics)Source codeComputer fileLevel (video gaming)CloningToken ringType theoryString (computer science)AlgorithmResultantSpacetimeProjective planeHash functionCodeFreewareNumbering scheme2 (number)Source codeComputer animation
Matrix (mathematics)Similarity (geometry)Pairwise comparisonGraph (mathematics)Vertex (graph theory)Connected spaceGraph (mathematics)CloningMereologyMerkmalsextraktionAbstractionComputer fileParsingGraph (mathematics)Universe (mathematics)CloningData structureSicState transition systemConnected spaceType theoryElectronic mailing listMereologyFormal languageSource codeCompilation albumRepresentation (politics)Different (Kate Ryan album)BitIdentifiabilityCodeInstance (computer science)Abstract syntax treeProjective planeSimilarity (geometry)Category of beingRevision controlGraph theoryPairwise comparisonTransformation (genetics)Order (biology)Abstract syntaxGroup actionOperator (mathematics)Declarative programmingStatement (computer science)
Interior (topology)Range (statistics)Statement (computer science)Graph (mathematics)FrequencyWeightWeight functionData conversionMerkmalsextraktionVertex (graph theory)IdentifiabilityFunctional (mathematics)Green's functionGraph (mathematics)Instance (computer science)Data structurePoint (geometry)DistanceComputer fileWeightDepth-first searchPairwise comparisonMereologyFrequencyOrder (biology)NeuroinformatikRevision controlSimilarity (geometry)State transition systemRandom walkMultiplication signDifferent (Kate Ryan album)BitCodeInformationType theoryNumbering schemeAlgorithmInverter (logic gate)ResultantInverse elementTerm (mathematics)2 (number)Context awarenessMetric systemAbstract syntax treeComputer animation
Similarity (geometry)Weight functionSubsetHash functionElement (mathematics)PermianEquals signElectronic signatureMatrix (mathematics)Musical ensemblePrinciple of localityThresholding (image processing)Graph (mathematics)CloningConnected spaceComputer fileDevice driverCodeScaling (geometry)Computer fileShape (magazine)Set (mathematics)Category of beingElectronic signatureHash functionConnected spaceData structureSimilarity (geometry)Functional (mathematics)Electronic mailing listDisk read-and-write headPoint (geometry)Matrix (mathematics)Power (physics)Graph (mathematics)Thresholding (image processing)Event horizonLattice (group)MathematicsInsertion lossBitGroup actionAlgorithmCASE <Informatik>PermutationElement (mathematics)Arithmetic meanDifferent (Kate Ryan album)Musical ensembleSource codeResultantFormal languageRow (database)Well-formed formulaLoginCurveAbstractionOrder (biology)Abstract syntax treeWeightSpecial unitary groupMultiplication signCloningLogical constantFile archiverIntegerWrapper (data mining)SummierbarkeitData storage deviceSource code
Point (geometry)SpeicherbereinigungInfinityComputer fileTask (computing)Error messageRepository (publishing)Data structureGraphics processing unitComputer animation
Computer fileMerkmalsextraktionDistribution (mathematics)Mathematical analysisGraph (mathematics)Thresholding (image processing)Connected spaceJava appletSoftware development kitConnected spaceComputer fileBitDifferent (Kate Ryan album)Inverter (logic gate)Insertion lossFreewareAlgorithmRepresentation (politics)Exception handlingLink (knot theory)Projective planeObject-oriented programmingJava appletGraph (mathematics)Data structureAverageFormal languagePairwise comparisonThresholding (image processing)ResultantOrder (biology)Instance (computer science)ParsingRepository (publishing)Multiplication signCASE <Informatik>1 (number)Numbering schemeMetric systemOptical disc driveArithmetic meanTrailPerturbation theoryArrow of timeGoodness of fitVirtual machineIdentifiabilityNP-hardLogicLatent heat
MaizeSource codeEntropiecodierungData storage deviceOrder (biology)Multiplication signFocus (optics)WeightBlogSource codeCASE <Informatik>ResultantThresholding (image processing)CodeProjective planeSimilarity (geometry)Data structureDifferent (Kate Ryan album)DivisorComputer fileComputer animation
Computer animation
Transcript: English(auto-generated)