Tools for large-scale collection and analysis of source code repositories
Formal Metadata
Title |
Tools for large-scale collection and analysis of source code repositories
|
Subtitle |
Open source Git repository collection pipeline
|
Title of Series | |
Author |
|
License |
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. |
Identifiers |
|
Publisher |
|
Release Date |
2018
|
Language |
English
|
Content Metadata
Subject Area |

00:00
Zoom lens
Visualization (computer graphics)
Projective plane
Convex hull
Mereology
00:17
Process (computing)
Visualization (computer graphics)
Open source
Repository (publishing)
Different (Kate Ryan album)
Patch (Unix)
Source code
Mathematical analysis
Software testing
Mereology
01:09
Implementation
Open source
Source code
Virtual machine
1 (number)
Mereology
Stack (abstract data type)
Computer programming
Formal language
Product (business)
Revision control
Machine vision
Uniform resource locator
Semiconductor memory
Computer hardware
Cuboid
Data structure
Booting
Software development kit
Area
Source code
Scaling (geometry)
Validity (statistics)
File format
Data storage device
Database
ACID
Configuration management
Cartesian coordinate system
Uniform resource locator
Process (computing)
Software
Repository (publishing)
Figurate number
Communications protocol
03:50
Implementation
Scaling (geometry)
Computer file
File format
Data storage device
Maxima and minima
Database
Water vapor
Mereology
Number
Uniform resource locator
Hash function
Root
Repository (publishing)
Single-precision floating-point format
File archiver
Cloning
Extension (kinesiology)
Routing
Typprüfung
Library (computing)
05:27
Stapeldatei
Computer file
SPARC
Interface (computing)
Source code
Mathematical analysis
Virtual machine
Scalability
Formal language
Machine learning
Radio-frequency identification
Repository (publishing)
Energy level
Library (computing)
06:37
Programming language
Distribution (mathematics)
Parsing
Computer file
Information
Multiplication sign
Projective plane
Source code
Mathematical analysis
Device driver
Independence (probability theory)
Bit
Mathematical analysis
Multilateration
Mereology
Measurement
Parsing
Scalability
Formal language
Abstract syntax tree
Estimation
Repository (publishing)
Different (Kate Ryan album)
Right angle
07:49
Area
Implementation
Information
Projective plane
Source code
Mathematical analysis
Data storage device
Abstract syntax tree
Mathematical analysis
Data storage device
Mereology
Abstract syntax tree
Formal language
Subject indexing
Process (computing)
Machine learning
Repository (publishing)
Natural language
Process (computing)
Resultant
Physical system
Computer architecture
10:08
Service (economics)
Collaborationism
00:05
so continuing this part of lightning
00:08
talks where we talk about Apache Software Foundation projects is Alexander B Zubov who will be talking visualization of Big Data
00:23
and test it just hey hey okay Kenny here okay so I think it's a bit different from what German has announced so I used to work on visualization for big data but this particular talk is about tools for large-scale collection and analysis of source code repositories and then something I've been working on and last half-year and that's exciting and even maybe more exciting than visualization part so my name is Alex and well I'm
00:51
engineer at sourced and also a committer and PMC the patches open source is a start-up in Madrid that I joined recently it's very cool and all the things that we work on there are open source so and I'm going to talk about some tools that I built my colleagues built during the daytime job so while we
01:12
collect a lot of source code but why and so it's twofold one it's the research material for academia and two it's the fuel for data-driven products on top of the source code and it's kind of rapidly evolving area of building better tooling to write programs so it's quite exciting but first you need to get the data and that's what we're going to talk about it's open source pipeline that you can use on premises to collect a lot of git repositories because well it's the most popular version control system and that's the source of truth both source code and the collection pipeline is pretty standard like who better crowler a distributed storage and the parallel processing so after you store that probably you want to go through that and figure something out so we'll briefly go through the tech stack and what the takeaway of this talk is that if you interested in a large scale data collection there are awesome existing open source tools and there are some new ones that I wanted to share today so things that have red or gray boxes around are the things that we build a source but we've got to go through all this tech one by one and to run the software well you need some hardware and then it's the structure side we have a dedicated cluster with which is I think kind of called the immutable infrastructure these days so there's basically machines that up from boot provisioned with Korres and they eventually become a part of a kubernetes cluster where you can schedule your application on so it's very nice and automated there's going to be detailed talk about that in configuration management camp on Tuesday in camp if somebody up for learning more details about infrastructure part on the collection part so look at machines we want to get some neat repositories so it consists of two parts getting the URLs to that repositories and then basically cloning them and well they will focus on git and that's the most popular thing so we need to talk get language to be able to do that we implemented custom implementation of the git protocol format and storage format called good get and was a talk about it last year and it's written goal language that room it's a pure go implementation one of the big five validation of kit
03:34
it's very extensible and it's interesting you can do a lot of things like store things in memory when you want to or add custom protocols or store things in database if you want to so that's something we use for cloning part and then they're two separate programs
03:54
one is to find the URL so they get repositories and store them in database and second one to schedule their cloning this called rubbers and bogus parts and well there's hundreds of millions I get repositories so we want to be space efficient and there's some nice tricks you can do like store Forks together and single git repository because you have an extensible gift library and that's the concept of rooted requester is depicted in the middle so basically replaces the store history and start with their initial commit to the same hash gets stored in very single big repository which is works really nice so well you collected URLs and you collected get your posters most probably want to store that somewhere and it's got to be distributed so on URL it's just a Postgres database we wrote custom your Auriemma implementation and go which is type safe and quite nice called calyx and for a poster Heidi's HDFS which works very well but it scales linearly with a number of files in it so you want to minimize number of files basically how we do it with custom archive format implementation so every route to three boys three end up being in a single file inside database this is
05:10
the example of custom archive implementation called Siva it's secure and able and indexed format so it's water based and this way you can fit your poster once and then append it after new clones or fetches happen and sold stored in HDFS so when after it
05:32
gets stored you want to protest it somehow and well apache spark is a good way to do batch processing and on a cluster of machines barkas go is is something useful people understand and and know how to do and we do custom library to expose those git repositories to the SPARC API level it's called ensign and well it exposes references commits files and so on in spark language be that Python or scholar which is super nice and it talks through jpc interfaces to more advanced stages of analysis of the source code if you want
06:07
to do that here's example of usage of that library I'm not sure if you can see that it's you can extract with a simple pipeline like that you express the extraction of Francis taking they had reference and then getting all the files and for every file do something like detective language and so on which is quite concise and really great that both engineers and machine learning people can use that because there is Python and scholar IP is for that and well after
06:40
just iterating through files most probably want to do some more advanced stuff and there are two projects on beside that do have build one is enry which is a programming language detector it's kind of right parts of the github linguist is the thing in Ruby that it have you sit it shows you the distribution of languages on top of your repositories one that we use is an NGO there is compatible with a linguist and this faster
07:06
you're basically from four to twenty times faster on our measurements then then original Ruby one and another one is project Babel Fish which is a bit different because well first it's it's scalable parser infrastructure so it drops native parsers and fat inside containers that you can schedule across the cluster and that exposes uniforms ERP C API so that way you can extract a lot of information from the source code in very uniform fashion and it has drivers for many different languages and something called universal AB 1663 which is native syntax tree of the language annotated with some language independent things that you might be interested in on their later analysis stages and well
07:54
that's a high-level overview and there are some future things that we want to get done for example on kubernetes side having bare-metal cluster and persisting storage is not not really easy thing to make on the collection side there's a concept of staged event-driven architecture it's a paper published by Google I think about how to make a scalable system that dynamically saturate the some resources by having queues in between the stages which is very interesting if you want to do called 100 million of repositories on the processing side we're looking into editing distributed indexes to speed up APIs park queries that we have and on analysis part or advanced things like how do you default a strict syntax tree or how do you sir cross language information from abstract syntax tree which we are looking at in a Babel Fish project but it
08:57
any questions right so that's the question was do we look at the red knees so the thing is the collection pipeline I described is generic so you get everything you get full repositories on the project that the tech language this does not do anything except what github is already doing so it's also doing the same job but there is some research in that area which is quite cool how which comes from natural language processing and machine learning area well how do you tell which language is that like basically classify language based on the source code and we've got pretty good results it's not inside Emory first it's more yet but it's more in the research stage and I think it has a lot of potential well hopefully even to be able to merge upstream and in github implementation so it does a better job but here this and that's good one any other questions okay thank you so much [Applause]
