Tools for large-scale collection and analysis of source code repositories

Video thumbnail (Frame 0) Video thumbnail (Frame 437) Video thumbnail (Frame 1218) Video thumbnail (Frame 1736) Video thumbnail (Frame 5337) Video thumbnail (Frame 5749) Video thumbnail (Frame 7704) Video thumbnail (Frame 8185) Video thumbnail (Frame 9152) Video thumbnail (Frame 9919) Video thumbnail (Frame 10631) Video thumbnail (Frame 11729) Video thumbnail (Frame 13258) Video thumbnail (Frame 15192)
Video in TIB AV-Portal: Tools for large-scale collection and analysis of source code repositories

Formal Metadata

Tools for large-scale collection and analysis of source code repositories
Open source Git repository collection pipeline
Title of Series
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Zoom lens Visualization (computer graphics) Projective plane Convex hull Mereology
Process (computing) Visualization (computer graphics) Open source Repository (publishing) Different (Kate Ryan album) Patch (Unix) Source code Mathematical analysis Software testing Mereology
Implementation Open source Source code Virtual machine 1 (number) Mereology Stack (abstract data type) Computer programming Formal language Product (business) Revision control Machine vision Uniform resource locator Semiconductor memory Computer hardware Cuboid Data structure Booting Software development kit Area Source code Scaling (geometry) Validity (statistics) File format Data storage device Database ACID Configuration management Cartesian coordinate system Uniform resource locator Process (computing) Software Repository (publishing) Figurate number Communications protocol
Implementation Scaling (geometry) Computer file File format Data storage device Maxima and minima Database Water vapor Mereology Number Uniform resource locator Hash function Root Repository (publishing) Single-precision floating-point format File archiver Cloning Extension (kinesiology) Routing Typprüfung Library (computing)
Stapeldatei Computer file SPARC Interface (computing) Source code Mathematical analysis Virtual machine Scalability Formal language Machine learning Radio-frequency identification Repository (publishing) Energy level Library (computing)
Programming language Distribution (mathematics) Parsing Computer file Information Multiplication sign Projective plane Source code Mathematical analysis Device driver Independence (probability theory) Bit Mathematical analysis Multilateration Mereology Measurement Parsing Scalability Formal language Abstract syntax tree Estimation Repository (publishing) Different (Kate Ryan album) Right angle
Area Implementation Information Projective plane Source code Mathematical analysis Data storage device Abstract syntax tree Mathematical analysis Data storage device Mereology Abstract syntax tree Formal language Subject indexing Process (computing) Machine learning Repository (publishing) Natural language Process (computing) Resultant Physical system Computer architecture
Service (economics) Collaborationism
so continuing this part of lightning
talks where we talk about Apache Software Foundation projects is Alexander B Zubov who will be talking visualization of Big Data
and test it just hey hey okay Kenny here okay so I think it's a bit different from what German has announced so I used to work on visualization for big data but this particular talk is about tools for large-scale collection and analysis of source code repositories and then something I've been working on and last half-year and that's exciting and even maybe more exciting than visualization part so my name is Alex and well I'm
engineer at sourced and also a committer and PMC the patches open source is a start-up in Madrid that I joined recently it's very cool and all the things that we work on there are open source so and I'm going to talk about some tools that I built my colleagues built during the daytime job so while we
collect a lot of source code but why and so it's twofold one it's the research material for academia and two it's the fuel for data-driven products on top of the source code and it's kind of rapidly evolving area of building better tooling to write programs so it's quite exciting but first you need to get the data and that's what we're going to talk about it's open source pipeline that you can use on premises to collect a lot of git repositories because well it's the most popular version control system and that's the source of truth both source code and the collection pipeline is pretty standard like who better crowler a distributed storage and the parallel processing so after you store that probably you want to go through that and figure something out so we'll briefly go through the tech stack and what the takeaway of this talk is that if you interested in a large scale data collection there are awesome existing open source tools and there are some new ones that I wanted to share today so things that have red or gray boxes around are the things that we build a source but we've got to go through all this tech one by one and to run the software well you need some hardware and then it's the structure side we have a dedicated cluster with which is I think kind of called the immutable infrastructure these days so there's basically machines that up from boot provisioned with Korres and they eventually become a part of a kubernetes cluster where you can schedule your application on so it's very nice and automated there's going to be detailed talk about that in configuration management camp on Tuesday in camp if somebody up for learning more details about infrastructure part on the collection part so look at machines we want to get some neat repositories so it consists of two parts getting the URLs to that repositories and then basically cloning them and well they will focus on git and that's the most popular thing so we need to talk get language to be able to do that we implemented custom implementation of the git protocol format and storage format called good get and was a talk about it last year and it's written goal language that room it's a pure go implementation one of the big five validation of kit
it's very extensible and it's interesting you can do a lot of things like store things in memory when you want to or add custom protocols or store things in database if you want to so that's something we use for cloning part and then they're two separate programs
one is to find the URL so they get repositories and store them in database and second one to schedule their cloning this called rubbers and bogus parts and well there's hundreds of millions I get repositories so we want to be space efficient and there's some nice tricks you can do like store Forks together and single git repository because you have an extensible gift library and that's the concept of rooted requester is depicted in the middle so basically replaces the store history and start with their initial commit to the same hash gets stored in very single big repository which is works really nice so well you collected URLs and you collected get your posters most probably want to store that somewhere and it's got to be distributed so on URL it's just a Postgres database we wrote custom your Auriemma implementation and go which is type safe and quite nice called calyx and for a poster Heidi's HDFS which works very well but it scales linearly with a number of files in it so you want to minimize number of files basically how we do it with custom archive format implementation so every route to three boys three end up being in a single file inside database this is
the example of custom archive implementation called Siva it's secure and able and indexed format so it's water based and this way you can fit your poster once and then append it after new clones or fetches happen and sold stored in HDFS so when after it
gets stored you want to protest it somehow and well apache spark is a good way to do batch processing and on a cluster of machines barkas go is is something useful people understand and and know how to do and we do custom library to expose those git repositories to the SPARC API level it's called ensign and well it exposes references commits files and so on in spark language be that Python or scholar which is super nice and it talks through jpc interfaces to more advanced stages of analysis of the source code if you want
to do that here's example of usage of that library I'm not sure if you can see that it's you can extract with a simple pipeline like that you express the extraction of Francis taking they had reference and then getting all the files and for every file do something like detective language and so on which is quite concise and really great that both engineers and machine learning people can use that because there is Python and scholar IP is for that and well after
just iterating through files most probably want to do some more advanced stuff and there are two projects on beside that do have build one is enry which is a programming language detector it's kind of right parts of the github linguist is the thing in Ruby that it have you sit it shows you the distribution of languages on top of your repositories one that we use is an NGO there is compatible with a linguist and this faster
you're basically from four to twenty times faster on our measurements then then original Ruby one and another one is project Babel Fish which is a bit different because well first it's it's scalable parser infrastructure so it drops native parsers and fat inside containers that you can schedule across the cluster and that exposes uniforms ERP C API so that way you can extract a lot of information from the source code in very uniform fashion and it has drivers for many different languages and something called universal AB 1663 which is native syntax tree of the language annotated with some language independent things that you might be interested in on their later analysis stages and well
that's a high-level overview and there are some future things that we want to get done for example on kubernetes side having bare-metal cluster and persisting storage is not not really easy thing to make on the collection side there's a concept of staged event-driven architecture it's a paper published by Google I think about how to make a scalable system that dynamically saturate the some resources by having queues in between the stages which is very interesting if you want to do called 100 million of repositories on the processing side we're looking into editing distributed indexes to speed up APIs park queries that we have and on analysis part or advanced things like how do you default a strict syntax tree or how do you sir cross language information from abstract syntax tree which we are looking at in a Babel Fish project but it
any questions right so that's the question was do we look at the red knees so the thing is the collection pipeline I described is generic so you get everything you get full repositories on the project that the tech language this does not do anything except what github is already doing so it's also doing the same job but there is some research in that area which is quite cool how which comes from natural language processing and machine learning area well how do you tell which language is that like basically classify language based on the source code and we've got pretty good results it's not inside Emory first it's more yet but it's more in the research stage and I think it has a lot of potential well hopefully even to be able to merge upstream and in github implementation so it does a better job but here this and that's good one any other questions okay thank you so much [Applause]