De-anonymizing Programmers from Source Code and Binaries

Video thumbnail (Frame 0) Video thumbnail (Frame 1329) Video thumbnail (Frame 3651) Video thumbnail (Frame 4872) Video thumbnail (Frame 5732) Video thumbnail (Frame 6578) Video thumbnail (Frame 7839) Video thumbnail (Frame 8823) Video thumbnail (Frame 11189) Video thumbnail (Frame 13455) Video thumbnail (Frame 15548) Video thumbnail (Frame 16509) Video thumbnail (Frame 17997) Video thumbnail (Frame 19306) Video thumbnail (Frame 21303) Video thumbnail (Frame 22498) Video thumbnail (Frame 25056) Video thumbnail (Frame 28172) Video thumbnail (Frame 29385) Video thumbnail (Frame 33997) Video thumbnail (Frame 42329) Video thumbnail (Frame 43938) Video thumbnail (Frame 52678)
Video in TIB AV-Portal: De-anonymizing Programmers from Source Code and Binaries

Formal Metadata

Title
De-anonymizing Programmers from Source Code and Binaries
Title of Series
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2018
Language
English

Content Metadata

Subject Area
Abstract
Many hackers like to contribute code, binaries, and exploits under pseudonyms, but how anonymous are these contributions really? In this talk, we will discuss our work on programmer de-anonymization from the standpoint of machine learning. We will show how abstract syntax trees contain stylistic fingerprints and how these can be used to potentially identify programmers from code and binaries. We perform programmer de-anonymization using both obfuscated binaries, and real-world code found in single-author GitHub repositories and the leaked Nulled.IO hacker forum.
Source code Slide rule Programming language Observational study Artificial neural network Set (mathematics) Student's t-test Binary file Formal language Formal language Programmer (hardware) Process (computing) Programmierstil Network topology Natural language
Identifiability Information Open source Artificial neural network Projective plane Execution unit Expert system Translation (relic) Mereology Code Formal language Formal language Medical imaging Programmierstil Internet forum Different (Kate Ryan album) Data compression Authorization Natural language Computing platform Fingerprint Identity management
Collaborationism Copyright infringement Multiplication sign Uniqueness quantification Basis <Mathematik> Information privacy Computer programming Programmer (hardware) Programmierstil Software Formal verification Information security Computer forensics
Web 2.0 Source code Programmer (hardware) Programmer (hardware) Sine Multiplication sign View (database) Core dump Website Information privacy Information security
Dataflow Random number Randomization Closed set Virtual machine Open set Perspective (visual) Wave packet Data compression Different (Kate Ryan album) Natural number Forest Software Formal verification Representation (politics) Software testing Process (computing) Fuzzy logic Computer forensics Task (computing) Social class Source code Machine learning Computer icon Parsing Multiplication Online help Binary code Fitness function Sampling (statistics) Open set Formal language Information privacy Forest Sign (mathematics) Category of being Programmer (hardware) Software Personal digital assistant Network topology Order (biology) Formal verification Information security Task (computing) Fingerprint Computer forensics
Open source Computer file Code Multiplication sign Computer-generated imagery Set (mathematics) Code Computer programming Programmer (hardware) Roundness (object) Data compression Forest Authorization Software testing Data structure Scale (map) Source code Scaling (geometry) Online help Software developer Sampling (statistics) Code Line (geometry) Limit (category theory) Complete metric space Open set Parsing Abstract syntax tree Sign (mathematics) Inclusion map Googol Process (computing) Integrated development environment Repository (publishing) Network topology Order (biology) Formal verification Physical system
Functional (mathematics) Computer file Code Set (mathematics) Mereology Replication (computing) Code Wave packet Programmer (hardware) Data compression Different (Kate Ryan album) Personal digital assistant Cross-validation (statistics) Authorization Software testing Data structure Abstraction Programming language Online help File format Computer file Sampling (statistics) Variable (mathematics) Cartesian coordinate system Wave packet Abstract syntax tree Sign (mathematics) Inclusion map Category of being Programmer (hardware) Commitment scheme Personal digital assistant Formal grammar Statement (computer science) Website Natural language Spacetime
Sign (mathematics) Functional (mathematics) Online help Personal digital assistant Representation (politics) Virtualization Line (geometry) Cartesian coordinate system Code Abstract syntax tree Spacetime
Source code Identifiability Online help Line (geometry) Multiplication sign Computer file Line (geometry) Drop (liquid) Code Abstract syntax tree Sign (mathematics) Programmer (hardware) Mathematics Personal digital assistant Office suite Arithmetic logic unit
Email Random number Randomization Transformation (genetics) Multiplication sign Open set Code Wave packet Programmer (hardware) Data compression Different (Kate Ryan album) Personal digital assistant Formal verification Wireless LAN Social class Source code Online help Computer file Sampling (statistics) Code Line (geometry) Binary file Wave packet Sign (mathematics) Message passing Programmer (hardware) Personal digital assistant Normed vector space Software testing Formal verification Reverse engineering
Code Virtual machine Set (mathematics) Computer programming Attribute grammar Local Group Programmer (hardware) Malware Hacker (term) Data compression Authorization Selectivity (electronic) Information security Fingerprint Information Forcing (mathematics) Binary code Sampling (statistics) Abstract syntax tree Malware Integrated development environment Personal digital assistant Network topology Hacker (term) Control flow graph Reverse engineering
Dataflow Game controller Token ring Set (mathematics) Abstract syntax tree Code Programmer (hardware) Cross-correlation Programmierstil Authorization Reduction of order Hausdorff dimension Diagram Selectivity (electronic) Information Gamma function Disassembler Information Binary code Graph (mathematics) Interior (topology) Line (geometry) Abstract syntax tree Sign (mathematics) Entropie <Informationstheorie> Cross-correlation Mixed reality Reduction of order
Computer file Open source Control flow Compiler Insertion loss Open set Code Declarative programming Fraction (mathematics) Programmer (hardware) Different (Kate Ryan album) Operator (mathematics) Cross-validation (statistics) Energy level Representation (politics) Mathematical optimization Binary code Projective plane Sampling (statistics) Code Variable (mathematics) Cartesian coordinate system Equivalence relation Symbol table Sign (mathematics) Category of being Type theory Number Sample (statistics) Integrated development environment Logic output Energy level
Scale (map) Online help Transformation (genetics) Binary code Cartesian coordinate system Open set Side channel attack Sign (mathematics) Programmer (hardware) Number Programmierstil Programmer (hardware) Term (mathematics) Personal digital assistant Compilation album Social class
Code Confidence interval Multiplication sign 1 (number) Set (mathematics) Mereology Total S.A. Perspective (visual) Computer programming Programmer (hardware) Malware Programmierstil Data compression Different (Kate Ryan album) Single-precision floating-point format Personal digital assistant Information security Error message Area Source code Computer icon Curve Mapping Computer file Sampling (statistics) Attribute grammar Bit Flow separation Arithmetic mean Malware Programmer (hardware) Sample (statistics) Repository (publishing) Software repository Order (biology) Arrow of time Right angle Quicksort Hacker (term) Arithmetic progression Resultant Reverse engineering Functional (mathematics) Game controller Link (knot theory) Observational study Computer file Open source Line (geometry) Real number Online help Average Code Wave packet Number Attribute grammar Internet forum Causality Hacker (term) Authorization Traffic reporting Data type Window Noise (electronics) Haar measure Online help Pseudonymization Code Line (geometry) Limit (category theory) Sign (mathematics) Integrated development environment Personal digital assistant Network topology Video game
Distribution (mathematics) Abstract syntax tree Code Attribute grammar Programmer (hardware) Network topology Different (Kate Ryan album) Representation (politics) Data structure Abstraction Online help Weight Bit Maxima and minima Abstract syntax tree Sign (mathematics) Inclusion map Type theory Vector space Network topology Order (biology) Einbettung <Mathematik> Quicksort Data structure Resultant
Statistical hypothesis testing Group action Parsing Information privacy Computer programming Different (Kate Ryan album) Forest Set (mathematics) Information security Physical system Collaborationism Sound effect Bit Formal language Abstract syntax tree Category of being Malware Data compression Quicksort Representation (politics) Programmschleife Random number Computer file Open source Cellular automaton Drop (liquid) Student's t-test Code Term (mathematics) Authorization Representation (politics) Pairwise comparison Information Weight Code Generic programming Computer network Group action Cartesian coordinate system Wave packet Sign (mathematics) Personal digital assistant Function (mathematics) Normed vector space Universe (mathematics) Building Code Multiplication sign Direction (geometry) Set (mathematics) Formal language Independence (probability theory) Programmer (hardware) Mathematics Malware Programmierstil Semiconductor memory Data compression Information Extension (kinesiology) Logic gate Vulnerability (computing) Source code Software engineering File format Computer file Feedback Attribute grammar Term (mathematics) Variable (mathematics) Measurement Parsing Forest Type theory Googol Repository (publishing) output Task (computing) Resultant Functional (mathematics) Support vector machine Feedback Abstract syntax tree Attribute grammar Wave packet Twitter Sequence Sound effect Read-only memory Software Uniqueness quantification Software testing output Hydraulic jump Task (computing) Window Online help Cellular automaton Forcing (mathematics) Projective plane Computer program Mathematical analysis Planning Scanning tunneling microscope Computer programming Similarity (geometry) Code Formal grammar
I'm Eileen Kelly skin and I used to be Rachel's PhD student but I just started as a professor and this is my first job as a professor so it's very exciting for me and I'll go over the first half of slides half set of slides and after that Rachel will continue with her new findings in our research and today we'll be talking about how we can do non mais programmers based on their coding style
style my dream Salameh tree is the study of language study of style in language and when we say language we mostly think about natural language which is for example English that we speak or our native languages but there are also artificial languages for example programming languages are artificial languages and when we say selamat really wanted to look at all kinds of languages
and on the natural language side we have been looking at English or English as a
second language to identify the native language of a speaker or translated text so that we can again identify the native language or the translator that has been used as well as the altar and here we looked at underground forum texts were underground forum users engage in business and we're still able to
identify the authors from the images and in artificial languages we wanted to see if coding style is unique for each it becomes the fingerprint and even focusing on Python as well as C and C++ and you look at source code and you saw that there's very high occurs in binary and this work the tools that we have developed and made open-source are being used by many researchers or different agencies as well such as the FBI or expert witnesses they can use the scientific information in court while testifying and your kin high-tech crime units are you for example identify suspects in different online platforms and regarding artificial language is focusing on code DARPA is interested in this project as you might imagine since they are part of Department of Defense and they might want to know the identities of malicious actors as well as us as well as expert
witnesses and the US Army Research Laboratory which we are collaborating with and this has been an ongoing collaboration for four years now ok why
do you like to do - first of all it's I would like to know since we learned programming on an individual basis we ended up developing a unique coding style and this came used for software forensics or detecting plagiarism but at the same time we can use this for example verification of authorship or disken ated and copyright investigations but at the same time such for example security enhancing technologies can be very privacy infringing and can be used at the same time for surveillance and to track some programmers Slade Malachor is
one example where we can see security enhancing technologies can at the same time be very privacy infringing site moniker he's an Iranian citizen and he was identified as the web programmer of a core site and when he went to Iran he was arrested and he was sentenced to death and he has been in prison for
years now and he couldn't get out even though he's a Canadian resident just because he was identified as the programmer of this site that is against Iranian government's views ok how can
you source codes high Lama tree from a machine learning perspective I'll try to go not into too many details but try to give you the basics about machine learning so that you understand the flow of we can be anonymous programmers and we're looking at different tests such as multi class or two class machine learning where we can for example the software forensics plagiarism detection copyright investigations in two party cases as well as authorship verification which would be a one class to open world machine learning tasks and in order to
do this we have the traditional machine learning workflow where first of all we need training data that is representative of what we are looking for and I'm the straining data we extract some features that are representative of coding properties or coding style and we feed this these features into a machine learning classifier so that we can train the classifier to learn each other's coding style from the features that we extracted and after that take the test samples and use the machine learning classifier to identify who disliked who the source code sample or binary or tech sample belongs to and in this case we are using random forests because by nature they are multi class classifiers indeed and they don't tend to over fit and when we use this classic machine
learning workflow we see that we get very high accuracy Xindi anonymizing programmers which is solving that if programmers would like to remain anonymous this is a serious threat to their anonymity if they want to for example contribute to open source repositories and it would still like to remain anonymous and for example with light scale the anonymous nation of 1,600 programmers each with 9 source code samples that are on average 50 70 lines of code we get 94% accuracy Indian on amaizing or identifying the authors of 14400 code samples and in order to do
this we need to develop a method and while we are first developed in the method we need a controlled environment to do this and for that we chose Google Code Jam as our development data set Google Code Jam is an annual competition where contestants from all over the world try to solve algorithmic questions within a limited amount of time and as they are able to provide their solutions correct solutions they get posted online go post them and they they go to high rounds where programs become or the problem becomes harder and they have to implement sophistic more sophisticated functionality so that we can control for the difficulty of the problem as well as how advanced the programmers ok we have our data set and we collected source code samples from 1600 programmers and we pre process the code samples especially to get the abstract syntax trees from source code and for that we use a file the abstract syntax tree parser which is able to even in complete source code and with the abstract syntax tree that represents the grammatical structure of code we start extracting features feed it into a random forest and each tree of his 500 trees in the random forests are voting for one particular programmer as the most likely programmer to have written a particular disputed test sample and then we do classification and when we are talking
about features we look at different categories of features for example when we look at the source code sample on the left side we see how function names our variable names are chosen by programmers and these are higher-level features that can be more easily changed and are called lexical features and spaces or the formatting is also a part of these lexical features but there are also syntactic features such as the grammar of natural language in this case this is the syntax of properties of this programming language and when we get to abstract syntax tree we can see how complicated this structure can get and based on this we extract features such as okay from the lexical side things such as variable names function names spacing by grams or features like this but on the abstract syntax tree site we are looking at more structured features and we abstract things from about 50 different abstract syntax tree nodes such as function and statement to nodes that are connected to each other with an edge and then how depth the average that of a notice for example and these are very identifying features and they are not as trivial to change quickly and how can we use these in real-world scenarios okay we extract these features and let's try to see how we can represent an or replicate real-world cases let's say that we would like to find out who Satoshi Nakamoto is and let's say that we have a suspect set
of size X we take the suspect sets source code samples from the past so that we can train classifiers with this training data from the suspect set extract features and then take bitcoins initial get cult as the test sample and then try to see which programmer is most likely the author of bitcoins first get commit and when we try to replicate this scenario takes 1,600 programmers from Google Code Jam though this is not a suspect set with these 1,600 we use nine files for each of them and then we get 94% accuracy in correctly identifying these and we use nine fold cross validation for this what happens if I would like to stay anonymous and knows that coding style would give them away our application is the first thing that comes to my mind and off-the-shelf obfuscators such as panics and it's available online many programmers use it and when we take it to obfuscate our code here we see the original sample and we can see different spacing or formatting with a certain abstract syntax tree structure as well as different function and variable names and once it is obfuscate it all the
function names or lexical features are refactored with random representations and all the comments are replaced with hexadecimal ASCII representations so everything is refactored spaces are strict and so on but we see that the anonymization accuracy is not affected by such applications at all because when we look how the obfuscation happened we see that refactoring did not change that abstract syntax tree at all and it remains unchanged so our method is impervious to such off-the-shelf obfuscators ok what
happens when we use more sophisticated obfuscators such as tigress I take about 15 lines of code and obfuscated with this function virtualizer and then I end
up with about 500 lines of code it looks much more cryptic and I cannot really tell what's going on easily from a higher level and at the same time the abstract syntax tree most importantly changes in this case okay and this
affects accuracy significantly so the accuracy of being able to identify C programmers was 96% 420 programmers and here the random chance of correctly identifying a programmer is 5% when we office skated with tigress the accuracy drops down to 67% okay there is a significant drop in accuracy but compared to 5% of random chance 67% is still a height red to unanimity even when we obfuscated with such sophisticated obfuscators and another
case another real-world case would be authorship verification for example someone comes up and says that okay I'm Satoshi Nakamoto and in that case we can ask for their pass coding samples take those samples to train a classifier where this person Satoshi or mole or Mallory is one class and then the second class is the open world its programmers random programmers from the open world and then I take Bitcoin source code and try to see who it belongs so does it belong to Mallory or to the open world and based on this we can see if it belongs to this Mallory person and at the same time if we have training data from this person in the past in different open world scenarios okay what
about executable binary so when we compile code it goes through various transformations does coding style in compiled code and again we have a few lines of code and in binary it looks quite cryptic we cannot tell much but thanks to improvements in reverse engineering methods we can generate rich
feature sets even from binaries and in this case we know that malware authors would like to remain anonymous and do not have any identifying information out there in the public and there was this one interview with the lob at malware author and this used to be recent but it's 2016 September it's not recent anymore but when this author is asked for you the answer is just some guy who likes programming I'm not known security researcher programmer or a member of any hackers so probably best answer for this would be nobody male authors or people that would like to remain anonymous would like to be nobodies but if in binaries coding style is embedded then that shows that that is a fingerprint identifying information for certain users online ok again we have our classical machine learning workflow he
needs source code samples for a controlled environment which we take from Google Code Jam compile it and then reverse engineer it to get disassembly features assembly features he come to get the source code so that we can extract or generate the abstract syntax tree as well as the control flow graph and then for a hundred programmers we are left with about a million features but a million features I cannot really tell what's going about what's going on about style of these programmers so we apply attribute selection methods to select the features that are most representative of style in binaries and then feed this again into a random force of 500 trees and then do a classification to anonymize the programmers and the features that we are
talking about are for example once the code is once the binary is dissemble this assembled we have assembly features and we take assembly token by grams or 2 consecutive lines and so
and from syntactic features again from the abstract syntax tree we are taking note diagrams or the average depth of a certain node and so on and from control
flow graphs we have similar features to abstract syntax trees and once we extract all of these we have a lot of features that we are dealing with and for that when we are applying dimensionality reduction the first thing we do is apply the information gain criterion so that we take features out of these 700,000 features that reduced the entropy when they are taken out of the feature set and then we are left about with 2,000 features that keep the accuracy at its highest and its most representative of the coding style in binary then if I want to understand why coders in binaries I won't be able to see this from 2000 very low-level features that don't really mean much when you're first take a look at them and for that I also apply correlation based feature selection which is taken the features that have the highest intraclass correlation which means that for an author it has the highest correlation but it has the lowest in inter correlation with other programmers and it becomes the most identifying for individual programmers and I'm left with about 50 features now I can get a better understanding of what might be representing coding style or in binaries
and when I try to analyze these 50 features even though they are very low level we still get low level properties that are representative of style that remain in binaries and we have things such as arithmetic or logic operations stack operations as well as file input operations and variable declarations and initializations and these are not very trivial to basically refactor or change to hide your coding style okay we said
that we have a controlled environment we take code samples from Google culture and then compile it and the reason for doing this was so that we can control for optimizations which might affect the denomination accuracy the anonymity of these samples and when I take a hundred programmers and I apply no optimizations when compiling I get 96% accuracy again with nine samples with nine-fold cross-validation and when I apply optimizations and then as well as stripping symbols the accuracy keeps decreasing with optimizations it's not affected a lot it drops to 89 percent accuracy but with strict symbols the accuracy is affected more but again here we see that the street symbols we have 72 percent accuracy but the random chance of correctly identifying these programmers is one person so even stripping symbols is not anonymizing these people okay what kind of applications can I apply to anonymize myself in an automated way and for that I used an open source project open LLVM and applied three different types of applications first of all bogus control flow insertion where code will never reach that but it's still in the binary so it looks as like a feature and what if I substitute instructions with equivalent instructions making the codes shorter or more complicated and so on or flatten the control flow to mess with the control flow of features and we see
that applications are decreasing the accuracy to 88% from 96 for 100 programmers and again this is showing that such obfuscation are not sufficient to hide coding style even in binaries coding style remains after many transformations compilations or obfuscation what happens if we try to
increase our class size and we have 600 programmers we see that with 20 programmers having 99% accuracy incorrectly the anonymizing term with 600 programmers we get 83% accuracy where the random chance of correctly identifying these people is less than 0.2% and we see that the accuracy degrades gracefully in this case
okay what about real-world cases this was a very controlled environment hm people are implementing the source code the functionality in a limited amount of time and it's small snippets of code and so on for that first of all I I Part II top repositories and ended up taking a bunch of code from hundreds of programmers and after that I compiled those and many of them did not compile and github repositories are work in progress that's okay but it took me days unless with 50 github programmers and able to join on mais number 65 percent accuracy okay this is one real-world scenario what about malicious programmers I'm currently actively working on that but one case study I had in a published paper was with six malicious programmers and ten samples some of these samples came from the Hecht hacker forum now that IO forum and once the forum was leaked I was able to find some live links to malicious code that they were selling and providing to their customers and I was able to download those and find the ones that was relevant to my training set and reverse-engineer it to get the features as well as some malware authors from security reports and so on and please please if you have a data set with known authors of malware or if you have good automated methods for quickly reversing malware or malicious software or encrypted or or so on anything that can help but please come and talk to us after the talk and we see that at least six malicious program which we have a hundred percent accuracy but I would like to make this experiment much larger scale and for that he needs help with the data set and now I will leave it to Rachel so that she can talk about more fascinating details about program early on inhalation so I'm gonna dig a bit deeper onto programmer D anonymization on github when we did this experiment we got much lower accuracy than in the original Google Code Jam dataset and with the experiments I'm going to show over the next couple slides I think a lot of this comes down to the fact that a lot of these repos we only had like a couple of files per author and it turns out that this sort of I think that's the thing that matters the most there is a lot of noise where sometimes people will like link in other things and so on but I think one of the biggest issues is that when you only have like two files to train on or one file in this case and one to test as opposed to the nine files that we used for our experiments I think that makes a big difference
so we've been up till now we've only talked about situations where people are writing code individually on their own so most people probably don't actually code that way in real life most of the time most code is collaborative so when we started presenting the initial work we had a couple tweets about it how far who some of you might know said I'll believe that this code style AMA tree stuff works when it can be shown to work on big commits hit hub histories instead of the Google Code Jam datasets and Zuko talked about hearing from an intern at Apple that they just allow her from contributing to open source on her own time and so we were interested from this like perspective both of you know privacy like if I want to contribute to something and and I don't want to know if that particular commit is gonna cause me problems later and also just you know to be more to validate this stuff more in the real world we wanted to do some experiments so in this case right we're only carrying maybe who wrote a small piece of code or we want to do naanum eyes some pseudonymous account on github who we have like several snippets or segments of code right so we don't have these whole files nicely written and in this case we're using the same feature set that we used before we're trimming it down to about 3400 features so like quite a bit more than Aileen was using earlier on but these are very small segments and snippets so sometimes we need more features for that and you know the ultimately when we were doing this we get about 73 percent accuracy and identifying the author when we're talking about a hundred about a hundred programmers for a snippet of code that's about five lines long so we are interested in kind of understanding when this works and when this doesn't work so what we did was we built this calibration curve and what that does is it just shows us like in general what the confidence of the classifier is relative to its accuracy for individual samples and we can see that in some cases we have pretty like high confidence and in a lot of cases we don't have such high confidence and yeah so this can help us understand even though we have 73 percent accuracy we can say given this you know answer that the attribution has given us should we believe it or not based on this confidence and then we can know like if we know it with high confidence we have much better belief that this is actually the programmer that we're looking for and we're also interested in you know how long these snippets need to be and how many snippets you need to train on in order to get good results so here's a night like an interesting kind of we have this sort of curve that happens where so say we have fairly large snippets that are like 38 lines of code in general with the Google Code Jam data that we were talking about before they were about 70 lines of code the files but in this case we only have four samples for each author or programmer right and this gives us about 54% accuracy on 90 programmers but even if we have smaller samples but much more of them typically our result goes up so even when we're only looking at single lines of code and I'm a little nervous about this result because it's really preliminary but if we had like a hundred 50 samples to train on we can usually identify the author about 75 percent of the time so we're still trying to kind of understand what makes lines of code more attributable or not I mean certain lines of code are going to be pretty generic but other ones not so much but what happens when we want to identify accounts not necessarily individual commits right this actually works a lot better because what happens here is that the the errors aren't correlated so we get close to about a hundred percent accuracy if we have four snippets meaning like the way that we the way that we analyze this is we ran like get blame on these repositories right so that's how we get the the repository clipped into into snippets and it turns out that the errors this 73 percent are not typically not that correlated so if we if we try multiple times and then we vote our results go close get close to 100% not perfect so you can see this sort of heat map here where the red area is basically over 90% accuracy and it tends to happen once you end up having more than about nine samples and more than about five account five snippets to train on right like that your to test on so once you have like a certain amount of training data and a certain amount of like testing a data meaning the account that you're wondering around about has committed you know more than four or five times and your results get pretty good interestingly we were also one of the other things we tried was instead of saying okay we're gonna identify these little snippets individually and then and then vote what if we merge them all to make a big sample and it turns out that's better than the initial snippets but it's not as good as doing them sort of one at a time in averaging because then there tend to get compounded in a single merged thing um okay
now for something a little bit different I'm going to talk about deep learning because it's the new hotness and so we we talked about in in in one of the things that sort of novel about this work is using the abstract syntax tree in the past most people had just used lexical and layout features and sort of there have been doing this sort of code attribution stuff since the 70s but it gets it tend to tend to get around eighty percent accuracy and not scale above like thirty programmers or so so using these at AST type features allowed us to get good results but the thing is an AST itself is not a feature right a tree is not a feature you can't just feed this into a random forest and have it you know tell you who it is we manually chose these features as i leanness mentioned we chose you know grams and buy grams and depth and so these tend to give us like very local features and very global features but what we want actually is the ability to get maybe more nuanced features than that so enter like a deep neural net so
we're gonna try and automatically learn a new feature representation so what we do is we first map the ast nodes into a vector and we just use this embedding layer to do that and then we create these sub tree layers and they're going to be using Elias TMS or bi-directional lcms in order to learn new structures of the AST and then we have a soft max layer to actually do the classification
so a little bit of background on LS TMS if you've not learned too much about them so we have our neural net here on the right it's an RNN which basically it allows us to handle sequential input and actually have some memory to be able to remember information that's where those little feedback loops are and an LS TM actually adds memory cells to this to have sort of more useful memory again so we cannot just have super local features so these cells have gates in them and they ask sort of what should I remember this information that's going in which I just ignore and what should I forget that we've already learned so that we can over time develop a richer representation of the ast so in this case we're only trying to use the ast features we're not using any of the layout or lexical features and that's why these results for the random forests are lower than what Eileen showed but you can see in and this is using Python and C++ that we get you know 86% accuracy on 25 programmers or 73 percent accuracy on 70 programmers in Python again just using the ast so this layout and lexical stuff matters too but when we use our new ast features we do get a big jump here in and so this new feature representation does seem helpful and it's nice to have a sort of a sole ast representation because it does allow us to it's as we mentioned before it's much harder to off you skate it's easy to pour it and so on yeah so this allows us to learn but orestes without doing in this manual feature engineering and its language independent and in our future work we'd like to actually combine the these features that have been learned with the random forest and the fuller feature set to see if we get better results or if it's just overlaps with some of the stuff that we're learning from the layout and lexical features okay so what about other languages as I mentioned basically porting this to a new language requires an ast parser which exists for almost everything and lexical and layout features that you choose for the language so so far we've done things we've done things for C++ C Python and JavaScript and we get similar accuracy so far on the Google Code Jam data set the results were just using the ast tend to vary more which is kind of interesting one of the Holy Grail kind of applications of this would be able to test on to sort of train on one language and test on another language and we don't know currently like how much does your programming style change when you actually change languages so to do this we need some sort of universal intermediate AST nation or some or some sort of just pairwise you know porting between two languages there is a project to work on this it's like the Babel Fish project but it doesn't really appear like ready yet for this kind of application it's something we're planning to look into a little bit if people know about sort of generic ast representations that'd be another thing we'd love to get your feedback on so I'm gonna end the talk with a couple interesting sort of software engineering insights that we've gathered as we've done this work about like what makes programming unique which i think is kind of fun so in general we will we started with looking at at repeating groups of people with so there's a another programming contest called the code forces contest which has a team competition and the teams can compete on sets of problems we looked at we have very preliminary results with a hundred eighteen teams with about twenty submissions each and they get about sixty seven currently at about sixty seven percent accuracy now I think this is one of the hardest cases for group attribution because the way that code forces works is it gives you a big group of problems to work on as a team together so I think people are mostly splitting those up so it's not actually group coding so I'm kind of surprised that it even works as well as it does at identifying the team so in in the future we'd like to work again with some more code repositories to get a better sense of like stuff that we know and can control for how much collaboration actually went into it difficult versus easy tasks it turns out that implementing harder functionality makes programming style more unique so when you're solving you know and we can kind of control for this because the problems in the programming contests are supposed to get harder as they move on so if we look at the same set of 62 programmers solving seven easy problems we get 90% accuracy which is pretty good but when we look at the same set solving seven harder problems the accuracy goes up to 95% so that tends to matter also programmer skill matters so programmers who got further in the contest which is some measure of perhaps were easier to attribute so in general like the coders that advanced less far I got eight we got 80% accuracy on them on the again these are the only easy problems because they they did that much but then when we look even on the easy problems at the people who got further in the competition we're able to classify them with 95% accuracy so it's kind of interesting that as you develop programming skill your your your style tends to be more unique we're also
interested in how coding style changes over time so we looked at again the cotton is competition where people are training and testing on teeth are competing in both 2012 and 2014 so when we train on 2012 and we test on 2014 the accuracy goes down from 92% on this 2012 set to 80 88% when we test on the 2014 set so it's a little bit of a drop I'd be more interested in maybe looking at like when we look at even larger timescales than that or sort of particularly formative sort of years maybe like University and things like that and how it affects people's programming style lastly we're interested in coding style by country right so one of the things that this contest does it has contestants from all over the world so when we were reporting this to JavaScript we grabbed a bunch of JavaScript files maybe for files written by programmers in Canada and then programs grammars in China and we were interested in just a binary classification whether we could tell whether the file had been written by a Canadian or a Chinese programmer now this we expected to be like particularly easy because there's like a native language difference which may show up in things like variable names and so on and in fact it worked pretty well so it was around it was ninety 1.9 percent accuracy for this task in the future we're planning to look at a much larger set of countries and a much larger sort of set of files and see about sort of if there's actually kind of if this is a native language effect or maybe sort of an education system style culture effect and what's going on there but I think it'll be interesting so in future applications where as we said we're really interested in whether this actually works to find malicious code authors and also you know what sort of anonymous contributors have to worry about when they contribute code online we're interested in breaking this stuff so writing better obfuscators all the up these skaters we've tried so far have not been really targeted specifically to the ast so we think that can happen there was some research done at the University of Washington building on our work showing that people can kind of imitate other people's style when they're given that as a task to some extent so that's you know there's hope right don't leave don't leave here thinking you can't ever write anonymous code again but be careful when we and in particular like if you're going to contribute to a repository anonymously you might want to create a new account for each commit even though that's annoying to find authors who write vulnerable code we're interested in looking at source code and understanding kind of what software engineering type stylistic features lead to two vulnerabilities and also you know some people have talked to us about finding out who to recruit directly by looking at how unique their coding style is and whether that suggests something about their program or skill so this was not work done just by Eileen and myself alone I have lots of students and other collaborators at Drexel University and at Princeton at the end at the Army Research Lab and at Scott again in Germany who've all worked on various aspects of this project so thanks to Vander Edwin rich Andrew Spiros Arvind Fredrika MA speaker Dennis Conrad Craig Mike and Fabien for all of their contributions to this work this is our contact information our code to do all of this stuff so if you actually want to try and figure out who Satoshi Nakamoto is and have a like actual suspect set you're welcome to try that it's not something that we're going to do we respect privacy so but you know the code is out there and we have I think about four ish minutes for questions so if people have some questions we would love to take them and then after the talk we'll walk out the back and ask any and you can ask any more questions that you have Thanks [Applause] any questions yeah how do we do this there's seriously no one ever does QA maybe we should I I intentionally left time so for the coding styles that you were going through for the people for the google coding challenge he said you were able to look at people who are going the furthest in the challenge did you see trends that were going along that we could later help make better coders later using that information no we have so for those of you that are leaving again go out through the back door do not go out through the side door so yeah so we have not done much analysis of sort of what makes the coding style of people who get further in the programming competition you know different or more attributable but I think that that would be a really interesting direction and we'd like to look at it do you have anything and one property was that more advanced programmers tend to write longer code that's one trend yeah yeah I mean it's it's tricky because we don't know if that's like the causal thing or just sort of my thing but but yeah the in general the code was longer which helped so yeah I just have a comment this is very interesting one thing I would see this going towards is cataloguing programmer reputation which is kind of like what we've got into with penetration testing so instead of going to where it's like a completely open source ecosystem we're pushing for statistical testing and this can now further be used to look at programmers and see their history with with security and then give a score to the code in that regards so do you think there's any value to cataloguing people in terms of looking at security for code or that's just an underlying ecosystem problem yeah there is ongoing research for automatically understanding security properties of code and they are working with similar properties and does this answer your question all right so let's give the speakers a hand
Feedback