Mining for Bugs with Graph Database Queries

Video in TIB AV-Portal: Mining for Bugs with Graph Database Queries


Purchase DVD

Formal Metadata

Mining for Bugs with Graph Database Queries
Alternative Title
Pattern-based Vulnerability Discovery
Title of Series
Part Number
Number of Parts
Yamaguchi, Fabian
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
While graph databases are primarily known as the backbone of the modern dating world, this nerd has found a much more interesting application for them: program analysis. This talk aims to demonstrate that graph databases and the typical program representations developed in compiler construction are a match made in heaven, allowing large code bases to be mined for vulnerabilities using complex bug descriptions encoded in simple, and not so simple graph database queries. This talk will bring together two well known but previously unrelated topics: static program analysis and graph databases. After briefly covering the "emerging graph landscape" and why it may be interesting for hackers, a graph representation of programs exposing syntax, control-flow, data-dependencies and type information is presented, designed specifically with bug hunting in mind. Our open-source program analysis platform Joern is then introduced, which implements these ideas and has been successfully used to uncover various vulnerabilities in the Linux kernel. Capabilities and limitations of the system will then be demonstrated live as we craft queries for buffer overflows, memory disclosure bugs and integer-related vulnerabilities.
Computer animation
so welcome to this talk on mining boxes with graph database queries of just for my information and I didn't talk like this before at last year's CCC was seen the talk of the same title nobody great circles just taken the slides there already had but OK and make some new ones as well so how well hopefully it's even better from again so this
talk is about vulnerability discovery and if you have nothing to do with vulnerability discovery just to give you a short working definition of vulnerability discovery is the art of navigating inside piles of junk the goal is to uncover small programs that we can use to our vantage so I think this is the so I was searching for a picture for a for very long time which kind of I know gives you communicates the feeling that you have when you 1st see a new code based and somebody tells you please find something explainable in here and I think this picture of the fits it's pretty well now the idea of my
work is to apply pattern recognition vulnerability discovery and in contrast to many approaches that you see in particular in academic work I don't try to focus on exact methods we reason about like confined regions of protein great detail and I you assume that you are you know can can exactly model all of the instructions of of your of your of language that you looking at but instead of I'm and taking a and taking a of a pattern recognition perspective where I try to recognize as opposed to detect parts so this is mostly inspired by all of the books to see you the real all of which is which is usually a which is more from the engineering field where you can aware of the fact that all of this stuff that you're dealing with this very imperfect so you looking at it in terms of the signal and noise and the question is whether we can use these methods and can apply them to what traditionally and people have been doing very exact methods can represented by the books that you see on the top now the an important thing to this is that I'm not trying to build something that automatically detects but instead of this is this is supposed to be tools that help the auditors in day-to-day work so in a way they are and the in that they enable people to ornaments from what can be automatic but also make use of their ideas and of their knowledge of code base to spot honorable now
I've been doing this for quite a while now and interesting and notice it's been 3 and a half years and fortunately and I finally finished my thesis and this and this is going to be coming out very soon so November 15 so if this stuff seems interesting to you please take a look so you know this this is a document that's so you know you write it for very long but then nobody actually takes a look at it so if you ever want to hide any secrets somewhere you best student pieces of but hopefully you know maybe that 1 person or 2 in the room finds this interesting actually checks to see how that would be very very cool OK so we're going to be looking at 2 chapters essentially this and 1 of them it that's like the basis of all this work that's it's a graph mining platform for vulnerability discovery out and build on that graph mining platform on my explore the 3 main directions of machine learning of unsupervised machine learning and see them whether those directions with those methods can be used to help us find lots of and I can't tell you about all of these in the talk but I'm going to talk about 1 of them are which is how clustering can be used to help you identify 1 abilities and the nice part here is that actually in and this this method will again generate queries for the graph database system so but you actually see exactly what the system has learned on so this is something that is very often criticized about machine learning that you know you can easily create a program that kind of classify something but you don't really know what it does so in a security context but that's not really good but but you know if it actually outputs a query in the and then you can see what it has learned and if it's rubbish then you also see right so as to the 1st part of this is about the graph mining platform and this combines the 2 what to things which seem to have not so much to do with 1 another when you start off on 1 hand we have a little computer science compiler construction so what compilers due to optimize your code and analyze you produce removed it was stuff like that and on the other hand you have the shiny new graph database and and the question is how can we combine these 2 actually getting a good system for that's wanted but the discovery and I would start of with an example of this is an example of a block from by famous German hackers defined as a and it's a bug that you should use a scanned 13 talk and what he showed a lot of really cool box that talk and this was moreover more besides finding you could say so but it's interesting because of the way he found it and it's also interesting because it looks very much like a lot of other blogs so that that you see all over the place so let's quickly go over this but there is a rival called name land and it's a 32 bit integer and I'm apparently it comes from a network of because we're using a function called the in the end to each user with the with converting it from that work by court order to by orders so this looks a lot like this might be from the network and and what then happens is we allocate a buffer and we add 1 before allocation and so clearly if we control this name Linfield completely even if we use a max maximum integer and what will happen is we get an awful and actually it allocates 0 bytes and then it copies mainland bites into the buffer so you get the base buffer overflow so this is a very classic see this all over the place and the question is how we define this spot and particularly when when you talk to academics they will they will you know expect something really involved right now so do use white box further enhanced symbolic execution was some machine learning powered animal detector for theorem proving or model checking but it turns out that all they did was use graph right this is that this is the regular expression that that use to to uncover the box and you can see kind of looking for locations and inside those allegations and yeah they're supposed to be a summation or you know any any sort of arithmetical operations and what encoding here at is the fact that you know if people do arithmetical inside the allocation and that's that's often resource for the base buffer overflows traded by integer overflows because obviously but the there's no sort of check after the the uh after the summation or whatever it is so they're going to be a lot of cases where this is not problematic maybe the check was done before so we know these of good points to start and I get I think we can learn a lot from this so 1st of all it even if you take something as primitive as correct and this can be a very powerful tool if you will allow an analyst to actually guide to the analysis if you allow them to introduce some knowledge they have of how things break in this particular language in this particular programming environment another interesting part is when you when you think of this graph query it it can it encodes the domain knowledge so that you can think of it as a as a model of the box and then finally high false positive rates are not that that know it's not like intrusion detection where 1 % false positives is and what you can't really use the system with 1 % false positives because it means the administrator is going to be knocked out the bad like every few minutes or so I here you know if you have a if you have like a hundred hits or so and know 10 of them are interesting so be it I mean it that's that's pretty good at 10 of them are actually vulnerabilities while that's not a problem but it's still here here's the goal of all we want to build a we want to build a robust such that search engine for source approach so there's going to be a robust parser I'm not talking about what robustness means of in this stock and 1 of the there couple of all the torques around on this topic that ever done and I talk about this in great detail but I figured out nobody really cares so it's a positive and that we use this positive to import the source code into this graph databases and in the back you have these pattern recognition algorithms that can be used for most of all the analyst can query the system get responses adapt the queries and query again the idea right so the 1st question we need to ask ourselves is where we actually need to be able to model instead of queries so 1 of the things that the auditor might want to specify a and I think the 3 things but after analyzing lots of clouds we found that the 3 things that you are 1 2 specifies what is what are the statements look like then we get from 1 statement to another and how the statements affect each other and the nice thing is we don't have to come up with new ways of dealing with this because compilers need to deal with this all the time so let's see how compilers still with us so in compiler construction you have all sorts of graphs are representations of road so for example there's the abstract syntax tree of this tree it essentially shows you exactly how a language constructs have been nested to create your program so you can see for example OK there's an if here and there's a predicate of this if and the predicate is fixed axis smaller than max and then there's a statement nested inside the is and so on now 1 of the things that you probably all know is this the Contra control flow graph is what you get in IDA Pro for example but this tells you how how statements can be reached and as in what conditions need to be true for me to reach a statement for example here you see that X needs to be smaller to match to reach this statement y equals 2 times 2 times X and then use 1 that not so many people know it's called the program dependence graph and the program dependence graph gives you data flow information so you can see for example that there is a value called X and it's produced at 1 statement here and it's being used by another statement here so these are the 3 main data structures that can be used to to analyze both compilers
and the problem is you know you can just choose 1 and do everything this so none of these is perfect if you take a look at a typical query it's more like you know find the call to prove this statement where dataflow exist to a statement that contains a call to bar so actually what you're doing is 1st you describing this syntax saying find culture through that's something syntactical then you're saying but now I want to reason about data-flow now I wanna see with this data you know is propagated to and then again when you follow the dataflow link you wanna go back into the syntax because you want to find calls to walk so there is this need to create transitions from the syntax tree to a dependency graph and back to the syntax tree the question is can we get a representation that have all the main observation to do this is that actually in each of these graph representations there's a node for each statement so why not just merge them right I mean the main trick here is that we actually have labels on the edges to to distinguish different types of edges that we have but we know now we have a data structure that represents all of this in 1 and this is what we call the the core property graph and now the question is can we may be described vulnerabilities as subgraphs in the scope of graph that once we have this and
we wanted to some storage to query it and we try to store it in relational database management systems because there are a lot of good reversing tools that use relational database management system so we're thinking of why not and that we failed completely writing a lot of coaches map graphs 2 tables and fail then we tried out document databases for some reason I'm not sure why but of course that feels as well because mapping documents to graphs is hard but actually mapping graphs to graphs exceeds immediately so yeah which are not using a graph database and is a short wrote about the relational database management systems of course you could just as create tables for nodes and then create like a joint table for the edges but this doesn't scale so the lookup time that you get on that they are joined tables of will become proportional to the size of the what's the size of the graph and this will become very inefficient so this is like this is 1 of the 1 of the main things that graph databases you natively and the support for these for efficiently wearing these graph structures for so graph databases they don't use tables as their native storage format but instead they use what is called a property crash and a property graph is really just a graph where you can attach like dictionaries to the notes what happened dictionaries or hashes if you like I and you can put labels on the edges that's it so here's an example of so you have a name and an age uh attached to each of the notes and you have different kinds of edges so created and knows
and with this graphic and of the nice thing about the graph databases you already have languages to query this and there are 2 very popular languages well actually there are a lot more with no new for j the most popular graph database you choose 1 of them is the site of query language which is not so also for what we want to do problem and the other is gremlin and gremlins pretty basic you choose a set of start nodes then you describe how we want to walk on the graph and if it's possible to walk the graph according to your description of the notes that you reach they will be returned that that's it so for example return all objects created by people over the age of 13 that would be the start all the people with age because 30 and then fall out going created edges so this is what that would look like in gremlin so Peter and John Archibald or with 30 right and now we started goes and we then for the created links and we stop at the things that they've created and this will be rich at the really nice
thing is that you can get something like stored procedures you once you've defined you know something that you commonly look for you can give it a name you can say all the people who are over 30 you've created something nice for us are people we can find now and then you can just say people we can fire now and you know that yeah you don't have to rewrite this thing over and over again and you can create really complex traversals they're called and you know save them in these pipes and then reuse these pipes all over so initially this
stuff was created mainly for social networks right so graph databases I think 1 of the events where they became publicly more known was when Facebook introduced graph search and they allow the you've allowed you to look for all sorts of fun social things for example you
could look for people like English Defence League and carry or you could
look for mothers of Catholics from Italy who like directs right and that those with the kind of examples that you know about the broader broader attention but the nice thing is you don't have to store useless crap and database graph databases you can also store for example the court property graph in graph database because by definition of course that's how I was mostly built it is a property graph OK and then
once you do that you can use the stored procedure trick to actually build a domain-specific Query Language for vulnerability discovery will you and called all of your ideas of what kind of things you would like to check for and
we did this and we build a platform and you can use this downloaded from its body everything's available that that form called Europe and some of its processes see and you know participants plus of and there's a Python scripting interface and some shell utilities you can use and it's actually been tested to work on other people's machines so make me very happy that it doesn't just on my machine gun is probably the 1st of my projects where this is true and and you can download it at an L sector or you're right so let's
take a look at what the system does and how we can use it so the 1st step is to import to code and the code is written to disk and the graph database server can then started and it accesses the disk and this graph database server exposes a REST API so you can see this comes from when people and I tried to hide this little bit because you know what I really think the whole web stuff is necessary at this point so I wrote a little I wrote a little library for you it's called Python your and you can just use this database without knowing that there's this rest stuff in between it's a very comfortable but now you can you know just and import of your the your library you specify query you connect to the database you run it you print the results but that's that's that's that's a fully functional script here except for the fact that I I did not include the query or if you like the shell you can use the shell so we we did practical evaluation of what this tool can do and it's going to make this very short in the original talk we talked about all the details of this but that was actually not so interesting but the point here is that this is not some this is not some purely academic thing that really doesn't produce results but in fact we we use the Linux kernel as a case study and I actually got somebody in the industry to to use this so this is my good friend people would to hear from up on and and what he said that they found about a hundred issues in an internal audit now I don't know what they qualify as issues so we were not able to really he was not able to really give this out but then you know we also did some we did some of them more open queries that we could share with everyone so we did 2 words for buffer overflow splits kernel 1 for the white allocations 1 for memory mapping works and 1 from memory disclosure and this was the outcome of we found 18 vulnerabilities acknowledged and fixed by the developers and you can see we also that's since your identifiers so these are not just some problems that our security people like to think of his problems but also the developer said OK yeah this is a vulnerability that's that but now for a case study and I think this is the more interesting case study so obviously so this is what you
see and is what you would do if you were to run this on you see so 1st you you run your to import obviously and this takes a while and then you start the new for graph database server so that's the correct address and you can point your browser to this address here just for some statistics so you can see that the you'll see could produce 2 million nodes about and 5 million properties for millions of almost 5 million relationships these of the edges in there and yeah and and it takes about 705 ones now in those 705 megabytes and lecturer in those yeah I should be in those and there's also integrate text search engine based on Apache Lucene and we're going to use that for for start node selection so this is like an extra feature that's about new so here's an example of a very simple idea of queries so this is just an seemed really so let's say we want to have all the files that have the name the marks inside them so we can write this period type violent viable the marks we can practice you and look up and it will give us all of these past and also the that may have probably thinking OK so when we need this I could just you find right so now comes the magic the magic is that once you once you are the final so you can actually descent into the contents of these files right so the entire code base is just 1 huge graph and you can transition from the different representations can transition from 5 to the content you can transition from syntax trees to the control flow graph and so on you can also transition from functions to the functions that they call you a call call program of all this possible so we put this type of violence part of the marks into 1 of these customs steps that we define which is the query node index of which would just execute this this inquiry into their arms we pull out going edges and take only those which are labeled as function so suddenly you have all the functions inside files called the month OK so let's let's take this force them to find vulnerabilities and let's take a take a look at some of the SSH struck again that this is h but again that we saw such different as a found so we might wanna formulate the following query in our head before we you know make code out of it so let's get calls to mellow out where the 1st argument contains an additive expression and a call to Member companies reached by data flow where the 3rd argument also contains an additive expression and the 2 additive expressions are not equal or exceed formulate something like this it's let's see you know what's what's the idea in the idea is that the size that's been calculated is on equal to that of the buffer that thing reach and it's pretty unsafe because doing the additions directly inside the allocation and the cop because that is what that would look like as a query it cost to a that's 1 of those utilities and from there you go to the of its 1st argument which is the amount to allocate and store that right now I take only those which are additive expressions and from those additive expressions you you in in the tree now you go up to the enclosing statement and now we can fall the dutiful links so now you can say fall data foldings where we can now you go back into the the AST find calls to Member copying where the 3rd argument is unequal to the stuff we save that's the the 1st argument to and this is also answer that's but that's the type that you type it into your and look up you sort it unique you use your location to get the actual locations you by the to flu right was it your file now you cat food and you get a foreign it's in the entire corpus and I really thought 0 and before it's nice to have something the force so great through 4 4 and your code which will look at the code and this is where it outputs and I would ask you have to trigger this part auditory says it so and what you see here is code to parse and for files and here's your model and it has an additive expression inside so that's instead 7 then we would allocate 7 plus 1 minus theta is 0 and then here's the men copy that's actually been reached by flow and the 3rd argument here also contains an additive expression and this time it is the size of minus state and if we insert 7 and that's actually minus 1 so we have enough uh a buffer that's allocated to close to 0 White we have a copy of minus 1 test to unsigned integer which is max into it in there we have a buffer overflow in immediacy right so that's the 1st example and of course the question arises what why search for that and so on what 1 of the things that I found very inspiring about this question is something I said in a book called the art software security assessment which is really cool book if you're into 1 another and in that book it says in fact of office typically start reviewing new code base by finding the equivalent of the you told Eric tree and reading the framework and glue code line by line now what was it means that means let's understand the language 1st as been spoken and by that I don't just mean the programming language but the abstractions that are being used and if we can find something that's fundamentally hockey use about these abstractions then you know looking for problems related to these is probably going to so we'll see I took a look at the stream processing API and that's mainly located in the streamed H and is 1 interesting function that you can find here which is the function of stream size and as the name implies it returns of the size of the stream now clearly stream you know this this could be some media files and you want your videos to be larger than 4 gigabytes so you'd have to return in and 64 and not just in into a 32 see 164 bytes another problem of course is if you have a 32 bit platform than all the locations that you're doing these waiting to be happening you guys have to be very very careful about these because they're probably going to a will fall or be truncated but yeah so the idea was 0 and this is realistic for an attack because you know a video can be larger than 4 gigabytes and nobody would say 0 wow this must be malicious nodes just that's why that right so now
that's look for truncations and center locations related to this so you may statements containing cost size and the symbol in 64 where flow exists the statements containing the simple model and this query is much smaller right get cost to again so costume size go to the enclosing statement from there take only those that contained in 64 and now follow the floor and see what we can actually reach and filter out only those were molecules so these are locations connected to the stream size and this is where you get a not very much there are 2 interesting except for this 1 because this 1 is in the automatic update the to use the code for this
what you see is here stream size being called right and so there's this integer called I read in 64 and and it's being used inside the model here and again we add 1 know what happens is of a 32 bit platform the argument that's expected by my work is actually 32 bit inside so it 1st takes this I read it actually adds 1 and that's not problematic yet so you still have a 64 bit integer but then when it's passed to my look at that point it needs to be truncated so again if you have if you choose this to be the maximum size of 32 bit integer and you add 1 which you actually get is 0 and then down here you have a stream read operations screen readers pretty much like men copy but much better as you'll see in a minute and it again reads I read bytes into this buffer of which is max in so they they they go you have floor again the updated probably going say
OK the updated there's probably going to be some sort of you know signatures involved to verify and update is valid or not the nice thing is all of this actually happens before signature validation it can also be using HTK so you know it's it's just clear text that you need to transfer rights to
execute the fortifications
that that so I check this out I edited easy hosts to you know point update the dual and don't work to my personal web server and created a file with the max in size containing always attached to the Barberton we'll see I check for updates and here's what happens and that's very very convenient because typically you know when you started about something like this you have to do a lot of work to get the instruction pointer to point to your race French but in this case actually worked immediately we're going to see in a minute why that's the case so the nice thing means arguing in favor of patching catching gets very very the the purity control yeah he no there's no question now this is a security but people can tell my program where to jump next year so on exploitation and why why this the work so for 1 thing and this is important you know actually have to transmit 42 by file that would just take too long and you know that that would not be so instead it's perfectly fine to use the content length field here and set it to that value so you actually telling it eventually you will receive 4 gigabytes now I'm just going to give you a few bytes went up so the attacker also fully controls the amount of data to actually copy because you can't you can charge the HET of responses that you have and this is the important part stream read so this version of MEM copy actually copies in blocks and in between these copy operations at the reference is a function pointer meaning that if you can override this function for the words you accidentally overwrite that function pointers I obviously did at that point then as the references the function pointer you already control workshops next baseline is typically unable but there's some position-dependent approach that allows you to build a smaller of change and the downside of this is multithreading so it's not so easy to explain right I have the demo exploits I'm is running out of time on community written with then there
others it's a bit unstable
so it works every 5th time so what you know I mean fired at 5 thousand
host you get a thousand shells check for updates and it doesn't work and it's this was not mentioned in the Quixote the
and then the set of the 2nd
block the incident OK something is broken this is this is the thing of light most of OK something very wrong connecting it lost its connection actually to the to the host that's because this guy's
standard data and that actually display something OK anyway I'm running out of time you can see the
working exploit if you really
want to see it in the CCC video it does work promise and but so hard to have honourably discovery using clustering so wants to explore lots of boxes this you start to see that it would be very nice to actually designed some templates that you could fill up front about box so for example if you take partly then what actually has the same structure as the box we looked at as well so you kind of have this of this attractor controlled payload field here and again there's this network articles by 5 order conversion and in the original coat this check here was not in place and then eventually reaches the 3rd argument of men cropping and since the check is not a place you can copy more than should and by that time you think OK let's just create a template for this you know get cost to some sinc function and then you wanna describe what the arguments to the function look like each of them and how they should be checked and if that's not the case in output so these are the kind of queries that that I use these days so knuckles long queries but something like this and this gives you a graph like functionality but wouldn't be
much nicer if the system could also actually tell us what is inside the predicted codebase so could it maybe tell us these are common data flows can we get L S like functionality can it automatically fill out this template for us to show us the most common data flows in it in the code base and the programming was attached to this so things like there was always the check here and a related question is and if we have a barber ready and can we may be generated like a robust signature for so
yeah that's scenario user template we want to learn regular expressions for this on both for the sources and for the sensitizers but yet you can think of this as a little car and there's what you learn use the model parameters that's a driver of the car and once you have that car you
put it into the core property graph drives down the code property graph and you know as a side effect it gives you a list of file that conceptually seeing this is the machine learning problem yeah machine learning problem machine was machine learning and all that really just means you take data you put data into your system and it creates this model for you so this strike and then use that model to instantiate a predictor in the predictor is a poor thing that it gets more of this data and it needs to somehow make sense of now known increases the same you have the source code you put it into little of and it outputs those templates parameters and then the query that you get the that's the predictor and you from Warsaw good at it and it drives down the quote probably graph is as vulnerable OK wonderful that's the but conceptually we can borrow from signature generation here so in signature generation and you take you take the data and you 1st extract objects from it you map this to a vector space and in there you can use classical clustering algorithms to find similar objects and then for each group of similar objects you generate 1 of these queries you now I'm not going to bore you with the details we build this of and if you're into this kind of stuff you can read this up essentially the objects we classify are invocations as in cost of the functions of them and we then use a representation where each dimension of the vector space is actually actually associated with the regular expressions so that's that's a nice trick if you like machine learning and then use those regular expressions to instantiate the template and if you're interested in the details as a set of this is coming out but also there's a pay it paper available already which is called automatic inference of such patterns has changed of abilities and just to quickly get to the results this is the query that was automatically generated for lead to lock the box so you can see it says that the 3rd argument that that there's a common flow in the database which is the 3rd argument is initialized by into S out and so for the 2nd argument is found OK I don't really know how this is initialized what and it's always checked against small telling the programmer disappointed and then for the 3rd argument I find that there is typically some sort of comparison with the number of this is what's automatically extracted and looks pretty close to what we're all manually and the nice thing is if you run this it is and should just find 1 instance of the heart the book but also the other right so there are 2 instances of this and he's a false positives and yet the system as false positives but yes then how we also use this and we'll see and just ask you know give us all the systems from and copy and 2 of these were interesting in their own right because by committee is looking at these URI see OK this is dangerous so here you have a buffer on the stack and controlled by a rival called land and so on and and there's some sort of addition in here so this is essentially something like a lock you can you can do this in a safe way if this is too large then actually on the you will you have memory that's that's not allocated that you actually copy copy to and this found and then the other 1 is you have to get some sort of get that controls on the 3rd argument to mimic property this from parts as well I'm wrapping up on these indices we got for these are this is pretty much just going over the 60 systems and picking these 2 and these immediately up direct inclusion of I introduce a system to mine coal basis for vulnerabilities and it builds a bridge between program analysis and graph databases and you saw that the you can just specify these traverse the search patterns and by yourself but also you can get the system to enumerate what's there using learning techniques and this is something that on the 1 hand tells you about what you could look for on the other hand give you robust signatures from old books that you have yeah we found real box using this so it kind of works that's all I can say is thank you FIL


  639 ms - page object


AV-Portal 3.16.0 (9cfa3864b8acb689056f9c67aa39bc8ec4c75d58)