AI VILLAGE - Chatting with Your Programs to Find Vulnerabilities

Video thumbnail (Frame 0) Video thumbnail (Frame 737) Video thumbnail (Frame 4177) Video thumbnail (Frame 5032) Video thumbnail (Frame 8186) Video thumbnail (Frame 12115) Video thumbnail (Frame 14605) Video thumbnail (Frame 15656)
Video in TIB AV-Portal: AI VILLAGE - Chatting with Your Programs to Find Vulnerabilities

Formal Metadata

AI VILLAGE - Chatting with Your Programs to Find Vulnerabilities
Title of Series
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
During the Cyber Grand Challenge, an automated vulnerability exploitation competition, all the teams used the same approach: use a fuzzer to find bugs, and symbolic execution to generate an exploit for any bugs found. Fuzzers are great at triggering bugs, but their effectiveness is often limited by the quality of the initial testcase corpus that is fed to them. Testcases are easy for humans to create, but hard to generate automatically. Teams used a wide variety of techniques to generate initial seeds: from using very slow symbolic execution techniques to find inputs that triggered execution paths, to just using the word “fuzz” as the seed and hoping for the best. However, many of the programs in the CGC are console programs designed to be used by humans: meaning they give a prompt in English and expect a response. For this research we trained a chatbot Recurrent Neural Network on a set of testcases generated by humans, and ran the RNN against the test set with the goal of finding testcases that had higher code coverage than random guessing and could be used with a fuzzer to find bugs.
Machine learning Expert system Speech synthesis Self-organization Bit Computer programming Vulnerability (computing) Social class Reverse engineering
NP-hard Scheduling (computing) Thread (computing) Divisor Computer file Code Multiplication sign Patch (Unix) Virtual machine Mereology Perspective (visual) Field (computer science) Software bug Number Fraction (mathematics) Crash (computing) Goodness of fit Term (mathematics) Computer configuration Computer programming Autonomic computing System programming Energy level Software testing Endliche Modelltheorie Computer architecture Vulnerability (computing) Cybersex Algorithm Software engineering Projective plane Fitness function Expert system Bit Exploit (computer security) Measurement Process (computing) Personal digital assistant Computer science output Fuzzy logic Spacetime
Randomization Run time (program lifecycle phase) Computer file Code Multiplication sign Image processing Virtual machine Control flow Mereology Computer programming Junction (traffic) 2 (number) Goodness of fit Mathematics Computer configuration Different (Kate Ryan album) Computer programming System programming Software testing Functional programming Nichtlineares Gleichungssystem Scaling (geometry) Electric generator Information Validity (statistics) File format Interactive television Sound effect Bit Limit (category theory) Symbol table Digital photography Process (computing) Personal digital assistant Order (biology) Fuzzy logic output Heuristic Cycle (graph theory) Library (computing) Probability density function
Run time (program lifecycle phase) Code Multiplication sign Chatterbot Computer reservations system Set (mathematics) Computer programming Software bug Online chat Different (Kate Ryan album) Single-precision floating-point format Data conversion Error message Cybersex Algorithm Block (periodic table) Binary code Sequence Recurrence relation Process (computing) Befehlsprozessor Internet service provider Order (biology) output Resultant Point (geometry) Divisor Virtual machine Number Wave packet 2 (number) Average Computer programming Energy level System programming Software testing Mathematical optimization Hydraulic jump Dependent and independent variables Artificial neural network Projective plane Cartesian coordinate system Symbol table Word Software Personal digital assistant Speech synthesis Fuzzy logic Natural language
Axiom of choice Parsing Multiplication sign Patch (Unix) Chatterbot Range (statistics) Virtual machine Combinational logic Menu (computing) Function (mathematics) Computer programming Number Wave packet Subset Software bug Web 2.0 Crash (computing) Machine learning Computer configuration Different (Kate Ryan album) Computer programming Operator (mathematics) System programming Software testing Form (programming) Vulnerability (computing) Area Dependent and independent variables Constraint (mathematics) Information File format Artificial neural network Interactive television Sound effect Bit Limit (category theory) Exploit (computer security) Symbol table Wave Word Befehlsprozessor Personal digital assistant Password Network topology Fuzzy logic Video game console
I'm Chris Gardner and this is chatting with your programs to find vulnerabilities where we are going to learn how to become therapists for our programs so to start off introduce
myself I recently graduated from UMBC which is where I did this research and while I was there I took a couple of classes on machine learning which in some organizations makes me a data science expert I also have done a lot of reverse engineering over the years and competed in many capsule fly competitions so I know a little bit about vulnerability research and speaking of vulnerability research it's
pretty expensive to do these days experts in the field are in high demand and costs a lot and plus they take a lot of time to do their jobs I think it's high time to replace people like me with a computer program and DARPA agrees with me so a couple years ago DARPA ran the cyber grand challenge here at DEFCON our CGC which challenged eight teams from industry and academia to create fully autonomous cyber reasoning systems to find exploit and patch bugs and computer programs as part of this DARPA created decree a special OS designed to be easy to model and analyze and decree you don't have to worry about networking file i/o thread scheduling remember the other hard problems in computer science and software engineering since there are no environmental factors each program written for decree is deterministic which makes to create good target for research projects so DARPA ran the cyber Grand Challenge or not have a job yet the winning cyber reasoning system went on to play against the humans and the DEFCON CTF and it didn't crash and burn horribly but it did get last place and additionally during the competition only a fraction of the program's actually had exploits written for them so we can do a little bit better there's space for new research in the field so if you want to build an automated vulnerability finding machine you basically have one option in terms of overall architecture we use a tool called the fuzzer to find crashes in a program and then use a technique known as symbolic execution to solve for an exploit symbolic execution is somewhat of a solved problem from a research perspective but there's a lot of space in the fuzzer level to apply cool AI tricks so let's talk about fuzzing basically if you're not familiar fussn't well fuzzing is as fuzzing takes an initial corpus of test cases which are inputs that cause some program to do something and randomly mutates them in the hopes of finding a bug many fuzzers called coverage guided fuzzers measure the code coverage each mutation to try and make sure that different parts of the code are tested and take a little bit of the variants out of the process one popular puzzle called American fuzzy lopper AFL actually uses some AI to do this it implements genetic algorithms to evolve the inputs using code coverage as the fitness function and fuzzers in AFL in particular work really really well in fact they work so well the combining AFL with more advanced techniques such as involving symbolic execution in the fuzzing process only marginally increases the number of crashes found but unfortunately buzzers aren't quite perfect coverage guided buzzers only
measure how much of the code is tested they cannot reason at all about what the code actually is they rely on random
chance and heuristics to boost the amount of code tested in a case like this program where asks you to enter 1 2 3 4 5 6 to trigger buggy code the photo breaks now even though the correct value is in the program it's even outputted to whoever's using the program the Flusser cannot make uses information and in order to trigger the buggy path the Flusser has to basically just guess 48 bits of information which is not really going to happen on any reasonable time
scale now this brings us to the limitations of fuzzers fuzzers are good at making small changes to find interesting execution paths but tend to be bad at making big coherent changes and the way you get around this is by providing the initial test case corpus if you're fuzzing a PDF reader the fuzzer isn't going to be able to create a valid PDF file from nothing most of the time so you should give the buzzer some valid PDF files to work with just to start it off and so in a sense the effectiveness of a buzzer is a function of the quality of the test case corpus and a good quality test case corpus should include inputs that exercises different parts of the program so for an image processing library good test cases would be a PNG file and a jpg file and maybe like a tiff or other weird formats generally test cases are easy for unskilled humans to create but hard for machines to create automatically but we can do something about that so to generate test cases automatically there's a few different solutions people have thought of and here are the three so the first method is use symbolic execution which I mentioned a little bit earlier to eight in the fuzzing process if the fuzzer ends up at a junction in a program so there's two paths and the fuzzer can't get the mutation engine in the fuzzer can't get the program to go down one specific path it can invoke a symbolic execution engine to solve for the input required for the program to take that path this has the advantage that it always works however it's very very slow we're talking execution times in the order of seconds whereas AFL mutations run in microseconds and this needs to happen thousands or millions of times during one Fuzz cycle and one example this kind of fuzzer is driller by the team shellfish which competed in the CDC but many of the other teams in the CDC also used this technique so another option is we can cheat and just add humans into the equation and a bunch of the shellfish guys developed a system where they would have random people on Amazon Mechanical Turk interact with the programs and then use those interactions as the initial test case corpus this actually worked really really well it almost actually worked better than just using symbolic execution in the fuzzing
process however it is cheating because we're using humans even though it's like a little more automatic since Amazon Mechanical Turk was pretty fast at finding work finding workers but and it's not quite automatic in the humans do take some time so you're still looking at run times in the order of seconds or minutes and while the humans didn't find any bugs they were in this paper they were able to create great test cases and AFL was able to take those and find bugs with those lastly we can try to apply AI to this problem there have been a few applications of AI to fuzzing namely using genetic algorithms in AFL which I mentioned which has been very successful it's also paper called deep reinforcement fuzzing which outlines a method use reinforcement learning instead of genetic algorithms and fuzzing the method the method is a little young and I don't think they released any code from that paper but early results seem to outperform AFL by about 10% which is definitely worth the effort and lastly there's our idea which approaches the problem in a different way than these two methods so we have a bunch of conversations between unskilled humans and these programs it sounds like we have data many of the programs from the cyber Grand Challenge are console programs meant to be used by humans like a good application for a chatbot and that's what we did we trained a chap on the results of the human assistance CRS paper that the shellfish guys wrote and sampled that on the input from new programs to provide test cases and also a big shout-out to czardas from shellfish for giving us the data and making this research possible so despite the despite the opinions of my machine learning professor I'm not actually good at machine learning so instead of writing my own like chat pod imitation I kind of just pulled one off of github it's this kind of a standard chat bot uses recurrent neural networks sequence to sequence blah blah blah buzzwords buzzwords and basically I just modified it to work on a byte level instead of a word level kind of discarding all the optimizations that people can do with words and let it rain and speaking of training since no one likes to fund undergrad research projects we only trained the network for about five hours on a CPU although was a nice CPU for the data we used about 3000 test cases from the H acrs paper which corresponded to about 50,000 question response pairs so 50,000 samples and for the test set we had 200 programs in the cyber grant challenge that we had data for so we excluded six of those from the training set and that was our test set and so the way we tested it is we measured the code coverage of the generated test cases and then compared that with the code coverage of the H acrs test cases code coverage the way you calculated is it's the number of basic blocks in each
program a basic block is one uninterrupted sequence of assembly instructions with no jumps and we just you take the number of basic blocks you hit you measured using instrumentation and divide that by the total number and basically we or I decided if a test case has higher code coverage it's a higher quality test case that's not the best way to measure these things but it was a quick and dirty method and it worked really fast so I did it and specifically we generated 10 different test cases for each program and then we took the average code coverage of our test case and also took the Union code coverage of those 10 test cases and then compared those with the H acrs test cases and we had about 30 test cases per binary to Train on so the results as expected we didn't really do quite as well as the humans but we only under perform them by a single percentage point in some cases and the worst we ever did was only ten percent under the humans which is pretty great and we way outperformed restoring random data or using the word fuzz as your initial seed for everything which some teams did during CDC and thankfully this system is also really really fast even though it's not quite as good as humans or using symbolic execution it runs in milliseconds compared to seconds for symbolic execution in humans and the actually the big bottleneck for that is not even the neural network it's just communicating with the target program and also the network wasn't just like throwing garbage and like hitting error cases and doing stuff it was actually navigating the menus of these applications and responding with sometimes invalid but a lot of times mostly valid data it wasn't perfect and so sometimes you only got half way there so for example
here's one of the programs tested the program prompts it asks for a number between 100 and 1000 and it responded with a number but not in that range it responded with 40 with love leading zero but it's important to note that that is actually only one bit away from the correct response it can just flip a bit to turn the leading zero into a one or two or whatever and then the fuzzer should be able to easily make that change and take it from there and then it's presented with a menu and it actually does pick it up pick a valid option from those choices showing that this approach is capable of recognizing and navigating the basic menus that are in many of the CGC programs the approach does have some limitations so and the
big one is that it only works on CDC programs or and most of CDC programs or toys this doesn't quite work as well in the real world because the CDC programs are little contrived compared to real the real-world programs however it might be possible to apply this to other like human interactable programs so just like GUI programs or web forms or other like simple console programs or stuff like that and lastly this research is going to make many waves in file format fuzzing which is what a lot of the new research is focused on that's me AFL is mainly a file format fuzzer and since this kind of relies on output program before the program does like really any major parsing in most like file format fuzzing situations it doesn't work there are a few ways this research could be extended though mainly we can train do we can train for longer add more data like I said we train for five hours in a CPU which is barely nothing in machine learning time and we were only able to acquire a small subset of the data in the HSE RS paper so it might be helpful to get more data can also try training on different data it might be possible to train on exploits or crashes instead of just test cases and then get crashes from those in the biggest area to combine this tree to move this research forward we would be to combine this technique with a symbolic buzzer like driller giving it another fast option to try when the mutator engine gets stuck instead of just diving into a long symbolic execution operation the fuzzer can try to approximate a first using machine learning this would allow driller to focus on solving the hard constraints like a hard-coded password rather than wasting time on navigating simple menus so to wrap this up here's a quick summary here's what I told you vulnerability research is expensive takes a really long time and it's high time we work on automating it cyber reasoning systems are one approach to that and they work by or they automatically find exploit and patch bugs generally using a combination of fuzzing and symbolic execution fuzzer effectiveness is often limited by the quality of the initial test case corpus which machines are currently bad at making automatically we can use symbolic execution to make test cases slowly or we can do it really fast with a recurrent neural network chat bot which was trained on human interactions and the chat bot works pretty well way better than just the word fuzz and finally we did really well with just a few hours on a CPU it would probably work a lot better with a GPU more time or more data all right thank you guys so much for listening me that's and my talk my contact info is listed if you don't have quite time for questions now please feel free to reach out to me on either of those I'm very responsive yeah please questions comments complaints private party invites you know to do alright to have time for questions all right any questions No okay alright thank you guys so much for listening [Applause]