Go Speed Tracer

Video in TIB AV-Portal: Go Speed Tracer

Formal Metadata

Go Speed Tracer
Title of Series
Part Number
Number of Parts
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
The past few years have seen a leap in fuzzing technology. The original paradigm established a decade ago resulted in two widely deployed approaches to fuzzing: sample based mutation and model based generation. Thanks to ever-increasing computational performance and better engineering, newer guided fuzzing approaches have proven to be supremely effective with a low cost of deployment. This talk will explore a few different approaches to guided fuzzing through dynamic analysis including code coverage analysis, constraint solving, and sampling/profiling based feedback mechanisms. Novel contributions in this talk include: - Opensource Windows Driver enabling Intel “Processor Trace” - DBI based tracing engine for Windows/Linux/OSX binaries - American Fuzzy Lop with full support for Windows binary targets
Computer animation Multiplication sign Image registration Table (information) Computer-assisted translation
Randomization Group action Source code Tracing (software) Software bug Fluid statics Single-precision floating-point format Cloning Information security Error message Physical system Exception handling Collaborationism Arm Block (periodic table) Software developer Binary code Sampling (statistics) Electronic mailing list Bit Unit testing Funktionalanalysis 10 (number) Category of being Data management Befehlsprozessor Spacetime Slide rule Open source Device driver Translation (relic) Branch (computer science) Latent heat Crash (computing) Differenz <Mathematik> Bridging (networking) Computer hardware Energy level Selectivity (electronic) Data structure Address space Computer architecture Graph (mathematics) Information Weight Total S.A. Cartesian coordinate system System call Symbol table Compiler Uniform resource locator Word Software Personal digital assistant Network topology Interpreter (computing) Eulerscher Graph Iteration Window Differential (mechanical device) Code Multiplication sign Decision theory Process modeling Set (mathematics) Mereology Likelihood function Very-high-bit-rate digital subscriber line Semiconductor memory Chromosomal crossover Cuboid Flag Extension (kinesiology) Vulnerability (computing) Area Algorithm Context-free grammar Fitness function Data storage device Entire function Website output Right angle Row (database) Asynchronous Transfer Mode Dataflow Service (economics) Data recovery Distance Binary file Attribute grammar Causality Root Operator (mathematics) Software testing Plug-in (computing) User interface Turbo-Code Addition Gender Mathematical analysis Interactive television Evolute Performance appraisal Kernel (computing) Computer animation Logic Formal grammar Fuzzy logic Natural language
Group action Dynamical system Parsing Serial port Source code Execution unit Tracing (software) Computer programming Software bug Mechanism design Different (Kate Ryan album) Core dump Cloning Physical system Rotation Mapping Block (periodic table) Binary code Electronic mailing list Sampling (statistics) Bit Funktionalanalysis Arithmetic mean Process (computing) Figurate number Point (geometry) Onlinecommunity Sequel Computer file Open source Branch (computer science) Number Latent heat Goodness of fit Crash (computing) Computer hardware Operating system Energy level Data structure Form (programming) Default (computer science) Graph (mathematics) Weight Compiler Software Integrated development environment Query language Statement (computer science) Iteration Window State observer State of matter Code Multiplication sign Direction (geometry) Process modeling 1 (number) Set (mathematics) Mereology Web 2.0 Strategy game Semiconductor memory File system Cuboid Injektivität Area Algorithm File format Wrapper (data mining) Feedback Fitness function Data storage device Price index Variable (mathematics) Proof theory Type theory output Right angle Asynchronous Transfer Mode Ocean current Server (computing) Implementation Attribute grammar Population density Programmschleife String (computer science) Integer Software testing Plug-in (computing) Forcing (mathematics) Database Evolute Subject indexing Computer animation Logic Blog Formal grammar Fuzzy logic Natural language Communications protocol
Group action Context awareness Code Multiplication sign Decision theory Direction (geometry) Source code Execution unit Process modeling Set (mathematics) Stack (abstract data type) Mereology Tracing (software) Computer programming Software bug Virtual memory Hooking Semiconductor memory Linker (computing) Special functions Series (mathematics) Determinant Data compression Physical system Social class Personal identification number Electric generator Mapping Block (periodic table) Structural load Binary code Virtualization Bit Funktionalanalysis Parsing Type theory Message passing Interrupt <Informatik> output Right angle File viewer Spacetime Point (geometry) Ocean current Slide rule Overhead (computing) TLB <Informatik> Open source Parity (mathematics) Connectivity (graph theory) Translation (relic) Similarity (geometry) Branch (computer science) Number Normal operator Wave packet Crash (computing) Computer hardware Energy level Software testing Data structure Acoustic shadow Booting Form (programming) Systems engineering User interface Time zone Multiplication Graph (mathematics) Demo (music) Mathematical analysis Line (geometry) System call Cache (computing) Loop (music) Computer animation Grand Unified Theory Personal digital assistant Logic Formal grammar Local ring Disassembler
Distribution (mathematics) Demo (music) Computer file Block (periodic table) Multiplication sign Mathematical analysis Set (mathematics) Line (geometry) Funktionalanalysis Theory 2 (number) Film editing Computer animation Different (Kate Ryan album) Order (biology) output Condition number
Trail Computer animation Semiconductor memory Block (periodic table) Structural load Flag Acoustic shadow
Point (geometry) Computer animation Open source Block (periodic table) Code Flag Bit Funktionalanalysis Mereology Sequence
Word Computer animation Computing platform 2 (number)
Group action Building Dynamical system Overhead (computing) Run time (program lifecycle phase) Computer file Open source Code Multiplication sign 1 (number) Disk read-and-write head Term (mathematics) Different (Kate Ryan album) Operating system Computing platform Plug-in (computing) Rhombus Injektivität Binary code Memory management Bit Instance (computer science) Line (geometry) Binary tree Computer animation Software MiniDisc Figurate number Local ring Asynchronous Transfer Mode Directed graph
Graph (mathematics) Demo (music) Block (periodic table) Code Patch (Unix) Multiplication sign Source code Bit Number 2 (number) Process (computing) Computer animation Software Right angle Computing platform Rhombus
Point (geometry) Server (computing) Thread (computing) Serial port Computer file Code Block (periodic table) Real number Direction (geometry) Mathematical analysis Generic programming Translation (relic) Branch (computer science) Cartesian coordinate system Tracing (software) Computer programming Word Computer animation Logic Computer hardware output Data structure Library (computing)
Slide rule Context awareness Multiplication sign View (database) Execution unit Sheaf (mathematics) Branch (computer science) Event horizon IP address Number Programmer (hardware) Latent heat Bridging (networking) Computer hardware Flag Data structure Software development kit Area Graph (mathematics) Type theory Sparse matrix Befehlsprozessor Pointer (computer programming) Computer animation Network topology Buffer solution Device driver Summierbarkeit Asynchronous Transfer Mode Spacetime Row (database)
Area Context awareness Demo (music) Multiplication sign Data storage device Branch (computer science) Tracing (software) Demoscene Cache (computing) Message passing Befehlsprozessor Computer animation Bus (computing) Flag Series (mathematics) Data structure Hydraulic jump Tunis Physical system
Web page Slide rule Server (computing) Code Multiplication sign Range (statistics) Source code Process modeling 1 (number) Branch (computer science) Tracing (software) Public key certificate Number Virtual memory Single-precision floating-point format Core dump Computer hardware Address space Tunis Physical system Linear code Seitentabelle Information File format Bit Perturbation theory Category of being Type theory Kernel (computing) Befehlsprozessor Loop (music) Computer animation Device driver Interrupt <Informatik> Fuzzy logic Near-ring Window Row (database) Asynchronous Transfer Mode Spacetime
Web page Ferry Corsten File format Software developer Multiplication sign Process modeling Set (mathematics) Computer animation Ring (mathematics) Buffer solution Interrupt <Informatik> Figurate number Asynchronous Transfer Mode
Implementation Demo (music) Mapping Code Block (periodic table) File format Multiplication sign Process modeling Physicalism Tracing (software) Medical imaging Computer animation Semiconductor memory Device driver Buffer solution Configuration space Data structure Series (mathematics) Window Sinc function Tunis Row (database)
Digital electronics Computer file Ferry Corsten Branch (computer science) IP address Bookmark (World Wide Web) Usability Number Core dump Flag Software testing Address space Installation art User interface Time zone Physicalism Cartesian coordinate system Connected space Data mining Befehlsprozessor Computer animation Device driver Buffer solution Configuration space Pattern language Right angle Spacetime Asynchronous Transfer Mode
Computer animation Code Multiplication sign Cartesian coordinate system
affect how a and different time and then there have
the the and and the look
the it but it we're going to get started again and so this is the last talk of the of the conference actually of it's been great yeah the and to import announcements 1st of all all the speakers should please go see cat at the registration desk before we'll clear out of here and a 2nd important announcement it's just a warning that the mysterious bag of Doritos on the table outside is is not official conference provided snacks so we suspect said Tang crunchy honeytrap so with that i'd like to introduce richer Johnson and hello yes I'm Richard Johnson and today my talk is go Speed Tracer and
just start up I'm honored to be speaking here at the 10th anniversary con this is the 3rd time in the last 4 years i've been here that you might have seen some of my earlier talks on concoct execution and other high-speed high-performance buzzing and I'm in this talk we're gonna drill a bit further down into the tracing engine design as well as some of the background that has taken us from 10 years ago in early explorations into evolutionary testing and genetic algorithms as applied to fuzzing and how we arrived at the modern toolkits like American buzzing up and things like that so I'm research manager for a team that Cisco and by way of source fire has several our team members are in the audience and hopefully as we had a chance to talk to them but generally speaking we work on finding and writing tools to help assess with that I do wanna give a special shout out to a a collaborator of hours is also part of tells about my team under leaving he's been really essential in helping put together some of the Windows Driver side of this talk so at the beginning what is gonna do a brief overview of of of you know how trees it can be applied in your day-to-day I'll probably skim through some of his early stuff fairly quickly go they've added substantial amount of information to this talk since the first one and that will go into guided fuzzing kind of which of approaches to guided opposing had merit along the way the last 10 years with attributes are actually an influence to where we are today why we have things like a high-performance feedback signal oriented mutation buzzing that highly effective and then specifically talk about engines such as binary translation and the hardware tracing engines that are available in your CPU's specifically Intel and so the goal is still you know get an idea of what attributes are useful and if writer on others would things to keep in mind that we can learn from the past work of the giants that we stand on and uh then figure out what what what things the remaining to be optimized and you know how can we push technology forward continually and then finally unlocking this new feature set in the Intel processes as a broad well on sky lake so completely new hardware tracing mode that recently which is added to the Linux kernel but to date has not really had support on windows and and it's a very efficient and so we'll get into details on that later but as far as applications for doing tracing obviously and suffer Engeneering for unit testing performance monitoring as useful now analysis for unpacking and behavioral analysis and automated sandboxes things like that that we saw Microsoft to give away a quarter million dollars for mitigations based upon LBR tracing engines um and also memory safety checkers oftentimes also due tracing for a data flow and things like that but it's not for security were mostly interested in applying tracing to doing things like corpus distillation so that you can observe whether or not the inputs the you might be using in fuzz testing are useful and trying to reach the minimal set so that you reduce your total bytes space down to the things that will be a put will will be effective if you're going to go and randomly mutated to and then and and more recently getting into guided fuzzing and being able to provide a feedback signal during the testing process and of course in post-crash analysis tracing will allow you to do things like recover data flow and try to automate some of the root cause analysis of the ah crashes the occurring and we release tools previously that did something like that and in my talk a couple years ago that was about the same clone and then some other tools written on top of that the do slicing and so forth symbolic execution things like amend the more most recently interactive debugging RGB now has some hooks into the LexPar per subsystem so that you can turn on some interesting features that are available and so when it comes time to do crash analysis you have a tree slot there so if you have a corrupted step for example you can simply while reference the trace which is a complete recording and so on but his general applications that what people have written tracing and a a kind of a higher level of looking style tracing that function levels than API level things last year Alex's talk was really amazing on the as a terror cooking hindrances fantastic if you haven't seen it and obviously started it was also great but influenced me quite a bit and actually a roads uh hunting engine that is built upon the nirvana Error upon have verifier stuff to help launch some this trace it into logic as well but in general we know there's things like the trace interface to bug engine I mean you know this is recon so it'll have to go into detail there um then Binary Instrumentation we see these things like compiler plug-ins that are efficiently able to interleave callbacks into all functions are blocks and the GCC LVM and then we've also probably been familiar with things like powder and over the days that I'm a real pain and more recently dined we really some tools based upon that which is static rewriting and I will go into detail about the benefits in kind of internal and design of some of these engine so that we can understand why is pain and I'm a real what's the performance differentials and what were the decisions that were made that determine whether or not a DVI is actually performing that and then recently we have new engines based on like the apes they freedom and I'm sure there's others that you know I'm not going to enumerate them all but tracing the ubiquitous available and useful and you know something has to be interested in but it's also important when you go to pick a tracing engine that you're picking 1 that's kind of tuned towards the application we're looking to supply so and besides the software oriented DBI engines there's made of hardware supported tracing going all about Pentium 4 and actually 6 and even before you count the single-step breakpoints but you know I'm I'm sure that we've all done a little bit of an interactive debugging solution single steps breakpoints maybe you see in back in the process Stocker days or even a concert this year discussion about the branch trace flag i which is a simple coupled that's that turn a single step into an actual branch step and West Branch recorded is a functionality and CPU that is essentially a set of still registers that allow you to whenever you do and a branch into a call or a branch within your logic it will actually pop the last bridge location it into a set of delicate reduces to have breast restores an extension on that's and that will take those registers there in and push them into a cash which then you can retrieve later as well and then the latest things a talk about as Intel processor trace and then there's a arm course site which was discussed at the other day so before we get really deep into the internal architecture of tracing engines and was just take kind a walk through that advising and how we got to where we are today they get a quick raise a hand if you guys anyway the audiences used American puzzle the upper Hong was or Jerry DiMaggio us back in the day and it like that that may be a quarter OK so evolutionary testing was originally white box testing done on source code and typically for a high critical systems that needed so you know safety against failure including denial service and things like that and the source that allowed them to do a prior analysis of the graph and to do things like static analysis that determined areas that might lead to exceptional conditions and then they would use evolution attesting to try to craft inputs that would get you to those locations and this allowed them to define a fitness function that was essentially distance from a target area in the graph so here you know where you want to go on a graph you can do a trace in recorded word you visit in this particular iteration and determine whether or not you got close to we're trying to land now we do evolvability research we don't necessarily know where the bugs are yeah so this is a just how influence this at this influence the later development of fuzz testing but ultimately it was still difficult to define properties that tell you how to traverse graph edges and get to the location that you're interested in um and applications were not released security specific in the sense of like finding involves the can exploit a is like that but it was the early generation mostly governments avionics from some automotive things like that so the bringing it into the you fuzzing realm what we're trying to do is apply this concept of having an evaluation of fitness or evaluation of where you went a graph over your execution and to determine if it was better than your last mutation are you visiting new areas of the graph essentially so in the concept of evolutionary testing you have a sample set that's your pool of samples you have operations that you perform on your samples which tend to be just like in genetics of random mutations and crossovers and then you have a way to evaluate whether or not those random mutations are crossovers were actually have fit for to to continue uh um executing the function which in this case the the function is generating inputs that will eventually to crashes are discover more code-coverage ultimately so the 1st time
this was discussed in public was in uh is a poet as applied to security testing for finding loans was a black cat in 2006 and Embleton spots and talked about a tool called sidewinder it was actually really cool they took um I think the approach was pretty intelligent and b they took a Markov process which is a way to weighted graph for essentially likelihood for visiting that have over time and so that way you can observe 1 tracing compared to the entire history of all your other traces very efficiently and they took a pretty simple interpretation of genetic algorithms really only applying random mutation and uh occasionally crossover between a previously known good inputs and that got new code coverage and the next 1 and they used a context-free grammar to describe the the input rather than just doing by a simple bite flipping and so they were able to actually mutate the grammar itself and then use the grammar to generate new inputs that match the now modified grammar and and then the only real weakness was that they didn't have a well engineered tracing and and that was very efficient so the the actual genetic of the evolution a testing set was pretty sound but they didn't have you know of a quick way to apply this where you can actually use it in practice the way that we do today to takes millions of iterations to really shake out the bugs and fuzzy their major contributions were certainly 1st of all actually taking gender governments applying them to fuzz indirectly and talking about a publicly leads to kind of put that exceed into the community for further research and that and the fact that they use the Markov process was inefficient of data structure to store the entire history of traces rather than having to go in parts like a binary format or a list of addresses that you executed each time into diffs over that using Markov process you keep that an estate all-in-one data structure and so this the system also allowed them to select target locations code locations ahead of time which you do by adding weights to the graph right so in this case what you're looking for is unique has typically so you wanna find exceptional that's the 1 that's not taken by all the inputs when you find a bug they tend to be unless traveled code and so by weighting the path in a way that says if this input reached transition on the graph that was rare and this is probably a better input than 1 that only reaches transitions on the graph that are common and so in addition to build a hole in the entire history of Olivier inputs doing it this way you can also specifically point out parts the graph and weight them in a way that you want so you can drive the feedback them directly just nice something you can do with current storage so I don't mean problems with them where that they had never open source data really discussed much about the internal architecture of and 1 step the slides and that's too bad because a lot of the ideas I think were ought to a great start and they were really dismissing the tracing engine
this next year we had a talk from chaired the lot of the news of his PhD on evolutionary testing and unfortunately I feel as though he kind of took it to an extreme try attracting many variables getting too close to the biology of things rather than remembering that were trying to come up which is an efficient signal to let us determine whether or not the executioner just occurred was useful and um and that led to hand operating system that was both slow due to the tracing mechanism of BTF for the of API as well as of the clunky because you storing lot stay in the database so you had to query database every time and do you know do you would you would compare your current iteration to all the iterations through sequel queries and as we know that's just not the fastest way to go there many just he had just a larger concept of genetic algorithms as applied the testing that didn't really panned out that well and he did experiment with doing poking out a function level and trying to use that and uh evolutionary testing but in an work somewhat in his up in this specific examples but he was only really looking at text-mode protocols which generally speaking when you architect software think that you're more likely to have a dedicated function per verb and textual grammar protocol than you are if you have a dense pack structure they're going to be tearing apart have a lotta logic and in a particular type of form and so and I think we we've sensed 2007 we discover that block-based granularity is the the better 1 to go with and um and then of course I would I would say it's kind of been a new evolution in the last 2 to 3 years is a real practical well engineered tool that Sam is designed to be idiot proof right of the box and kind of be a form of a mutational father just like any other mutational Father except that has a lot more logic intelligence to it now most mutation of others don't take a lot of set up for you know reasoning here configurations which would variables or figure out weights on a Markov graph they just kind of our fire and forget so American for lot has pretty much pulled off that goal and what it does is it does it use the compile-time instrumentation out the box which when you compile your code at every logic branch and within your code so every if then statement by injects a callback function or a set of actually in the index of a couple of instructions which essentially update a bite map that contains the state of the current trace as well as compared makes a comparable to all previous traces so this will this looks like is the 64 K and memory and this is a of a bite map that allows you to increment and offsets into that to indicate number times you visit the branch the way that you determine what the offset into that by mapping is you take your previous block ID in your current block ID use or them together and then modulus 64 K that's the offset into the by mapping you every time you take a branch of but that after the so this what I want to do is very efficiently which is a couple instructions um encode an edge transition and just by using a byte array and then you can also keep this binary key depending on the current bytes In this array to total historical like a global array that shows your entire corpus and covered essentially In so I mean also took a comprehensive approach at mutation and strategies that we've seen kind of spike started I suppose encoding specific mutation strategies that were just random bit flips at a percentage but they're were more like you know edge values for integers or and format strings or or various other inputs that we know to be good sets for doing mutation was the so in general it would it and it's dumb mode it's already just a good mutational father and adds this feedback mechanism through a compiler plug-in that tracks the code coverage via these edge transitions that are encoded into a by and ultimately it's the 1st practical high-performance gathered father that's available and to the public so even comes with a nice little doing you can fancy appears got a little mice old-school in the style of looking layout and it uh also gives you global coverage map it allows you to use tools to do men sat on the corpus that allows you to minimize individual inputs a lot of extra tooling and also includes a fort servers which reduces the initialization time every iteration the process so once you get the process launched a just of force at every time which and when x is copy-on-write and last year of my my talk we kind of were exploit we we ported AFL use it when environment which simply required implementing some and shared memory API is and then also we tried to write a 1 is made of fork which was successful and using antique colonies user process however and when you create processes using that low-level system called user 32 the CSI as a subsystems unaware of it and so you're limited to fuzzing to essentially a wonderful is encoded compile yourself because in Windows almost everything is like use of 32 so but it turns out that that wasn't exactly the best approach in the and which is why I got interested in looking at other tracing engines and Intel PT and things like that so but but it modifies the lab certainly has added a lot it's had atomic contributions from the persistent parsing is another 1 where essentially you can add annotations to a function and it will be able to restart at the beginning of a function you give it a little clean logic as well so will launch the program and just you in memory fuzzing essentially which there hasn't been very many tools out there that are just ready to go to do that prior to air fell in the last 2 years we've seen and LDM and come out with live farther and we've seen some other things that accomplish the same goals but America was a lot really did guide the way in that direction
and then also the last thing I'll come in here and some of the things that have been Bernanke doing talks on he using a reuse of inputs that were generated through and directed fuzzy like this which is very useful so for example if you wanna target a binary programs and you don't have a way to efficiently instrumented you can use a reference implementation on open source implementation of the same parser and generate all the inputs this will save every unique file that went somewhere interesting are unique on the graph so every input will be exploring a new part of the graph of the target under test meaning probably has new features or it exercises loops longer or you know it's it's doing something interesting and so each 1 of those inputs is saved for you and then you can go reapply those inputs to any other parts are about format a lot of fathers actually don't do that even other guided ones and some of them only when you're doing evolution attesting that the fitness function leads to what's called the elite sample and oftentimes they get rid of all the other ones American puzzle lot keeps them all the observations years if you're right will number 1 American puzzle his opinions to actually had on yourself um it's not a modularly written system it like that so and you know you are your own father and take some attributes from obviously keep it simple stupid actually still does apply to fuzzing even when you're getting into more technical approaches to it and performance does need to be a priority I mean it's a first-class function of finding bugs if you're gonna use mutation the more iterations you can do a 2nd the more likely it is they're going to stumble upon of upon the input that will actually lead to a crash um and so sources as rotation really can't be beat there's a lotta reasons that it's more performance than doing dynamic Binary Instrumentation or any other type of treason but obviously there's a lot of software out there that we can apply that and source instrumentation to clear how the code and and so we will talk about some engines that we worked on that can augment that's can binary star but some alleles finds on the bugs it actually foster the user community because it was efficient and because it he did a good job of like kind a keeping track of about a high-profile blogs and I guess being a Google there were able to scale it up a bit and actually apply it In significantly in and final bunch up and last year and 1 shell-shocked was happening is seen as shell-shocked hit the web the that you know he found like 4 5 different variants within I think a couple hours a day or something and so it's pretty cool if you actually have an idea of an area code that you think is weak anyways you can take a trigger for a previous home that's been patched and let it go and will explore in areas around that that as well because of the feedback mechanism the and I would call American was allowed the current state of the art for mutational fuzzing and In the back their new tools that are coming out and Hong was just added guided advising last year as well if you're not familiar HunPos it's 1 other and when it's over Posets portable buzzing engine and its previously was pretty much as to mutation units from Roberts like the other Google's well and that but he added up wrappers to interact with the likes per subsystem which itself wraps all kinds of and hardware counter oriented and but feedback as well as now it has wrappers for doing BTS and until pt which will get into later to and many also this is an example 1 of is the added some logic for doing tracing in the back but he didn't keep entire corpus and he currently recently added a bloom filter to determine whether or not you generate that input before so can escape some the cost but and the nice thing about it is that it can operate both in a binary mode as well as a compiler instrumentation mode and it works pretty much everywhere that has some form of plastic support the you latest father of that with that kind of in the guided fuzzing direction is core on which was discussed a 0 nights last year and from the odd peer believe and some of those guys and I would say the main benefit to this 1 is that add some serialization capabilities and some knowledge about not specific data structures in a format but general structure like a container formats you know is this a block-based type of format or is the set of file system type before molecular structure storage or use a hierarchical and things like that and unfortunately he uses off-the-shelf tracing and and so he used causing that run tracer I believe in the use then oreos code-coverage traces which actually hit traces default mode so um some the logic for keeping around the the feedback is what little bit weaker but some of the up-front like serialization being able to tear apart in input and actually get deeper into it deposit uh is is ended in a bit better and there's all kinds of other guys that have done similar things on a DFA was probably 1 of the earliest and think that opposes the tried to Adams and memory tagging so you can have a concept of whether or not you may have tainted the specific API and Jackson currently the blanket coverage father that came out years ago and colorful as kind of a miniature clone of FO and the so the whole point of doing a little review is you know what are the attributes that we wanna pay attention to foreign writer on parsing engines a 1 . advising does find more bugs and why mutational wanting to do that it present you need a fast racing engine with block-based granularity and not only that you need a fast locking mechanism so that you're not different lists of basic blocks you need a way to store and data a very efficient 1 method to compare current run against the history of all the runs
and that's pretty much what AFL and sidewinder had contributed 10 then we learned learned from Jared marked not to get overly complicated and designing and adopting a revolutionary testing try to keep it as a minimal in as efficient as possible so that for the amount of data that you're recording in dealing with you're getting the most efficiency of return and so that you can continue to find bugs as quickly as possible so in the other things that we would like to see using a portable code almost all the stuff is written for Linux very few people set out for some reason to write an evolutionary testing for the for windows and so were working in that direction and then you know just idiot-proof stuff helper tools which you can pretty much use already anyway AS and the grammar detection type stuff that I'm currents is added so we thought about by a translation this is a fun topic I enjoy the the buyer translation is essentially a jet for a up new devices the this action applied across ISO before back when there was big units but modern times you pretty much the exodus x 86 chip in rare cases you'll see actually 6 x 64 that uses the extra registers space to do things like memory tracing from Matthias player has some stuff like that but in general what you're doing is you have a custom loaded so you replace the Linux linker with a preloaded that loads the binary in itself and has a low that it's able to it as a discovers code when you executed a copies it into a code cash and as a copies into the code catch various DBI engines allow you to instrument that code and so you know into either callbacks or instructions directly and then it links these blocks into what they call traces which is essentially a series of basic blocks there at time executed within uh the structure of a graph like viewer back edge than that begins a new trace the decision of a package or if you have a call and you know all all all far branch than that begins execution trace but you have a series of branches that are executed and you know it up in a larger graph structure than that will be an individual traits it's kind of like of a single pass throughout a larger function right so all the performance comes from the way that they load the Code Into Its Cache and how they generate these traces and originally the whole point of doing the trace generation is to determine which parser executed most commonly and then try to optimize how those are all arranged in memory so that there are more attuned to the native requirements of your hardware like L 2 cache and and to avoid things like branch and branch misses and keep you TLB like it gets getting the same trace over over over in a loop than you want at all to be resident and not spread out in the code the and and and so this originally came out of H. P. and then some work was done at MIT to expand on it and and those guys started a comical determiner and now the IBM learning and that's how we end up with open source dynamo Rio on the pin side that was then told as obviously doing something similar and um and there is ongoing work in this space right now there's still papers and patents in research and happening especially the the advantages of a binary translation is that it tends to be supported on most mainstream analysis you could pick you know painted animal Rio or freedom in run them on Windows Linux OSX maybe some BST is and they're faster than the traditional hardware tracing mechanisms because until recently the cost of adding special functionality into a processes seem to exceed the need for doing good diagnostics and so a lot of the tracing and that were available were slower clunky and not really efficient enough for doing online testing like we wanted um and they can easily be target certain parts of the you know you're you're working in software such as much easier to turn on and off the tracing functionality in there they have built in hooking and so you can quixote's certain API isn't delaying enabling you trace things like that as opposed to programming low-level hardware interrupts and I'm as ours and but of course they do incur a performance overhead and this is mostly architecturally but because this loader that gets added in there the thing that's copying the code from the native map into the cold caches actually executable that is essentially a whole lecture layer a context switching and due to that fact that essentially rather than just have a context switch that goes whenever you need I O resources and now every time you discover a new piece of code you're getting a context switch in the center the high cost because again code locality and this code is unlikely to be directly in your cash already and so you end up getting up a fairly massive overhead due to that also just as a I think I've observed and the ISA compatibility is not guaranteed right so we would assume that the maybe pain would be pretty darn close to what's actually in the silicon for Intel because that's using the Z disassembler on the back and we have no guarantee that supporting the latest feature sets and we certainly don't have any guarantee that dynamo Rio or dine and store any the other engines are at one-to-one parity with the native ISA so if the tracing that you're doing needs to be exactly precise to what the hardware is doing then you might run into some pitfalls using a software-based translation engine and then also 1st if you were doing tracing like amour realm as opposed to just doing folding if you're up against hostile binaries there and that these are easily detectable essentially I'm just like virtualization of any form there is no kind of she giveaways of the there's certain memory maps not being available or other things there an anomalous to normal operating system as so not really talk about programs and other than VAX IR is really useful and if you seen tools written on governs like memcached and and other uh crash analysis tools a lot of that functionality is now available in our real which is at least twice as fast as portable across operating systems and they include things like uh that the ability to do shadow stacks of memory tainting and things like that all in Dr. memory which is an advanced to budding tool built on top of an array of so my main reference point governors it's awesome it's just kind of older and slower than a lot of people have access to in academia as they may go through some systems engineering classes but I would say I would like to see people move toward something that's at least portable because all the all the tooling written on background is not portable to Windows the more significant tools have been written on top of the window are remembered and and flair have flares really cool as whether it's current zone being able to counter terror part components of an input it reminds me of flair would flared essentially was a method for cooking deep into a program to allow you to follow this Beyond things like check something compression and get to the guts of the code that was also worked on by Actavis Armenian will during so there's probably a cake erase enhances anybody use can before yeah so it's probably be the most commonly used 1 today uh my DBT would training wheels it was a lot people migrate toward the because a has a higher level API it's written in C + + it's really easy to get into I mean and my next slide show you in 5 lines you have a call back to what everyone to do and any basic block in the in the in your target and but unfortunately there's a few things about the design of that reduce the efficiency like number 1 you don't get to observe a basic block when it's loaded into the cache you observe basic blocks as the executed at traces and this is significant because traces can contain multiple basic blocks meaning that you can the read the instrument the same block over and over and over in multiple traces being generated that include that basic blocks that I hope will only be in that trace so it won't process so each time mastery hook it so that's kind of slow and another problem closed source licensing is pretty restrictive why you can't just distribute pen with your tool and but in general it's it's really easy to get into so like for example this is all you need to write besides a couple of new callback functions where this the call that went to the the couple initialization API has yet to call but this right here will allow you to answer the call and anywhere in the basic block that will call back to base a block of and do whatever logic you want you can observe and analyze the program and I don't know what my father's rendering we blame office but as so this weekend it's just really easier into but it's not very fast is quite a bit slower and in fact I do have some demos of each 1 of these racing engines
here um which I will show you really quick what we're looking at so I have a demo that shows a block trace using pain and were in run with PNG over a set of 100 inputs it and just see the
time that takes it and it will take roughly 60 seconds or so and orderings running the defaults and let png the PNG test functionality which is kind of passes these PNG files rapidly and essentially the reason that then is slightly slower is that the it's not able to inline your functions directly them in theory if you don't have any conditionals in your analysis functional in line and in practice uh I think that it is very conservative doing that so it doesn't allow you to do interleaving of instructions it doesn't allow you to do a register aliveness checks and actually instrument AST that's more or less just doing callbacks out to your functions so this should be about done here um but you'll visually builders see the difference between this and the next demo anyway so this cut it short so we don't go over at the so dynamo Rio it is
essentially that of a much deeper introspection on DBT it allows you to directly modify the AST that's inside the instrumentation engine so when it loads of your rhinorrhea discovers new basic block it converts the block into an AST internally than then it allows you to actually and 2 checks on those the registers the flags like everything you wanna do you can synthesize instructions on the fly and the this also portable BSD-licensed so if you're writing tooling the want dealership in package it's available and and it has powerful tools like Dr. memory which like I mentioned already has you know shadow memory taint tracking and all kinds of stuff similar to government checked and the it does take a
little bit more effort to read this spot out of nowadays takes a more effort to to do the instrumentation that what this is doing right here is essentially I'm it's saying that we can't we wanna some insert some instructions to each basic block we don't really care where the instructions are act as long as you know we are able to put them in sequence and were looking to see if flags dead because we're going to do an increment which will impact the flags we don't have to store that restores so and it's looking for a point in the code where are injected instructions are going to not have to do a whole store and restore of previously saved data registers which will but it greatly increase the efficiency of your instrumentation and so it takes a little longer this is the 1st part of the function a 2nd part of the function and when open source a tool that actually is a you know a
full-fledged blotches a 40 I can show
you an example this 1 to its like visually much faster OK and the so this is the same as accessed agency it's much much faster it's literally not doing anything else other than that at will the more observant about words adding it so it's an it's an inline functionally it's doing the exact same amount work and it's obviously this 1 took 11 seconds I want to go over a minute so just by using the right platform and doing just slightly more work on the front and you can have a much faster chasing engine
but that's not alone we have had
we have tenants next oblivious to
dine instances is a really cool platform that comes out of so
that comes out of the university Wisconsin Madison and University of Maryland actually Barton Miller is the head of the research group behind it who coined the term funding to begin with and so that's kind a cool just spend a coincidence there but basically it's the only open source platform that allows to do static binary rewriting as opposed to only being able to do at runtime based on putting an injection this is really critical at once again it all has to do code cash code locality and but by being able to rewrite your instrument a binary back to dis uh the operating systems advantages of doing things like copy-on-write when you fork and and being able to have all the code serialized into 1 blob of data rather than having a being dispersed and allocated on the heap and but I'm not the local it's cash makes a massive massive difference and it's a little bit harder right tooling for it does have an online mode that only allows you to do callbacks it has it has a whole bunch API is like before my tag line for this was it was the kitchen sink of the dynamic instrumentation but I'm really the most novel thing is that it allows use static rewriting and I we use all time and when we test commercial software that the has binaries available on Linux this the go to tool and we release the plugin of AFL directly that allows you to use diamonds and so just rewrite the target binary wants to write to disk in then you can use it with a bell and has only about a 30 per cent overhead so if you use the QM you plug in the you know QMU binary trees mode for a file has a really significant overhead dyneins is actually super super performance and and will show you overhead a course building guidance is NP-hard so don't try it yourself there's is like a whole bunch of dependencies and you have to like figure out which ones should come from the vendor would 1 compile yourself Spain yes but I uploaded doctor files so you can just either run it in Dhaka or copy of instructions for yourself that's why the plug-in they
have a number of API as this 1 is the binary patch the API end you essentially and iteratively walk the graph find your blocks and so on you can reference
code later it'll be available when they go and show
you a demo that 1 so the muscles 11 seconds we expect this 1 to be even faster 0 there goes just 1 2nd so we went from like over a minute if you're right it still and 10 end if you choose to write the tool and diamonds and you are able to run it on when it's against genome influence software there and I union and source code it's obviously way way faster so 60 times speed up just by choosing the right platform and you know doing a little bit extra work up front and the code the right so and that my recommendation is if you're going to be writing tracing tools pick the right platform for the job don't necessarily just go with the easiest 1 4 that's approachable and
so what you gonna use dynamic prior
translation of the tips that I can give you is that that's 1st of all you only need to instrument indirect branches there's no real point and instrument direct branches so that's the easy savings uh you can delay instrumentation SIL input is seen if if you're using this for fuzzing we typically know that word you have a file or network-based input you can delay and so that occurs so you can isolate a library or you don't have us a lightweight consul obligation wrapping logic then you can and you know at least wait until the program was initialized before you actually don't heavyweight tracing and you can also limit to the threads access to data if you're again if you're targeting a larger application typically 1 thread will be the 1 that's reading the data and passing data structures and then handing it back to and the other threads so you nicely that as a API guys have if then else logic built in statically which will handle inside the code a much more efficient way I'm rather than putting conditional logic in your analysis itself um avoided trampolines they're very very slow compared to being able serialize and synthesize instructions directly into a block and injector FoxServer if you're repeatedly executing and we have a tool what a culture trace it's actually not on get hold that don't think but it's generic if you have a corpus that year executing the same tool over iteratively as injector for server and of give you about 30 % speed boosters lied about that last year yeah so let's get into the hardware tracing
section and the Honduran time OK that 1 ship so essentially views common hardware tracing they have a performance monitoring unit uh it's the exposes things like counters and sampling in and things like that there's a guy and sell the wrote this PMU tools tool kit it's called experiment with modestly that's the user I but it's a sum of more important stuff I when you programmers CPU only enable hardware tracing mode essentially we're doing as you're going to be configuring model-specific registers user registers that are not exposed to you in user land these are only available in CPL 0 you root kernel-mode rather and um their entire instruction that the design to turn on and off special feature sets the in the in the CPU and when you program them essentially what you're doing is you're setting up the interrupts so that the CPU can notify you when events occurred after you've set these flags and that's uh may be given a pointers to buffers and so on I use said a bias so they can tell you OK your buffer is full now you might wanna copied this from the neon kernel driver into user space and so on and a kind of briefly go over some of the basic stuff here and let you reference that later that you wanna get into the newer stuff before around the time and stepping we all know it's slow branches flag is just a motive single-stepping that steps in the branches references OK to start from concept if you want know more about that and there is a specific emisora register that has a number of flags that you see here is 1 of these enables different types of tracing modes that were kind of the traditional so the top of L or which I'll show you a diagram of and the branches flag of branch trees store which is like the cash for doing um graph tracing and and old uh bridges store essentially allowed use sparsity structure this says here's the base of my buffer and how they can be it whenever I wanna be interrupted due to be told it's full and that essentially the CPU would store this DS area record structure which includes your IP address all the registers context and things like that the flags and so on so essentially every time he took a branch it would build a dumb context into a buffer for you graphically looks kind
of like this in the instruction pipeline I ripped this slide up somebody else
but it is a nice illustration
essentially when you're of branches like a rat or you know jump are loaded into the instruction pipeline there's a series of registers behind the scenes that you don't normally you called LBR registers and those keep the branch from into and also a bit on whether or not that was a predicted branch are not just for internal cash tuning and then if you have the branch trace message of flag enabled then that will actually go outside of the CPU onto the system bus and into your L 2 cache yells you cash will then holders structure that will point to the did the same area for those storage and essentially this street you'll just get a series of these structures which at your context of the time your branch unfortunately it's really really slow I got a demo that but is super slow it's not really worth going much further into and then I'll
let you reference these slides later is lot so until processor traces the new hotness and was available as of last 2 years pretty much in consumer devices it's not on the server side yet the stone as well but I'm soon they'll be catching up and going all category the features like uh SGX and Intel PT and apparently this new Intel CET stuff all super exciting I what you need to know about Intel prosody traces that is the 1st efficiently to engineered tracing and go into an Intel CPU it's per core um it gives you essentially full-system tracing and I'm all the way from SMM although up to you the space and if you want to toy with that the easiest way is now in Linux for 1 i what is after I developing stuff last year they have added some support the per subsystem and there's also a simple reference drivers possible pt that's the same guy that wrote the PMU tools has upon get hub and there's some support in uh until the tune system studio but only allows you to do it and remote like slave debug mode which doesn't make sense to me and then as of today we have a working in Windows global so global host hardware support tracing engine and so as I mentioned it can actually traced into SMM if you have a digital certificates and so most of us won't have that access but it is able to trace hypervisors and kernels and user space and do filtering based upon the Sierra 3 page table of your choosing a log directly the physical memory through of the linear address range and it can also take not just a contiguous address range but and scholarly they added the ability of a given array of pages the array of ranges so and you're able to have arbitrarily large the near cash that doesn't have to be contiguous which can be difficult to to allocate and they are 1 of the main features is again has a minimal log format and the 2nd was that what I was saying about when you do the directed fuzzing you need to figure out ways to keep his traces small so that they're not you not wasting time and I O instead of actually doing what you wanna do so for example the they can do 1 bit per conditional branch and they can actually stack up to 6 of those in a single packet like upper above kind of packets so this means essentially if you a loop for example and you're waiting for your counter the countdown uh tightly compress the number times the branch was taken up so printed branches it also course has to recorder destination address it's a the interrupts these source and destination for a lobbying ends of the main problem where the main difficulty in the world with this is that you need access to the original memory map information to decode the trace because of so spots um so this is kind of what the
format looks like each of the top ones are individual of packet types and these tend to be literally like just a couple of bits so I'll show you a decoding of that in a 2nd I can't tell you how
much of a pain in the ass so this is configured it's 90 pages and the Intel suffer developer manuals and so we can just randomly foot that's to get this go you really gotta know you're doing a figure out which features sets you 1 enabled this is a really meant for you to read right now I can look at a later time but the basically the
idea is that per-CPU you can enable these emisora registers um and then it will write this heavily compressed format out to your linear cash and then you're you can register interrupts to handle when that caches full bore you can use ring buffer mode and dump it later on the process exits if you want play with it
and you can use it to x per subsystem if you wanna write tools around it actually
recommend you the simple pt on Linux or our new tells so until
pt driver as similar the structures in here but we're gonna publish the Code next week anyway so yes can just go through the code since I'm a little short on time the idea is that you know per processor we have trace buffer and each 1 is buffers is allocated via about allocate physical memory in Windows analyzing a filter on each core and and then a whole bunch of configurations for which features you want to enable this is more or less strictly right and manual and so on and so on solution using a demo here so lower what we should be expecting is a warrior a driver who run a program under trace and then I will use the a reference them live IPTV which knows how to decode this packet format to jump out of these records right here this these are those internal packets that I showed you in that image and these can be mapped to then later on to a process map and so essentially you can end up with you know a series of blocks that were executed so let's see the demagogues will be with us today if it was again all this
code I have implementations of every engine that I've talked about today have all code will be open source in upon get in the coming week a 2 and I go directly from here to much got eastern so after I get back from that all distilled final tuning end up looking for it yeah OK
suppose if you circuit so we have a we're going to go ahead and use the driver installer here but the it when speedy driver and so this is just driver that needs to be an usable driver or and hypervisor just so that you have access to program them registers directly this is doing all the heavy lifting of configurable flags and setting up a buffer zone allocating the physical address ranges so that they will get trampled on by nothing else so now we are installed Annex we want
to go ahead and this is all test what this does is it launches an application suspends it it gets of Sierra 3 value and then tell the Intel pt driver to start tracing for that so and it so this is our little test here exit so that world copy are we do 0 copy from the love of physical address up to users space for the file the and then over here we can do it a dump of that new log file yeah and here is decoding the packets and these were I guess is good for me posits this connection the were seen here so we're saying here in the top so I guess the fast and a small but the very top you see some of the unique packets that essentially telling the CPU what mode to enter into you and to enable until PET trace striving and then you'll see we have some these patterns safety and T of that's taken not taken and it has a number after for the number of of branches the and represented the exclamation marks meaning that it was taken the dot speeding that's 0 it's not taken on and then below that you'll see like the T. IP packets that's the taken I prefer the target IP address and and so on so uh that's the working until PT on windows and this will all be available to you soon and as far as our conclusions essentially I know it's not everybody's favorite hobby like mine to the right and tracing engines and so on so I think ultimately people should
try to engineer tooling on top of tracing it is that fit what their needs are and I am releasing this code and I want to support it and an Oreo dyneins Stan Intel pt engine so if you have an application that you need a tracing engine for and I would like to hear from you and we can work together to increase the efficiency of tracing Annan's firm to carry testing overall and that's it so thank you very much thank you when I have time for a 1 or 2 questions right the 1 in the sensors again I wanted to questions it but the that the no questions I will thank you thank you