We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

I am a packer and so can you

00:00

Formale Metadaten

Titel
I am a packer and so can you
Serientitel
Anzahl der Teile
109
Autor
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Automating packer and compiler/toolchain detection can be tricky and best and downright frustrating at worst. The majority of existing solutions are old, closed source or aren’t cross platform. Originally, a method of packer identification that leveraged some text analysis algorithms was presented. The goal is to create a method to identify compilers and packers based on the structural changes they leave behind in PE files. This iteration builds upon previous work of using assembly mnemonics for packer detection and grouping. New features and analysis are covered for identification and clustering of PE files. Speaker Bio: Mike Sconzo has been around the Security Industry for quite some time, and is interested in creating and implementing new methods of detecting unknown and suspicious network activity as well as different approaches for file/malware analysis. This includes looking for protocol anomalies, patterns of network traffic, and various forms of static and dynamic file analysis. He works on reversing malware, tool creation for analysis, and threat intelligence. Currently a lot of his time is spent doing data exploration and tinkering with statistical analysis and machine learning.
32
Vorschaubild
45:07
MathematikVorlesung/Konferenz
Virtuelle MaschineHydrostatikRechnernetzAnalysisBitMathematikProjektive EbeneEinsProdukt <Mathematik>Rechter Winkelsinc-FunktionMereologieResultanteLoginComputersicherheitHydrostatikInformationMailing-ListeAnalysisTwitter <Softwareplattform>Repository <Informatik>
MalwareStandardabweichungÜbersetzer <Informatik>SystemplattformExpertensystemRechter WinkelSampler <Musikinstrument>Quick-SortCodecHauptidealringElektronische UnterschriftCoxeter-GruppeFigurierte ZahlFuzzy-LogikProjektive EbeneMatchingFormale SpracheBitPortabilitätMusterspracheVerknüpfungsgliedMereologieBildschirmfensterSechseckEinsTexteditorWort <Informatik>ProgrammierungMultiplikationValidität
DatenstrukturTermSchnittmengeElektronische PublikationMAPBitDatenstrukturTermE-MailWort <Informatik>GarbentheorieRechter WinkelCodePunktDifferenteSampler <Musikinstrument>BildschirmsymbolProgrammiergerätMereologieZustandsdichte
E-MailElektronische PublikationDatenstrukturE-MailMultigraphDateiformatGarbentheorieTypentheorieAnalysisSampler <Musikinstrument>BitChiffrierungZahlenbereich
Kette <Mathematik>ProgrammierumgebungCompilerMaschinenschreibenSoftwareBinder <Informatik>InformationService providerSampler <Musikinstrument>Elektronischer FingerabdruckVisualisierungProgrammierumgebungKette <Mathematik>SoftwareBinärcodeGebäude <Mathematik>SchnittmengeBinder <Informatik>Ordnung <Mathematik>ProgrammierungMereologieVersionsverwaltungSIDISHalbleiterspeicherRechter Winkel
MereologieCodeMereologieKartesische KoordinatenKontextbezogenes SystemLokales MinimumDatenkompressionElektronische PublikationBinärcodeProgrammierung
KontrollstrukturSinusfunktionWechselsprungHalbleiterspeicherPunktBildgebendes VerfahrenCodeBildschirmfensterGamecontrollerAdressraumBootenElektronische Publikation
Übersetzer <Informatik>Sampler <Musikinstrument>Projektive EbeneSchlussregelFormale SpracheElektronische UnterschriftHauptidealringGrenzschichtablösungKryptologie
RFIDStichprobeSoftwaretestSchnittmengeDatenanalyseMultiplikationsoperatorResultanteTypentheorieAnalysisRechter WinkelSoftwaretestRechenschieberElektronische UnterschriftMathematikElektronische PublikationDatenanalyseStichprobenumfangEindeutigkeitSchnittmengeHauptidealringProgrammverifikationEins
Visuelles SystemVersionsverwaltungMatchingHauptidealringZahlenbereichElektronische UnterschriftEinsVisualisierung
StichprobeZahlenbereichDateiformatGeradeQuaderElektronische UnterschriftRechter WinkelFahne <Mathematik>KorrelationElektronische PublikationKorrelationsfunktionDiagonale <Geometrie>HauptidealringMultiplikationsoperatorMultiplikationGraphSkalarprodukt
Elektronische UnterschriftStichprobenumfangGraphAdditionSampler <Musikinstrument>E-MailMultiplikationsoperatorFahne <Mathematik>Rechter WinkelProgrammierumgebungsinc-FunktionInternetworkingHauptidealringZeichenkette
ZeichenketteClientBinder <Informatik>VersionsverwaltungGarbentheorieZeichenketteProjektive EbeneStrategisches SpielRechter WinkelTypentheorieRotationsflächeAutorisierungRandomisierungVisualisierungMalwareKryptologieEvoluteBinder <Informatik>VersionsverwaltungZahlenbereichStichprobenumfangGruppenoperationGarbentheorieUmwandlungsenthalpieStichprobeSchnittmengeVerschlingungÄhnlichkeitsgeometrieDistributionenraumComputeranimation
Abstrakte ZustandsmaschineVirtuelle MaschinePunktCodeSelbstrepräsentationWort <Informatik>Rechter WinkelOrdnung <Mathematik>
DisassemblerSystemplattformFreewareMultiplikationsoperatorComputerarchitekturResultanteTypentheorieDisassemblerSchnelltasteWechselsprungSystemaufrufPunktCodeWiderspruchsfreiheitKorrelationsfunktionMultiplikationProgrammbibliothekFormale SpracheRechter WinkelVererbungshierarchie
Abstrakte ZustandsmaschineBasis <Mathematik>CompilerSchnittmengeAbstandÄhnlichkeitsgeometrieElement <Gruppentheorie>Total <Mathematik>EindeutigkeitInklusion <Mathematik>MathematikMinimumMathematikKorrelationPunktCASE <Informatik>SchnittmengeMotion CapturingAbstandProgrammierungOrtsoperatorKorrelationsfunktionTotal <Mathematik>Element <Gruppentheorie>ZahlenbereichDomain <Netzwerk>FitnessfunktionOrdnung <Mathematik>BitEindeutigkeitElektronische UnterschriftRechter WinkelCodeStichprobenumfangDifferenteDatenflussGamecontrollerWechselsprungÄhnlichkeitsgeometrieEinsTeilbarkeitDisjunktion <Logik>Neuroinformatik
Abstrakte ZustandsmaschineCodeOrdnung <Mathematik>DatenflussRechter WinkelSoftwaretestWechselsprungVerzweigendes ProgrammMetrisches SystemGraphPhysikalische TheorieTypentheorieCode
MagnetbandlaufwerkOrtsoperatorVersionsverwaltungÄhnlichkeitsgeometrieAbstandInklusion <Mathematik>SystemaufrufAlgorithmusWechselsprungMixed RealityFormale SpracheRechenbuchRechter WinkelGewicht <Ausgleichsrechnung>Motion CapturingSchnittmengeÄhnlichkeitsgeometrieProgrammierungAbstandPunktOrtsoperatorEinsCASE <Informatik>DickeVerzweigendes ProgrammQuick-SortDemoszene <Programmierung>Regulärer Graph
ÄhnlichkeitsgeometrieAnalysisGruppenoperationE-MailDatenmodellQuellcodeHackerCloud ComputingElektronische PublikationVisualBASICVisuelles SystemBinder <Informatik>StichprobenumfangE-MailAlgorithmusGarbentheorieVersionsverwaltungRechter WinkelZahlenbereichMatchingEndliche ModelltheorieComputersicherheitEinsÄhnlichkeitsgeometrieDialektElektronische PublikationCASE <Informatik>Domain <Netzwerk>BitElektronische UnterschriftWald <Graphentheorie>ImplementierungUmwandlungsenthalpieFormale SpracheDemo <Programm>PortscannerAlgorithmische LerntheorieMereologieSchwellwertverfahrenWellenpaketPaarvergleichFitnessfunktionSchnitt <Mathematik>Hash-AlgorithmusPunktGrundsätze ordnungsmäßiger DatenverarbeitungProzess <Informatik>SchnittmengeProblemorientierte ProgrammierspracheQuellcodeSchlüsselverwaltungGüte der AnpassungArithmetisches MittelGlobale OptimierungStandardabweichungHauptidealringSystemverwaltung
Demo <Programm>Demo <Programm>Grundsätze ordnungsmäßiger DatenverarbeitungTouchscreenFigurierte ZahlComputeranimation
BildschirmfensterDemo <Programm>QuaderMathematikGrundsätze ordnungsmäßiger DatenverarbeitungTypentheorieTouchscreenVorlesung/Konferenz
Coxeter-GruppeMultiplikationsoperatorCASE <Informatik>Demo <Programm>RechenschieberComputeranimation
MehragentensystemZoomDemo <Programm>Demo <Programm>Rechter WinkelGüte der AnpassungGrundsätze ordnungsmäßiger DatenverarbeitungVersionsverwaltungElektronische UnterschriftBinder <Informatik>VerzeichnisdienstSkriptspracheMailing-ListeZahlenbereichGarbentheorieGenerator <Informatik>Pi <Zahl>StichprobenumfangÄhnlichkeitsgeometrieSchwellwertverfahrenElektronische PublikationMatchingCASE <Informatik>PhasenumwandlungNeuroinformatikRechenschieberPunktOrdnung <Mathematik>Quick-SortComputeranimationProgramm/Quellcode
Abstrakte ZustandsmaschineDemo <Programm>Deskriptive StatistikSchlussregelSchnittmengeEinsCluster <Rechnernetz>Rechter WinkelDämon <Informatik>GraphfärbungWort <Informatik>PaarvergleichSichtenkonzeptVariationsrechnungGraphiktablett
Formale SpracheMultigraphStandardabweichungBinder <Informatik>GarbentheorieCASE <Informatik>VersionsverwaltungBitElektronische UnterschriftZahlenbereichVariationsrechnungEinsSchnittmengeDifferenteÄhnlichkeitsgeometrieRechter WinkelStichprobenumfangCluster <Rechnernetz>Turing-TestGruppenoperation
Green-FunktionMalwareZahlenbereichDimension 2KreisflächeDimension 3Cluster <Rechnernetz>Rechter WinkelSchnittmengeGarbentheorieSichtenkonzeptMultigraph
Abstrakte ZustandsmaschineRuhmasseHauptidealringBitBildschirmfensterKurvenanpassungCluster <Rechnernetz>VersionsverwaltungKreisflächeResultanteSchnittmengeBinder <Informatik>ZahlenbereichSpiraleGarbentheorieRechter WinkelGanze Funktion
ZufallszahlenÄhnlichkeitsgeometrieRechter WinkelGraphÄhnlichkeitsgeometrieCluster <Rechnernetz>KryptologieModifikation <Mathematik>Elektronische Publikation
SpiraleNational Institute of Standards and TechnologyGarbentheorieZahlenbereichGoogle Chrome
Google ChromeGarbentheorieAbstrakte ZustandsmaschineBinder <Informatik>VersionsverwaltungBinärdatenGarbentheorieGraphische BenutzeroberflächeHash-AlgorithmusZeichenketteBinder <Informatik>AbstandPunktVersionsverwaltungInstantiierungElektronische UnterschriftGoogolZahlenbereichRechter WinkelMatching
GruppenoperationEindeutigkeitGarbentheorieVersionsverwaltungBinder <Informatik>VariationsrechnungElektronische PublikationGruppenoperationZählenSchreib-Lese-KopfVersionsverwaltungBitMAPCheat <Computerspiel>ZeichenketteResultanteTabelleRechter WinkelZahlenbereichÜbersetzungsspeicher
Abstrakte ZustandsmaschineInternetworkingResultanteGraphiktablettAlgorithmusBitUltraviolett-PhotoelektronenspektroskopieVariationsrechnungZahlenbereich
SystemplattformSkriptsprachePortabilitätTypentheorieElektronische UnterschriftMathematikBitDemo <Programm>RechenschieberWort <Informatik>Rechter WinkelTwitter <Softwareplattform>EinsStichprobenumfang
PunktReverse EngineeringNatürliche SpracheMultiplikationsoperatorRechter WinkelZahlenbereichBinärcodeInformationElektronische UnterschriftGruppenoperationMathematikInternetworkingStichprobenumfangBitTypentheorieKategorie <Mathematik>Selbst organisierendes SystemDickeMetropolitan area networkMalwareAnalysisKontextbezogenes SystemFarbverwaltungssystemHilfesystemMereologieBesprechung/Interview
Transkript: Englisch(automatisch erzeugt)
So hopefully everybody will be entertained. I know everybody thinks, you know what I really want to do at 5 p.m. is go to a talk that involves math. So hopefully this will ‑‑ excellent. Came for the math. Stay for the mustache. So that's me.
I'm Packer and so can you. I'm going to attempt to keep this to the 45ish minute mark so I can do some Q&A. Hopefully I'll get some hard questions. Hopefully I'll get some good questions. All right. Now we're on to the agenda. Do a little
bit of an intro, talk about the product ‑‑ not the product, the project. A little bit about me because why wouldn't I? I'm up here. Give everybody a little refresher, kind of techniques, talking a little bit about the PE format since that's mostly what we're going to be focused on today. We're going to look at the data and pull out our magnifying glass and look at ones and zeros, do a little bit of math.
We're going to then look at the solution and then finally we're going to look at the results. So the most important part, me. What do I do? Currently threat research at Bit9 Carbon Black. Those are my hobbies, static analysis, machine learning. Anybody else from Texas? There we go. If you're in Austin, I will totally buy you a beer. I run a
little project in a website called secrepo.com. If you guys are looking for various security data, I try to keep a somewhat updated and curated list. So everything from kind of bro logs and snort logs to other projects that have way more information than I could possibly host. You can follow
me on Twitter, at Sushi and then finally I'm a sometimes occasionally contributing member to the project. Thanks, Alex. And feel free to tweet about this and use the hashtag secure because math because we are going to be talking about math. All right. So what's the main problem here? I'm
sure a lot of people are familiar with the idea of detecting compilers and packers and encryptors and all sorts of other stuff, right? There's some good tools. Some of the tools are really old. So I'm going to pick on PID here. PID was written in 2005. So in essence it's 10 year
old technology. Maybe there's a more interesting way or a better way to kind of manage this problem. So really the goal was set out as can we do something new and different? So we've got some goals. So we've got some great projects out there like PID and some of the other ones. Yeah, they might be a
little old but there's probably some validity. However, for this we're going to try and adopt kind of a zero trust towards them. In other words, if somebody as an analyst says, oh, yeah, this PID signature is verifiably correct, then great. We being myself or anybody else in this room can create a
signature and kind of directly translate it into this new language. The other one is this easy to create signatures. So looking at PID and some of the other associated tools, you've got to live in a hex editor, right? You've got to maybe open up IDA and find the exact pattern that you're looking for. It requires kind of a certain bar to entry. So the
idea here is can this really be distilled down to something anybody can get value out of, right? So let's make it easy and we're going to talk a little bit about the signatures as well. Cross platform. So running PID on a Mac itself, that's not going to happen. There are a couple
solutions to attempt to let you run PID signatures on Linux, on Mac. They're really good. They're not as full featured as actually using PID on Windows. So that's kind of a negative there. The other thing, once again, right, simple to extend and understand. So in my opinion, what I'm going to start with here is kind of this base notion, this idea,
present some data and say, look, I'm pretty sure this mostly works. And then hopefully somebody, multiple somebody's in this room or elsewhere will go, wow, that guy wasn't really dumb. He was only mildly dumb. And instead here's a couple of enhancements, right? And the other thing that I really wanted to get out of this was this idea of fuzzy
matching. So if you've got kind of something like PID or another signature based language, generally it's the signature hit or it didn't. So instead I want to kind of introduce a notion of, well, part of the signature hit and this is about how much of the signature hit. So in other words, when I use this or when anybody else uses this for
signature management, you can kind of figure out where your overlapping signatures lie and you can maybe be a little bit more effective kind of out of the gate. So this we're going to jump in, just an easy refresher, talk a little bit about the terms. When I say certain words, what I mean, it
might be different from what other people said, so I want to make sure to do basic level setting. Talk a little bit about the PE file structure. I'm sure most of you in this room go home and dream about the PE headers. Probably not everybody does. All right. So this is a very simplified look of the PE
file structure, right? You've got kind of this DOS stub at the beginning. You've got these other various headers, some of which are optional, some of which are only generated by certain compilers. You have this notion of sections, right? Some sections contain the code and some contain data and so forth and so on. This idea of resources, so if you ever look
at, you know, a programmer executable's icon, generally stored in the resource section, so there are many, many different parts. This is one of my favorite graphs, and I apologize if you can't see it all that well. These are all the header values that you can have in a PE file. Now keep in mind not all of them are required to exist. Not all
of them are required to be filled out in an entirely accurate way, but this is what you can deal with. So there's a lot of things to mess with. They're color coded. So really as far as the PE format itself and the header structure, this is what we're going to care about today. The three
basic things that I decided, and whether I'm correct or not, that's fine, but three basic features out of the PE header that I said these can be kind of interesting and these should generally vary enough from compiler to compiler or decrypter to decrypter that they should be useful features in
kind of doing this type of analysis. The other one is number of sections. So things like UPX and a lot of other packers, right, maybe they jammed the entire executable, and we'll get a little bit more into this in a second, into one section and then just have their little tiny data section. So when I use the phrase tool chain, right, what I'm talking about is the set of tools used to develop
software. So you have things like IDEs and linkers and compilers and all that kind of stuff, and each one of these actually leaves somewhat of a relatively unique fingerprint upon the binary that it creates. Now, once again, you can manually go in and change these. Not a lot of people do. So for this, when I talk about tool chain, I'm actually
going to go, we're going to talk about kind of the build environment. So GCC versus visual C++. So packers, what are they? Packers are generally this program within a program. When I want to pack a binary, what I'll do is I can take the original executable, kind of smush it down and ram
it somewhere inside this new pack executable. So generally want to do that to evade AV, right, make analyst lives harder, because who loves or who doesn't love really stepping through a debug trying to figure out how do I get the unpacked version of this in memory, because this is just ridiculous. So at least if you know, you can
identify what packer, if it's similar to anything you've seen before, right, you know what steps you have to go through or maybe you know what tool to pull out of your toolbox in order to do the unpacking. So there are really two parts to a packer. You get the packer executable that you run on the original file. This is the thing that actually does the
compression or the obfuscation and creates this new executable. And then you get the unpacker. And the unpacker is generally this little stub that comes out in the new program that when this new executable is run, the sub is generally the first thing that is executed and it goes through and it kind of, you know, unpacks the original
binary and goes oh, okay, now I'm going to run this. So really when I talk about packer detection in this context, I'm actually going to be referring to the unpacker of the stub. So unpackers, how do they work? So what you really want to do is you want to take control of the address of the entry point, right? So where when a
Windows or a P file is loaded, where should I go and begin executing code? So you want that to now point to your stub. And then once you unpack it, right, so maybe you decrypt it or maybe you deobfuscate it or whatever it is, right, so you find the pack data, you kind of restore it, you get this little in-memory image. You've got to do a couple
relocation fixes because it's not the Windows loader doing the actual loading for execution. You have to mimic some of that. And then you jump into the original program and keep going. All right. So now on to the popular kids. So these are kind of the three, in my opinion, and there's probably several more tools that when people do compiler detection
or do packer or crypto detection, this is what they're talking about. So PID, I mentioned that one before. It's nice. The signature language is pretty good. It's been around forever. It's my opinion. It's kind of the de facto standard. Yara has its own signature language. Several projects that will allow you to take PID rule sets,
convert them to Yara rules so you can kind of update your analyst tools, but you're still kind of using this limited idea of what it is you're looking at or this harder way to describe data. And then this last one, this RDD packer detector, I actually really like their slogan. All right. So now we're going to dig into data.
And who doesn't love data? And honestly, if you're going to talk about math and if you're going to talk about doing any type of analysis, if you don't use data and you don't understand your data, right, it's really, really hard to get good results. And a lot of times, data is really
ugly, right? It's not this beautiful end result. It's this nasty thing you have to slog through and dissect and understand. So this is the data that I used in my testing setup. So I went and I found and I Googled and I threw together 3977 unique PID signatures. That's a lot of PID
signatures. Right? So that alone kind of got me thinking maybe we can address the signature management problem. We've got some file sets, various sizes, right? We've got smaller ones that I understood that I could pull apart and go, oh, okay, I get it. Yeah, these two are right and this technique seems to be working. And then we have this
giant random sample at the bottom, right? So 411,000 files. Because everybody loves big data and this wouldn't be a math talk unless I use the phrase big data. So there you go. So that was kind of the end all after I felt comfortable with the technique and comfortable with the tool. What I ran it over to kind of verify and did some spot checking with
that giant data set. We'll talk about that as well. So let's get into some of the data analysis, right? So for this, there's a handful of slides. We'll go through them. We're going to talk about the basic exploration of the Zeus data set. So if I go back a slide, I think, yeah, there we go. So 6,700 samples roughly is what these slides
are based off of. There we go. Okay. So first thing I did was, all right, what happens if I run PID on these 6,700 files? Well, turns out PID signatures don't match 4,600 of them. Really disappointing. So you get some other ones. So
this different UPX and another UPX version and Microsoft Visual Basic and Armadillo Packer, which I'm sure just by looking at the numbers you could probably make a relatively educated guess that maybe Microsoft Visual Basic 5.0 and 6.0 and Armadillo Packer are really, really closely
related. So kind of those numbers, what they look like in visual format. It's a bar chart. You don't have to worry about the numbers. That really tall line is the 4,600. This is kind of another way to visualize it. Just to kind of get into the idea that creating signatures is hard. It's
not trivial. So having an easier way to do it would be great because then that really big giant, and I apologize for not using gray scale, blue box or bluish purplish box, to make that smaller, to get more things that you can actually label and understand. Okay. Cool. So this graph, in my
opinion, is what science looks like. Right? You show this to somebody and they're going to go that dude up there totally did science. So this is simply a correlation matrix. And the idea being is you take all of these PID signatures and for files that had multiple PID signatures flagged, you want to
see which signatures flagged, right, with a high correlation or flagged, when one flagged the other one was very, very likely to flag. So the diagonal is basically the signature correlating with itself, which makes sense, right, because every time a signature fires, it's going to be observed. So
with this you kind of want to pull out the little black dots. And while this one is kind of hard to view, we can zoom in on one little snippet of the graph. So this is kind of that upper left-hand corner. And you can see that there are a couple of signatures that are highly, highly correlated, right? So there's a lot of signature overlap. There could be signature overlap, right, in your environments. There's obviously signature overlap on
the stuff I downloaded from the Internet, right? So every time, you know, one of these AS pack signatures flagged, the other one did. And so with that, you kind of get a feel for, oh, this is where I'm lacking or this is maybe where I have some duplication. So that's where we've got, we understand what PID looks like in a sample dataset. So
now looking at maybe some of the other features that we can use in addition to the header features that may allow us to definitively say or say with a very high probability that we're looking at a, you know, a specific packer or a specific compiler. So we can use PDB strings. I love it
when any type of malware author or any author in general includes a PDB string because sometimes it's like hitting gold. Sometimes they're awesome, right? And they're like, oh, yeah, by the way, we're using this crypto called crypto evolution. It's our visual C++ project. Sometimes, you know, it's kind of random garbage. It doesn't really give you anything. It's important to keep in mind that these are just text, right? So there's no reason why you
can't create your own. So for misinformation strategy, right? So now I kind of mentioned this linker version. So you've got these major and minor linkers. What do they look like in the sample set? So this is just kind of breaking down. So if you've got the first one, right, linker 2.5,
2,000 of them. So while you can group, you know, this Zeus sample set or many other sample sets just by looking at the number of linker versions or, sorry, looking at the linker versions of the account, it still really doesn't tell you the whole story. So we looked at the number of sections. And you can kind of see a relatively similar distribution,
right? You've got a couple of really, really big groups of files, right, that might indicate a specific campaign or something like that within the Zeus data set. And you kind of have this longer tail. So another thing we really wanted to look at, assembly mnemonics. So I think these are kind of cool. So the idea here is, right, when an
executable runs, there's code. And that code, those bytes, can be translated into a mnemonic. And all the mnemonic is is simply instead of, right, the byte representation for ad, it just prints out the word ad. And it's easier for me and a lot of other people to understand. So the idea is maybe we can use assembly mnemonics to help understand exactly what it is they're
looking at. And Johnny 5 is alive, but in order to get assembly mnemonics, you must disassemble. So sorry, Johnny. So for this, capstone engine was used. I don't know if anybody has played with capstone engine. If you're looking for a free and a really awesome disassembler,
it's great. I love it. Runs on multiple architectures. There's bindings for multiple languages. It's super easy to use. So the reason I call this out specifically is I'm sure a lot of you have noticed that every single time you run a different disassembler on an executable or some code, you will get different results, right? So
really you only get consistency within a disassembling engine. So if you were to write your own or use one of the other disassemble libraries, the technique itself would still work, and that's totally cool, right? I'm not pimping completely capstone engine. I like it a lot. But the point is just to be consistent with this type of stuff. So then I had what I
thought was a really bright idea. I was going to look at the correlation between assembly mnemonics, right? So every time an add appears, how likely is it that a move or maybe a call or a jump also appears? Yeah, that was an awful idea. So we moved on. So now let's get into
some of the math, right? Because how do you not love math? Math is so fantastic. So going back to kind of the assembly mnemonics, right? These mnemonics describe the program behavior, and that's kind of what we're looking to capture is what exactly is this unpacker doing or how
exactly does the executable get set up, right? Because it's generally compiler specific or in the case of a packer or cryptor, right, they have to know what to undo so they can then run whatever code they want. So we want to kind of capture this program behavior, and that's what we're doing with the assembly mnemonics. So
how can we look at these various assembly mnemonics? We looked at correlation. Correlation doesn't really take order into account. You saw the correlation matrix. It looked ridiculous, right? So imagine looking at that for 400,000 samples, and it's going to be some massive gray blob, and you're going to go blind and be sad. So, you know, there's this kind of
notion of distance or similarity that fuzzy idea is if I have a signature, I want to know how close what I'm looking at is how close is it to the signature, right? How similar are two things, this idea of similarity. So we'll talk a little bit about Jacquard distance. Jacquard's awesome. It's cool. However, it doesn't take order into account. The idea being that with assembly, right, it executes an order. It doesn't
jump around. I mean, there's, you know, flow control and all that kind of stuff, but generally, you know, if you see an add, a move, and an XOR, they'll be executed in that order and not, you know, move XOR at or vice versa. So while Jacquard's great and it
might be useful, order I thought was pretty important to take into account. So there's this idea of leveraging distance. There's another cool distance metric. The number of edits determine the distance, and position is important. So let's look at one of the examples of Jacquard distance. So here we have two seemingly random just sets of assembly mnemonics,
right? So we can say the left most is the one at the edge of entry point. So this is where the executable will start. And then it moves from left to right. And you can see there's various ones. So the easy way to view of computing leverage, sorry, Jacquard distance is to take the total number of shared elements, divide it by the total number of unique elements, and that's your distance. So in this
case, it's move push, which is 2, divided by the other set, right, which is 8, and you get .25. So as far as set membership is concerned, these two things are, you know, have a distance of .25. And while okay, I just didn't quite feel right. So with Levenshtein, once again, you have this idea of
order. So how many things have to change to make one into the other? So this kind of fit the domain a little bit better. So once again, kind of just doing a quick compare, you know, looking at if they're different. So, right, there's one difference and then they're not different and so forth and so on. So basically seven changes are necessary
to make one set into the other set. Therefore, we get a distance of 7. So kind of what we were talking about before. But code is executed in order, there may be branches. I really didn't want to build any type of, you know, flow graph or any of that kind of stuff. I wanted to keep it simple and understandable and efficient. So in theory, the
assumption was, what I worked with was the assembly mnemonics to the left should be more than the assembly mnemonics on the right, right, because it will execute starting on the left and finish somewhere off the right. And if there's a jump in there, maybe you want to care about it, but maybe you don't really want to care about this stuff after as much as the fact that there was a jump. So there were a bunch of testing and metrics where
I tried to figure out where the cutoff was, how many assembly mnemonics were required, so forth and so on and we'll get into that. We also have taken into account how big is the stub and if you don't know what you're looking at, then you don't know what you're looking at and some of these questions are really hard to answer. So we turned to tapered leverage scene. And this I think is a really, really cool algorithm. So basically the
idea is it's position dependent like regular leverage, except the ones on the left, right, any edit to the left will have a higher weight than an edit to the right, which kind of makes sense. So this is kind of a way to capture and now we have, oh, we care about more the things that are executed first in case there's something
like a branch or a jump, right, and now we have a language, this assembly mnemonics to kind of capture program behavior. So we can put those two together and the way you basically calculate this for every single position is one minus the position of the thing you're looking at, right, divided by the length of the set. So in this case
there's ten things in the set. So the first thing, right, requires one full edit. The second thing requires zero edits. And the third thing requires, right, .8 of an edit. So you kind of go on and now you have a distance of 3.5. So to me this was great because it said, yeah, these things are separate and different, but there
might be some sort of similarities. The nice thing you can also do with leverage scene is you can actually use it as a similarity calculation, right? So if you want to use it as a similarity, so it says basically those two sets are 65% similar. So this is how the idea of
similarity is saying, oh, we get this fuzzy hashing, this idea of similarity mixed into the algorithm. All right. So now that we've made it kind of through this great refresher, everybody loves P files and their headers and all the various values, and we have an idea of the
features we're going to look at, right, what we're going to use. We're going to use the major linker version, the minor linker version, these various assembly mnemonics, right, number of sections. We have some really fancy sounding algorithms that are actually really simple to understand, which is great. We have a way to do fuzzy matching. Awesome. So what do we do? Well,
first step, gather samples. We already talked about the data sets, so you know there are well over 411,000 samples that we dealt with. So the second thing was, right, let's get PID, kind of this industry standard, this thing that I'm very comfortable with, I've used a lot in the past. Let's see what it looks like for everything. Then from there, for
one of the executables, we're going to disassemble them, right, because we need the assembly mnemonics, and in this case we wound up using the first 30 assembly mnemonics. We need the header features. We'll talk a little bit about clustering so you can kind of understand which P files are similar based on these three features, right, assembly mnemonics, the various
header features. Then when I ran this across all the data sets, my threshold was this 90% similar. So I felt that if an executable's signature and a signature that I was matching against were not at least 90% similar, I felt that wasn't good enough to call it an actual match. So one of
the things that I started off using was banded minhash. It's a similarity comparison optimization, because I didn't feel like doing big O, right, of N squared comparisons, especially on 400,000 things. However, the implementation of banded minhash that I was using was broken, so I wound up doing a lot of comparisons, but luckily not by hand. All right.
Then we created signatures so we could test and verify. So one of the things that I kind of want to talk about briefly is why signatures, right? So everybody, we live in this great age where it's like, oh, my God, security data science, we have to do supervised machine learning, and if we're not using random forest or run unsupervised, if we're not
using TV scan or k-means with k-means optimization, you're like, no. Sometimes it's overkill, right? So one of the nice things about signatures in this case is we can use it to capture this domain specific language, but me or anybody else, we don't have to worry about model drift. So after you create this awesome
machine learning model that might have great accuracy, what happens when you get new data and you go to train it, right? That accuracy begins to drift, the model gets out of whack, so to speak, and you keep going through this large process, right? This is one of the issues with operationalizing machine learning. Also the model
will vary based on training source, right? So if I trained it against only my APT1 set, well, then it would be really good at probably finding things labeled APT1, but it would be worse about trying to determine which packer or which cryptor is what, right? And likely everybody else will have different data than me, so it really
wasn't a good fit. Kind of that last bullet is really where I was going is simple, right? You want to play. You want to do things. You want to tinker. Sometimes machine learning is fun to tinker with. Sometimes you really just want to get something done. So here's kind of what the signature language itself looks like. So really, really simple. It's kind of highlighted
to show you the signature, and I'm going to go into a demo in a second, but so the signature for Microsoft Visual Basic is the top line, and then the parts where it matched on the file, so you can kind of see those blue highlighted regions. There's quite a few. The ones on the left, there's a really long run, right? So you get the similarity of .902, right,
because it required 2.9 through repeating edits. So in my opinion, I think that accurately captured, yeah, this signature is relatively simple to the file, and I feel pretty confident that this file matches my signature. All right. Now let's move
into a demo. Oh, God. I'm going to minimize that real quick. We should be good. I think I
broke everything. That's phenomenal. Yeah,
seriously. All right. This is what I get for doing, trying to do like an honest to God demo, huh? Maybe if I don't do it in full screen.
We'll figure it out. There we go. It just hates full screen. So I'm going to ‑‑ Asian guy showed up. Awesome. Went from friendly math talk to clan rally quickly. I apologize. So I scripted this out because I was kind of a chicken as well. I didn't
want to type commands, and quite frankly, watching me type commands is boring. So I'll direct your attention to the top kind of small box and walk through the demo. This really is ‑‑ you know
what? You're going to get back up here. Just sit there for like two minutes. Got to want it. That's
all right. If it doesn't work, I have slides, but I thought a demo would be way more entertaining for everybody. Third time's a charm. Nope. Third time
was not a charm. When in doubt, try a different port. Oh. Maybe that's awesome if that was the
case. All right. I really didn't want to like try and lean over. You know what? Screw it. I'm going to unplug one more time. We're just going to go back to the presentation. If anybody wants to actually see a demo, I promise ‑‑ I literally promise it
works. I swear. I don't promise ‑‑ oh, no, no. I was trying to make sure of that. I think we're good. All right. We're good? All right. Screw you,
demigods. Now you get completely unreadable slides. So I'll try to describe what's going on. There's two phases to this. One, there's the signature generation phase, and that simply says run this one script on a binary that I can't even show on a computer. That's what I get for trying to do a demo. And generate the
signature. And all the signature is going to be is a simple list of assembly mnemonics and then give you this major, minor linker version and then as well as the number of sections. And then all you have to do, if you're not giving a demo, is run this other
script that if you can see it, that mmpes.py on the signature, and you can do all sorts of things. You can give it a threshold. So if your idea for similarity is different than mine, if you say I want to know everything that's 50% similar, you can do that. You can give it this crazy verbose to where it says, all right, here's the signature that I have, and here's what I'm matching against. You can
do that in case you really want to interrogate everything. It also tells you when the major and minor linker versions match or when they don't match or when the number of sections don't match. It tells you how many edits you have and then the actual similarity. So this is actually between two APT1 samples. And you can kind of see the
signature generated on the two files in this directory. The first one really didn't match all that well. It had this .844 required roughly 4.5 edits. But then this other file, it matched exactly. So all 30 assembly mnemonics were perfectly in order. Both the major and minor linker numbers matched as well as
the number of sections. And here's kind of a better description of the rule that you guys might be able to see. All right. Apologies for the demo. So let's look at some of the data sets. We'll start with the APT1. So for here, this is kind of
describing the clusters. So in other words, the like things grouped with other like things. And it's two bar charts superimposed on one another, which is why you get the color variations. Once again, apologies for not doing it in grayscale. So that very far one on the right, the idea is PAD found in that yellow thing
and said this many things are similar. And then kind of that green bar is the assembly mnemonic comparison of this many things were similar. So kind of the cool thing, even with having zero trust in the labels of using something like PAD, right, you get kind of
this anticipated view. You expect a lot of things to kind of fall into a few buckets and then you get this really long tail that as an analyst is always a pain in the ass to deal with. So one of the other ways you can represent this is kind of these neat-looking bubble graphs. It's not really science unless you have sweet graphs. So this is just clustered on
assembly mnemonics. So once again kind of representing what you can see, this one really large cluster and kind of these other ones. But the signature language and this work revolved around a couple other features. So what do they look like? All right. So the darker blue is the actual is the group. So in
case it's that big orange one is the big dark blue one. And then within that one cluster, right, based only on assembly mnemonic similarity, you have kind of these three subclusters based on number of sections. So this is kind of interesting. There's maybe a little bit of variation. So maybe somebody used a slightly different version of something,
so forth and so on. Likewise with linker versions, I thought this was kind of neat. There's very little in this example set. Deviation for linker versions when used as a subclustering. So then this is kind of a three-dimensional or two-dimensional view of three-dimensional set of
features. So once again kind of that dark blue is the assembly mnemonic circle. Then you've got these various subcircles. Kind of the one on the lower right-hand corner. You can see the cluster and then you can see one cluster that was actually based off of number of sections. Then you can see two subclusters in that. And then
everything else only had that one cluster. So it's kind of cool. So let's look at Zeus. Much bigger data set, much bigger graphs. Much more science. So this is what Zeus looked like. And once again kind of earlier with a little teaser, you get this massive, massive, massive PID unknown label. But the cluster one, it actually breaks it up. So
this one and the stacked one, you can see the assembly mnemonic clustering on that yellow bar kind of in that blown up window is a little bit more manageable. And you kind of get this slightly more gentle sloping curve. But you get a
lot of bubbles. So either the end result is I shouldn't do anything in D3 or you should never D3 while you're high. Because both scenarios end badly. So once again, what
does it look like if we subcluster on a number of sections versus the initial cluster on assembly mnemonics? You get more circles. What if we do it on linker version? You get these crazy sub spirals. Things look so bizarre. This was for me kind of enjoyable because it was a really neat
exploration of Zeus and kind of a way to visualize this entire data set. And then when you subcluster on both, you just kind of want to go home and cry. It's never very good. All right. So I mentioned that I did something on 411,000 files, which was awesome. So let's talk about them. All right.
This is just the assembly mnemonic graph. So you can see there are tons of clusters based on, similarity based on assembly mnemonics. This is awful to read. So one of the fun facts about this is roughly 5800 out of these
411,000 files are not 90% similar to any other file in this entire corpus. I thought that was really cool and really surprising. So this might be some polymorphic stuff, right? It might be various cryptors. Who knows? But it
was cool. 5800 things is way too many for me to actually dig through. So we'll kind of skip through some of these. Everybody loves spirals and I really wanted to leave 15 minutes for questions. So don't D3 or I shouldn't D3. I actually broke D3 on one of these. My NIST adjacent was
too big and it just wouldn't work. So once again subcluster number sections and this is the one that I broke. So this is where D3 just simply said, I give up. Or you're doing it really wrong. And it might very well be that I was doing it really wrong, but it cried. So there were a couple really,
really cool things that popped out of this relatively large data set. Like Google Chrome. There were 97 Google Chrome instances, right, hashes in this 411,000 and they all matched the same signature. Right? They all had this kind of same assembly mnemonic string. So they're very
consistent with their builds at Google. So if anybody is in the room from Google, thanks. Appreciate that. Right? They're very consistent with what linkers they have, what linkers they use. So out of the 97, kind of the take home is, 94 of those 97 have matching linker versions, matching
number of sections and assembly mnemonics within 90%, right, this .9 distance. So this is kind of cool. And then it really wouldn't be a talk about packers if they didn't talk about UPX. Because somebody was going to ask about it. So this was kind of cool. This was kind of telling. I dug into
TBX some in the past, but this actually forced me to do a little bit more digging. So I kind of cheated and I said, all right, what if I do this really, really naively and just look for the string, right, UPX 0, UPX 1 or UPX bang in the file and said, it's probably UPX, right, because once again
I didn't want to test any prior solution and I wanted to really see how this kind of stuff stacked up. So with the assembly mnemonics, right, out of just doing that simple thing through it, it got 65 different groups and I thought, shit, now I'm going to be laughed off stage. However,
there's some pretty cool results in here. So you can see in the table there's this group label and there's this count. So that's group label or the cluster label is just the arbitrary number that I assigned to it, this group. So you can kind of see once again you get this neat little slope. And I was like, all right, so maybe there's some variations of UPX. Maybe I'm much smarter than I thought I
was and I can do UPX version detection with this. Maybe my head's going to explode or maybe I failed miserably. The algorithm is looking up against how it stacked up against PAD. While I didn't trust the PAD results fully, it was neat to say either me or every random person that I pulled
signatures from on the internet were making the same mistakes or maybe were totally on to something. So kind of the cool thing was, here's the numbers, it looks like maybe I was on to something after all. There's also kind of that none. I dug through that a little bit to see what was going on and if this algorithm was completely failing. It turns out
there's a bunch of packers that basically wrap UPX, which I really hadn't much exposure to. So I thought that was awesome. So I learned a whole bunch there. These kind of variations. So, right, let's go through this recap. The idea was easy to generate signatures. Had I had a working
demo, you would have literally seen me type one command and the signature would have appeared out of nowhere. It would have been awesome. But I can show you later, I'm happy to. It doesn't love math. It's cross platform. It's all written in Python because Python is the new old Ruby. It's cross platform.
It met that requirement. It's mostly easy to understand. It involves a little bit of math but hopefully not too bad even for 5 o'clock on a Friday. And probably most important for me is it works. So even though the paper promised the demo didn't work, I'm going to release it online. The guys at work were more than happy to say, yeah, you
can totally release this tool and sample signatures for people to play with and use. That's the URL where it and these slides will live, the updated slides because the old ones are on the CD. So feel free to take a picture of it or you can ping me on Twitter. However, it's not up there yet because I'm a slacker so it will probably get done next week. And last but not least, if anybody has any
questions, I'm more than happy to answer them. Yeah. So the answer is once you have all of this data, what's the
action? And that's actually a really good question. So aside from why did I do it because I love messing with things, it's important in my opinion for any analysis to drive an action. And the action is to understand what you're looking at as a malware or somebody looking for extra context. So if I can kind of help solve part of the
signature management problem and you can get this idea of fuzzy matching out of signatures and whatnot and you have fairly accurate signatures with very little low lift, right, when you're at your home organization and you got, man, I've got this piece of malware that I've never seen and you go grab 3,900 signatures off the Internet, right, you can go, oh, right, here's a technique that uses
these types of signatures that works that tells me how similar it is to some of these other things that other people have seen. So it kind of helps give you a starting point for analysis. Any more? Okay. Honestly, I haven't looked at
much so I don't really know if I have a good opinion on it. Sorry. Any more? I mean, it would be awesome. Would you believe it? Oh, if anyone is using it, I haven't run into it. So the question was have I run into anybody
actually putting in the packer information into the packed files? My answer is no because I didn't run into it in any of my sample sets. However, even at 410 or 411,000 binaries, given the number of executables that everybody
talks about, right, that's still a relatively small sample set, so it's nowhere near everything. Any more? Like a question? Yep. So does this apply to protectors as well or am I using packers on a broad side? Yeah. When I say
packer, I mean protectors, cryptors, the whole gamut, the whole idea that it's going to obfuscate some intellectual property or something in a binary and make it hard for someone to get the juicy bits. Any more? Man, is my math that much on point that not everybody fell asleep and nobody has questions on math? All right. Cool. So I'll
be around if everybody has questions. One more. How do I make this mustache happen? I think it is genetics. It is math. This is what happens when you do too much math. You
wear super classy shoes. So I actually had a really long beard at one point in time, and my wife hated my long beard because I told her I was going for wizard length. So I said,
you know, if I can have a long beard, I'm going to have a long mustache. Now I sleep on the couch. Too much D3. Exactly. All right. Any more questions? Nobody? All right. Cool. Thanks for coming. I appreciate it.