AI VILLAGE - It's a Beautiful Day in the Malware Neighborhood
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 322 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/39783 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Information securityInformation retrievalSimilarity (geometry)Sample (statistics)Hash functionFuzzy logicStandard deviationSubject indexingCryptographyContext awarenessSystem programmingProcess (computing)Reverse engineeringMathematical analysisMalwarePhysical systemSampling (statistics)Matching (graph theory)CryptographySimilarity (geometry)CASE <Informatik>Branch (computer science)Presentation of a groupNumberContent (media)Subject indexingMemory managementInformation retrievalMathematical analysisReverse engineeringContext awarenessOpen sourceInformation securitySampling (statistics)Data analysisSet (mathematics)Attribute grammarIntegrated development environmentDatabaseGoodness of fitProcess (computing)Hash functionNeighbourhood (graph theory)Query languageRoutingObject (grammar)Game theory
02:17
Error messageMalwareSimilarity (geometry)Representation (politics)Sample (statistics)Thresholding (image processing)Pairwise comparisonCharacteristic polynomialFeature spacePairwise comparisonMetric systemTwo-dimensional spaceSet (mathematics)Bound stateCharacteristic polynomialSampling (statistics)RadiusFunktionalanalysisDistanceCircleSampling (statistics)Feature spaceScaling (geometry)VarianceThresholding (image processing)Euklidischer RaumQuery languageSubject indexingSearch algorithmVirtual machineString (computer science)Task (computing)Similarity (geometry)Dimensional analysisAlgorithmError messageLimit of a functionApproximationRaw image formatComputer animation
03:43
Graph (mathematics)Hash functionSpacetimeLatent heatCollisionoutputCellular automatonGraph (mathematics)CASE <Informatik>Data structureObject (grammar)Dimensional analysisFlow separationNetwork topologyNavigationNatural numberFeature spaceMereologyHash functionResultantAlgorithmPhase transitionFunction (mathematics)MathematicsMultiplication signGraph (mathematics)TheoryAreaDistanceVector space modelPairwise comparisonCategory of beingNumberFunktionalanalysisType theoryLocal ringSimilarity (geometry)Sampling (statistics)Query languageTraverse (surveying)
06:01
Term (mathematics)Implementation2 (number)Right angleFraction (mathematics)Point (geometry)Set (mathematics)BenchmarkResultantLine (geometry)Level (video gaming)CASE <Informatik>Software developerSubject indexingPhysical systemQuery languageHierarchyNavigationAlgorithmProduct (business)Variety (linguistics)BitWebsiteGraph (mathematics)Web pageUniformer Raum
07:27
Pairwise comparisonGraph (mathematics)MultiplicationSample (statistics)Greatest elementPhase transitionMechanism designVertex (graph theory)Pairwise comparisonSlide ruleAlgorithmCASE <Informatik>Query languageMultiplication signMaxima and minimaSampling (statistics)MultiplicationGraph (mathematics)Norddeutsche SeekabelwerkeNeighbourhood (graph theory)Level (video gaming)Parameter (computer programming)Phase transitionIdentifiabilityProcess (computing)ResultantGrass (card game)Basis <Mathematik>Similarity (geometry)NumberSampling (statistics)Point (geometry)Graph (mathematics)Mechanism designImplementationDifferent (Kate Ryan album)Reverse engineeringMilitary base
09:16
Direction (geometry)Pairwise comparisonSample (statistics)AlgorithmVector space modelSpacetimePrice indexQuery languageSubject indexingAnalytic continuationAlgorithmVirtual machineProjective planeImplementationIterationSampling (statistics)Query languagePairwise comparisonAuthorizationStudent's t-testNumberGraph (mathematics)Network topologyDynamical systemFeature space1 (number)Heegaard splittingOpen sourceVector spaceDirection (geometry)CASE <Informatik>Computer animation
11:12
ImplementationPairwise comparisonSimilarity (geometry)MalwareSample (statistics)Block (periodic table)Hash functionPhysical systemLocal ringImplementationMessage passingSimilarity (geometry)Sampling (statistics)Performance appraisalFluid staticsBitData modelMalwareFamilyBlogBulletin board systemComputer virusSubject indexingCore dumpPairwise comparisonSet (mathematics)WindowPhysical systemOpen source1 (number)NumberDisassemblerDatabaseType theoryGroup actionTotal S.A.Hash function
13:09
PrototypeSequenceSubject indexingScale (map)Computer-generated imageryMachine visionRepresentation (politics)Physical systemPrototypeSelectivity (electronic)Similarity (geometry)ComputerMedical imagingPoint (geometry)Different (Kate Ryan album)Pairwise comparisonSystem identificationMachine visionSubject indexingBinary codeRobotics
13:56
Sample (statistics)Query languagePrice indexoutputSinguläres IntegralMatrix (mathematics)Subject indexingMetadataSubject indexingQuery languageLevel (video gaming)Data storage deviceNumberMetadataDatabaseFeature spaceRepresentation (politics)AlgorithmSimilarity (geometry)Sampling (statistics)Variety (linguistics)CASE <Informatik>Physical systemVector space modelSampling (statistics)Different (Kate Ryan album)MathematicsFitness functionData structureContext awarenessKey (cryptography)Category of beingIntegrated development environmentIncidence algebraParameter (computer programming)Set (mathematics)Computer animation
15:36
Sample (statistics)SystementwurfSampling (statistics)Cartesian coordinate systemSineDifferent (Kate Ryan album)Subject indexingLibrary (computing)Type theoryConnectivity (graph theory)Social classNorddeutsche SeekabelwerkeCASE <Informatik>Point (geometry)Vector space modelBitComputer fileData structureDynamical systemImplementationNumberFluid staticsMessage passingModal logicSequelCompilerForcing (mathematics)Set (mathematics)
17:30
Right angleParameter (computer programming)CASE <Informatik>Sampling (statistics)ResultantType theorySocial classMultiplication signSelectivity (electronic)Query languageMalwareSampling (statistics)Metric systemTotal S.A.NumberFraction (mathematics)MereologySimilarity (geometry)Neighbourhood (graph theory)Cartesian coordinate systemSet (mathematics)CausalityPoint (geometry)Algebraische K-TheorieComputer animation
19:11
Interface (computing)Multiplication signDemo (music)ResultantSubject indexingCASE <Informatik>Cartesian coordinate system
19:38
Graph (mathematics)Sample (statistics)CASE <Informatik>Sampling (statistics)NumberSimilarity (geometry)Letterpress printingGraph (mathematics)Standard deviationSubject indexingStatement (computer science)Query languageComputer animation
20:10
Observational studyMacro (computer science)Pauli exclusion principleExecution unitElectronic visual displayGraph (mathematics)Query languageSampling (statistics)Sampling (statistics)Code refactoringFluid staticsResultantFeature spaceComputer animation
20:48
TorusTwin primePersonal digital assistantAerodynamicsMathematical optimizationScale (map)Parameter (computer programming)BenchmarkMetric systemPerformance appraisalSample (statistics)Partial derivativeSampling (statistics)FrustrationSimilarity (geometry)Pairwise comparisonDefault (computer science)Social classKey (cryptography)Sampling (statistics)Subject indexingBitPerformance appraisalSet (mathematics)DistanceCASE <Informatik>Trigonometric functionsInsertion lossMalwareQuery languageVector space modelRepresentation (politics)AdditionModal logicFeature spaceDifferent (Kate Ryan album)Selectivity (electronic)Integrated development environmentMathematical optimizationBenchmarkDisassemblerWordMetric systemParameter (computer programming)Operator (mathematics)Point (geometry)Fluid staticsResultant
23:14
Branch (computer science)Vector space modelSet (mathematics)ResultantComputer animation
23:52
Visualization (computer graphics)SystementwurfElectronic signatureVector space modelCASE <Informatik>Mobile appSelectivity (electronic)Rule of inferenceRight angleComputer animation
24:41
Metric systemCASE <Informatik>DistanceTrigonometric functionsForcing (mathematics)Euklidischer RaumSet (mathematics)
25:27
Personal digital assistantAerodynamicsScale (map)Parameter (computer programming)Mathematical optimizationBenchmarkMetric systemPerformance appraisalPartial derivativeSource codeAlgorithmVector space modelSelectivity (electronic)CASE <Informatik>ImplementationSubject indexingRight angleResultantPhase transitionProjective planeSequel
27:21
Morley's categoricity theoremOperations researchPhysical systemAerodynamicsFluid staticsData typeContinuous functionCode
Transcript: English(auto-generated)
00:00
Welcome to AI Village. The next talk is it's a beautiful day in the malware neighborhood by Matt. And uh basically we'd like to thank our sponsors Endgame, Silence, Sophos, and Tinder. And of course silence your cell phones and if you have an open seat next to you please raise your hands so the people next to you or people coming in can know there's a seat. Thanks. Hey good afternoon everyone. Um even though Silence is a
00:23
sponsor don't worry this isn't a sponsored talk. Um this tool's uh completely open source. Um so again my name's Matt Maisel. I'm the manager of security data science at Silence. Um and uh specifically today I'm going to be talking about the use of nearest neighbor search techniques um applied to malware similarity. Specifically in a tool
00:41
called Rogers. Um that's uh open source on GitHub right now. I have a a feature branch that I'm working on some updates from the content today that I'll present. Um but this tool is designed for malware analysts and security data scientists to perform malware similarity research. So just some motivation. Um so building databases of our malware is uh is interesting for for analysts and for data scientists. So search and
01:03
retrieval of similar samples and it can provide valuable context to analysts and systems. The uh objective uh um in this case is to basically build a database, index our malware by some attribute or some set of attributes and when we have some unknown sample uh query that database and hopefully if we've ever seen a similar sample we get back
01:22
valuable context. You know maybe other labeled uh samples maybe samples that have been reversed and we have you know a lot of details on. Um so that's use case kind of number one for these systems. Uh use case number two is you know if it's a sample that we've never seen before and it also doesn't match anything in our in our corpus. You know we can prioritize that uh for manual analysis or maybe more uh more advanced reverse
01:43
engineering. Finally kind of the uh final use case of search and retrieval systems from our similarity is to uh augment uh uh larger systems maybe doing clustering or maybe doing classification. So we can use nearest neighbor search techniques to uh basically process incoming alerts and leverage any sample any hits that we get back with the
02:02
context to determine if we want to route that sample to other workflows in our environment. Um historically this of course has been done in you know big databases of cryptographic hashes. Uh fuzzy hashing notably SSD is also uh kind of a standard approach still and still the de facto I'd argue. So how does this relate to nearest neighbor search? So you know if we consider malware similarity as being performed
02:23
through comparison of raw bytes or extracted static and dynamic features that distill the semantic characteristics. Um we can take these features and represent them in this end dimensional feature space. And with that we can feed that into a lot of nearest neighbor search algorithms. Um and as well as other machine learning algorithms as well too. Um so nearest neighbor search simply put is the task of uh being
02:43
given a set of samples X or our corpus. Um basically take a query sample or unknown sample XQ and uh query the index and basically get back the K nearest neighbors um according to some distance function. And there's you know many different uh distance functions that could be used here. A gentleman earlier today mentioned Euclidean and we can use cosine uh we can even use other other
03:02
metrics like string metrics that uh operate on edit distance or Levenshtein distance. Um and ultimately um nearest neighbor search of course uh is hard at scale you know with uh high dimensional space so we have to look at approximate variants of this that allow some error threshold epsilon and uh basically uh kind of bounds our true distance whenever we query our index. And uh just again as a
03:24
really simple example here if uh this kind of two dimensional space here is if we query this red dot here uh the uh K nearest neighbor for three would be these three samples in the inner circle. Um if there was maybe uh if this was maybe an approximate um variant you know might there might be a chance that we could accidentally return uh some of these samples in the larger kind of radius here. So um
03:45
don't worry I don't have any algorithms but um I actually cut out a lot of this but there's a lot of interesting theory and literature um around nearest neighbor search you know over the past um several decades. Um I kind of categorized them specifically in three different areas so there's tree based methods you know where we're basically partitioning our data set into you know
04:04
these these cells in our in our uh feature space so we use tree data structures to exploit uh that nature and basically rapidly look up and identify the cell um kind of uh shown here in this case here still in a two dimensional space. We use a tree data structure to quickly look up and uh identify nearest neighbors in
04:20
this particular cell. Um there's also hashing nearest neighbor uh methods as well too. You know where we're typically in this case applying a non-cryptographic hash that kind of has this property of you know ensuring that any small input or any small change in the input space only results in a small change in the output space. So the idea is here we're actually looking for collisions between similar objects. Um in this case uh we we can come up with different hashing algorithms uh
04:44
locality sensitive hashing is one kind of popular one where the whole idea is uh to find uh hashing functions that take some input so some input sample in this case represented by our feature vector um and if we hash it we ultimately want to end up in the same bucket or end up with the same hash code if it's similar. And then again ultimately this determines uh it it reduces the number of candidates in this case
05:03
that end up in the same bucket uh ultimately to actually do uh distance comparisons on. Because doing you know again doing like pairwise distance is super expensive and and no one wants to do that. Finally uh a kind of more recent uh uh really more recent approach for nearest neighbor search is uh graph based methods. Um so the general idea
05:20
here is that we're gonna build these proximity graphs. Um maybe several layers of graphs that we kind of stack together. Um and we have uh algorithms that basically uh do a initial kind of offline phase of building our graph and connect to the neighbors uh and with these edges based off of their similarity. Um and then uh ultimately query time we kind of end up in uh one part of the graph and navigate around
05:42
um basically traverse our graph potentially traverse multiple layers and build up our candidate set for comparison. Um the the downside here is that a lot of the graphing algorithms are extremely expensive. You know we have to basically find specific types of graphs when we do our build our offline build phase. Uh that make them uh easy to search through and traverse through in a a short amount of time. So
06:03
there's a crap ton of methods out there. Um I highly recommend uh checking out this uh ANN benchmarks page. It's on uh GitHub. There's a paper associated with well too. Um every uh so often uh this developer reruns all the latest and greatest uh implementations of these various nearest neighbor search methods uh across a wide
06:22
variety of uh data sets for benchmarking. Um kind of the the typical benchmark that's used here is uh this trade off between queries per second. So how fast can we look up items to uh the recall. Uh and in this case the recall is the fraction of the true nearest neighbors returned in our search. So the kind of general idea here is that you know up and to the right is better but you can kind of see uh usually
06:42
there's this trade off of hey we can query our our index for nearest neighbors very quickly. Again if this is like in a a large production system. But we get that at the expense of having really low recall. Uh conversely if we you know really want a high recall basically we we want our approximate methods to bring back the exact results. And we typically we typically trade that off at the cost of queries per
07:01
second. Now one uh one algorithm to kind of point out here and I'll I'll get into in a bit is this uh HNSW or hierarchical navigable small world. Um so that's this uh uh that's this line right here. So it it does fairly well. And again this is just one example of this New York Times data set for K equals 100. Um if you go back to their site you'll see that it actually does
07:21
fairly well across a large a wide variety of data sets. And uh also uh does well at varying levels of K uh varying levels of K. So hence uh kind of why uh that's one of the algorithms that I specifically picked uh to look at and uh you know use the implementation in Rogers for an hour similarity. So this method was uh recently created in 2017 against based on a lot of uh different algorithms and
07:41
graph uh graph nearest neighbor search. But the the basic idea at a high level is uh basically construct this multi-layer graph um and use it to greedily identify candidates uh for comparison. So as I kind of alluded to in my overview slide you know there's this phase where we construct this graph. Um we query the candidates through this traversal mechanism and then iteratively search
08:01
neighboring nodes until some stopping criteria is met. So HNSW defines all that all the uh kind of methods there for the stopping criteria for the way that we build the graph. And ultimately uh kind of just to sketch this out here um after we built this graph consisting of multiple layers. So it's really multiple layers of graphs. You know we uh set different parameters for this algorithm to determine how frequently basically how deep a sample uh ends up
08:24
in one of these layers. Um I should say how uh shallow. So we basically start from the top layer here going down to the bottom layer. But the idea at query time is uh once we've constructed these uh these layers of graphs we start at some point here and navigate. In this case there's only one sample so it basically searches the neighborhood, goes right to the sample, eventually
08:42
reaches a a local minimum here because there's no other neighbors to look for and then drops down to the next layer. And it kind of this this process continues until eventually we get to the final layer. Um and this is also can be tuned as well to a query time to determine how deep into the layers of graphs we'd like to go. But ultimately this results and again uh kind of this paper
09:00
based uh bases the this approach and uh basically ensures that any samples that we visit across all the layers are likely to be nearest neighbors. Um and that's ultimately what we use to determine the number of candidates that we end up querying or sorry end up uh doing uh comparison against uh with our query. So that's uh that's a graph based method. Um so there's an also another
09:20
really recent method too that uh I actually caught at um the uh NIPS workshop uh at uh basically a machine learning research conference uh back uh last fall. It's really interesting um so it's called prioritized dynamic continuous indexing or or PDCI. And there's uh an earlier iteration of this just called dynamic continuous indexing as well too. Um so the authors here actually designed an exact randomized algorithm. Um and it's built
09:43
around this idea that we're gonna avoid partitioning samples by vector space. Um so kind of going back to this example with the tree based methods where we're kind of splitting up our feature space along the uh each feature. Um that gets uh that basically has a lot of issues. And uh PDCI uh the authors noticed that you know what we can do is just build these indices and
10:03
basically uh take our samples and project them along a random direction. And we do this so we can control the number of indices we want to build to basically determine uh how well this uh this method actually works. Um and the idea is we construct multiple indices and the kind of the main gist is that as you visit the indices and uh uh you query the index you end up at a place where you're
10:24
you're basically your query is projected to. Um any of the samples that are nearby uh either uh in this case that are larger or smaller uh you you kind of pop those samples off. And if they end up appearing in all indices um again this paper kind of shows that uh that's highly likely to be the exact nearest
10:40
neighbors. Um so you add that to the candidate set for comparison. Um and again this uh this is particularly interesting uh because uh in this uh specifically within this exact uh nearest neighbor search method. Uh some of the guarantees in this paper are uh are pretty pretty compelling. And unfortunately though because it's uh an academic uh paper I mean the author is uh a pretty well a very well respected PhD
11:01
student I think uh at Berkeley. Um but there's no open source implementation yet. So I went ahead and tried to do a naive implementation and in Rogers uh specifically. So that's uh PDCI. Uh again these two algorithms are the ones that kind of uh I focus on right now for the this talk. So other other malware similarity systems um you know there there's quite a few that have different
11:22
approaches to doing this uh for nearest neighbor search or just similarity in general. So of course virus total um has different uh ways to index data. Um also using SSD um but also a uh a clustering uh uh basically a clustering API that is based off of feature hashing of the static uh from my understanding from the the docs uh from the structural data pulled out from static feature
11:42
extraction. Um and that this is actually where I source uh some of my my data sets for evaluation of these methods which uh you'll soon see to is uh to my detriment unfortunately. Um our very own uh Brian Wallace uh one of the uh the AI village uh uh core team members actually came up with a blog uh wrote a blog post in virus bulletin a few years back um that basically
12:02
exploited the uh the way that the SSD message digest is built um to eliminate the number of comparisons needed. Um and then more recently uh you know again you you can take this idea and basically apply it to elastic search as well too so you can just use an off the shelf database um to actually use the same uh kind of uh method here for indexed SSD. Um so again that's using
12:24
uh one of these similarity digest methods that's kind of in the in the uh bigger larger group I should say of hashing based nearest neighbor methods. Um and then of course there's kind of the popular uh academic implementations of malware similarity systems so bit shred is uh highly cited um back in 2011 so it
12:40
uses pairwise jaccard similarity and it uses Hadoop to do that so that it gets again fully pairwise so that's very expensive hence the need for Hadoop. Um there's also the malware provenance system which is a little bit more recent um that uh uses minhash a type of the uh uh like basically an LSH family a locality sensitive hashing family that um approximates um jaccard similarity um so
13:02
that's that's used uh across the sliding window of n-gram features on disassembled samples um and yeah those are the two uh yeah two final ones here. Um there's also uh malheur uh which focuses more on the behavioral uh feature similarity comparisons specifically for uh cuckoo um it's a there's a lot of
13:20
different capabilities built in there for clustering and also classification but uh underneath the hood there's this uh kind of use of uh looking at behavior features for doing uh prototype identification or prototype selection um so you can basically identify prototypes um you know that are in a large cluster pretty much like the centroid around these points and uh use that for a way to quickly do comparison. And then uh there's also the sarvam which uh kind of takes some uh
13:45
computer vision ideas from uh basically indexing images and uh takes a binary raw as raw bytes and basically uh converts it into a grayscale image and then indexes that. Um so there's there's a handful of systems out there there's a lot of algorithms to kind of choose from again that have a lot of different uh properties
14:02
uh again with that kind of going back to the the performance metric there of query uh by recall. Um so when we're approaching uh the design of the system to do malheur similarity and specifically uh use that system to evaluate different nearest neighbor search techniques. You know I kind of define these four uh key design design ideas. So number one you know of course we
14:21
have to extract and store sample metadata um from our raw features. Um we have to then transform uh that feature uh transform that raw data into you know some feature representation again in this end dimensional feature space. And we might have a variety of different vectorization pipelines that we want to experiment with. Uh talks earlier today mentioned a few uh TF-IDF. Um you know
14:42
I I use feature hashing specifically in my more recent approach. Um so we might want to kind of change out that vectorization pipeline depending on you know what we want to evaluate and what features we want to include. Um and also how large our data set might be. Um so after we transform our our features with one of these pipelines we then fit um you know the the different nearest neighbor methods. And you know do some bookkeeping maybe
15:02
have to save some uh basically save some of the database structures that are required. Um and that's kind of the fit stage. Um and finally once we fit all these indices you know we want to actually query samples and then uh basically again kind of pick the parameter k to determine how many nearest neighbors we're going to pull back. And then possibly if we have this this database of sample metadata we also might want to display uh the
15:23
contextual features maybe again we have some uh a case database of previous uh incidences in our environment and we want to annotate our samples that we pull back with some of that. And it might help for uh more of the analyst context uh kind of thing. Um so that really gives us um or really what I kind of came up with is this design then for for
15:42
Rogers. Like uh Rogers is a Python 3 application um that you know pretty much has a uh a sample class. Um in this case it's it's really only a sample class consisting of the PE class. Um that really only focuses on really basic uh static feature extraction using PE file. Um but it's built in a way that you
16:00
could expand uh the number of sample classes if you again bring in you know other things other than just portable executables. Um vector- vectorizer so there's a uh basically a scikit-learn pipeline APIs that I pretty much use extensively. And in this case right now I have a latent semantic analysis which again earlier talk kind of uses some of those ideas with the TF-IDF um and then projecting down. And then uh more recently just because I started
16:23
getting into data sets that were a little bit larger um to handle I started looking at feature hashing approaches. You know and um this basically can be extended with anything supported by scikit-learn or other vectorizers as well too. And then uh finally the the kind of final component here is this index class. You know which uh in this case I have implementations or I use
16:42
uh libraries for like uh HNSW or I use the LSH force in scikit-learn which uh is gonna be deprecated anyway here soon. Um but then I have an implementation for index SS deep and PDCI and it pretty much at this point uses uh SQLite you know as the the store for at least the uh index SS deep and the PDCI uh
17:02
methods. Um all the feature data too is uh I forgot to mention is uh stored in uh in protobuf uh a message definition that has um basically uh kind of a structure that allows you to add different modalities of of features. Um so you can add static features, dynamic features, contextual features and then uh give them uh if if you want to you can give that a
17:21
variable type. Um if you'd like to actually automatically build like uh feature vectorizers later on but um again that's not really not really supported yet in the vectorizer class. Cool so now unfortunately to the sad panda part of my talk um so yeah uh doing these types of experiments and getting data sets you know for doing malware similarity is difficult um
17:41
unfortunately I didn't really get as much time on doing this experiments as I wanted to. Um what we're looking at here again is uh we have two charts. Um so this chart right here is this recall at K for exact nearest neighbor. So at the on the X axis we have the number of K's that we picked um for each of the experiments and then on the left on the Y axis here is the recall. Again
18:02
this is the fraction of the true uh nearest neighbors returned uh for the query. Um on this side right here we have precision at K for for neighborhood class. So this is actually more of a same kind of metric where we're looking at the relevant documents over the K or the uh the total relevant sorry relevant over the total number of uh relevant documents
18:22
in our data set. In this case um I'm actually just uh using the class uh for the samples to basically say hey if if I query the sample and I get back off all my results in my query are all the same class you know cause I have labels for them um you know that indicates to me we have high precision at K. Um so if you look at this it does very well for
18:41
these samples and I kind of illustrate this in a sec uh you'll kind of see why. But um that's actual exact nearest neighbor results are are pretty bad. Um you know we have point three for PDCI and then uh HNSW kind of is is you know pretty low. And again I did parameter selection or I did some grid search you know for parameters for each of these methods and tried to come up with things but I really couldn't get
19:02
anywhere and um you know just to kind of highlight this again this is this data sets coming from the VT clustering data set with twenty five twenty seven thousand samples and fifteen classes. Um so cool so uh quick demo time to kind of illustrate um you know at least the interface. So Rogers is exposed as a command line application so if you want to use it on the CLI you can. Um but I also have APIs
19:23
exposed to to basically uh make it so that you can import uh import Rogers and then build an index and then I use Plotly in this case just to visualize some of the results. Um so I'll blow this up a little bit real quick. Like I said still probably hard to
19:41
see in the back but um the idea here is that I've I've previously fit an index um this one specifically HNSW and um again I have some kind of standard APIs here for uh passing in a sample set setting the number of K getting back the neighbors. Um and in this case uh if we clear this and we can see we get back the sample here's the the query sample just kind of again
20:01
doing some print statements here just to make it easy to see what we're getting back. Um then here's the nearest neighbors and you know again they're look I mean look how similar these are this is cosine similarity. Um the the graph here just kind of displays them and if we look at some of the uh the kind of features here so here's the query sample you know we can see that this is uh with lamer label um and we look at some of these other samples
20:21
it's reg run um there's another one too and yeah totally different. So the the labels themselves are different um and what I kind of realized after getting into this is that given the fact that my feature space is limited to the static uh feature extraction methods you know and uh given that you know we're really only uh looking at samples that are usually I think a lot of these
20:42
are just like packed samples everything really looks identical. Um so I think that kind of might explain why I'm I had really bad results um and kind of uh illustrates need to for me to get better data sets uh for evaluation here. Um so just for a little bit bonus um so Zori was uh released at Black Hat um and because of the frustration I had with some of the feature extraction around the basic static stuff I went ahead and
21:02
implemented a Zori vec- a Zori feature extraction class uh for PE and uh you know just in this case I only got to pulling out a bag of words for the mnemonics um so it's just kind of an example of uh for this particular sample running Zori on it. I was able to do run this on like 500 samples or so and build a a vectorizer and
21:20
again now we can uh rerun uh rerun the uh same query with I'm sorry this is a different sample we can now leverage the uh the bag of word um mnemonics uh as well in addition to the other kind of static uh features that are pulled out and it gives us slightly different results. Again I haven't really done any uh formal comparison of these methods. So um this kind of
21:41
wraps up stuff I think um but the uh the general idea here you know is that Rogers is a tool for one experimenting with different nearest neighbor search techniques um but also is a tool to build out vectorizers for the different methods uh similarity you know doing similarity uh in your environment might depend you know you might have different use cases you might only wanna do static comparison
22:02
you might only wanna do dynamic you might wanna do both um you might wanna you know apply like an automated disassembler like Zory on it um there's clearly uh at least in the vectorizers that I published with this tool um you know there's it's very limited to uh kinda just PE um and there's definitely opportunity for different modalities um there's also opportunities for doing feature selection and learning representations as well too to
22:22
come up with a better feature space um that could be used for similarity comparison um for experiments uh definitely in my case uh you know I did run some parameter optimization but unfortunately just need to get more data sets for doing benchmarks um and also I think it would be interesting to evaluate different distance metrics you know beyond just Euclidean uh for some of this
22:40
and basically use that to determine uh again similarity for some of these methods um and again some of these methods uh like HNSW by default uses cosine um but like PDCI is Euclidean and finally uh more use cases uh so potentially in this case you know we're only indexing malware samples uh but you could also potentially index benign samples as well too and of course like the the key is being able to
23:02
continuously update the index with new samples as they come in and have been classified or been analyzed um so doing like a partial fit or like an insert operation uh would be pretty easy to extend as well too. Cool um so at this point for questions so again uh this this tool is up on GitHub I I do have a feature branch I apologize
23:22
I gotta get it out uh it's just uh it's just been crazy the past week so that this feature branch will add uh feature hashing will add the PDCI um and basically uh publish the experiment uh again I do feel that the experimental results were pretty weak but maybe could be explained just by the data set I was using uh certainly it would be kinda interesting to experiment more um so yeah and pull
23:42
requests are welcome. Any uh any questions at all? Go ahead. Sorry go. Show the vectorizers yeah so I guess uh Zori I because I was kind of experimenting with this right here so this is a
24:06
signature vectorizer so this one is actually using the Yara rules repo um so I just used the Yara uh detections as features themselves and uh and kind of try to figure out like what you are in this case apply TDF TDF T F T F I D F to determine like what uh signatures are more
24:21
uh useful compared to others. Go ahead. Oh so yeah in this case there really isn't any feature selection um other than applying like T F I D F and then projecting down. Any final questions? I'm sorry I can't uh I can't hear
24:47
you at all. I I apologize I still can't hear you. Uh oh so the the metrics and the distance
25:01
uh so yeah as I mentioned H N H N S W uses cosine um the uh PDCI is using Euclidean um but the metrics are just this uh this recall at K so basically I do exact nearest neighbor search on my data set um just using brute force and I compute that for Euclidean and I compute that for cosine and then uh those are
25:20
like kind of the uh basically the ground truth that I use in this case for this recall at K for the nearest uh sorry for the exact neighbors. Yeah so uh I mentioned like you know I I don't don't do any feature learning don't do any of that kind of stuff
25:41
um and I also think that you know going beyond some of the basic static features would would definitely obviously change the results significantly um so I hope to do that in the future. Oh sorry didn't see ya? Right.
26:16
Yeah so um just a little background so the question is uh you know do more experiments pretty much. Experiments are fun
26:21
sometimes they're they're sad but you know it's always gotta push forward to the next one. The other question was like what what did I learn kind of doing this implementation? It was definitely really a lot of fun to do the PDCI implementation uh again it's really naive in Python 3 uh using uh using SQLite uh basically to to store those indices so uh it was great to have a chance
26:40
to kind of go through a paper that again has no published source code you know um and then uh base or no published implementation for reference and then uh try to do that um and then yeah a lot a lot of Python 3 uh kind of things I picked up during this project as well. Yeah there's not the so those
27:03
algorithms don't do any feature engineering they don't do any feature selection you basically pass in your your feature vector into index so that's kind of a pre-processing step in this case here which I did at the vectorization phase. Cool um thanks everyone.