AI VILLAGE  It's a Beautiful Day in the Malware Neighborhood
Formal Metadata
Title 
AI VILLAGE  It's a Beautiful Day in the Malware Neighborhood

Title of Series  
Author 

License 
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2018

Language 
English

Content Metadata
Subject Area  
Abstract 
Malware similarity analysis compares and identifies samples with shared static or behavioral characteristics. Identification of similar malware samples provides analysts with more context during triage and malware analysis. Most domain approaches to malware similarity have focused on fuzzy hashing, locality sensitivity hashing, and other approximate matching methods that index a malware corpus on structural features and raw bytes. Ssdeep or sdhash are often utilized for similarity comparison despite known weaknesses and limitations. Signatures and IOCs are generated from static and dynamic analysis to capture features and matched against unknown samples. Incident management systems (RTIR, FIR) store contextual features, e.g. environment, device, and user metadata, which are used to catalog specific sample groups observed. In the data mining and machine learning communities, the nearest neighbor search (NN) task takes an input query represented as a feature vector and returns the k nearest neighbors in an index according to some distance metric. Feature engineering is used to extract, represent, and select the most distinguishing features of malware samples as a feature vector. Similarity between samples is defined as the inverse of a distance metric and used to find the neighborhood of a query vector. Historically, treebased approaches have worked for splitting dense vectors into partitions but are limited to problems with low dimensionality. Locality sensitivity hashing attempts to map similar vectors into the same hash bucket. More recent advances make the use of knearest neighbor graphs that iteratively navigate between neighboring vertexes representing the samples. The NN methods reviewed in this talk are evaluated using standard performance metrics and several malware datasets. Optimized ssdeep and selected NN methods are implemented in Rogers, an open source malware similarity tool, that allows analysts to process local samples and run queries for comparison of NN methods.

00:00
Standard deviation
Context awareness
Presentation of a group
Open source
Similarity (geometry)
Set (mathematics)
Branch (computer science)
Mathematical analysis
Number
Attribute grammar
Information retrieval
Goodness of fit
Malware
Cryptography
Thermodynamisches System
Hash function
Subject indexing
Process (computing)
Fuzzy logic
Reverse engineering
Information security
Physical system
Context awareness
Matching (graph theory)
Neighbourhood (graph theory)
Sampling (statistics)
Content (media)
Memory management
Mathematical analysis
Database
Cryptography
Similarity (geometry)
Sampling (statistics)
Subject indexing
Sample (statistics)
Hash function
Personal digital assistant
Query language
Information retrieval
System programming
Game theory
Object (grammar)
Information security
Routing
Reverse engineering
02:17
Graph (mathematics)
Multiplication sign
Set (mathematics)
Function (mathematics)
Mereology
Dimensional analysis
Traverse (surveying)
Mathematics
Hash function
Circle
Pairwise comparison
Area
Algorithm
Sampling (statistics)
Vector space model
Funktionalanalysis
Twodimensional space
Flow separation
Type theory
Category of being
Malware
Sample (statistics)
Hash function
Search algorithm
Phase transition
output
Representation (politics)
Metric system
Resultant
Navigation
Spacetime
Feature space
Characteristic polynomial
Virtual machine
Similarity (geometry)
Distance
Thresholding (image processing)
Theory
Number
Latent heat
Natural number
String (computer science)
Feature space
Data structure
Euklidischer Raum
Task (computing)
Pairwise comparison
Scaling (geometry)
Graph (mathematics)
Characteristic polynomial
Cellular automaton
Graph (mathematics)
Bound state
Variance
Thresholding (image processing)
Similarity (geometry)
Subject indexing
Sampling (statistics)
Radius
Error message
Query language
Personal digital assistant
Network topology
Object (grammar)
Collision
Limit of a function
06:01
Graph (mathematics)
Multiplication sign
Set (mathematics)
Parameter (computer programming)
Fraction (mathematics)
Mechanism design
Different (Kate Ryan album)
Phase transition
Vertex (graph theory)
Pairwise comparison
Multiplication
Algorithm
Software developer
Sampling (statistics)
Bit
Maxima and minima
Term (mathematics)
Benchmark
Mechanism design
Process (computing)
Sample (statistics)
Uniformer Raum
Phase transition
Website
Right angle
Resultant
Navigation
Reverse engineering
Web page
Point (geometry)
Slide rule
Implementation
Identifiability
Variety (linguistics)
Similarity (geometry)
Product (business)
Number
Thermodynamisches System
Norddeutsche Seekabelwerke
Hierarchy
Energy level
Pairwise comparison
Multiplication
Graph (mathematics)
Graph (mathematics)
Neighbourhood (graph theory)
Basis <Mathematik>
Line (geometry)
Grass (card game)
Greatest element
Subject indexing
Sampling (statistics)
Query language
Personal digital assistant
09:16
Computer virus
Dynamical system
Group action
Direction (geometry)
1 (number)
Price index
Heegaard splitting
Fluid statics
Malware
Hash function
Core dump
Query language
Pairwise comparison
Physical system
Algorithm
Sampling (statistics)
Bit
Bulletin board system
Type theory
Data model
Message passing
Thermodynamisches System
Sample (statistics)
Malware
Vector space
Hash function
Block (periodic table)
Implementation
Open source
Algorithm
Virtual machine
Similarity (geometry)
Student's ttest
Vector space model
Number
Thermodynamisches System
Authorization
Spacetime
Feature space
Implementation
Analytic continuation
Pairwise comparison
Graph (mathematics)
Direction (geometry)
Projective plane
Database
Similarity (geometry)
Performance appraisal
Subject indexing
Query language
Personal digital assistant
Network topology
Blog
Iteration
Family
Local ring
Window
Disassembler
13:09
Machine vision
Context awareness
Price index
Parameter (computer programming)
Medical imaging
Mathematics
Different (Kate Ryan album)
Query language
Physical system
Algorithm
Binary code
Sampling (statistics)
Fitness function
Data storage device
Metadata
Vector space model
Category of being
Sample (statistics)
System identification
Representation (politics)
Prototype
Point (geometry)
Variety (linguistics)
Computergenerated imagery
Similarity (geometry)
Computer
Machine vision
Metadata
Number
Sequence
Prototype
Thermodynamisches System
Robotics
Subject indexing
Representation (politics)
Selectivity (electronic)
Data structure
Feature space
output
Scale (map)
Pairwise comparison
Key (cryptography)
Database
Incidence algebra
Subject indexing
Sampling (statistics)
Query language
Personal digital assistant
Matrix (mathematics)
SingulĆ¤res Integral
15:36
Point (geometry)
Sine
Dynamical system
Implementation
Sequel
Computer file
Connectivity (graph theory)
Modal logic
Multiplication sign
Similarity (geometry)
Parameter (computer programming)
Mereology
Number
Fraction (mathematics)
Fluid statics
Systementwurf
Malware
Different (Kate Ryan album)
Compiler
Selectivity (electronic)
Data structure
Social class
Neighbourhood (graph theory)
Sampling (statistics)
Vector space model
Bit
Total S.A.
Cartesian coordinate system
Subject indexing
Sampling (statistics)
Type theory
Message passing
Sample (statistics)
Personal digital assistant
Query language
Right angle
Metric system
Resultant
Library (computing)
19:11
Standard deviation
Graph (mathematics)
Demo (music)
Graph (mathematics)
Multiplication sign
Interface (computing)
Letterpress printing
Similarity (geometry)
Cartesian coordinate system
Number
Subject indexing
Sampling (statistics)
Sample (statistics)
Personal digital assistant
Query language
Statement (computer science)
Resultant
20:10
Sampling (statistics)
Fluid statics
Execution unit
Observational study
Macro (computer science)
Query language
Pauli exclusion principle
Sampling (statistics)
Electronic visual display
Code refactoring
Feature space
Resultant
20:48
Metric system
Modal logic
Similarity (geometry)
Set (mathematics)
Frustration
Insertion loss
Branch (computer science)
Parameter (computer programming)
Distance
Fluid statics
Malware
Performance appraisal
Benchmark
Thermodynamisches System
Different (Kate Ryan album)
Operator (mathematics)
Personal digital assistant
Representation (politics)
Selectivity (electronic)
Aerodynamics
Feature space
Mathematical optimization
Social class
Scale (map)
Pairwise comparison
Addition
Default (computer science)
Torus
Key (cryptography)
Twin prime
Sampling (statistics)
Parameter (computer programming)
Vector space model
Bit
Benchmark
Performance appraisal
Sampling (statistics)
Subject indexing
Word
Sample (statistics)
Personal digital assistant
Query language
Partial derivative
Metric system
Mathematical optimization
Resultant
Disassembler
23:52
Mobile app
Forcing (mathematics)
Vector space model
Distance
Rule of inference
Trigonometric functions
Electronic signature
Systementwurf
Personal digital assistant
Visualization (computer graphics)
Selectivity (electronic)
Right angle
Metric system
25:27
Scale (map)
Implementation
Algorithm
Metric system
Sequel
Projective plane
Source code
Parameter (computer programming)
Vector space model
Subject indexing
Performance appraisal
Benchmark
Personal digital assistant
Personal digital assistant
Phase transition
Partial derivative
Selectivity (electronic)
Aerodynamics
Mathematical optimization
Resultant
27:21
Operations research
Thermodynamisches System
Fluid statics
Code
Aerodynamics
Morley's categoricity theorem
Continuous function
Data type
00:00
welcome to AI village the next talk is it's a beautiful day in the malware neighborhood if I'm at and basically we'd like to thank our sponsors end game silence Sophos and tinder and of course silence your cellphone's and if you have an open seat next to you please raise your hands to the people next to you or people coming in could know there's a seat Thanks hey good afternoon everyone even though silence is a sponsor don't worry this isn't a sponsored talk this tool is completely opensource so again my name is matt nasal I'm the manager of security data science at silence and specifically today I'm going to be talking about the use of nearest neighbors search techniques applied to malware similarity specifically in a tool called Rogers that's open source on github right now I have a future branch that I'm working on so updates from the content today that I'll present but this tool is designed for malware analysts and security data scientists to perform our similarity research so just some motivation so building databases of our malware is interesting for for analysts and for data scientists search and retrieval of similar samples and it can provide valuable context to analysts and systems the objective in this case is to basically build a database index or malware by some attribute or some set of attributes and when we have some unknown sample query that database and hopefully if we've ever seen a similar sample we get back valuable contacts you know maybe other labeled samples maybe samples that have been reversed and we have in a lot of details on so that's used case kind of number one for these systems use case number two is you know if a sample that we've never seen before and it also doesn't match anything in our in our corpus you know we can prioritize that for manual analysis or maybe more more advanced reverse engineering finally a kind of the final use case of search and retrieval systems from our similarity is to augment larger systems may be doing clustering or may be doing classification so we can use nearest neighbor search techniques to basically process incoming alerts and leverage any sample any hits that we get back with the context to determine if we want to route that sample to other workflows in our environment historically this of course has been done in a big databases of cryptographic hashes fuzzy hashing notably SS deep is also kind of a standard approach still instilled in fact I'd argue so how does
02:19
this relate to nearest neighbor search so you know if we consider malware similarity as being performed through comparison over all bites or extracted static and dynamic features that distill the semantic characteristics we can take these features and represent them in this n dimensional feature space and with that we can feed that into a lot of nearest neighbor search algorithms and as well as other machine learning algorithms as well too so nearest neighbor search simply put is the task of being given a set of samples X or our corpus basically take a query sample or unknown sample X Q and query the index and basically get back the K nearest neighbors according to some distance function and there's you know many different of distance functions that could be used here a gentleman earlier today mentioned the Euclidean and we can use cosign we can even use other other metrics like string metrics that operate at a distance or Levenstein distance and ultimately nearest neighbor search of course is hard at scale you know with a high dimensional space so we have to look at approximate variance of this that allow some air threshold Epsilon and basically kind of bounds our true distance whenever we query index and just again there's a really simple example here if this kind of twodimensional space here is if we query this red dot here the K nearest neighbor for three would be these three samples in the inner circle if there is maybe if this is maybe an approximate variance you know maybe there might be a chance that we could accidentally return some of these samples in the larger kind of radius here so don't worry I don't
03:46
have any algorithms but I actually cut out a lot of this but there's a lot of interesting theory and literature around nearest neighbor search you know over the past several decades I kind of categorized them specifically in three different areas so there's tree based methods you know where we're basically partitioning our data set into you know these these cells and our in our feature space so we use tree data structures to exploit that nature and basically rapidly lookup and identify the cell kind of shown here in this case here still in a two dimensional space we use a tree data structure to quickly look up and identify it nearest neighbors in this particular cell there's also hashing nearest neighbor methods as well to you know where typically in this case applying an on  graphic hash that kind of has this property of you know ensuring that any small input or any small change in the input space only results in a small change in the output space so the idea is here we're actually looking for collisions between similar objects in this case we we can come up with different hashing algorithms localitysensitive hashing is one kind of popular one where the whole idea is to find hashing functions that take some input so some input sample in this case represented by our feature vector and if we hash it we ultimately want to end up in the same bucket or end up with the same hash code if it's similar and then again ultimately this determines it reduces the number of candidates in this case that end up in the same bucket ultimately to actually do a distance comparisons on because doing again doing like pairwise distance is super expensive and no one wants to do that finally a kind of more recent a really more recent approach for nearest neighbor search is graph based methods so that general idea here is that we're going to build these proximity graphs maybe several layers of graphs that we kind of stack together and we have algorithms that basically do a initial kind of offline phase of building our graph and connect to the neighbors and with these edges based off of their similarity and then ultimately a query time we kind of end up in one part of the graph and navigate around basically traverse our graph potentially traverse multiple layers and build up our candidate set for comparison the downside here is that a lot of the graphing algorithms are extremely expensive you know we have to basically find specific types of graphs when we do our build our offline build phase that make them easy to search through and traverse through in a short amount of time so there's a crapton of methods
06:05
out there I highly recommend checking out this an N benchmarks page it's on github there's a paper associated with fall to every so often this developer reruns all the latest and greatest implementations of these various nearest neighbor search methods across a wide variety of data sets for benchmarking in kind of the typical benchmark that's used here is this tradeoff between queries per second so how fast can we look up items to the recall and in this case the recall is the fraction of the true nearest neighbors returned in our search so the kind of general idea here is that you know up until right is better but you can kind of see usually there's this tradeoff of hey we can query our in our index for nearest neighbours very quickly again if this is like in a large production system but we get that expense of having really low recall conversely if we you know really want a high recall basically we want our approximate methods to bring back the exact results and we took we typically trade that off at the cost of queries per second now one one algorithm to kind of point out here and I'll get into in a bit is this h n SW or hierarchical navigable small world so that's this graph that's this line right here so it does fairly well and again this is just one example of this New York Times data set for K equals 100 if you go back to their site you'll see that actually does fairly well across a large for a wide variety of data sets and also does well at varying levels of carrying levels of K so hence kind of why that's one of the
07:30
algorithms I specifically picked to look at and you know use the implementation in Rogers for an hour similarity so this method was recently created in 2017 against based on a lot of different algorithms in graph a graph nearest neighbor search but the the basic idea at a high level is basically construct this multilayer graph and use it to greedily identify candidates for comparison so as I kind of alluded to in my overview slide you know there's this phase where we construct this graph we query the candidates through this reversal mechanism and then iteratively search neighboring nodes until some crop and stopping criteria is met so H NSW defines all that all the kind of methods there for the stopping criteria for the way that we build the graph and ultimately kind of just to sketch this out here after we built this graph consisting of multiple layers so it's really multiple layers of graphs you know we set different parameters for this algorithm to determine how frequently basically how deep a sample ends up in one of these layers which I should say how shallow so we basically start from the top layer here going down to the bottom layer but the idea at query time is once we've constructed these these layers of graphs we start at some point here navigate in this case there's only one sample so it basically searches the neighborhood goes right to the sample eventually reaches a local minimum here because there's no other neighbors to look for and then drops down to the next layer and it kind of this process continues until eventually we get to the final layer and this is also can be tuned as well to a query time to determine how deep into the layers of grass we'd like to go but ultimately this results and again the kind of this paper is based basis of the this approach and basically ensures that any samples that we visit across all the layers are likely to be nearest neighbors and that's ultimately what we use to determine the number of candidates that we end up querying or start end up doing a comparison against with our query so that's that's the
09:18
graph based method so there's an also another really recent method to that I actually caught at the nips workshop a typically machine learning research conference back last fall it's really interesting so it's called prioritize dynamic continuous indexing or or PD CI and there's an earlier iteration of this just called dynamic continuous indexing as well too so the authors here actually designed an exact randomized algorithm and it's built around this idea that we're going to avoid partitioning samples by vector space so kind of going back to this example with the tree based methods where we're kind of splitting up our feature space along the each feature that gets that base has a lot of issues and the PD CI the authors notice that you know what we can do is just built these indices and basically take our samples and project them along a random direction and we do this so we can control the number of indices we want to build to basically determine how well this this method actually works and the idea is we construct multiple identities and the kind of the main gist is that as you visit the indices and you query the index you end up at a place where you're basically your query is projected to any of the samples that are nearby either in this case that a larger are smaller you kind of pop those samples off and if they end up appearing in all indices again this paper kind of shows that that's highly likely to be the exact nearest neighbors so you add that to the candidate set for comparison and again this this is particularly interesting because and this specifically within this exact nearest neighbor search method some of the guarantees in this paper are pretty pretty compelling and unfortunately though because it's an academic paper I mean the author is a pretty well a very wellrespected PhD student I think at Berkeley but there's no open source implementation so I went ahead and tried to do a naive implementation than in Rogers specifically so that's a PD CI again these two algorithms are the ones that
11:13
kind of I focus on right now for this talk so other other malware similarity systems you know there's quite a few that have different approaches to doing this for nearest neighbor research or just similarity in general so if course virustotal has different ways to index data also using SS deep but also a clustering basically a clustering API that is based off of feature hashing of the static from my understanding from the other Doc's from the structural data pulled out from static feature extraction and this is actually where I source some of my my datasets for evaluation of these methods which are you'll soon see to is to my detriment unfortunately our very own Brian Wallace one of the they iVillage core team members actually came up with a blog wrote a blog post and Virus Bulletin a few years back that basically exploit it the way that the SS deep message digest is built to eliminate the number of comparisons needed and then more recently you know again you you can take this idea and basically apply it to elasticsearch as well to so you can just use an offtheshelf database to actually use the same kind of method here for indexed SS deep so I can that's using one of these similarity digest methods that's kind of in the in the bigger larger group I should say of hashing based nearest neighbor methods and then of course there's kind of the popular academic implementations of malware similarity systems sobut Shred is highly cited back in 2011 so uses pairwise Jaccard similarity and it uses Hadoop to do that so it gets again fully pairwise so it's very expensive hence the need for Hadoop there's also the malware provenance system which is a little bit more recent that uses min hash a type of the basically an Alice H family a locality sensitive hashing family that approximates Jaccard similarity so that's that's used across a sliding window of Engram features on disassembled samples and the others are the two uh two final ones here there's
13:11
also a Malheur which focuses more on the behavioral feature similarity comparisons specifically for cuckoo it's a there's a lot of different capabilities built in there for clustering and also classification but underneath the hood there's this kind of use of looking at behavior features for doing prototype identification or prototype selection so you can basically identify prototypes you know that are in a large cluster pretty much like the the centroid around these points and I use that for a way to quickly do comparison and then there's also the sarvam which kind of takes some computer vision ideas from basically indexing images and takes a binary a robot as well bytes and basically converts it into a grayscale image and then indexes that so there's a
13:56
handful of systems out there there's a lot of algorithms to kind of choose from again that have a lot of different properties again with that kind of going back to the the performance metric there of query by recall so when we're approaching the design of the system to do malware similarity and specifically use that system to evaluate different nearest neighbor search techniques you know I kind of define these four key design design ideas so number one you know of course we have to extract and store sample metadata from our wall features we have to then transform that feature transform that raw data into you know some feature representation again in this N dimensional feature space and we might have a variety of different vectorization pipelines that we want to experiment with stalks earlier today mentioned a few tfidf you know I use feature hashing specifically in my more recent approach so we might want to kind of change out that vectorization pipeline depending on and what we want to evaluate and what features we want to include and also how large our dataset might be so after we transform our features with one of these pipelines we then fit you know the different nearest neighbor methods and you know do some bookkeeping maybe you have to save some basically save some of the database structures that are required and that's kind of the fit stage and finally once we fit all these indices you know we want to actually query samples and then basically again kind of pick the parameter k to determine how many nearest neighbors we're going to pull back and then possibly if we have this this database of sample metadata we also might want to display the contextual features maybe again we have some case database of previous incidences in our environment and we want to annotate our samples that we pull back with some of that it might help for more of the analyst context kind of thing so
15:37
that really gives us or really what I kind of came up with is this design then for Rogers like Rogers is a Python 3 application that you know pretty much has a sample class in this case it's really only a sample class consisting of the PE class that really only focuses on really basic static feature extraction using PE file but it's built in a way that you can expand the number of sample classes if you again bring in you know other things other than just portable executables vectorizer so there's a basically a scikitlearn pipeline API is I pretty much used extensively and in this case right now I have latent semantic analysis which again earlier talk kind of uses some of those ideas with the tfidf and then projecting down and then more recently just because I started getting into datasets that were a little bit larger to handle I started looking at feature hashing approaches you know and this basically can be extended with anything supported by scikitlearn or other vectorizer x' as well too and then finally the the kind of final component here is this index class you know which in this case I have implementations or I use libraries for like H NSW or I use the LS h for sin scikitlearn which is going to be deprecated anyway here soon but then i have an implementation for indexed SS deep and p DCI and pretty much at this point uses sequel Lite you know as the the store for at least an index SS deep and the P DCI methods all the feature data to is forgot to mention is stored in a and protobuf a message definition that has basically a kind of a structure that allows you to add different modalities of features so you can add static features dynamic features contextual features and then give them if you want to you can give that a variable type if you'd like to actually automatically build like feature vectorizer later on but again that's not really not really supported yet in the vectorizer class
17:30
cool so now unfortunately to the sad panda part of my talk so yeah doing these types of experiments and getting datasets you know for doing malware similarity is difficult unfortunately I didn't really get as much time on doing this experiments as I wanted to what we're looking at here again is we have two charts so this chart right here is this recall at K for exact nearest neighbor so at the on the axe xaxis we have the number of Cades that we picked for each of the experiments and then on the left the y axis here is the recall again this is the fraction of the true nearest neighbors returned for the query on this side right here we have precision at k4 for a neighborhood class so this is actually more of a same kind of metric we're looking at the relevant documents over the K or the the total relevant sorry relevant over the total number of relevant documents in our data set in this case I'm actually just using the class for the samples to basically say hey if I query the sample and I get back all my results of my query are all the same class you know because I have labels for them you know that indicates to me we have high precision I K so if you look at this it does very well for these samples and I kind of illustrate this in a sec you'll kind of see why but um that's actual exact nearestneighbor results are pretty bad and we have 0.3 for PDI and then h nsw kind of is you know pretty low and again I did parameter selection I did some grid search you know for parameters for each of these methods and tried to come up with things but I really couldn't get anywhere and you know just to kind of highlight this again this is this data sets coming from the VT clustering data set with twenty five twenty seven thousand samples and fifteen classes so
19:12
cool so a quick demo time to kind of illustrate you know at least the interface so Rogers is exposes a command line application so if you want to use it on the CLI you can but I also have api's exposed to basically make it so that you can import import Rogers and then build an index and then I use plotly in this case just to visualize some of the results so i'll blow this up a little bit real quick like I said
19:40
still probably hard to see in the back but the idea here is that I've previously fit an index this one specifically HN SW and again I have some kind of standard API here for passing in a sample set setting the number of K getting back the neighbors and in this case if we clear this and we can see we get back the sample here's the the query sample just kind of again doing some print statements here just to make it easy to see what we're getting back then here's the nearest neighbors and you know again there look I mean look how similar these are this is cosine similarity that the graph here just kind
20:12
of displays them and if we look at some
20:14
of the the kind of features here so here's the query sample you know we can see that this is a with Lehmer label and
20:20
we look at some of these other samples that the Reg Ron there's another one too
20:25
and the totally different so the the labels themselves are different and what
20:30
I kind of realized after getting into this is that given the fact that my feature space is limited to these static future extraction methods you know and given that you know we're really only looking at samples that are usually I think a lot of these are just like pack samples everything really looks identical so I think that kind of might explain why I had really bad results and kind of illustrates need for me to get
20:50
better datasets for evaluation here so just for a little bit bonus so Zuri was released at blackhat and because of the frustration I had with some of the feature extraction around the basic static stuff I went ahead and implemented a zoria zoria feature extraction class for PE and you know just in this case I only got to pulling out a bag of words for the mnemonics so it's just kind of an example of this particular sample running Zuri on it I was able to do run this on like five hundred samples or so and build a vectorizer and again now we can rerun rerun the same query with I'm sorry this is a different sample we can now leverage the bag of word mnemonics as well in addition to the other kind of static features that are pulled out and it gives us slightly different results again I haven't really done any formal comparison of these methods so this kind of wraps up stuff I think but the the general idea here you knows that Rogers is a tool for one experimenting with different nearest neighbor search techniques but also is a tool to build out vectorizer x' for the different methods similarity you know doing similarity in your environment might depend you know you might have different use cases you might only want to do static comparison you might only want to do dynamic you may want to do both you might want to you know apply like an automated disassembler like zouri on it there's clearly at least in the vectorizer x' that i published with this tool you know there's very limited to kind of just PE and there's definitely opportunity for different modalities there's also opportunities for doing feature selection and learning representations as well too to come up with a better feature space that could be used for similarity comparison fair experiments definitely in my case you know I did run some parameter optimization but unfortunately just need to get more data sets for doing benchmarks and also I think would be interesting to evaluate different distance metrics you know beyond just Euclidean for some of this and basically use that to determine again similarity for some of these methods and again some of these methods like hsw by default uses cosign but like PDC is Euclidean and finally more use cases so potentially in this case you know we're on the indexing malware samples but you could also potentially index benign samples as well too and of course like the the key is being able to continuously update the index with new samples as they come in and have been classified or been analyzed so doing like a partial fit or like an insert operation would be pretty easy to extend as well too cool so this point for
23:17
questions so again this this tool is up on github I do have a feature branch I apologize I got to get it out it's just it's just been crazy the past week so that this feature branch will add feature hashing will add the PD CI and basically published the experiment again I do feel that the experimental results were pretty weak but maybe could be explained just by the data set I was using certainly it would be kind of interesting to experiment more so yeah and pull requests are welcome any questions at all sorry
23:53
show the vectorizer yeah so I guess Missouri because I was kind of experimenting with this right here so this is a signature vectorizer so this one is actually using the yarrow rules repo so I just used the yarra detection zazz features themselves and and it kind of tried to figure out like what your in this case apply TF TD T app tfidf to determine like what signatures are more useful compared dollars Oh so yeah in this case there really isn't any feature selection other than applying like tfidf and then projecting down any final questions
24:45
I'm sorry I can't I can't hear you at all I apologize I still can't hear you oh so the metrics in the distance so yeah as I mentioned HNH NSW uses cosine the PD CI is using York Lydian but the metrics are just this this recall at K so basically I do exact nearest neighbor search on my dataset just using brute force and I compute that for Euclidean then I compute that for cosine and then those are like kind of the basically the ground truth that I use in this case for this recall at K for the nearest sorry for the exact neighbors
25:36
yeah so I mentioned like you know I don't don't do any feature learning
25:40
don't do any of that kind of stuff and I also think that you know going beyond some of the basic static features would definitely obviously change the results significantly so I hope to do that in the future oh sorry did yeah right yeah so just a little background so the question is you know do more experiments pretty much the experiments are fun sometimes there's they're sad but you know it's always got to push forward to the next one the other question was like what what did I learn kind of doing this implementation it was definitely really a lot of fun to do the PD CI implementation again it's really naive in Python 3 using using sequel light basically to store those indices so it was great to have a chance to kind of go through a paper that again has no published source code you know and then base or no published implementation for reference and then try to do that and then yeah a lot of Python 3 kind of things I picked up doing this project as well yeah there's not so those algorithms don't do any feature engineering they don't any future selection you basically pass in your your feature vector into index so that's kind of a preprocessing step in this case here which I did at the vectorization phase cool thanks everyone
27:22
[Applause]