AI VILLAGE - It's a Beautiful Day in the Malware Neighborhood

Video thumbnail (Frame 0) Video thumbnail (Frame 3414) Video thumbnail (Frame 5576) Video thumbnail (Frame 9035) Video thumbnail (Frame 11187) Video thumbnail (Frame 13895) Video thumbnail (Frame 16804) Video thumbnail (Frame 19736) Video thumbnail (Frame 20899) Video thumbnail (Frame 23396) Video thumbnail (Frame 26239) Video thumbnail (Frame 28780) Video thumbnail (Frame 29442) Video thumbnail (Frame 30245) Video thumbnail (Frame 31206) Video thumbnail (Frame 34855) Video thumbnail (Frame 35802) Video thumbnail (Frame 37018) Video thumbnail (Frame 38181) Video thumbnail (Frame 41036)
Video in TIB AV-Portal: AI VILLAGE - It's a Beautiful Day in the Malware Neighborhood

Formal Metadata

AI VILLAGE - It's a Beautiful Day in the Malware Neighborhood
Title of Series
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Malware similarity analysis compares and identifies samples with shared static or behavioral characteristics. Identification of similar malware samples provides analysts with more context during triage and malware analysis. Most domain approaches to malware similarity have focused on fuzzy hashing, locality sensitivity hashing, and other approximate matching methods that index a malware corpus on structural features and raw bytes. Ssdeep or sdhash are often utilized for similarity comparison despite known weaknesses and limitations. Signatures and IOCs are generated from static and dynamic analysis to capture features and matched against unknown samples. Incident management systems (RTIR, FIR) store contextual features, e.g. environment, device, and user metadata, which are used to catalog specific sample groups observed. In the data mining and machine learning communities, the nearest neighbor search (NN) task takes an input query represented as a feature vector and returns the k nearest neighbors in an index according to some distance metric. Feature engineering is used to extract, represent, and select the most distinguishing features of malware samples as a feature vector. Similarity between samples is defined as the inverse of a distance metric and used to find the neighborhood of a query vector. Historically, tree-based approaches have worked for splitting dense vectors into partitions but are limited to problems with low dimensionality. Locality sensitivity hashing attempts to map similar vectors into the same hash bucket. More recent advances make the use of k-nearest neighbor graphs that iteratively navigate between neighboring vertexes representing the samples. The NN methods reviewed in this talk are evaluated using standard performance metrics and several malware datasets. Optimized ssdeep and selected NN methods are implemented in Rogers, an open source malware similarity tool, that allows analysts to process local samples and run queries for comparison of NN methods.
Standard deviation Context awareness Presentation of a group Open source Similarity (geometry) Set (mathematics) Branch (computer science) Mathematical analysis Number Attribute grammar Information retrieval Goodness of fit Malware Cryptography Thermodynamisches System Hash function Subject indexing Process (computing) Fuzzy logic Reverse engineering Information security Physical system Context awareness Matching (graph theory) Neighbourhood (graph theory) Sampling (statistics) Content (media) Memory management Mathematical analysis Database Cryptography Similarity (geometry) Sampling (statistics) Subject indexing Sample (statistics) Hash function Personal digital assistant Query language Information retrieval System programming Game theory Object (grammar) Information security Routing Reverse engineering
Graph (mathematics) Multiplication sign Set (mathematics) Function (mathematics) Mereology Dimensional analysis Traverse (surveying) Mathematics Hash function Circle Pairwise comparison Area Algorithm Sampling (statistics) Vector space model Funktionalanalysis Two-dimensional space Flow separation Type theory Category of being Malware Sample (statistics) Hash function Search algorithm Phase transition output Representation (politics) Metric system Resultant Navigation Spacetime Feature space Characteristic polynomial Virtual machine Similarity (geometry) Distance Thresholding (image processing) Theory Number Latent heat Natural number String (computer science) Feature space Data structure Euklidischer Raum Task (computing) Pairwise comparison Scaling (geometry) Graph (mathematics) Characteristic polynomial Cellular automaton Graph (mathematics) Bound state Variance Thresholding (image processing) Similarity (geometry) Subject indexing Sampling (statistics) Radius Error message Query language Personal digital assistant Network topology Object (grammar) Collision Limit of a function
Graph (mathematics) Multiplication sign Set (mathematics) Parameter (computer programming) Fraction (mathematics) Mechanism design Different (Kate Ryan album) Phase transition Vertex (graph theory) Pairwise comparison Multiplication Algorithm Software developer Sampling (statistics) Bit Maxima and minima Term (mathematics) Benchmark Mechanism design Process (computing) Sample (statistics) Uniformer Raum Phase transition Website Right angle Resultant Navigation Reverse engineering Web page Point (geometry) Slide rule Implementation Identifiability Variety (linguistics) Similarity (geometry) Product (business) Number Thermodynamisches System Norddeutsche Seekabelwerke Hierarchy Energy level Pairwise comparison Multiplication Graph (mathematics) Graph (mathematics) Neighbourhood (graph theory) Basis <Mathematik> Line (geometry) Grass (card game) Greatest element Subject indexing Sampling (statistics) Query language Personal digital assistant
Computer virus Dynamical system Group action Direction (geometry) 1 (number) Price index Heegaard splitting Fluid statics Malware Hash function Core dump Query language Pairwise comparison Physical system Algorithm Sampling (statistics) Bit Bulletin board system Type theory Data model Message passing Thermodynamisches System Sample (statistics) Malware Vector space Hash function Block (periodic table) Implementation Open source Algorithm Virtual machine Similarity (geometry) Student's t-test Vector space model Number Thermodynamisches System Authorization Spacetime Feature space Implementation Analytic continuation Pairwise comparison Graph (mathematics) Direction (geometry) Projective plane Database Similarity (geometry) Performance appraisal Subject indexing Query language Personal digital assistant Network topology Blog Iteration Family Local ring Window Disassembler
Machine vision Context awareness Price index Parameter (computer programming) Medical imaging Mathematics Different (Kate Ryan album) Query language Physical system Algorithm Binary code Sampling (statistics) Fitness function Data storage device Metadata Vector space model Category of being Sample (statistics) System identification Representation (politics) Prototype Point (geometry) Variety (linguistics) Computer-generated imagery Similarity (geometry) Computer Machine vision Metadata Number Sequence Prototype Thermodynamisches System Robotics Subject indexing Representation (politics) Selectivity (electronic) Data structure Feature space output Scale (map) Pairwise comparison Key (cryptography) Database Incidence algebra Subject indexing Sampling (statistics) Query language Personal digital assistant Matrix (mathematics) Singuläres Integral
Point (geometry) Sine Dynamical system Implementation Sequel Computer file Connectivity (graph theory) Modal logic Multiplication sign Similarity (geometry) Parameter (computer programming) Mereology Number Fraction (mathematics) Fluid statics Systementwurf Malware Different (Kate Ryan album) Compiler Selectivity (electronic) Data structure Social class Neighbourhood (graph theory) Sampling (statistics) Vector space model Bit Total S.A. Cartesian coordinate system Subject indexing Sampling (statistics) Type theory Message passing Sample (statistics) Personal digital assistant Query language Right angle Metric system Resultant Library (computing)
Standard deviation Graph (mathematics) Demo (music) Graph (mathematics) Multiplication sign Interface (computing) Letterpress printing Similarity (geometry) Cartesian coordinate system Number Subject indexing Sampling (statistics) Sample (statistics) Personal digital assistant Query language Statement (computer science) Resultant
Sampling (statistics) Fluid statics Execution unit Observational study Macro (computer science) Query language Pauli exclusion principle Sampling (statistics) Electronic visual display Code refactoring Feature space Resultant
Metric system Modal logic Similarity (geometry) Set (mathematics) Frustration Insertion loss Branch (computer science) Parameter (computer programming) Distance Fluid statics Malware Performance appraisal Benchmark Thermodynamisches System Different (Kate Ryan album) Operator (mathematics) Personal digital assistant Representation (politics) Selectivity (electronic) Aerodynamics Feature space Mathematical optimization Social class Scale (map) Pairwise comparison Addition Default (computer science) Torus Key (cryptography) Twin prime Sampling (statistics) Parameter (computer programming) Vector space model Bit Benchmark Performance appraisal Sampling (statistics) Subject indexing Word Sample (statistics) Personal digital assistant Query language Partial derivative Metric system Mathematical optimization Resultant Disassembler
Mobile app Forcing (mathematics) Vector space model Distance Rule of inference Trigonometric functions Electronic signature Systementwurf Personal digital assistant Visualization (computer graphics) Selectivity (electronic) Right angle Metric system
Scale (map) Implementation Algorithm Metric system Sequel Projective plane Source code Parameter (computer programming) Vector space model Subject indexing Performance appraisal Benchmark Personal digital assistant Personal digital assistant Phase transition Partial derivative Selectivity (electronic) Aerodynamics Mathematical optimization Resultant
Operations research Thermodynamisches System Fluid statics Code Aerodynamics Morley's categoricity theorem Continuous function Data type
welcome to AI village the next talk is it's a beautiful day in the malware neighborhood if I'm at and basically we'd like to thank our sponsors end game silence Sophos and tinder and of course silence your cellphone's and if you have an open seat next to you please raise your hands to the people next to you or people coming in could know there's a seat Thanks hey good afternoon everyone even though silence is a sponsor don't worry this isn't a sponsored talk this tool is completely open-source so again my name is matt nasal I'm the manager of security data science at silence and specifically today I'm going to be talking about the use of nearest neighbors search techniques applied to malware similarity specifically in a tool called Rogers that's open source on github right now I have a future branch that I'm working on so updates from the content today that I'll present but this tool is designed for malware analysts and security data scientists to perform our similarity research so just some motivation so building databases of our malware is interesting for for analysts and for data scientists search and retrieval of similar samples and it can provide valuable context to analysts and systems the objective in this case is to basically build a database index or malware by some attribute or some set of attributes and when we have some unknown sample query that database and hopefully if we've ever seen a similar sample we get back valuable contacts you know maybe other labeled samples maybe samples that have been reversed and we have in a lot of details on so that's used case kind of number one for these systems use case number two is you know if a sample that we've never seen before and it also doesn't match anything in our in our corpus you know we can prioritize that for manual analysis or maybe more more advanced reverse engineering finally a kind of the final use case of search and retrieval systems from our similarity is to augment larger systems may be doing clustering or may be doing classification so we can use nearest neighbor search techniques to basically process incoming alerts and leverage any sample any hits that we get back with the context to determine if we want to route that sample to other workflows in our environment historically this of course has been done in a big databases of cryptographic hashes fuzzy hashing notably SS deep is also kind of a standard approach still instilled in fact I'd argue so how does
this relate to nearest neighbor search so you know if we consider malware similarity as being performed through comparison over all bites or extracted static and dynamic features that distill the semantic characteristics we can take these features and represent them in this n dimensional feature space and with that we can feed that into a lot of nearest neighbor search algorithms and as well as other machine learning algorithms as well too so nearest neighbor search simply put is the task of being given a set of samples X or our corpus basically take a query sample or unknown sample X Q and query the index and basically get back the K nearest neighbors according to some distance function and there's you know many different of distance functions that could be used here a gentleman earlier today mentioned the Euclidean and we can use cosign we can even use other other metrics like string metrics that operate at a distance or Levenstein distance and ultimately nearest neighbor search of course is hard at scale you know with a high dimensional space so we have to look at approximate variance of this that allow some air threshold Epsilon and basically kind of bounds our true distance whenever we query index and just again there's a really simple example here if this kind of two-dimensional space here is if we query this red dot here the K nearest neighbor for three would be these three samples in the inner circle if there is maybe if this is maybe an approximate variance you know maybe there might be a chance that we could accidentally return some of these samples in the larger kind of radius here so don't worry I don't
have any algorithms but I actually cut out a lot of this but there's a lot of interesting theory and literature around nearest neighbor search you know over the past several decades I kind of categorized them specifically in three different areas so there's tree based methods you know where we're basically partitioning our data set into you know these these cells and our in our feature space so we use tree data structures to exploit that nature and basically rapidly lookup and identify the cell kind of shown here in this case here still in a two dimensional space we use a tree data structure to quickly look up and identify it nearest neighbors in this particular cell there's also hashing nearest neighbor methods as well to you know where typically in this case applying an on - graphic hash that kind of has this property of you know ensuring that any small input or any small change in the input space only results in a small change in the output space so the idea is here we're actually looking for collisions between similar objects in this case we we can come up with different hashing algorithms locality-sensitive hashing is one kind of popular one where the whole idea is to find hashing functions that take some input so some input sample in this case represented by our feature vector and if we hash it we ultimately want to end up in the same bucket or end up with the same hash code if it's similar and then again ultimately this determines it reduces the number of candidates in this case that end up in the same bucket ultimately to actually do a distance comparisons on because doing again doing like pairwise distance is super expensive and no one wants to do that finally a kind of more recent a really more recent approach for nearest neighbor search is graph based methods so that general idea here is that we're going to build these proximity graphs maybe several layers of graphs that we kind of stack together and we have algorithms that basically do a initial kind of offline phase of building our graph and connect to the neighbors and with these edges based off of their similarity and then ultimately a query time we kind of end up in one part of the graph and navigate around basically traverse our graph potentially traverse multiple layers and build up our candidate set for comparison the downside here is that a lot of the graphing algorithms are extremely expensive you know we have to basically find specific types of graphs when we do our build our offline build phase that make them easy to search through and traverse through in a short amount of time so there's a crap-ton of methods
out there I highly recommend checking out this an N benchmarks page it's on github there's a paper associated with fall to every so often this developer reruns all the latest and greatest implementations of these various nearest neighbor search methods across a wide variety of data sets for benchmarking in kind of the typical benchmark that's used here is this trade-off between queries per second so how fast can we look up items to the recall and in this case the recall is the fraction of the true nearest neighbors returned in our search so the kind of general idea here is that you know up until right is better but you can kind of see usually there's this trade-off of hey we can query our in our index for nearest neighbours very quickly again if this is like in a large production system but we get that expense of having really low recall conversely if we you know really want a high recall basically we want our approximate methods to bring back the exact results and we took we typically trade that off at the cost of queries per second now one one algorithm to kind of point out here and I'll get into in a bit is this h n SW or hierarchical navigable small world so that's this graph that's this line right here so it does fairly well and again this is just one example of this New York Times data set for K equals 100 if you go back to their site you'll see that actually does fairly well across a large for a wide variety of data sets and also does well at varying levels of carrying levels of K so hence kind of why that's one of the
algorithms I specifically picked to look at and you know use the implementation in Rogers for an hour similarity so this method was recently created in 2017 against based on a lot of different algorithms in graph a graph nearest neighbor search but the the basic idea at a high level is basically construct this multi-layer graph and use it to greedily identify candidates for comparison so as I kind of alluded to in my overview slide you know there's this phase where we construct this graph we query the candidates through this reversal mechanism and then iteratively search neighboring nodes until some crop and stopping criteria is met so H NSW defines all that all the kind of methods there for the stopping criteria for the way that we build the graph and ultimately kind of just to sketch this out here after we built this graph consisting of multiple layers so it's really multiple layers of graphs you know we set different parameters for this algorithm to determine how frequently basically how deep a sample ends up in one of these layers which I should say how shallow so we basically start from the top layer here going down to the bottom layer but the idea at query time is once we've constructed these these layers of graphs we start at some point here navigate in this case there's only one sample so it basically searches the neighborhood goes right to the sample eventually reaches a local minimum here because there's no other neighbors to look for and then drops down to the next layer and it kind of this process continues until eventually we get to the final layer and this is also can be tuned as well to a query time to determine how deep into the layers of grass we'd like to go but ultimately this results and again the kind of this paper is based basis of the this approach and basically ensures that any samples that we visit across all the layers are likely to be nearest neighbors and that's ultimately what we use to determine the number of candidates that we end up querying or start end up doing a comparison against with our query so that's that's the
graph based method so there's an also another really recent method to that I actually caught at the nips workshop a typically machine learning research conference back last fall it's really interesting so it's called prioritize dynamic continuous indexing or or PD CI and there's an earlier iteration of this just called dynamic continuous indexing as well too so the authors here actually designed an exact randomized algorithm and it's built around this idea that we're going to avoid partitioning samples by vector space so kind of going back to this example with the tree based methods where we're kind of splitting up our feature space along the each feature that gets that base has a lot of issues and the PD CI the authors notice that you know what we can do is just built these indices and basically take our samples and project them along a random direction and we do this so we can control the number of indices we want to build to basically determine how well this this method actually works and the idea is we construct multiple identities and the kind of the main gist is that as you visit the indices and you query the index you end up at a place where you're basically your query is projected to any of the samples that are nearby either in this case that a larger are smaller you kind of pop those samples off and if they end up appearing in all indices again this paper kind of shows that that's highly likely to be the exact nearest neighbors so you add that to the candidate set for comparison and again this this is particularly interesting because and this specifically within this exact nearest neighbor search method some of the guarantees in this paper are pretty pretty compelling and unfortunately though because it's an academic paper I mean the author is a pretty well a very well-respected PhD student I think at Berkeley but there's no open source implementation so I went ahead and tried to do a naive implementation than in Rogers specifically so that's a PD CI again these two algorithms are the ones that
kind of I focus on right now for this talk so other other malware similarity systems you know there's quite a few that have different approaches to doing this for nearest neighbor research or just similarity in general so if course virustotal has different ways to index data also using SS deep but also a clustering basically a clustering API that is based off of feature hashing of the static from my understanding from the other Doc's from the structural data pulled out from static feature extraction and this is actually where I source some of my my datasets for evaluation of these methods which are you'll soon see to is to my detriment unfortunately our very own Brian Wallace one of the they iVillage core team members actually came up with a blog wrote a blog post and Virus Bulletin a few years back that basically exploit it the way that the SS deep message digest is built to eliminate the number of comparisons needed and then more recently you know again you you can take this idea and basically apply it to elasticsearch as well to so you can just use an off-the-shelf database to actually use the same kind of method here for indexed SS deep so I can that's using one of these similarity digest methods that's kind of in the in the bigger larger group I should say of hashing based nearest neighbor methods and then of course there's kind of the popular academic implementations of malware similarity systems so--but Shred is highly cited back in 2011 so uses pairwise Jaccard similarity and it uses Hadoop to do that so it gets again fully pairwise so it's very expensive hence the need for Hadoop there's also the malware provenance system which is a little bit more recent that uses min hash a type of the basically an Alice H family a locality sensitive hashing family that approximates Jaccard similarity so that's that's used across a sliding window of Engram features on disassembled samples and the others are the two uh two final ones here there's
also a Malheur which focuses more on the behavioral feature similarity comparisons specifically for cuckoo it's a there's a lot of different capabilities built in there for clustering and also classification but underneath the hood there's this kind of use of looking at behavior features for doing prototype identification or prototype selection so you can basically identify prototypes you know that are in a large cluster pretty much like the the centroid around these points and I use that for a way to quickly do comparison and then there's also the sarvam which kind of takes some computer vision ideas from basically indexing images and takes a binary a robot as well bytes and basically converts it into a grayscale image and then indexes that so there's a
handful of systems out there there's a lot of algorithms to kind of choose from again that have a lot of different properties again with that kind of going back to the the performance metric there of query by recall so when we're approaching the design of the system to do malware similarity and specifically use that system to evaluate different nearest neighbor search techniques you know I kind of define these four key design design ideas so number one you know of course we have to extract and store sample metadata from our wall features we have to then transform that feature transform that raw data into you know some feature representation again in this N dimensional feature space and we might have a variety of different vectorization pipelines that we want to experiment with stalks earlier today mentioned a few tf-idf you know I use feature hashing specifically in my more recent approach so we might want to kind of change out that vectorization pipeline depending on and what we want to evaluate and what features we want to include and also how large our dataset might be so after we transform our features with one of these pipelines we then fit you know the different nearest neighbor methods and you know do some bookkeeping maybe you have to save some basically save some of the database structures that are required and that's kind of the fit stage and finally once we fit all these indices you know we want to actually query samples and then basically again kind of pick the parameter k to determine how many nearest neighbors we're going to pull back and then possibly if we have this this database of sample metadata we also might want to display the contextual features maybe again we have some case database of previous incidences in our environment and we want to annotate our samples that we pull back with some of that it might help for more of the analyst context kind of thing so
that really gives us or really what I kind of came up with is this design then for Rogers like Rogers is a Python 3 application that you know pretty much has a sample class in this case it's really only a sample class consisting of the PE class that really only focuses on really basic static feature extraction using PE file but it's built in a way that you can expand the number of sample classes if you again bring in you know other things other than just portable executables vectorizer so there's a basically a scikit-learn pipeline API is I pretty much used extensively and in this case right now I have latent semantic analysis which again earlier talk kind of uses some of those ideas with the tf-idf and then projecting down and then more recently just because I started getting into datasets that were a little bit larger to handle I started looking at feature hashing approaches you know and this basically can be extended with anything supported by scikit-learn or other vectorizer x' as well too and then finally the the kind of final component here is this index class you know which in this case I have implementations or I use libraries for like H NSW or I use the LS h for sin scikit-learn which is going to be deprecated anyway here soon but then i have an implementation for indexed SS deep and p DCI and pretty much at this point uses sequel Lite you know as the the store for at least an index SS deep and the P DCI methods all the feature data to is forgot to mention is stored in a and protobuf a message definition that has basically a kind of a structure that allows you to add different modalities of features so you can add static features dynamic features contextual features and then give them if you want to you can give that a variable type if you'd like to actually automatically build like feature vectorizer later on but again that's not really not really supported yet in the vectorizer class
cool so now unfortunately to the sad panda part of my talk so yeah doing these types of experiments and getting datasets you know for doing malware similarity is difficult unfortunately I didn't really get as much time on doing this experiments as I wanted to what we're looking at here again is we have two charts so this chart right here is this recall at K for exact nearest neighbor so at the on the axe x-axis we have the number of Cades that we picked for each of the experiments and then on the left the y axis here is the recall again this is the fraction of the true nearest neighbors returned for the query on this side right here we have precision at k4 for a neighborhood class so this is actually more of a same kind of metric we're looking at the relevant documents over the K or the the total relevant sorry relevant over the total number of relevant documents in our data set in this case I'm actually just using the class for the samples to basically say hey if I query the sample and I get back all my results of my query are all the same class you know because I have labels for them you know that indicates to me we have high precision I K so if you look at this it does very well for these samples and I kind of illustrate this in a sec you'll kind of see why but um that's actual exact nearest-neighbor results are pretty bad and we have 0.3 for PDI and then h nsw kind of is you know pretty low and again I did parameter selection I did some grid search you know for parameters for each of these methods and tried to come up with things but I really couldn't get anywhere and you know just to kind of highlight this again this is this data sets coming from the VT clustering data set with twenty five twenty seven thousand samples and fifteen classes so
cool so a quick demo time to kind of illustrate you know at least the interface so Rogers is exposes a command line application so if you want to use it on the CLI you can but I also have api's exposed to basically make it so that you can import import Rogers and then build an index and then I use plotly in this case just to visualize some of the results so i'll blow this up a little bit real quick like I said
still probably hard to see in the back but the idea here is that I've previously fit an index this one specifically HN SW and again I have some kind of standard API here for passing in a sample set setting the number of K getting back the neighbors and in this case if we clear this and we can see we get back the sample here's the the query sample just kind of again doing some print statements here just to make it easy to see what we're getting back then here's the nearest neighbors and you know again there look I mean look how similar these are this is cosine similarity that the graph here just kind
of displays them and if we look at some
of the the kind of features here so here's the query sample you know we can see that this is a with Lehmer label and
we look at some of these other samples that the Reg Ron there's another one too
and the totally different so the the labels themselves are different and what
I kind of realized after getting into this is that given the fact that my feature space is limited to these static future extraction methods you know and given that you know we're really only looking at samples that are usually I think a lot of these are just like pack samples everything really looks identical so I think that kind of might explain why I had really bad results and kind of illustrates need for me to get
better datasets for evaluation here so just for a little bit bonus so Zuri was released at blackhat and because of the frustration I had with some of the feature extraction around the basic static stuff I went ahead and implemented a zoria zoria feature extraction class for PE and you know just in this case I only got to pulling out a bag of words for the mnemonics so it's just kind of an example of this particular sample running Zuri on it I was able to do run this on like five hundred samples or so and build a vectorizer and again now we can rerun rerun the same query with I'm sorry this is a different sample we can now leverage the bag of word mnemonics as well in addition to the other kind of static features that are pulled out and it gives us slightly different results again I haven't really done any formal comparison of these methods so this kind of wraps up stuff I think but the the general idea here you knows that Rogers is a tool for one experimenting with different nearest neighbor search techniques but also is a tool to build out vectorizer x' for the different methods similarity you know doing similarity in your environment might depend you know you might have different use cases you might only want to do static comparison you might only want to do dynamic you may want to do both you might want to you know apply like an automated disassembler like zouri on it there's clearly at least in the vectorizer x' that i published with this tool you know there's very limited to kind of just PE and there's definitely opportunity for different modalities there's also opportunities for doing feature selection and learning representations as well too to come up with a better feature space that could be used for similarity comparison fair experiments definitely in my case you know I did run some parameter optimization but unfortunately just need to get more data sets for doing benchmarks and also I think would be interesting to evaluate different distance metrics you know beyond just Euclidean for some of this and basically use that to determine again similarity for some of these methods and again some of these methods like hsw by default uses cosign but like PDC is Euclidean and finally more use cases so potentially in this case you know we're on the indexing malware samples but you could also potentially index benign samples as well too and of course like the the key is being able to continuously update the index with new samples as they come in and have been classified or been analyzed so doing like a partial fit or like an insert operation would be pretty easy to extend as well too cool so this point for
questions so again this this tool is up on github I do have a feature branch I apologize I got to get it out it's just it's just been crazy the past week so that this feature branch will add feature hashing will add the PD CI and basically published the experiment again I do feel that the experimental results were pretty weak but maybe could be explained just by the data set I was using certainly it would be kind of interesting to experiment more so yeah and pull requests are welcome any questions at all sorry
show the vectorizer yeah so I guess Missouri because I was kind of experimenting with this right here so this is a signature vectorizer so this one is actually using the yarrow rules repo so I just used the yarra detection zazz features themselves and and it kind of tried to figure out like what your in this case apply TF TD T app tf-idf to determine like what signatures are more useful compared dollars Oh so yeah in this case there really isn't any feature selection other than applying like tf-idf and then projecting down any final questions
I'm sorry I can't I can't hear you at all I apologize I still can't hear you oh so the metrics in the distance so yeah as I mentioned HNH NSW uses cosine the PD CI is using York Lydian but the metrics are just this this recall at K so basically I do exact nearest neighbor search on my dataset just using brute force and I compute that for Euclidean then I compute that for cosine and then those are like kind of the basically the ground truth that I use in this case for this recall at K for the nearest sorry for the exact neighbors
yeah so I mentioned like you know I don't don't do any feature learning
don't do any of that kind of stuff and I also think that you know going beyond some of the basic static features would definitely obviously change the results significantly so I hope to do that in the future oh sorry did yeah right yeah so just a little background so the question is you know do more experiments pretty much the experiments are fun sometimes there's they're sad but you know it's always got to push forward to the next one the other question was like what what did I learn kind of doing this implementation it was definitely really a lot of fun to do the PD CI implementation again it's really naive in Python 3 using using sequel light basically to store those indices so it was great to have a chance to kind of go through a paper that again has no published source code you know and then base or no published implementation for reference and then try to do that and then yeah a lot of Python 3 kind of things I picked up doing this project as well yeah there's not so those algorithms don't do any feature engineering they don't any future selection you basically pass in your your feature vector into index so that's kind of a pre-processing step in this case here which I did at the vectorization phase cool thanks everyone