AI VILLAGE - Stop and Step Away from the Data: Rapid Anomaly Detection via Ransom Note File Classification

Video thumbnail (Frame 0) Video thumbnail (Frame 751) Video thumbnail (Frame 7884) Video thumbnail (Frame 10594) Video thumbnail (Frame 11545) Video thumbnail (Frame 12583) Video thumbnail (Frame 13789) Video thumbnail (Frame 14966) Video thumbnail (Frame 16887) Video thumbnail (Frame 19133) Video thumbnail (Frame 21385) Video thumbnail (Frame 24092) Video thumbnail (Frame 25712) Video thumbnail (Frame 29799) Video thumbnail (Frame 30836) Video thumbnail (Frame 33899) Video thumbnail (Frame 34916) Video thumbnail (Frame 37020) Video thumbnail (Frame 37826) Video thumbnail (Frame 38817) Video thumbnail (Frame 40818) Video thumbnail (Frame 41980) Video thumbnail (Frame 42784) Video thumbnail (Frame 44717) Video thumbnail (Frame 46015) Video thumbnail (Frame 49121) Video thumbnail (Frame 51729) Video thumbnail (Frame 54323) Video thumbnail (Frame 58243)
Video in TIB AV-Portal: AI VILLAGE - Stop and Step Away from the Data: Rapid Anomaly Detection via Ransom Note File Classification

Formal Metadata

Title
AI VILLAGE - Stop and Step Away from the Data: Rapid Anomaly Detection via Ransom Note File Classification
Title of Series
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2018
Language
English

Content Metadata

Subject Area
Abstract
The proliferation of ransomware has become a widespread problem culminating in numerous incidents that have affected users worldwide. Current ransomware detection approaches are limited in that they either take too long to determine if a process is truly malicious or tend to miss certain processes due to focusing solely on static analysis of executables. To address these shortcomings, we developed a machine learning model to classify forensic artifacts common to ransomware infections: ransom notes. Leveraging this model, we built a ransomware detection capability that is more efficient and effective than the status quo. I will highlight the limitations to current ransomware detection technologies and how that instigated our new approach, including our research design, data collection, high value features, and how we performed testing to ensure acceptable detection rates while being resilient to false positives. I will also be conducting a live demonstration with ransomware samples to demonstrate our technology’s effectiveness. Additionally, we will be releasing all related source code and our model to the public, which will enable users to generate and test their own models, as we hope to further push innovative research on effective ransomware detection capabilities.
Randomization Computer file Right angle Game theory
Dynamical system Multiplication sign Combinational logic Function (mathematics) Electronic signature Software bug Uniform resource locator Fluid statics Pointer (computer programming) Malware Machine learning Bit rate Radio-frequency identification Encryption File system Negative number Software framework Endliche Modelltheorie Extension (kinesiology) Information security HTTP cookie Pressure Area Computer icon Computer file Binary code Bit Statistics Electronic signature Type theory Process (computing) Order (biology) MiniDisc Quicksort Ocean current Computer file Virtual machine Focus (optics) Lattice (order) Spreadsheet Natural number Computer worm Focus (optics) Java applet Sign (mathematics) Word Software Personal digital assistant Normed vector space Game theory
Computer file Multiplication sign Computer-generated imagery Execution unit 1 (number) Function (mathematics) Computer programming Medical imaging Latent heat Software Encryption Descriptive statistics Multiplication Trail Key (cryptography) File format Computer file Bit Directory service MiniDisc Pattern language Encryption Quicksort Virtual reality Window Impulse response
Email Email Computer file Computer file Computer-generated imagery Sampling (statistics) Password Similarity (geometry) Boltzmann equation Bit Medical imaging Latent heat Encryption Videoconferencing Address space
Point (geometry) Email Computer file Multiplication sign Sampling (statistics) Mereology Rule of inference Template (C++) Wave packet Personal digital assistant Different (Kate Ryan album) Order (biology) Encryption Representation (politics) Family Window Address space Form (programming)
Laptop Group action Projective plane Sampling (statistics) Electronic mailing list Bit Mereology Twitter Frequency Blog Compilation album MiniDisc Representation (politics) Family Spacetime
Default (computer science) Group action Token ring Token ring Electronic mailing list Gene cluster Set (mathematics) Counting Bit Word Vector space Function (mathematics) Infinite conjugacy class property Normed vector space Order (biology) Core dump Encryption Encryption Social class
Vacuum Context awareness Computer file View (database) Counting Bit Public-key cryptography Hand fan Category of being Word Vector space Different (Kate Ryan album) Natural number Normed vector space Encryption Convex hull Computer worm Quicksort Address space
Group action Matching (graph theory) Multiplication sign Gene cluster Sampling (statistics) Set (mathematics) Cryptography Distance Computer graphics (computer science) Medical imaging Message passing Cross-correlation Causality Root Intrusion detection system Personal digital assistant Encryption Square number Compilation album Right angle Software testing Summierbarkeit Resultant
Parsing Computer file Connectivity (graph theory) Multiplication sign Programmable read-only memory Set (mathematics) Real-time operating system Insertion loss Event horizon 10 (number) Template (C++) Medical imaging Mathematics Energy level Location-based service Software framework Diagram Process (computing) Endliche Modelltheorie Covering space Interior (topology) Content (media) Sampling (statistics) Entire function Process (computing) Cube MiniDisc Social class Queue (abstract data type) Quicksort Resultant Spacetime
Computer file Token ring Chemical equation Multiplication sign Source code Workstation <Musikinstrument> Set (mathematics) Login Binary file Wave packet Bridging (networking) Program slicing Physical law Selectivity (electronic) Software testing Traffic reporting Installation art Link (knot theory) Sine Chemical equation Electronic program guide Computer file Binary code Bit Line (geometry) SCSI Message passing Word Supervised learning Vector space Quicksort Window Resultant
Building Scaling (geometry) Outlier Fitness function 1 (number) Set (mathematics) Wave packet Heegaard splitting Personal digital assistant Cross-validation (statistics) Matrix (mathematics) Negative number Software testing Endliche Modelltheorie Position operator
Point (geometry) Goodness of fit Graph (mathematics) Cross-validation (statistics) Matrix (mathematics) Similarity (geometry) Representation (politics) Software testing Flow separation Resultant Wave packet
Installation art Perfect group Computer file Connectivity (graph theory) Multiplication sign Source code Device driver Menu (computing) Event horizon Tracing (software) Latent heat Mathematics Different (Kate Ryan album) Software framework Physical system Three-valued logic Focus (optics) Process (computing) Computer file Type theory Process (computing) Malware Event horizon Personal digital assistant Order (biology) Configuration space Quicksort Window
Windows Registry Dot product Computer file Repository (publishing) Query language Projective plane Order (biology) Real-time operating system Event horizon
Axiom of choice Filter <Stochastik> Group action Computer file Demo (music) Line (geometry) Computer file Moment (mathematics) Real-time operating system Event horizon Radical (chemistry) Number Mathematics Process (computing) Query language Personal digital assistant Single-precision floating-point format Cube Order (biology) Row (database) Software framework Right angle Family Resultant
Latent heat Process (computing) Computer file
Computer file Divisor Multiplication sign Demo (music) Programmable read-only memory Device driver Mereology 2 (number) Wave packet Product (business) Estimator Different (Kate Ryan album) Octave Software testing Software framework MiniDisc Fingerprint Multiplication Sampling (statistics) Product (business) Process (computing) Personal digital assistant Order (biology) MiniDisc Musical ensemble Freeware Arithmetic progression Boiling point Family
PC Card Machine code Run time (program lifecycle phase) Computer file Code Multiplication sign Mathematical singularity Device driver Set (mathematics) Drop (liquid) Disk read-and-write head Tracing (software) Product (business) Formal language Medical imaging Latent heat Natural number Energy level Software framework Gamma function Extension (kinesiology) Installation art Twin prime Suite (music) File format Sampling (statistics) Shared memory Plastikkarte Menu (computing) Parity (mathematics) Limit (category theory) Sign (mathematics) Mathematics Proof theory Message passing Sample (statistics) Ring (mathematics) Personal digital assistant Lie group output MiniDisc Hill differential equation Thumbnail Quicksort Game theory Physical system Task (computing) Resultant
next up we have mark major on stop and step away from the data rapid anomaly detection
doo doo doo doo by a random note file classification we'd like to thank our sponsors end game silence tinder and Sophos and reminder if you could please sit down in the seat so we don't want to have a fire code violation with that enjoy the talk all right morning everybody so just get the things
a little bit at me I'm not a data scientist so take whatever I say up here on stage with a very big grain of salt in please feel to ridicule and embarrass me after the talk about the things that I get wrong but anyways a little bit about me I'm a senior malware researcher and game to put the reverse engineering in since the development and past two and half years pretty much since I've been in endgame I been doing rates protection research
just to get into the agenda I'm gonna provide a brief overview of ransomware um what current detection methodology looks like um ransom notes and then I'm going to delve into some exploratory research about the detection research that I did and then discuss in depth the proof-of-concept framework think I'm came up with and then wrap things up Lucian hopefully have a little bit of time for questions so if you don't know about ransomware basically it's a software it's written to deny users access to data on their hosts the most typical approach is through in encrypting individual files on file system and the file extensions are what's going to be targeted so think of high-value documents like PDFs text files Word document Excel spreadsheet things of that nature and so there's two typical types of output from ransomware the encryption encrypting files that I was just alluding to and the actual ransoms so detection methodology can be broken down pretty simply into two areas right now you have static detections which are who can be signature based signature or here at six base or machine learning base um the main benefit to this approach is that all data is preserved if the detection is successful but the drawback is that you essentially have one chance to detect whether a binary is ransomware or no malware or not and if we miss that then all data is going to be compromised up close enough for dynamic detections basically the way those work is its process is going to be running in the background that's going to monitor for any sort of anomalous behavior on the host there could be a focus on detecting encrypted files in certain cases some approaches leverage canary files which are files that are written to disk and you know kind of spread out in different locations and if they're modified in particular ways and that can be a trigger forward so the main benefit for dynamic detections is that you hypothetically as a process is executing there will always be an ongoing chance that have been detected so it's not just there's one initial chance to detect and then you hose that for that but you know so should still be able to detect it later on the drawback to dynamic approaches is that essentially you're sacrificing a large amount of files in order to determine whether or not there's grants more executing on the house maybe in certain cases it's easier to detect B in almost behavior in some cases that might you know even be possible or take a very long time so how can we improve what the current Savr is right now um probably the best approach is to combine benefits of static and dynamic detections in the ideal case yes you would detect everything with machine learning immediately nothing would ever execute on the hose but that's not always case there is definitely false negatives so you need a robust dynamic detection to serve this all back so leveraging a layered security approach is probably the most recommended way to make sure coverage grants more on optimizing your machine learning models to specifically classify ransomware as opposed to just malware specifically that can prove very beneficial for this problem and then to go back to dynamic detection perhaps there might be a way to boost the time or reduce the amount of time that's required for detecting - behave so
getting into ransom notes a little bit of background off this since I've been doing ransomware research for about two years or so I've executed detonated you know probably almost thousands of files manually in a virtualized environment and kind of studied how the output typically looks and then you know as sort of a aside I was seeing ransom notes being rented this in multiple ways multiple directories or formats so what kind of got the gears turning forward research that I'm presenting today is that I started kind of seeing a pattern and how the ransom notes looked and so I wanted to explore that to see if there was a way that we could kind of classify those and see if there was something there that kind of unites all of them and makes and detect using AI so just go back a little bit ransom notes files that are written juice less than the ransom payment they come across in multiple file types the most typical format is TFC files plain text files but you also see ones that are in or formatted text formats such as HTML RTF and there's also images or even like gui-based like little dominant programs ransom notes are going to be one of the first files that are written to disk and sometimes already been written to every directory essentially the adversary is trying to be as noisy as possible in the hopes that they frustrate the users enough and get the point across that you know their data has been you know totally compromised and they have to provide the ransom to getting their data back so we'll go back through here and
like look at a few rants and most kind of get a general idea of what I'm talking about so this one's from cryptolocker and they kind of lead off with just saying your files are encrypted and then they talk about you don't have access to the description key so you can't recover your files they want you to email them and they're providing a specific time window for how long the ransom will be valid essentially and then they even get into talking about the AES encryption that they're hypothetically
going on to the next sample it pretty much starts out with the exact same way all your files have been encrypted and then they say something similar about all your documents are encrypted in recover please a s 0.01 Bitcoin you know to a specific wall wallet ID and then they also provide emails
and finally here's the actual image base ransom note in if you'll pay particular attention they were requesting 100 Bitcoin which is approximately 750,000 dollars right now so not exactly sure how successful they were with this ransomware campaign but it released pretty pricy so as we saw from even just
looking at three very disparate samples of ransom notes you can kind of see a template kind of form the typically laid-off was saying something about your files have been encrypted sometimes they provide a family name and then they'll sometimes get into talking about the actual encryption that was in fluent and permitted as part of the rain somewhere they get a point across that that files can't be recovered without encrypted that they'll provide only if a holy the permeant is provided and then you know potentially provided email address and then a time window for when their ransom essentially will develop so you know as
I previously said my intro I'm not data scientist so exploratory research for me in this case was just developing a better familiarity with a design data science concepts and different rules I'd use and then moving on from there I need to collect a you know big enough corpus of ransom notes in order to do some training and books like that we need to put together a nice base representative benign data set there to go with the ransom notes and you know but the overarching goal of the exploratory research is disturbing if you know this approach can possibly work for classification
so tools I'm using here pretty much is used in Akana for everything came bundled with Python three your notebook and it will also use a psychic learn spaces so delving into the datasets a
little bit benign data I just ended up using v20 newsgroups dataset which probably most of you are familiar that's a list of the actual 20 News groups that are part of that and then for the ransom notes it's definitely a little tougher to put together a large collection and ransom notes not every ransom family writes them out to disk so going through and kind of manually doing the research and figuring out which families actually drop notes can be a little tedious so a lot of this involved manually detonating ransom or seeing holes over a period of years collecting the ransom notes you know soaring them off and then you know kind of digging them out for this project but also you know searching through blog posts Twitter and a lot of things like that you know I was able to collect enough samples to for confident I had something whose representative make sense in general
so the actual approach that I was taking for the exploratory research is we're just going to go with unlabeled data we're going to take the twenty-eight newsgroups data set and then we'll combine that with the ran sentence and so we will take a clustering approach using k-means and set it to twenty one clusters and we write for that is newsgroups data set in ransom notes and we're going to hope that with the 21 classes the way they kind of settle down this you know but they'll be distinctly each of the news groups will be in their own cluster and then ransom notes will stick together in another cluster in order to analyze the data real closely we'll take a look at the data using a count vectorizer in T five year so getting started just to do some very basic data prep before tokenization we're just going to strip out newline characters convert to lowercase strikeout null bytes just things like that to just get the data starting to make a little bit more sense and then when we do the actual tokenization we're going to limit it to alphanumeric characters only we're going to strip out any stop words that are in the default Spacey stop words list and we'll do lemon ization so in a very quick example here encryption would actually turn
so here's just a very quick overview of how the tokenization worked and for this example actually took a very small blurb for a ransom note and pass it in and you can see how it actually breaks down to two very very core set of words there balm Crips and Bitcoin mates and payment I mean that pretty much is very
descriptive of exactly what they're going for um so now not sure how well you can see up there breaking down the most common features that were seen in the 173 ransom notes we see a lot of the same sort of words we see beings describing you know files what data is being encrypted different course encrypt decrypt you know things along that sort of nature like even just looking through those words you might be able to construct what the purpose of business without having any sort of context and then when you break it out to buy grams things make a little more sense because you're working with phrasing so it's not just files in a vacuum it's files being cryptid files encrypted private keys Bitcoin addresses things but just to give you a little bit of an idea of what the data looks like then when we apply
tf-idf you know looks pretty similar to what we're getting from the count vectorizer so yeah just gives you another view of what the data looks like
now not sure how well you can see this here but essentially with it with the 21 clusters they they broke out like quite nicely for us actually and in cluster 3 despite the ransom notes only consisting 173 unique samples versus the 11000 messages that were in the 20 newsgroups data set the ramps have notes all clustered together extremely well the bat cluster the every best of top 10 features that are that are in that cluster matches extremely well with what we just seen in the previous two slides and that actually is a a good test for for the data set because if you'll see the top entry for the for the news group in the in the image to the right is slide-out crypt which is the the encryption news group at the time but yeah if you see cluster 6 it might be a little tough to tell but it kind of you can get an idea of how old it is because they're talking about clipper chips which you know or pop like that was around the mid 90s and so but but either way distinguishing between news group discussions around encryption versus ransom notes that do discuss encryption that a more high-level that's a good initial test of how strong the data correlates and so delving into how you
know how the costs are actually worked we want to like kind of get under the hood and passing some sample data so it took another ransom note and passage the k-means predictor and if you break out the results for that using the square root of the sum of the squares we can calculate the distance from the centroid centroid for each of the causes here so in our case with that ransom note it did end up in cluster 3 which is what we're hoping in very well first for a second example we kind of use something that's more generically just talking about encryption but not specifically a ransom note in this case it actually ended up being a closer match 2+2 4 which is actually entries from computer graphics so what did we learn from doing our
export very research well as I mentioned before we have a small set of data but the ransom notes do cluster together very well the second sample demonstrate that there is nuance and how the data was clustered together and you know from all that we learned that it appears that the data is going to be appropriate for classification so you know we can actually go forward with an actual concept so for POC framework we have a few requirements first and foremost we need to obtain the file of change events in real time we need to take the file paths that are being created and pass them to a model that we develop from there we're going to read in the actual text data from the valve hasn't been created we read in the file contents and then has to let that along to the classifier to determine whether or not the data consists of amazing though and then if it is a ransom note we need a way to mitigate the process so to reduce a problem space for this we're going to put up a few restrictions here we're going to stick to English only and Docs Exe files as I mentioned they're the most common ransom notes but that doesn't cover you know the entire world of ransom notes but yeah well mattad text it's going to require parsing and images we have to use OCR to extract the data and it's probably you know it will hire what they're cleaning up beyond that so at least for the for this research I figured that was out of scope for Western New College and then we're going to stick to files that are only less than twenty kilobytes the reasoning for this is rancher notes are generally pretty small I'm kind of going back to the template is discussing me earlier they're not really trying to get across too much to their very utilitarian just saying a file is encrypted please send us a raise that's basically it so um reducing the problems face they're keeping me the less than twenty kilobytes you know helps out with performance as well so we can break down the the framework into a couple of components in just two pretty distinct processes so we'll have a file change event listener and that's going to read in the events and place it into a cube or a second process which will be the text extraction in the actual classification of notes and then if we've determined that there's a ransom note there will be a process mitigation handler that will operate so here's a
kind of a high level diagram of how a typical sort of infection scenario would play out with the framework on on disk so you have ransomware executing they drop a ransom note to Duluth C drag the event listener is going to be you know pulling for events at that time it will see Pollack creations then for grants note and then it'll pass along that file path to text extractor and classifier which will read in the contents of the No and then do the actual classification hopefully return yes and then that works up result in being really losses being suspended so for the POC framework we
wanted to build out a more representative dataset so for the benign side will still stick with the 20 newsgroups but we'll take a smaller slice of it instead of the overarching 11000 and then to supplement that will leverage some of the windows text files that I was kind of able to scoop up so typically talking about log files leaving files any sort of like installer logs you know things along those lines and then for grants and notes did my best to collect as many many more ransom notes as I could ended up finding a bunch on paste bin and few other sources so that was a great source you know but but still we're left with only 350 ransom notes compared to 11,000 benign messages so for the classification approach here we want to address a data set imbalance which is very quite large so we can use smoke to generate a synthetic data for us and see hopefully that can kind of bridge the gap for us and make up for that you know pretty big imbalance so the approach for the classifier here we're going to use a do feature selection via tf-idf and essentially what we have is a supervised learning problem we're going to label the data this time as either benign or ransom note and then then yeah we're breaking down all this into a binary classification problem it doesn't text consists of a ransom note or a zip a nine-game and for us a naive Bayes classifier is straightforward and that's the approach that we we went for immediately and it will delve into the result so even that getting so very high overview for
high-level overview of data processing pipeline here we start with our labeled data set and we pass that along to the pre tokenization we're stripping out character is converting lowercase C along those lines we do the actual tokenization and then we you know get into sanitizing data a little bit by stripping out stop words a things not helping Marik and then do London station report we pass that along to the tf-idf vectorizer to vectorize the data will use smoke to balance out the data sets and then we'll do the actual training with our base classifier so for testing here we're
splitting the data into a 80/20 split 80% of data lose creating while 20% will be used for testing we use the Train test foot from scikit-learn to to handle that and just get to a brief overview of the terminology involved you know probably the extremely common you know known to most you guys but the actor to score them referring to here is the actual accuracy classification score f1 score is going to be a a verge of the precision and recall on confusion matrix just a great way to represent true true and false positive and negative rates and for our cross-validation we're going to use a Monte Carlo approach essentially where we're running multiple runs you know through through building and see you know training you can test data sets each side so so we're in this case we're just we're testing out the models of building you know testing to see how this how this approaches is going to be flexible and not try to give us over fit to the data we were passing so for our single
uh one single test here we actually ended up doing this really well accuracy over 99% F ones 491 and you can see from confusion matrix zero false negatives which is great a few false positives but nothing too crazy so you know that's encouraging but how does that scale so we need to do some cross validation to determine if that was just going to an outlier or if it's a you know particular things to
come and so we ran through cross validation ten separate runs a very good training and test data and and ended up with actually very similar results accuracy was over ninety nine and that point the score was over ninety the confusion matrix looked about the same so I think that you know vindicates you know the approach the problem overtaking
this some graph data here to provide you a better representation of what we're looking at I said data scientist but executed so
breaking things out into the other components and the framework with the event listener we do need to monitor file change events we're looking at all processes that are active on a host and we need a way to map each event to a specific process and focus specifically on file creates in order case there's a few approaches that you can take to to getting this data including you know using Python watcher but as I said before the most important thing that we need here is need a we need the type of file event we need that the type of sorry the the process that's responsible for the particular event and we need the file path so Python watch from this case it's based off of the redirector changes api that i've windows api that I believe Justin actually returned any sort of source process data so in our case that's that's not going to help so alternative approaches to that you could comb through event logs or you can write your own file menu filter driver you know both of those you know would work the developing your own driver that's going to take way too much work
so for our case here what I ended up wanting to do was leverage something that's going to be pre-built in see if I can kind of sift through event log data for that to get our file this time and so for my case I was able to leverage system on if you're not familiar with system on its you know just a tool that's you know used for monitoring event data on on Windows and so there's a specific file create event actually of a 911 that's perfect for our purposes so we don't have to worry about distinguishing between different types of file change events we we have one type of but for us here it's discrete a very simple configuration
file that I came up with then I posted a git repository for this project we're limiting things just to dot txt files as I previously mentioned and just trying to sift out other data so we're not trying to crowd the event logs there's a
registry key that you have to add in order to properly Lao the event log can query data in real time so that's there and so basically what we're trying to do
in this case is we're going to pull the the event log and we're going to use the WI query language and we're essentially just going to be pulling every 10 milliseconds in order to try to get updates of new file change events that are coming in in real time so we need to limit the size of the result set that we're getting and we're parsing any results we get or something pops quarter action that's what the query essentially looks like pretty self-explanatory and for the actual approach for process mitigation very straightforward here all we need to do to determine is is that process from reactive with that ID and process name this active will suspend it and we need a user that there was a scientific I'm on their host and giving the choice to terminate process or resume process all right so we're going to try a live demo here so soon happens okay so I have the framework here running in a single Python file I process monitor set up
with comfortable filters we're looking at volcano DXE volcano is a common ransomware family and I rename the X cube old people thing about exe to make this more simple and we're going to use we're just going to look strictly at right file events for process for that so as you can see no events at the moment and here is my volcano DXE and
execute that and we get our pop-up so it provides us with a specific file path to the text file that had determined to be a ransom note and it went ahead and suspended volcano dot exe with that specific pit and if we go back to here
and process Explorer we can verify that that process has been suspended and if
we go through here we can kind of look through how you know the progression of the ransomware as it's writing about the disk so I get 5 40 22 that was the first activity and seen around 5:40 25 this is when it was when the process is spending and there was no fingerprints so detection time within three seconds or so but we have to like for our purposes actually since we're not keying off of any other files what we're only keying off of is the text file so we can go through the process monitor in shifting the data to only look at text files to get a better idea of how long it took for us to decorate to learn [Music] at that actually ends with not next boils and so here what we can see is that there are multiple ransom notes are ribs discus as I mentioned before our arrangement is typically pretty noisy with how they're distributing ransom notes on disk so in this case we actually have 22 of the same file that are that's going to be written out or well actually I think we're only looking for key dot text so that that it might even be less than I think some of those were actual files that would be directly encrypted but that gives you an idea of just how noisy ransomware was so we still have that process suspended and we can go ahead and click the terminating and as we'll see here across this gum
okay so getting it to you know some more testing that I did a framework I was able to text test against nine samples that were essentially holdouts because the ransom notes weren't part of our training or test dataset so we're able to detect those nine specific samples from those families and as well as three samples I tested that already had mates in the training set from earlier so in order to get some you know a better idea of how successful this approaches I wanted to test against what's currently out there and so for our for our cases for a case for this we just wanted to do tests against anything with free or trial base you know I didn't want shot any prove tested here and we want to break it out for two two different tests dozen from jet fun doesn't detect the sample and if it does and we run it side by side with the classifier framework where you just came up with and we just want to give a rough estimate of what the detection speed looks like definitely potential complicating factors in that for that particular test case because things like driver altitude can definitely affect how would be two products are running side by side but you know just a way to get a rough idea of how the performance compares to actual stuff that's currently available for download and so the testing issue
was extremely well kept things are generic I don't want to call out any specific vendors or anything like that but in our case there was one specific product that did perform very well in ways typically faster and detection than the classifier framework that I developed and that being said the detections where the where product he did perform better the framework was was so close in performance in lack behind over my so um but surprisingly there were two products that were very easily outperformed by the classifier print work and I mean if you even look at the a one in the to the while it detects pretty much all of the time for the new 12 samples that we saw only I think this unable to run a test or for one of the samples but it was outperformed nearly all the time by our framework and that's actually pretty amazing considering uh you know sort of you know ad hoc approach we took with sipping food event logs for data and then doing all you know all those classification at runtime in you know doing it all in Python we you know essentially going head head into something that's running native code and probably leveraging the MIDI filter driver to obtain their input so definitely validates the approach over time so that being said you know
those results are great but they're definitely limitations this approach there are plenty of Sandton ring summer samples that don't drop TFC some don't even drop notes at all some try to convey their ringing some message just in a custom file extension that they apply to every single apply some samples drop ransom notes much later in the game after all the files are encrypted and then there's also examples that leverage some sort of persistence and typically respond even if you suspend the process from native whatever yeah we might be able to detect it but it's going to just keep going over and over and of course there are grants um where that actually take different approaches to denying these as access their data and the are modifications anything for Rob discs or just simple and of course as we mentioned going in our sticking so future work by to improve data sets you know definitely more rain some notes lessons they said instead of day would be nice as well um you know as well as new ransom notes is as superbabies anywhere and we'd also like to build out a more representative a nine text dataset log files more installer files things that nature if we could port our code base to a lower level language that would be great and lead to very significant performance improvements and we'd be able to improve our detection time as well um you know be nice to support other file types for the the formatted text as I mentioned before as well as images OCR to extract text its main language support would be nice as well as well as experiment with the actual approaching classification so to wrap things up clustering gave us a good idea of the data being suitable for classification on solvent main simone's do share and up features for socially viable and you know going into this we we do realize this isn't going to catch all rain smart but it could be a very integral piece of a layer detection approach with the static classify as well so yeah the proof of concept did work but they're definitely minions
alright thank you very much [Applause]
Feedback