AI VILLAGE - Stop and Step Away from the Data: Rapid Anomaly Detection via Ransom Note File Classification - TIB AV-Portal

AI VILLAGE - Stop and Step Away from the Data: Rapid Anomaly Detection via Ransom Note File Classification

00:00

2

Formale Metadaten

Titel

AI VILLAGE - Stop and Step Away from the Data: Rapid Anomaly Detection via Ransom Note File Classification

Serientitel

Anzahl der Teile

322

Autor

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/39782 (DOI)

Herausgeber

Erscheinungsjahr

Sprache

Inhaltliche Metadaten

Fachgebiet

Genre

Abstract

The proliferation of ransomware has become a widespread problem culminating in numerous incidents that have affected users worldwide. Current ransomware detection approaches are limited in that they either take too long to determine if a process is truly malicious or tend to miss certain processes due to focusing solely on static analysis of executables. To address these shortcomings, we developed a machine learning model to classify forensic artifacts common to ransomware infections: ransom notes. Leveraging this model, we built a ransomware detection capability that is more efficient and effective than the status quo. I will highlight the limitations to current ransomware detection technologies and how that instigated our new approach, including our research design, data collection, high value features, and how we performed testing to ensure acceptable detection rates while being resilient to false positives. I will also be conducting a live demonstration with ransomware samples to demonstrate our technology’s effectiveness. Additionally, we will be releasing all related source code and our model to the public, which will enable users to generate and test their own models, as we hope to further push innovative research on effective ransomware detection capabilities.

Sprache

Text

Bild

00:00

Rechter WinkelSpieltheorieElektronische PublikationRandomisierungComputeranimation

00:30

Cookie <Internet>RFIDBildschirmsymbolDruckverlaufVollständiger VerbandElektronische PublikationFokalpunktWurm <Informatik>Zeiger <Informatik>Vorzeichen <Mathematik>StatistikNormierter RaumElektronische UnterschriftAppletURLMalwareSoftwareentwicklerBitMAPReverse EngineeringSpieltheorieBitrateVirtuelle MaschineElektronische UnterschriftChiffrierungElektronische PublikationCASE <Informatik>MultiplikationsoperatorSchaltnetzQuick-SortDynamisches SystemFunktion <Mathematik>HydrostatikBinärcodeProzess <Informatik>Framework <Informatik>ProgrammfehlerTypentheorieMini-DiscFlächeninhaltWort <Informatik>Negative ZahlStrömungsrichtungDateiverwaltungOrdnung <Mathematik>MaßerweiterungFokalpunktSoftwareNatürliche ZahlAlgorithmische LerntheorieEndliche ModelltheorieTabellenkalkulationComputersicherheitBeweistheorie

05:15

SpezialrechnerFunktion <Mathematik>Quick-SortProgrammierungElektronische PublikationVerzeichnisdienstDateiformatBitMultiplikationEinsVirtuelle RealitätMini-DiscBildgebendes VerfahrenMusterspracheRechenwerkMailing-ListePunktMicrosoft dot netComputeranimation

07:04

Weg <Topologie>ChiffrierungSoftwareElektronische PublikationAntwortfunktionSchlüsselverwaltungElektronische PublikationDeskriptive StatistikBildschirmfensterMultiplikationsoperatorChiffrierungKryptologieUmwandlungsenthalpieEins

07:42

Elektronische PublikationAdressraumE-MailPasswortChiffrierungSpezialrechnerVideokonferenzBoltzmann-GleichungBitStichprobenumfangUmwandlungsenthalpieE-MailElektronische PublikationAdressraumÄhnlichkeitsgeometrieBildgebendes VerfahrenComputeranimation

08:23

BildschirmmaskeChiffrierungPunktElektronische PublikationBildschirmfensterTemplateAdressraumMereologieMultiplikationsoperatorStichprobenumfangService providerFamilie <Mathematik>E-Mail

09:12

SelbstrepräsentationOrdnung <Mathematik>MereologieDifferenteCASE <Informatik>SchlussregelWellenpaketComputeranimation

09:59

Sampler <Musikinstrument>Notebook-ComputerMinkowski-MetrikMereologieProjektive EbeneFrequenzGruppenoperationStichprobenumfangFamilie <Mathematik>Mailing-ListeWeb logTwitter <Softwareplattform>BitMini-DiscSelbstrepräsentation

11:15

ICC-GruppeWort <Informatik>ChiffrierungGruppenoperationToken-RingVektorraumBitZählenKlasse <Mathematik>Ordnung <Mathematik>DefaultMailing-ListeCluster <Rechnernetz>SchnittmengeTeilbarkeitRechter WinkelGerade

12:45

Normierter RaumFunktion <Mathematik>ChiffrierungToken-RingFächer <Mathematik>Konvexe HülleSpeicherabzugWort <Informatik>Token-RingSchnittmengeChiffrierungQuick-SortKontextbezogenes SystemNatürliche ZahlElektronische PublikationBitPublic-Key-KryptosystemAdressraumVakuumKategorie <Mathematik>DifferenteComputeranimation

14:15

Wurm <Informatik>EindringerkennungSampler <Musikinstrument>SichtenkonzeptVektorraumZählenChiffrierungCluster <Rechnernetz>SchnittmengeRechter WinkelBildgebendes VerfahrenGruppenoperationSoftwaretestKryptologieStichprobenumfangMultiplikationsoperatorMessage-PassingKorrelationsfunktionRechenschieberMAPComputeranimation

16:04

QuadratzahlCASE <Informatik>Klasse <Mathematik>SummierbarkeitNeuroinformatikMatchingChiffrierungAbstandWurzel <Mathematik>ResultanteComputergraphikPhysikalischer EffektStichprobenumfangComputeranimation

17:08

ZehnStandortbezogener DienstInnerer PunktWarteschlangePROMKlasse <Mathematik>Prozess <Informatik>Elektronische PublikationEreignishorizontProzess <Informatik>Bildgebendes VerfahrenStichprobenumfangMinkowski-MetrikSchnittmengeTemplateInhalt <Mathematik>Syntaktische AnalyseÜberlagerung <Mathematik>MathematikEndliche ModelltheorieZusammenhängender GraphEchtzeitsystemGanze FunktionFramework <Informatik>WürfelNatürliche ZahlBitWarteschlangeLesen <Datenverarbeitung>Beweistheorie

19:52

Inhalt <Mathematik>Quick-SortElektronische PublikationEreignishorizontDiagrammMultiplikationsoperatorMini-DiscBasis <Mathematik>Framework <Informatik>ResultanteProzess <Informatik>EinfügungsdämpfungMAPComputeranimation

20:33

VerschlingungElektronischer ProgrammführerGesetz <Physik>SCSIElektronische PublikationSinusfunktionBildschirmfensterMessage-PassingProgram SlicingGeradeQuellcodeElektronische PublikationFramework <Informatik>MultiplikationsoperatorBridge <Kommunikationstechnik>Installation <Informatik>BinärdatenSchnittmengeTrennschärfe <Statistik>Quick-SortSummengleichungResultanteBinärcodeÜberwachtes LernenLogin

22:36

SummengleichungToken-RingVektorraumOrtsoperatorSchnittmengeMAPWellenpaketGeradeBitCASE <Informatik>SoftwaretestWorkstation <Musikinstrument>VerkehrsinformationWort <Informatik>

23:17

Heegaard-ZerlegungSoftwaretestMatrizenrechnungEndliche ModelltheorieWellenpaketMittelwertGebäude <Mathematik>SchnittmengeKreuzvalidierungCASE <Informatik>Fitnessfunktion

24:41

Extreme programmingKreuzvalidierungAusreißer <Statistik>MatrizenrechnungOrtsoperatorEinsNegative ZahlSoftwaretestZentrische StreckungComputeranimation

25:13

Güte der AnpassungResultanteÄhnlichkeitsgeometrieGrenzschichtablösungKreuzvalidierungSoftwaretestMatrizenrechnungPunktWellenpaketGraphSelbstrepräsentationComputeranimation

25:53

Treiber <Programm>CASE <Informatik>Zusammenhängender GraphTypentheorieEreignishorizontProzess <Informatik>Elektronische PublikationQuick-SortFokalpunktMathematikMenütechnikBildschirmfensterFramework <Informatik>AblaufverfolgungUmwandlungsenthalpieQuellcodeProzess <Physik>Ordnung <Mathematik>Äußere Algebra eines Moduls

27:13

MalwareDreiwertige LogikEreignishorizontElektronische PublikationInstallation <Informatik>KonfigurationsraumPerfekte GruppeElektronische PublikationPhysikalisches SystemEreignishorizontTypentheorieBildschirmfensterCASE <Informatik>DifferenteMultiplikationsoperatorMathematikUmwandlungsenthalpieEchtzeitsystemComputeranimation

27:59

KonfigurationsraumSkalarproduktElektronische PublikationAblaufverfolgungProjektive EbeneDokumentenserverEreignishorizontKonfigurationsdatenbankOrdnung <Mathematik>AbfrageEchtzeitsystemComputeranimation

28:31

Elektronische PublikationZahlenbereichDatensatzGeradeResultanteProzess <Informatik>AbfrageCASE <Informatik>Einfache GenauigkeitEreignishorizontFramework <Informatik>Elektronische PublikationMathematikAuswahlaxiomOrdnung <Mathematik>Radikal <Mathematik>EchtzeitsystemGruppenoperationDemo <Programm>

29:49

Elektronische PublikationProzess <Informatik>Familie <Mathematik>Rechter WinkelEreignishorizontFilter <Stochastik>WürfelMomentenproblemUmwandlungsenthalpieComputeranimation

30:41

Prozess <Informatik>Mini-DiscElektronische PublikationMultiplikationsoperatorMultiplikationArithmetische FolgeVerschiebungsoperatorZweiElektronischer FingerabdruckSiedepunktFormation <Mathematik>CASE <Informatik>Computeranimation

32:45

Demo <Programm>PROMProdukt <Mathematik>Mini-DiscOktave <Mathematik>SoftwaretestCASE <Informatik>Treiber <Programm>WellenpaketFramework <Informatik>Ordnung <Mathematik>SchätzfunktionStichprobenumfangFamilie <Mathematik>DifferenteFreewareTeilbarkeitMereologieProdukt <Mathematik>Tabelle

34:29

Hill-DifferentialgleichungGammafunktionLie-GruppeGerade ZahlMenütechnikThumbnailMaschinenspracheLaufzeitfehlerUmwandlungsenthalpieFramework <Informatik>ZweiEin-AusgabeProdukt <Mathematik>AblaufverfolgungMultiplikationsoperatorTreiber <Programm>SoftwaretestCASE <Informatik>StichprobenumfangQuick-SortSchreib-Lese-KopfComputeranimation

36:13

StichprobeVorzeichen <Mathematik>Physikalisches SystemPrimzahlzwillingeSingularität <Mathematik>PCMCIATaskMathematikSuite <Programmpaket>StichprobenumfangSpieltheorieMultiplikationsoperatorInverser LimesElektronische PublikationMaßerweiterungFormale SpracheMessage-PassingMAPDateiformatQuick-SortBildgebendes VerfahrenCodeFamilie <Mathematik>BeweistheorieResultanteNatürliche ZahlSchnittmengeRohdatenTropfenEinfacher RingGemeinsamer SpeicherMini-DiscInstallation <Informatik>Plastikkarte

38:50

Computeranimation

Transkript: Englisch(automatisch erzeugt)

00:00

Next up we have Mark Mager on stop and step away from the data rapid anomaly detection Doo doo doo doo by a ransom note file classification. We'd like to thank our sponsors endgame silence tinder and sofos and Reminder if you could please sit down in the seats. We don't want to have a fire could violation with that

00:20

Enjoy the talk morning everybody So just getting things a little bit about me. I'm not a data scientist so take whatever I say up here on stage with a very big grain of salt and please feel to

00:42

Ridicule and embarrass me after the talk about the things that I get wrong But anyways a little bit about me I'm a senior malware researcher at endgame typically do reverse engineering and sensor development and Past two and a half years pretty much since I've been in endgame. I've been doing ransomware protection research Just to get into the agenda. I'm gonna provide a brief overview of ransomware

01:06

What their current detection methodology looks like Ransom notes, and then I'm gonna delve into some exploratory research about the detection research I did And then discuss in depth the proof concept framework that I came up came up with and then wrap things up

01:24

Inclusion, hopefully I have a little bit time for questions So if you don't know about ransomware basically, it's a software it's written to Deny users access to data on their hosts the most typical approach is through in

01:40

encrypting individual files on the file system and The file extensions are what's going to be targeted So think of high value documents like PDFs text files or document Excel spreadsheets things of that nature And so there's two typical types of output from ransomware

02:03

the encryption encrypted files that I was just alluding to And the actual reason so detection methodology can be broken down pretty simply into two areas right now You have static detections, which are either going to be signature based

02:20

signature or your career six base or machine learning base the main benefit to This approach is that all data is preserved if the detection is successful But the drawback is that you essentially have one chance to detect whether a binary is Ransomware or you know malware or not. And if we miss that then all data is going to be compromised on the host

02:47

And for dynamic detections, basically the way those work is it's process is going to be running in the background That's going to monitor for any sort of anomalous behavior on the host

03:00

There can be a focus on detecting encrypted files in certain cases Some approaches leverage canary files, which are files that are written to disk and you know kind of spread out in different locations and if they're modified in particular ways and that can be a trigger for alert so the main benefit for

03:23

dynamic detections is that You know hypothetically as a process is Executing there will always be an ongoing chance that have been detected. So it's not just there's one initial chance to detect and then You know your host is after that But you know, so should still be able to detect it later on the drawback

03:45

to dynamic approaches is that essentially you're sacrificing a Large amount of files in order to determine whether or not there's ransomware executing on the host Maybe in certain cases. It's easier to detect the anomalous behavior

04:04

In some cases it might, you know, either be possible or take a very long time So, how can we improve what the current saving art is right now? Probably the best approach is to combine Benefits of static and dynamic detections in the ideal case. Yes, you would detect everything with machine learning

04:25

Immediately, nothing would ever execute on the host But that's not always case. You know, there's definitely false negatives So you need a robust Dynamic detection to serve as fallback. So leveraging a layered security approach is

04:42

Probably the most recommended way to make sure you're covered for ransomware Optimizing your machine learning models to specifically classify ransomware as opposed to just malware specifically That can prove very beneficial for this problem. And then to go back to

05:01

dynamic detections Perhaps there might be a way to Boost the time or reduce the amount of time that's required for Detecting anomalous behavior. So getting into ransom notes A little bit of background on this

05:22

Since I've been doing ransomware research for about two years or so. I've executed Detonated, you know, probably almost thousands of files manually in a virtualized environment and kind of studied how the output typically looks and then you know as sort of a

05:41

Aside I was seeing, you know, ransom notes being rented in multiple ways, multiple directories multiple formats, so What kind of got the gears turning for research that I'm presenting today is that I? Started kind of seeing a pattern and how the ransom notes looked and so I wanted to explore that to see if there was

06:03

a way that we could kind of classify those and See if there was something there that kind of unites all of them and makes them easy to detect So just go back a little bit ransom notes files that are in juice list of the ransom payments

06:21

They come across in multiple file types. The most typical format is txt files plain text files But you also see ones that are in or formatted text formats such as HTML RTF And there's also images or even like GUI based like little dotnet programs

06:43

Ransomos are going to be one of the first files that are written to disk and sometimes they're even written to every directory essentially the adversary is trying to be as noisy as possible in the hopes that they frustrate the users enough and get the point across that You know their data has been you know, totally compromised and they have to provide the ransom to getting their data back

07:05

So we'll go back through here and like look at a few ransom notes to kind of get a general idea of what I'm talking about so this one's from crypto locker and they kind of lead off with Just saying your files are encrypted and then they talk about

07:22

You don't have access to the description key so you can't recover your files They want you to email them and they're providing a specific time window for how long The ransom will be valid essentially and then they even get into talking about the AES encryption that they're hypothetically using

07:43

Going on to the next sample it pretty much starts out with the exact same way All your files have been encrypted and then they say something similar about all your documents are encrypted Can't recover, please pay us 0.01 Bitcoin, you know to a specific wallet ID

08:00

And then they also provide email address And Finally, here's the actual image base ransom note and if you'll pay particular attention They were requesting 100 Bitcoin, which is approximately seven hundred fifty thousand dollars right now so not exactly sure how successful they were with this ransomware campaign, but

08:21

They were at least pretty pricey So as we saw from even just looking at three very disparate Samples of ransom notes, you can kind of see a template kind of form They typically lead off with saying something about your files have been encrypted sometimes they provide a family name and then they'll

08:42

sometimes get into talking about the actual Encryption that was implemented as part of the ransomware They get a point across that That files can't be recovered without a decrypter that they'll provide only if a

09:00

only if the ransom is provided and then, you know potentially provide email address and then a time window for when they're ransomed essentially so You know as I previously, you know said in my intro, I'm not data scientist. So

09:23

Developing a better familiarity with data science concepts and And then moving on from there I need to collect You know big enough corpus of ransom notes in order to do some training and

09:41

On the flip side that we need to put together a nice base representative benign data set to go with the ransom notes and You know the part of the overarching goal of the exploratory research is to determine if You know this approach can possibly work for a classification

10:01

So tools I'm using here pretty much just use anaconda for everything which came bundled with Python 3 Jupyter notebook and you'll also use a scikit-learn spacing So delving into the data sets a little bit benign data I just ended up using the 20 newsgroups data set, which probably most of you are familiar

10:21

That's a list of the actual 20 news groups that are part of that and then for the ransom notes, it's definitely a little tougher to Put together a large collection of ransom notes not every Ransom of family writes them out To disk so going through and kind of manually doing the research and figuring out which families actually drop notes

10:44

Can be a little tedious so a lot of this involved manually detonating ransomware samples over a period of years collecting the ransom notes You know storing them off and then you know kind of digging them out for this project but also

11:01

You know searching through blog posts Twitter and a lot of things like that, you know, I was able to collect enough Samples and I had something that was representative General, so the actual approach that I was taking for the explored heard research is

11:21

We're just gonna go with unlabeled data We're gonna take the 20 news groups data set and then we'll combine that with the ransom notes and so we will take a clustering approach using k-means and Set it to 21 clusters and wait right with that Newsgroups data set and ransom notes and we're gonna hope that with the 21 clusters the way they kind of

11:46

Settle down this You know that they'll be distinctly each of the news groups will be in their own cluster and then ransom notes will stick together in In order to analyze the data a little closely we'll

12:00

Take a look at the data using account factorizer in a t5 So getting started just to do some very basic data prep Before tokenization, we're just going to strip out new line characters convert to lowercase strip out null bytes just things like that to just get the data starting to

12:24

Make a little bit more sense and then when we do the actual tokenization We're going to limit it to alphanumeric characters only we're going to strip out any stop words that are in the default Spacey stop words list and we'll do lemonization. So

12:40

Very quick example here encryption would actually so here's just a very quick overview of how the tokenization worked and For this example actually took, you know Very small word for a ransom note and pass it in and you can see how it actually breaks down

13:05

To two very Very core set of words there file encrypts and Bitcoin gets in payment. I mean that pretty much is Very descriptive of exactly what they're going for. So now not sure how well you can see up there

13:23

Breaking down the most common features that were seen in the 173 ransom notes we see a lot of the same sort of words we see Things describing You know files with data is being encrypted Bitcoin of course encrypt decrypt

13:42

You know things along that sort of nature like even just looking through those words you might be able to you know, construct what the purpose of is without having any sort of context and then when we break it out to bigrams things make a little more sense because you're working with phrasing so

14:03

It's not just files in a vacuum. It's files being encrypted files being decrypted private keys Bitcoin addresses But just to give you a little bit of an idea of what the data looks like Then when we apply TFIDV

14:20

You know looks pretty similar to what we're getting from the count vectorizer So yeah, just gives you another view of what the data looks like now Not sure how well you can see this here, but essentially with it with the 21 clusters they They broke out like quite nicely for us actually and in cluster 3

14:44

Despite the ransom notes only consisting of 173 unique samples versus the 11,000 Messages that were in the 20 news groups data set The ransom notes all clustered together extremely well The that that cluster the I believe that's the top 10 features that are that are in that cluster

15:06

Matches extremely well with what we've just seen in the previous two slides And that actually is a a good Test for for the data set because if you'll see the top

15:21

Entry for the for the news group and in the image to the right is psi dot crypt, which is the The encryption news group at the time But yeah, if you see cluster 6 it might be a little tough to tell But it kind of you can get an idea of how old it is because they're talking about clipper chips

15:42

Which you know, we're pop like that was around the mid 90s or so, but but either way Distinguishing between news group discussions around encryption versus Ransom notes that do discuss encryption that a more high level That's a good initial test of how strong the data correlates and so delving into how

16:09

You know how the cluster actually worked We want to like kind of get under the hood and in passing some sample data. So it, you know took another ransom note and Passage the k-means predictor and if you break out the results for that

16:25

Using square root of the sum of the squares we can calculate the distance From the centroid centroid for each of the classes here, so In our case with that ransom note, it did end up in cluster 3, which is what we're hoping and

16:42

For a second example, we kind of used something that's more generically just talking about encryption But not specifically a ransom note In this case it actually ended up being closer match to

17:02

cluster 4 which is actually entries from computer dot graphics, so What did we learn from doing our exploring research? Well, as I mentioned before we have a small set of data, but the ransom notes do cluster together very well the second sample Demonstrated that there is nuance and how the data was clustered together

17:24

and You know from all that we learned that it appears that the data is going to be appropriate for classification So, you know, we can actually go forward with a natural proof of concept. So for POC framework, we have a few requirements First and foremost, we need to obtain the file change events in real time

17:43

we need to Take the file paths that are being created and pass them to a model that we develop from there We're going to read in the actual text data from the file paths have been created We're reading file contents and then pass select that along to the classifier to determine whether or not

18:05

The data consists of a ransom note and then if it is a ransom note, we need a way to mitigate the process So to reduce the problem space for this We're going to put up a few restrictions here we're going to stick to English only and

18:22

Doc TFC files only as I mentioned they're the most common ransom notes But that doesn't cover, you know the entire world of ransom notes But yeah formatted text it's gonna require parsing and images We'd have to use OCR to extract the data and it's probably you know, it would require a little bit cleaning up beyond that

18:43

So at least for the for this research I figured that was out of scope for what I was trying to accomplish and then We're gonna stick to files that are only less than 20 kilobytes. The reasoning for this is ransom notes are generally pretty small You know kind of going back to the template I was discussing earlier

19:04

They're not really trying to get across too much they're very utilitarian just saying hey files encrypted Please send us a ransom. That's basically it. So Reducing the problem space there keeping me the less than 20 kilobytes You know helps out with performance as well. So we can break down the

19:24

the framework into a couple of components and just two pretty distinct processes, so we'll have a file change event listener and That's going to read in the events and place it into a queue for a second process Which will be the text extraction and the actual classification of notes and then

19:44

If we determine that there's a ransom note There will be a process mitigation handler That will operate So here's kind of a high-level Diagram of how a typical sort of infection scenario would play out with the framework on disk

20:03

So you'd have ransomware executing they drop a ransom note to the Rubens C drive The event listener is going to be you know polling for events at that time It'll see a file creation event for the ransom note And then it'll pass along that file path to text extractor and classifier

20:22

Which will read in the contents of the ransom note and then do the actual actual classification Hopefully return yes, and then that works out the result in the ransomware process basis So for the POC framework we wanted to build out a more representative data set

20:42

so for the benign side, we'll still stick with the 20 news groups, but we'll take a smaller slice of it instead of the overarching 11,000 and then to supplement that we'll Leverage some of the windows text files that I was kind of able to scoop up So typically talking about log files reading files

21:04

Any sort of like installer logs, you know things along those lines and then for the ransom notes did my best to collect as many Many more ransom notes as I could ended up finding a bunch on pace bin and a few other sources So that was a great source

21:21

You know, but but still we're left with only 350 ransom notes compared to 11,000 But nine messages so For the classification approach here. We want to address a data set in balance, which is very quite large

21:40

So we can use smoke to generate synthetic data for us and see hopefully that can kind of bridge a gap for us and make up for that You know pretty big imbalance. So the approach for the classifier here. We're going to use Do feature selection via TFIDF and essentially what we have is a

22:02

Supervised learning problem. We're going to label the data this time As either benign or ransom note And then then yeah, we're breaking down All this into a binary classification problem. It's does the text consists of a ransom note or zip a 90 and

22:21

For us a 90 based classifier Is straightforward and that's the approach that we you know with went for immediately and it will delve into the results that we end up getting So very high overview the high level overview of data processing pipeline here we start with our

22:45

label data set and We pass that along to the pre tokenization where we're stripping out characters and bring lower casings along those lines We do the actual tokenization and then we You know get into sanitizing the data a little bit by stripping out

23:01

Stopwords, anything is not help numeric and then do one position for we pass it along to the TFIDF vectorizer to vectorize the data Will you smoke to balance out the data sets and then we'll do the actual training with our nine days classifier So for testing here we're splitting the

23:22

Data into a 80-20 split 80% of data will be training while 20% will be Used for testing we use train test split from scikit-learn to to handle that and just get it to a brief overview of the terminology involved that

23:42

Probably extremely common Known to most of you guys, but the accuracy score they're going to be referring to here is the actual accuracy Classification score f1 score is going to be a average of the precision and recall confusion matrix just great way to represent

24:02

True true and false positive and negative grades and For our cross validation, we're going to use a Monte Carlo approach Essentially where we're running multiple runs You know through through building and test data sets

24:23

each side so so we're in this case, we're just We're testing out the models ability to bring testing to see how this how this approach is going to be flexible and not try to Overfit to the data overpassing

24:42

So for a single One single test here. We actually ended up doing it extremely. Well accuracy over 99% f1 score 91 Confusion matrix zero false negatives, which is great a few false positives, but nothing too crazy

25:00

So You know that's encouraging but how does that scale? So we need to do some cross validation to determine if that was just getting an outlier or if it's a predictor of things to come and So we ran through cross validation ten separate runs

25:20

Very good training and test data and that ended up with actually very similar results Accuracy was over 99 and that point the score was over 90 The confusion matrix looked about the same. So I think that you know indicates, you know the

25:40

approach the problem we're taking Just some graph data here to provide you a better representation of what we're looking at I said not a data scientist, but it's good. So breaking things out into the other components in the framework With the event listener, we do need to monitor file change events

26:03

We're looking at all processes that are active on a host and we need a way to map each event to a specific process And focus specifically on file creates in our case There's a few approaches that you can take to to getting this data Including

26:21

You know using Python watcher, but as I said before the most important thing that we need here is we need a We need the type of file event. We need the type of Sorry the the process That's responsible for the particular event and we need the file path

26:42

So Python watcher in this case, it's based off of the redirector changes API that I believe Windows API that I believe Doesn't actually return any sort of source process data. So in our case, that's that's not going to help. So Alternative approaches to that you could comb through event logs or you can write your own file mini filter driver

27:03

You know both of those, you know would work Developing your own driver. That's gonna take way too much work that so For our case here. What I ended up wanting to do was leverage something that's going to be pre-built and

27:21

See if I can kind of sift through event log data for that To get our file events in real time. And so for my case I was able to leverage Sysmon If you're not familiar with Sysmon, it's you know, just a tool that's You know used for monitoring event data on

27:42

on Windows and so there's a Specific file create event actually event ID 11 that's perfect for our purposes So we don't have to worry about Distinguishing between different types of file change events. We only have one type of it For us here. It's just great

28:00

Very simple configuration file that I came up with and then I posted that to the git repository for this project We're limiting things just to dot txt files as I previously mentioned And just trying to sift out other data. So we're not trying to Crowd the event logs

28:22

There is a registry key that you have to add in order to Properly allow the event log to be queried at an in real time. So that's there and so basically what we're trying to do in this case is we're going to pull the The event log and we're going to use the w my query language and we're essentially just going to be pulling every 10

28:45

Miliseconds in order to try to get updates of new file change events that are coming in in near real-time so we need to limit the size of the result set that we're getting and We're parsing any results we get with something we can classifier work. That's what the query essentially looks like

29:06

You know pretty self-explanatory there And for the actual approach for a process mitigation very straightforward here. All we need to do determine is Is that process currently active with that ID and process name

29:23

it is active will suspend it and we need to avert the user that there was this activity on their host and Give them a choice to Terminate the process or resume the process. All right, so we're gonna try a live demo here. So let's see what happens

29:43

Okay, so I have the framework here running in a single Python file. I have process monitor set up with Couple of filters. We're looking at volcano dot exe volcanoes common ransomware family And I renamed the executable to volcano dot exe to make this more simple and we're going to use

30:04

We're just going to look strictly at right file events for a process with that So as you can see no events at the moment And here is my volcano dot exe

30:20

Execute that and we get our pop-up So it provides us with a specific file path To the text file that it determined to be a ransom note, and it went ahead and suspended Volcano dot exe with that specific PID and if we go back to here

30:42

In process Explorer we can verify that that process has been suspended And if we go through here we can kind of look through how You know the progression of the ransomware as it's writing files to disk Looks like it five forty twenty two. That was the first activity and

31:05

See around five forty twenty five is when it was When the process is suspended and there was no further offense So detection time within three seconds or so But we have to like for our purposes actually since we're not keying off of any other

31:22

files, what we're only keying off of is the text files so we can go through the process monitor and Shift through the data to only look at text files to get a better idea of how long it took for us to detect

32:00

That actually ends with dot exe files And so here what we can see is that there are multiple ransom notes over into disk is as I mentioned before Ransomware is typically pretty noisy with how they're distributing ransom notes on disks. So in this case, we actually have

32:20

22 of the same file that are that's going to be written out well Actually, I think we're only looking for a key dot text So that that it might even be less than I think some of those were actual files are being directly encrypted But that gives you an idea of just how noisy ransomware was So we still have that process suspended and we can go ahead and click terminate and as we'll see here the process is done

32:57

Okay, so getting into you know, some more testing that I did at the framework. I was able to text

33:04

Test against nine samples that were essentially holdouts because the ransom notes weren't part of our training or test data set so we were able to Detect those nine specific tables from those families and As well as three samples I tested that already had notes in training us

33:26

So in order to get some you know a better idea of how successful this approach is. I wanted to test against what's currently out there and so For our for our cases for a case for this. We just wanted to do

33:40

Testing against anything was free or trial-based You know, I didn't want shot in me and want you for testing here and We want to break it out for two different tests doesn't protect doesn't detect the sample and if it does can run it side by side with the classifier framework where you just came up with and

34:00

We just want to give a rough estimate of what the detection speed looks like Definitely potential complicating factors in that For that particular test case because things like driver altitude can can definitely affect how the two products are running side by side But you know just a way to get a rough idea of how the performance

34:25

compares to actual stuff that's currently available for download And so the testing issue went extremely well Kept things very generic. I don't want to call out any specific vendors or anything like that But in our case there was one specific product that did perform very well and was typically faster in detection

34:48

than the classifier framework that I developed and That being said the detections Where the where product feed did perform better

35:01

The framework was was still close in performance and you know lag behind over by a couple seconds or so But surprisingly There were two products that were very easily outperformed by the classifier framework and I mean if you even look at the A1 and A2

35:23

The While it detects pretty much all of the time for the U12 samples that we saw only I think this unable to run a test for for one of the samples, but It was outperformed nearly all the time by our framework and that's actually pretty amazing considering

35:44

You know, you know sort of You know ad hoc approach we took with sipping through event logs for data And then doing all all those classification at runtime and you know doing it all in Python with you know

36:01

essentially going head-to-head with something that's running native code and Probably leveraging the MiniFoam driver to obtain their input so definitely validates the approach that we took So that being said, you know, those results are great, but there are definitely limitations with this approach There are plenty of ransomware samples that don't drop dot txt

36:25

Some don't even drop notes at all. Some try to convey their ransom message Just in a custom file extension that they apply to every single file Some samples drop ransom notes much later in the game

36:41

After all the files have been encrypted and then there's also samples that Leverage some sort of persistence and typically respond even if you suspend the process, terminate it, whatever Yeah, we might be able to detect it, but it's going to just keep going over and over And of course there are ransomware

37:00

that actually take different approaches to Denying users access to their data, NPR modifications, anything for raw disk, or just simple screenwriters And of course as we mentioned going in, we're sticking only to English So future work, we'd like to improve the data sets

37:20

You know, definitely more ransom notes, less synthetic data would be nice as well You know, as well as new ransom notes as the ransomware families come up and are made aware of And we'd also like to build out a more representative 9-text data set More log files, more installer files, things of that nature

37:42

If we could port our code base to a lower level language That would be great and lead to very significant performance improvements, and we'd be able to improve our detection time as well You know, it'd be nice to support other file types or the

38:02

formatting text as I mentioned before as well as images OCR to extract text Hispanic language support would be nice as well As well as experimenting with the actual approach and classification So to wrap things up

38:20

Clustering gave us a Good idea of the data being suitable for classification And we saw that ransom notes do share enough features for a solution to be viable And you know going into this we do realize this isn't going to catch all ransomware But it could be a very integral piece of a layer detection approach

38:44

So yeah, the proof of concept did work, but there are definitely many improvements. All right. Thank you very much