We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

AI Village - Fighting Malware with Deep Learning video

00:00

Formale Metadaten

Titel
AI Village - Fighting Malware with Deep Learning video
Serientitel
Anzahl der Teile
335
Autor
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Presentations from DEF CON 27 AI Village
MalwareMalwareBitBinärcodeBesprechung/Interview
HackerMereologiet-TestMalwareHackerMereologieSoftwareindustriet-TestMalwareBesprechung/Interview
CodeVirtuelle MaschineAnalysisMathematikMinimalgradInformationAggregatzustandGemeinsamer SpeicherParametersystemNeuroinformatikMalwareBesprechung/Interview
Kontextbezogenes SystemArithmetisches MittelDifferenteBesprechung/Interview
Endliche ModelltheorieGebäude <Mathematik>Endliche ModelltheorieAlgorithmische LerntheoriePunktDämpfungFramework <Informatik>MAPImplementierungBackpropagation-AlgorithmusBesprechung/Interview
MalwareComputervirusVerzeichnisdienstElektronische PublikationFreewareInstantiierungMalwareGemeinsamer SpeicherDokumentenserverEndliche ModelltheorieComputervirusArithmetisches MittelRechter WinkelElektronische PublikationCASE <Informatik>Besprechung/Interview
SinusfunktionMAPSchnittmengeMalwareNeuronales NetzBesprechung/Interview
MalwareNotepad-ComputerMultiplikationsoperatorAnalysisEndliche ModelltheorieMinkowski-MetrikNeuronales NetzDifferenzengleichungVirtuelle MaschineBesprechung/Interview
Framework <Informatik>CASE <Informatik>Endliche ModelltheorieInformationArithmetisches MittelRepository <Informatik>Virtuelle MaschineProgrammierumgebungGüte der AnpassungMalwareBesprechung/Interview
CodeTorvalds, LinusEndliche ModelltheorieNotebook-ComputerTouchscreenHilfesystemSchnittmengeCodeGrenzschichtablösungBesprechung/Interview
TouchscreenCodeMultiplikationsoperatorComputeranimationBesprechung/Interview
MalwareAnalysisPerzeptronHydrostatikPräprozessorSkalarfeldSummengleichungVisualisierungUrbild <Mathematik>FunktionalSummengleichungMultiplikationsoperatorCodeProgrammierungMalwareSchnittmengeTVD-VerfahrenKategorizitätEndliche ModelltheorieKorrelationsfunktionStichprobenumfangElektronische PublikationBitBimodulComputervirusAlgorithmusAnalysisSoftwareschwachstelleKategorie <Mathematik>TabelleHydrostatikAutorisierungPerzeptronInhalt <Mathematik>WechselsprungInformationHeegaard-ZerlegungGarbentheorieMetropolitan area networkGruppenoperationAdressraumInstantiierungProgrammbibliothekArray <Informatik>RahmenproblemSoftwaretestCASE <Informatik>Reverse EngineeringRandomisierungOrdnung <Mathematik>ModallogikMusterspracheVirtuelle MaschineZahlenbereichVirenscannerEntropie <Informationstheorie>Formation <Mathematik>Algorithmische LerntheorieHilfesystemBesprechung/Interview
Elektronischer FingerabdruckStandardabweichungDynamisches RAMLeistungsbewertungMatrizenrechnungBimodulBinärdatenHydrostatikAnalysisRechnernetzBitrateEndliche ModelltheorieSchnittmengeTrennschärfe <Statistik>ZahlenbereichResultanteWort <Informatik>BenchmarkOrtsoperatorAlgorithmusComputerarchitekturMalwareRandwertNatürliche ZahlOrdnung <Mathematik>EntscheidungstheorieNormalvektorPlastikkarteSchiefe WahrscheinlichkeitsverteilungDimensionsanalysePerzeptronMatrizenrechnungParametersystemVisualisierungSoftwaretestCASE <Informatik>Minkowski-MetrikAutomatische HandlungsplanungSchaltnetzStapeldateiKreuzvalidierungNeuronales NetzTropfenNegative ZahlSpeicherabzugBildgebendes VerfahrenZweiMultiplikationsoperatorSummengleichungHypercubeLeistungsbewertungComputeranimationBesprechung/Interview
Inklusion <Mathematik>InformationEndliche ModelltheorieFaltungsoperatorZentrische StreckungBildgebendes VerfahrenCASE <Informatik>Lokales MinimumBimodulTrennschärfe <Statistik>KreuzvalidierungVisualisierungSummengleichungZweiSchnittmengeResultanteInhalt <Mathematik>MultiplikationsoperatorSoftwaretestDezimalzahlNormalvektorRandwertGüte der AnpassungEntscheidungstheorieSchiefe WahrscheinlichkeitsverteilungLeistungsbewertungNeuronales NetzBinärcodeStapeldateiRechter WinkelMalwareBesprechung/Interview
MalwareRechnernetzTermGasströmungAnalysisROM <Informatik>PräprozessorFunktion <Mathematik>Diskrete-Elemente-MethodeEinsModallogikRechenwerkRauschenResultanteAlgorithmusPräprozessorDifferenteSoftwareNeuroinformatikVersionsverwaltungSummengleichungSchnittmengeProzess <Informatik>BildschirmfensterMultiplikationsoperatorWellenpaketHalbleiterspeicherSchlüsselverwaltungKategorizitätUmsetzung <Informatik>MatrizenrechnungMustererkennungEndliche ModelltheorieSprachsyntheseCodierung <Programmierung>AnalysisZahlenbereichDynamisches SystemMaschinelle ÜbersetzungTrennschärfe <Statistik>Folge <Mathematik>SystemaufrufWarteschlangeMalwareCASE <Informatik>ZweiGrenzschichtablösungBesprechung/Interview
MalwareAnalysisGasströmungData AugmentationSelbstrepräsentationMalwareModifikation <Mathematik>Meta-TagErweiterte Realität <Informatik>SelbstrepräsentationEinsComputeranimationBesprechung/Interview
RechenschieberBesprechung/Interview
Transkript: Englisch(automatisch erzeugt)
Hello everybody. My name is Angelo and I'm here to talk about a little bit about fighting malware with deep learning. So I will show you how you can use deep learning to build a classifier, a binary classifier. And I'd like to thank you for coming
and thank you to DEFCON AI village. This is my first DEFCON. So I'm a little bit lost here. But it's been very amazing here. And I'd like to thank my company, Tautus, to support me to come
here also. All right. So who am I? I work as an ethical hacker at Tautus. It's the largest software company in Latin America. We build ERP. And I'm a part-time data scientist,
basically, you know, self-made. And then I'm a part-time PhD student. I hope my supervisor won't listen to that. And I'm interested in deep learning data science for malware detection and classification. Okay. So first I want to share
some experiences I had. Okay. So I was not a researcher. You know, I got a degree in mathematics, but I was not a researcher. So I went to,
I started working with computers, programming, and so on. So actually, you're not researching. But then I wanted to make some research and I wanted to get into this machine learning, deep learning AI thing. And I didn't know how. So I started studying machine learning,
there's online resources and books, deep learning, and malware analysis. I've already knew something about malware analysis, but I kind of learned more. Okay. And then since you want to make research, you need to start studying the state of the
art papers, you know, to see what's going on, what's in the edge. So okay. And then I started to read these papers. And I got stuck when I tried to reproduce them, because they are very theoretical, you know, and sometimes you don't have enough information
to reproduce them. So this is a very big problem in machine learning papers, and for malware analysis in particular, because the guys, they don't want to tell you, their hyperparameters, they don't share data. So it's quite hard to reproduce the papers.
But also sometimes because you just don't know the basics, you haven't been through the basics, you didn't code something simpler to see it work, working. So you need to do that.
Okay, so here what happens is because we actually, what he's saying here is that there is a difference between knowing the path and walking the path. What does that mean in this context? Means that I wanted to do research, but I haven't done anything yet. I want to jump,
make a big, take a big step. So that's quite hard to do. So you need to step back, right, and fill that gap and how you do that. Basically, you want to build,
that's very important for machine learning in not only malware research, but any kind of research. You need to build your baseline, your own baseline models from the scratch. And from the scratch, I mean that you don't need to implement backpropagation yourself.
You don't need, you can, but you don't need. You can use a high-level framework, deep learning framework to do that for you. So that will be your starting point for doing research. And the second thing is that you want to build simple and working models using
the techniques you want to master and research because you need to learn them very well if you want to contribute with something new. And the third point is, this is kind of strange, but you need to not contribute first to be able to contribute later.
First, you need to learn the basics. Then, you know, you are not contributing. You are contributing with yourself. And then you can then try to contribute with the scientific community or whatever. So always be humble. You don't know anything. You don't know everything.
Okay. So we are data scientists, right? Right. So we need, in our case, we need malware, a lot of malware, meaning we need instances, we need data to feed our models. This one is my favorite repository. It's called Virus Share. So there you can find,
I checked yesterday, almost 34 million instances of malware to download. So you can download huge files, 60 gigabytes each with 60,000 instances there, and you can play with them. Okay. And
then one good thing is that they are already labeled using VirusTotal. So as you can see,
I told you several layers. And the first layers are low-level features and the deeper layers, high-level features. What we are interested in is getting this data from these data sets,
I will show you, and then try to figure out if they are malicious or not, because we are trying to detect malware. And then we have also the recurrent neural networks. They are good
for sequential modeling. So if you've got sequential data varying time or space, something like that, you can use recurrent neural networks for that. They are good for
dynamic analysis. So okay. So we need to build our machine learning, deep learning data science pipeline. This is actual real information that I took about one month to download malware and Goodware. And then I set up an environment hypervisor with nine virtual machines running
for four months around the clock to run this malware and get the dynamic information. And then we need to do the basics from data science, meaning we need to pre-process our data,
we need to clean the data, we need to make some feature engineering, get the best features for our problem. In this case, the feature engineering was minimal. I haven't done almost any. And then we are ready to play with the models.
So you use a high-level framework and you can start building the models. So all right. So I have here three models to show you. I will share this code, the Jupyter notebooks, and I will share the data set, the data sets. There are several data sets.
So don't worry if you can't get something because I need to be very quick. Now I just need some help here because I will need to throw the code there and I can't see the code here because my screen is extended. Can someone help me there from AI village, please?
Yes, but then I can't see then the code from here when I drag. Please.
So what I need is I open here the code and I want to project there, but the screen is extended. So I would have time to explain step by step the code, but basically what I'm showing here is an
example of each of those models I showed before. This one is a multi-layer perception
for static analysis data. So basically, what we do first is to import all the libraries, necessary libraries from Python libraries. They're amazing. So here's our data set.
I'm getting data from the PA sections. So PA sections basically describes the sections in the executable file. So one information here that's very important is this entropy. Usually when the entropy is high, it means that either the code is encrypted or
packed. So malware authors, they usually like packing the content of the file to avoid antivirus detection, but that's not enough. This can be very high, but it's not
so can't rely only on one feature. So okay, and there is a column here saying some hour or not. Okay, so this is how our data looks like and this is tabular data. It's in a table. Here I do some correlation analysis to figure out correlation among the features. You see that
some features are very highly correlated, so we can drop them out. You don't need to duplicate data. And then what I'm doing here is opening another data set that is called imports, PE imports. PE imports basically, you know, when you create a program, you need to
import functions from DLLs and so on. And maybe we can find, the machine learning, deep learning algorithm can spot some patterns that are used in malware. So what we're doing here,
is in this data set is the most 1000 important imported functions in reverse order. So for example, this malware imported this get proc address function and this get proc address function is the most important, most important function. Okay, so this is what it means. So you
see that there are a lot of features here and they are categorical. Categorical features are a little bit harder to deal with because you can't just play with them. You're not playing with numbers. Okay, we're playing with categories. All right, so in the end we have
some malware or not. And this data set has 47,000 entries. Okay, so we remove duplicates and then we merge the files and then now we have the data set ready. Okay, I'll jump a little
bit here. We convert these data sets, band as data frames to NumPy arrays to feed the models. And then there is another problem here. We need to check the imbalance and this is high imbalanced. See 24 to 1. So we have 24 malware instances for each
goodware instances. So that's a problem. Okay, then we need to make the strength test split stratified. So meaning that after performing this split, we need to keep the proportions.
Okay, and basically here I'm standardizing the code. I will only standardize the code that is numeric. See, I'm not touching the categorical data. Okay, so let's jump to the model.
Ah, here there is something interesting. We need to deal with imbalance. So the data set is imbalanced. So we can try to use some algorithm to oversample, oversampling technique.
So for example, SMOTE won't work here because it can't deal with categorical features. And there is a variation called SMOTE nominal and continuous that was taking just too much time to run. So I gave up for now. And then we can use random oversampler. But random oversampler
is bad because it just duplicates data. But in our case, helped. So after running that, you see, we can see here that now the data set is balanced. And then I will perform some visualization using TSNE. TSNE is a pretty
awesome algorithm for visualizing high dimensional data. Basically, it projects high dimensional data onto the plane or to a space, to a space. So you can have an idea how the data is
in that high dimensional space. And then you can also have an idea about the decision boundaries your model will need to learn. And if you see, this is complicated, this one. Red is malware and green is a good word. We need to separate them out. So deep learning.
So here's a model, deep learning model for mutilated perceptron. As you can see, the model is quite simple. This is not research. This is just that homework. I said in the beginning that I need to understand how to build a simple model
in order to try to build something more complex or new. So this is just a model with two dense layers. And then after that, we create the model, get the, here the summary, 146,000 parameters. It's not really too much, right?
Here's a model. We can get this model, the architecture of the model, just with one command. And now here there is an algorithm that I would like to explain more deep, but I can't because I don't have time.
But this is a model selection. Basically, I'm making an automation of the model selection here. I'm getting those hyperparameters and testing each one of them, combinations, you know,
and performing three-fold cross-validation to see which combination is better. And after that, we train the model for some time. This case didn't take too much. I trained at home. I have a NVIDIA RTX 2080 Ti. So 4,300 to the cores. It's a decent
card. So 10,000 seconds, roughly three hours. And here, it's important to evaluate the model using the test set. And okay, we have here the results from the
model selection. So this is the best combination, the dropout rate 0.6. The first model architecture, the first model architecture here is a dropout first and then batch normalization. There is a holy war about that. You know, people know you should use batch norm first and then
dropout or vice versa and neither of them are both in any order. But actually, it depends on the nature of your data. So that's why you need to test it. And the lowest number of neurons,
one. So after that, we perform an evaluation. And with the evaluation, we have a thing called confusion matrix to show the true negatives, false positives, false negatives, and true positives. First, what we do is to create a benchmark. So suppose that you have a model that predicts
that every example is a malware. So what happens? These are the numbers you get. And you see, since the data set is imbalanced, you get pretty good numbers here for accuracy, precision, recall, F1 score. This is misleading. So you need to use a better metric for unbalanced data set.
The best metric, as far as I know, is a balanced accuracy. And balanced accuracy is showing here that basically you have chosen, you have predicted every example as a malware. So it's very bad. This is the worst case scenario. And then when you apply
on the test set, our model, we get this result. So it's much better. So as you can see, if we were using a multi-layer perceptron that is a very simple deep neural network, we can already
get very good results. Now you can imagine those very, very deep models with those inception layers and so on. You can get much better. Okay, now the next example, I'll just show you the model. This is interesting because what I have done here is just to treat the binary data
like an image, as if it was an image. So take a look at that. This is what we get. So these are malware images. I get the binary and then I treat each byte like a gray scale
and then scale it. So we get this information. And then basically we feed it to a convolutional neural network because convolutional neural networks are specialists in dealing with images. So the same problem, imbalanced and so on. And then visualization
looks similar to what we have seen there. Pretty complicated decision boundary. And then deep learning. So this is our model, convolutional model. So as you can see, it's very, very simple. We have a convolutional layer and then we have, there is a model selection here.
It's about should we apply max pooling first and then batch normalization or batch normalization first and max pooling. This is also a holy war, you know. Nobody knows what's better. But what you can do is to perform model selection with cross-validation
and then you see which is better for your kind of data. All right. And then, so basically we have a convolutional layer and then max pooling and then another convolutional layer and max pooling and then we flatten them out to feed to a fully connected layer for
classification or in our case, binary classification. Okay. So we train this for some time. I think it's still got some five minutes. I want to show you how much time I needed to train. Here. So 15,000 seconds, five hours,
four hours. Not too much really. And the results. So cross-validations and results.
Evaluation. Okay. So the bank mark also pretty good. Accuracy, precision, recall F1 score because they don't deal very well with unbalanced data sets. Balanced accuracy is the worst possible.
And then when we use our model to predict using the test set, we get this. So also not good but there are some explanations here. First because I think the main explanation here is because malware usually nowadays they are all packed.
So the content is encrypted or obfuscated. So what you are seeing here is mostly random noise. So you can't expect the algorithm to do much better. But still it's a pretty good result.
You could just, for example, add and step in your pre-processing pipeline to unpack this. But this is quite complicated because each malware uses a different technique, different key. Sometimes the key, he gets the key online and so on.
So it's complicated. And then finally, dynamic analysis. Dynamic analysis, I'll show you the network called long short-term memory. It's quite pretty good for dealing with
sequential data. And all this software you see for speech recognition, automatic translation use this kind of network. So here I'm getting the sequence of calls, of API calls. See, T0, T1. These are the calls the malware does to the Windows API. So I got just 1,024.
And then basically I do the job here of cleaning and then balancing the data set. And take a look at this, even more simpler network. There is a version. If you are running TensorFlow and Keras, you can use the Qt version
that runs much faster than the LSTM version. And then basically, this takes much more time to train. Your GPU will be burning there for some hours,
really burning 90%, 100% usage for several hours. And in this case, it took about 29,000 seconds,
nine hours. And the result here of the model selection, and basically you get our confusion matrix and then get the results here. You see, we got 91% balanced accuracy. This is a little bit impressive because our data set is small. And the number of features,
I don't know if you have noticed there, but the data set has 1,024 features. And we used only I think 300 because my computer didn't have enough memory. Because we need to make a conversion of these features from categorical to one-hot encoding before feeding the model.
But my computer didn't have enough memory for that. So all right, now we can have an idea how you can apply deep learning for malware detection and classification.
That was the idea of this talk. So let me just close here. Can you see there the slides? Okay. So, okay. And my research basically is about malware behavior. And I'm researching
specialized data augmentation methods because the current ones, they just don't work very well with this kind of data. And also specialized representation learning techniques from our classification. And this will lead improvements in detection and classification of zero days
and polymorphic and metamorphic malware. And that's it. Thank you very much for attending.