AI VILLAGE - Malware Panel

Video thumbnail (Frame 0) Video thumbnail (Frame 2739) Video thumbnail (Frame 4419) Video thumbnail (Frame 15314) Video thumbnail (Frame 17497) Video thumbnail (Frame 18498) Video thumbnail (Frame 21189) Video thumbnail (Frame 22894) Video thumbnail (Frame 26088) Video thumbnail (Frame 29851)
Video in TIB AV-Portal: AI VILLAGE - Malware Panel

Formal Metadata

Title
AI VILLAGE - Malware Panel
Title of Series
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2018
Language
English

Content Metadata

Subject Area
Principal ideal Dynamical system Scaling (geometry) File format Virtual machine Bit Online help Staff (military) Wave packet Twitter Fluid statics Data management Malware Machine learning Process (computing) Addressing mode Different (Kate Ryan album) Endliche Modelltheorie Information security Reverse engineering
Dynamical system Context awareness Parsing State of matter Decision theory Modal logic Multiplication sign Source code Sheaf (mathematics) Set (mathematics) Function (mathematics) Perspective (visual) Fluid statics Malware Machine learning Different (Kate Ryan album) Single-precision floating-point format Entropie <Informationstheorie> Endliche Modelltheorie Determinant Identity management Intelligent Network Curve File format Sampling (statistics) Sound effect Sequence Parsing Electronic signature Type theory Arithmetic mean Hash function Right angle Quicksort Computer file Tape drive Virtual machine Heat transfer Field (computer science) Product (business) Number Term (mathematics) String (computer science) Queue (abstract data type) Representation (politics) Feature space Task (computing) Domain name Multiplication Artificial neural network Chemical equation Content (media) Expert system Voting Personal digital assistant Point cloud Musical ensemble Game theory
File format Multiplication sign Virtual machine Sampling (statistics) Set (mathematics) Bit Type theory Fluid statics Malware Goodness of fit Supervised learning Different (Kate Ryan album) Software repository Formal verification Encryption Right angle Endliche Modelltheorie Tunis
NP-hard Code Set (mathematics) Port scanner Mereology Coprocessor Perspective (visual) Field (computer science) Wave packet Formal language Malware Machine learning Different (Kate Ryan album) Endliche Modelltheorie Information security Focus (optics) File format Chemical equation Binary code Feedback Sampling (statistics) Instance (computer science) Compiler 10 (number) Performance appraisal Type theory Quicksort
Context awareness Divisor Computer file Multiplication sign System administrator Virtual machine Sampling (statistics) Instance (computer science) Mereology Rule of inference Power (physics) Malware Machine learning Fuzzy logic Quicksort Information security Resultant Task (computing)
Computer virus Group action Direction (geometry) Multiplication sign Function (mathematics) Mereology Perspective (visual) Medical imaging Malware Mechanism design Machine learning Different (Kate Ryan album) Software framework Endliche Modelltheorie Extension (kinesiology) Vulnerability (computing) Social class Area Feedback Sampling (statistics) Perturbation theory Electronic signature Hand fan Category of being Type theory Process (computing) Supervised learning Right angle Heuristic Quicksort Cycle (graph theory) Resultant Spacetime Functional (mathematics) Virtual machine Gene cluster Similarity (geometry) Latent heat Authorization Representation (politics) Task (computing) Dependent and independent variables Focus (optics) Inheritance (object-oriented programming) Artificial neural network Projective plane Mathematical analysis Generic programming Database Incidence algebra Loop (music) Personal digital assistant Mixed reality Interpreter (computing) Family Library (computing)
my name is Brian Wallace some of you might know me as bought that hunter from Twitter and say what we're gonna be talking about is machine learning and malware classification and its uses and so on I'm now gonna pass the mic along for everyone to introdu this introduce themselves on the panel hi there my name is Andrew Davis I'm a staff data scientist at silence my main day-to-day job is training our PE model so I know a little bit about a feature extraction and different methods like that for training pea models disassembly static feature extractions starting to work a bit with dynamic feature extraction and yeah he is kind of my bread and butter I'm returning I'm a principal data scientist at Sophos I also work a lot on PE but also on other formats including document and HTML particularly malicious JavaScript my name is Hiram Anderson I'm the technical director of data science at ingame and I work with a great team we we have a also primarily PE was our first machine learning model but also we do macros and mock oh hey everyone I'm mad nasal I'm the manager of security data science at silence I work closely with both Andrea and Brian and yeah I guess my kind of interests are more and again the nearest neighbor search if you attended the last talk clustering week supervision active learning and basically applying those different methods to help basically perform malware classification or help with the pipeline to do malware classification at scale if I may I would like to just briefly introduce Amanda Russo who'll be joining us momentarily she's fleeing the paparazzi and will be here when she arrives she is unlike everyone else on this panel not a data scientist but malware a reverse engineer so I'll
provide an interesting perspective there so we're gonna start out with kind of a an easy question for all of our panelists what do you guys believe is the current state of the art for machine learning and malware classification who wants to [Music] go ahead you know you live in controversy oh boy I guess I think the the current state of the art from our classification is basically applying deep neural networks basically getting a crap-ton of benign you know and malicious samples and basically using that to train a very large scale model and then obviously once you have a great model you actually have to put it in production somewhere so whether it's on an agent or in the cloud you know a lot of the challenges usually come after you make the model so yeah I would say the deep neural network so again my kind of expertise isn't necessarily in the classification algorithms so I probably probably get overturned here but um yeah that's kind of started off with that just to know
we're not talking purely about static detection we're also talking about dynamic and any sort of automated machine learning base malware detection method yeah so um at in game we primarily deal with static detection there are I think at least two other types there's dynamic there's also contextual machine learning so things when you see a large spike in traffic right that talk to a vendor with a lot of hosts like a tinted tape indicate a new attack where you haven't actually analyzed the file you only see you know the the number of hashes flying by right so in terms of static machine learning I will respectfully disagree with my panelists but I think that they're the way you define state of the art is purrito curve and there there's an optimum along that curve depending what you want to achieve and that depends on things like you know obviously false positive false negative trade-off but also think about model size does this need to live on the endpoint or doesn't he live in the cloud or think about time time time required for detection so in cases where milliseconds matter like detecting ransomware but in a dynamic case it's often really advantageous to have a speedy lightweight model that within a few you know in sub millisecond time can determine if that's if that's ransomware so my belief right now it for static the the determination I'd like to make a clear-cut that then my research still there's a difference between ins two ends deep learning performance and parse a file first and apply a model performance that is actually highly related to the next question and I will move along then yeah yeah so what are the different approaches for feature extraction and how do we represent malware to a model that that is I imagine something that's new to much of the audience I guess I've got the mic yeah so I think the a lot of the stuff as far as I'm aware that's actually in production tends to be static artifacts and so you can sort of you have two different ways that you can represent it you can either use some sort of parsing approach where you do things like you pull out section names or sorry you do things like you pull out section names or the entropy of individual sections but relies on you being able to sort of like dig in and analyze the file as a PE file and then you also have features that don't require any parsing so you can just get like this the entropy of every consecutive 256 bytes of the file or something like that and treat that as a sequence and you can also there's research that's been done in convolutional and recurrent neural networks where you just feed the file in a byte at a time or you feed in in chunks of the file and it just operates directly on the raw bytes of that and so I think all of them have their pluses and minuses there's trade-offs in terms of how much domain knowledge you need to put into it and whether your parser will actually choke on some files and not others and how flexible the representation is but that's probably the key challenge in getting an effective malware model out is figuring out what the right balance between these sort of very domain-specific expert knowledge requires parsing features and the the more flexible less structured you know I'm just going to ingest the whole file and it doesn't actually really matter that it's a P file I'm just treating it as a bag of byte so I have a question for the pen are real quick do you guys believe there is a single approach like a single model that could be applied to detecting malware or is it acquirement to have multiple different approaches being used in conjunction give me different file types or what do you mean file types dynamic contextual no I mean you're you're gonna have different problems like like I'm said right you're gonna have different places where you want to deploy it you'll have different requirements for latency for speed for size of the model and every one of those decisions is going to lead you to a different kind of model yeah so I would definitely agree there's probably not a single all-encompassing model that would work as well across all different types of malware well one of the things that I wanted to mention about the previous question about feature extraction and feature engineering is that feature extraction feature engineering is vitally important to make sure that your model isn't abusable if you're extracting features that don't really make sense your model might queue in on things that it shouldn't be queuing in on and that's not going to be a very good model it's not going to be a robust model and adversary could find those weak points and exploit them which kind of brings me around to why there probably isn't an all-encompassing model that would work well in all different kinds of malware because the things that you would use to do malicious things with PE are going to be different from the things that you would do to do malicious stuff with all 4 maku or any other file format or javascript even so having a single model that just queues it on you know yeah the TLDR no single model I will just note there there are corporations who do use a sort of single model for all file types and have claims about that so I would you know it's something it wouldn't be great using transfer learning yeah yeah wouldn't be great I'll say the only things that I have to add would be like um you know obviously each model would be specific to the modality but considering like multi modal models where we're looking at feature spaces from Stata and dynamic and identity you know again how you source how you're sourcing that data how you're doing the feature extraction I think would could change things but each model could be trained independently and then again you know we can in samba lore build you know like a majority vote you know again simple ways of basically combining the outputs of different models that are you know built for a specific task I guess would be my two cents that to that so on to the next kind of set of questions what do you guys feel are the biggest challenges in like the field of detecting now where with machine learning well I'll start I think again you can break this down into the different types but in in static detection I think an awful lot of engines still have trouble with pact samples right because Packers change it's hard to track them that the entropy and sections can vary depending on the packing technique used and if you know the day not you know not all static engines actually attempts to unpack before inspecting the contents they actually use instead of sort of you know instead of looking inside the sample and trying to determine a string or a signature that that would indicate maliciousness instead they look at artifacts that you know that is this the smoking gun you know something in here is probably much just because of the
imports and all those things that can't be packed so but that still represents I think a weak point for for some type of static approaches oh do you want to yeah introduce yourself please yeah yeah is actually hanging out at Maurer Tech trying to get food but yeah my name is Amanda Russo people also know me as Maori unicorn can you see me back there okay I work at endgame doing our research I help a lot of the data scientists kind of tune the model machine learning models that we use at endgame and we try to look at different types of malware and kind of see you know where do we need to focus our feature set on as well as you know terming manual verification if something is malicious or benign we're still under challenges right yes challenges cease continue the challenges yeah I think the so packing an encryption is a major headache I think the other challenge that's that's really kind of a big one is just getting a good labeled set for doing supervised learning so you know you've got a lot of reap repos of stuff that's you know allegedly malware or mostly malware you've got you know various threat intelligence feeds you can subscribe to but a lot of the stuff in there is kind
of the labels are kind of garbage right and you've got a lot of disagreement between vendors about whether some samples are malware or not and so being able to get a good set of ground truth labels that you trust that you can use to then train these models in a fully supervised fashion is is kind of a constant headache and so finding ways to refine those labels and make them more trustworthy is something that we actually spend a lot of time on so that actually leads well to elaborate on that a bit I think one of the main challenges that I see is that um if you're dealing with certain kinds of file formats where you don't have a lot of label data like
let's take maca for example you might scan vt4 labels or something like that and you might come back with something like a thousand or a couple tens of thousands of malicious Marco executables and when it comes to training models that will generalize well how do you learn anything from 10,000 samples right when you compare ma code it's a PE for example you might be able to get a label data set of several hundred million malicious PE executables but when it comes to file formats we're sort of like the way to abuse them hasn't been fully exercised like it has for PE you know people have been running malware for PE for decades now and it's a really difficult problem to deal with so like he said file format is a major issue because you don't have just file format you have ma koel whatever you also have compiler type so there's a lot of bootstrapping code in the beginning of a binary that can change your future set say your focus on the beginning of the binary and you're trying to design features around the malicious part of the code but you're looking at the bootstrapping code of the compiler so when you start to cluster PE executables based on just the PE alone you also need to cluster them based on compiler language that they were written in because all of the colleague conventions even in the Maur binary itself are all different depending on the compiler and what processor is meant to be same with 32-bit for 64-bit you're gonna have different training sets for those so you don't have to make sure that you have those samples in your training set that are labeled properly I'll just add one last thing to this question not necessarily from the perspective of modeling itself but from building really engineering sustainable model pipelines to support rapid experiments you know again deploying models getting feedback
quickly being able to make make use of feedback that doesn't come through normal channels as well - that's a it's a really hard problem I think especially when we're considering some of the data set sizes maybe not necessarily the class of like the imbalance data sets but like you know for PE like moving that all around can be challenging and again quickly building new models experimenting with different instances of them and evaluating them and then getting evaluation - getting feedback from the field can be a hard engineering problem so we can move on to the next question now which i think is gonna be kind of interesting well more interesting so what should we do as an industry of you know security folk about things that are only malware in the malicious hands like PS exact and and so forth if anyone wants to jump in online yeah I'm not alright so I think I'm a proponent that machine learning is not panacea it's not a cure-all a model should be designed with a specific focus and a question and it can do that well
and if you kind of try to fuzzy your
objective then you'll get kind of fuzzy results so with that in mind I think the sort of the the tools packages some times can be addressed better without machine learning so looking at common rules of parent-child relationships that should not happen you can't can discover an x plate much more more easily than maybe I mean machine learning package could infer for example so I think there's a lot of room for traditional security concepts to sort of wrap what Michelini provides in a more laser focused setting I think just not the binary itself but it also depends on the context so one of the things that we focus on is adding another I guess feature to the set there is determining how the file the power is delivered because eck if you're using a tool and it was delivered a part of package with other malware packaged inside of a zip file for instance in that context you know it's gonna be probably used maliciously then rather than admin using the same tool in the wrong way so there's more outside factors to just the binary itself to determine if something is malicious or bad okay so great answers everyone so like a lot of machine learning engines applying to malware classification generally tell whether a sample is malicious or benign how do we feel as an industry or how could we approach naming malware through machine learning or is it really a problem we need to address well you could build a classifier around you know all the VT detection labels you know and basically just produce labels from that or generate labels I should say but um I think um there's a lot more interesting kind of you know machine learning tasks here beyond just binary classification so multi-label classification you know
our multi-class classification could be applied here i guess this kind of maybe goes in the realm so just to clarify binary classification he means 1 or 0 not like Vinay inter malicious yeah the designer malicious not like raw bytes yeah thank you for that clarification I think other areas too especially because like again I grew out of I grew up in a sock working as an incident responder so I loved having tools that enabled me to do my hunting better or do my analysis better so applying things like nearest neighbor search or clustering you know for for customers and even internally you know I think it really helped us out so yeah I guess I think there's there's more beyond just my or again beyond binary classification on you know things being malicious or benign that could definitely help that and maybe this gets into model explanations as well too but I don't know if that's a question later but I think that's also a very interesting area to model interpretability you know on that front that's one of the reasons why it sofas were so deep in on deep neural networks as opposed to some other more structured classifiers simply because you can put a lot more flexibility into them and so you can do things like give me something that will not just say is this malicious or benign but is this malicious and if it's malicious is it malicious because it's puah or malicious because it's a virus or a malicious because it's ransomware and you can even put things like family labels into it so you can build a whole lot more flexibility in how you want to slice up your space of your malicious in your benign samples and sort of enrich your your output in that sense and then also chief plug blackhat talk we have found that using sort of the internal representations from these deep learning models actually is a really useful analytic tool to take a look at sort of how these different samples group together in space and maybe suggest other clusterings that might relate more to functionality rather than to authorship similarity and things like that so I think there's definitely and I think we're just sort of seeing the beginning of this there's definitely a large role for these machine learning models not just as tools of classification but also useful analytic tools that can be sort of handed over to people that do more conventional malware analysis to sort of assist them in in this sort of virtuous cycle and so I think as far as Auto classification taking a machine learning model and having it you know take a binary and try and predict what kind of malware family it belongs to is a pretty
good idea and a lot of people do it really well however I think we will always need malware analysts in the loop we're always going to need humans to validate these results we can't just be you know automatically classifying things and be asleep at the wheel so I'm not super sure if it would be a great idea to have machine learning models that do things like automatically come up with new family names and things like that you're for one thing it may come up with something that's you know not entirely meaningful and the name those you know my work family names there's there's often like some you know you know signatures being like all over the place never yeah but yeah just having a human in the loop to you know give a specific malware family name a meaningful name is you know something that's always going to be important from the Mauer analyst perspective it's the data science job to help us bubble up the important things for us to resolve right so you know a lot of traditionally the malware samples were identified by someone said giving it a cool name and saying a children something you know wanna cry or something like that but really we you know if we kind of went back to the whole of biology part of it where we classify that it it actually does this certain behavior or has a certain property rather than just giving it a random name and calling in a certain class that would help us a lot better because we would need to figure out how to get rid of the Mauer itself but the only way is if we knew what type of behavior it did so we can remove it and I think that's one of the major problems that you know data science tries to resolve is like there are heuristics based on behavior or properties that could be more helpful than just a generic family name that someone threw a dart on the wall and decided to name it I'm just gonna lead into the next part cuz we're a little bit over because we started late what do you guys feel is the future of malware classification without burning intellectual property does no one want to touch that so I think from from the sort of machine learning perspective the amount of data that's coming in like day our day out day over day is growing you know really fast so there's a lot of challenges challenges for us to overcome just unlike the engineering in the pipeline end of things but also being able to sort of put analysts more tightly into the loop and sort of engage that feedback loop and maybe bubble up stuff that's suspicious but hasn't been assigned like a family name or something and hand that over and get a final verdict about yeah is that malicious or is it or is it benign so being able to get more and sort of focus more on an active learning type problem where you're putting humans into the loop and you're learning how to sort of spend analyst hours most effectively to supplement and improve your machine learning I think is a really promising direction for the future well I'll just add that I I think though everybody I think represented on the panel we rely maybe to an embarrassing extent on traditional AV to supplement our label library right now it's it's a really unfortunate thing but they also rely on each other so it's yeah yeah right but these labels don't come for free we put in human effort but there's also automated effort in there so one thing to think about in the future is you know if everybody moves to machine learning then from whence the free labels right so there's a conundrum here the reliance on the human is gonna become more important and furthermore I think that sort of a really hot research area is not in supervised learning of machine learning models but in unsupervised and semi-supervised Michelina models where the reliance on labels diminishes yeah I have a strong I'm strongly in the opinion of using active learning I'm a big fan of it it's just kind of go on on top of that um I would say there's a really interesting project called snorkel it's a framework for doing week supervision so specifically in the case where there's labels of all these different you know types of malware and you know maybe we're getting like heuristics or signatures from analysts in a week supervision the task is basically to try to combine all those labels in a way that overall improves another model later down the road in this this project snorkel it's it's what I think it's hazy research group I can't remember the author's name right now but they have a great tool it's kind of more a built around like weak supervision of images I'm so combining all these different labels from different kind of mechanical turrets specifically for image classification but you know that could be potentially extended to help with some of the problems of taking really noisy labels and also taking account of like you know external databases of signatures or heuristics you know that might be useful for influencing the model so I'm sorry we have to we have to stop now we're we're over time but thank you for all the panelists they're great answers to all their questions super happy to have you all and thank you everyone for coming [Applause]
Feedback