AI VILLAGE - AI DevOps: Behind the Scenes of a Global Anti-Virus's Machine Learning Infrastructure

Video in TIB AV-Portal: AI VILLAGE - AI DevOps: Behind the Scenes of a Global Anti-Virus's Machine Learning Infrastructure

Formal Metadata

AI VILLAGE - AI DevOps: Behind the Scenes of a Global Anti-Virus's Machine Learning Infrastructure
Title of Series
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Thus far, the security community has treated machine learning as a research problem. The painful oversight here is in thinking that laboratory results would translate easily to the real world, and as such, not devoting sufficient focus to bridging that gap. Researchers enjoy the luxuries of neat bite-sized datasets to experiment upon, but the harsh reality of millions of potentially malicious files streaming in daily soon hits would-be ML-practitioners in the face like a tsunami-sized splash of ice water. And while in research there’s no such thing as ““too much”” data, dataset sizes challenge real-world cyber security professionals with tough questions: ““How will we store these files efficiently without hampering our ability to use them for day-to-day operations?”” or ““How do we satisfy competing use-cases such as the need to analyze specific files and the need to run analyses across the entire dataset?”” Or maybe most importantly: ““Will my boss have a heart-attack when he sees my AWS bill?”” In this talk, we will provide a live demonstration of the system we’ve built using a variety of AWS services including DynamoDB, Kinesis, Lambda, as well as some more cutting edge AWS services such as Redshift and ECS Fargate. We will go into depth about how the system works and how it answers the difficult questions of real world ML such as the ones listed above. This talk will provide a rare look into the guts of a large-scale machine learning production system. As a result, it will give audience members the tools and understanding to confidently tackle such problems themselves and ultimately give them a bedrock of immediately practical knowledge for deploying large-scale on-demand deep learning in the cloud.
Antivirus software Virtual machine Demoscene Demoscene Virtual machine
Web 2.0 Information Computer-assisted translation Front and back ends 2 (number)
Mathematics Computer virus Multiplication sign Electronic signature Product (business) Electronic signature
Machine learning Musical ensemble
1 (number) Sampling (statistics) Motion capture Virtual machine Calculus of variations Bit Line (geometry) Rule of inference Electronic signature Antivirus software Latent heat Malware Process (computing) Loop (music) Authorization Circle Quicksort Position operator Spacetime
Channel capacity Artificial neural network Multiplication sign Projective plane Electronic mailing list Virtual machine Online help Electronic mailing list Neuroinformatik Number Mathematics Machine learning Different (Kate Ryan album) Pattern language Quicksort Physical system
Machine learning Integrated development environment Projective plane Data mining Virtual machine Quicksort Mereology Scalability
12 (number) Multiplication sign Sampling (statistics) Mathematical analysis Set (mathematics) Calculus of variations Scalability Thresholding (image processing) Scalability 10 (number) Product (business) Wave packet Fluid statics Malware Pi Error message Different (Kate Ryan album) Software testing Software testing Quicksort Endliche Modelltheorie Family Physical system
Statistics Sequel Multiplication sign Set (mathematics) Database transaction Scalability Product (business) Sample (statistics) Query language Touch typing Set (mathematics) Query language Table (information) Simulation Address space Row (database)
Virtual machine Set (mathematics) Black box Mereology Scalability Computer programming Formal language Product (business) Data model Mathematics Malware Software testing Endliche Modelltheorie Computer architecture Physical system Weight Sound effect Machine code Formal language Data management Right angle Quicksort Discrepancy theory Physical system Window
Data model Greatest element Computer virus Process (computing) Sampling (statistics) Virtual machine Automation Endliche Modelltheorie Binary file Position operator
Robotics Optimization problem Sampling (statistics) Tangible user interface
Machine learning Demo (music) Robotics Direction (geometry) Software testing Endliche Modelltheorie Total S.A. Line (geometry) Musical ensemble Endliche Modelltheorie Total S.A. Electronic signature
Type theory Android (robot) Uniform resource locator Key (cryptography) Demo (music) Query language Different (Kate Ryan album) Feedback Total S.A. Endliche Modelltheorie Office suite Product (business)
and without further ado here's Alex with this talk thank you thank you for coming on a Sunday morning we are gathered here today to talk about AI my talk is entitled AI DevOps behind the scenes of a global anti viruses machine learning infrastructure and just a meta note real quick in making this talk I was I'm going to be talking about work that was performed by many people on my team so I actually interviewed about four people and this work spans about two years and I have about 20 minutes to tell you about it so this is gonna be a very kind of distilled highlight reel of of our story Who am I so I'm doing a lot of
things at the team I do a lot of web dev I do a lot of AWS backend stuff some ml research I've done some Android stuff that's a picture of my face and my contact information this is a picture of
my cats which you have three seconds to enjoy this is the data science team at
Sophos maddie Chiappa is actually in the crowd today and Hilary Sanders Konstantin Berlin and the two maths at the bottom contributed a lot more than I did to the work that I'm taking credit for today I also have you on here because we're hiring so if you have any expertise at the crossroads of cybersecurity and data scientists data science we'd love to hear from you okay
so in the beginning there was signature based antivirus and by beginning I mean 2017 that's about the time that the data science team was added to Sophos and we were the first kind of introduction of ml to their antivirus product and there
are a lot of just initial misconceptions the big one being that machine learning
is all you need so this is something that we kind of had to convince people was wrong even though we are the data scientists and so just to kind of give you guys a way of thinking about this [Music]
if you imagine sort of the space of all files and the green ones are benign where and the red ones or malware and you kind of imagine that similar files are closer together so the malware is sort of clustered in the middle if you have signature based antivirus then your your analysts are sort of writing these rules to draw these circles around specific malware samples and the problem that has kind of besieged the whole industry in the last ten years or so is that malware authors will just mutate their malware they'll create variations and it just overwhelms analysts and the process of trying to update your signature and like capture these new variations without capturing any false positives is just very tedious and it's very laborious because it's all done by humans so ml's value is to kind of fudge the lines a little bit kind of generalize and take those original signatures and then just kind of learn from them so it kind of helps to counteract this main tool that malware authors are using so it's actually very well-suited for this challenge um but it's important to keep in mind that without the signatures there would be no ground truth and there'd be nothing for machine learning to learn so
ml is not really all you need you still need humans in the loop so
some other some other things that we've added to kind of help ml our whitelist for critical system files that you might not want to FP on like Explorer dot exe black lists for things that you just know or malware and what I'm calling fuzzy white lists so we have sort of a reputation system where if we've seen a file in a lot of different computers we've seen a lot of times we sort of make the assumption that that means it's not malicious so we sort of give it a probabilistic it's probably good when people at the company asked well why can't machine learning do all this we
were like well math and stuff basically um if you if you imagine like a neural network or something the number of layers that it has the number of neurons that it has that's a certain capacity that it has to learn patterns and there's no reason to waste the capacity trying to learn certain patterns of files that you never want to convict or files that you always want to convict when you and just do these kind of dead simple whitelists and blacklists so the tone kind of changed from machine learning is all you need to kind of hey you know that project you got over there you know you how you could make it even better it's a machine learning on it and
that project that would be great with some machine learning if you don't know this is from Portlandia and this is an episode where they're making fun of how everyone in portland thinks things look better with birds drawn on them the best part about this episode is the ending when a bird actually flies into her shop and she's totally disgusted about it which is sort of what happened to us now
machine learning is alive and well at Sophos but the reality of trying to deploy it in a production environment was a lot uglier than anyone envisioned which brings me the scalability so the
main issue a scalability is just that when you're working with a product that's deployed across millions of endpoints and it's scanning tens of thousands of files on each end point every day that's a very different data set than the one that you trained and tested on which may be maybe tens of millions of samples and it's a static data set that you're kind of optimizing over and over on so a lot of just different things sort of manifest so just as an example to kind of illustrate this was anybody born on August 12th okay so it's nobody's birthday today in this room if you were to be very naive you might take that statistic and say well there's nobody born on August 12 which presumably is wrong presumably roughly one 365th of the population is born on August 12th but so that kind of demonstrates a small sample size issue and the same thing occurs with malware analysis if you have a small malware sample data set then those kind of tiny pieces of pie just kind of get disability disappear so it may be a small variation of a family or some critical system files so what we've done is we set up this testing gauntlet so a model will get like three million files to work on to train and test on and if it reaches a certain threshold then it gets like five million and if it does well there it gets 10 million and you keep ramping it up that way you don't waste training time on large amounts of data you start with three million and that's a test you can run relatively quickly if that goes well then you you keep amping it up so in general scalability just kind of amplifies things we also have to strive for ridiculously low false positive rates because we're running our model so often and the last last thing that sort of gets amplified is just costs so if you're using something like AWS which you should be it's great because whenever you ask for resources you get
resources the problem with AWS is that whenever you ask for resources you get resources and you don't always know how much you're really asking for so in one situation we had a bunch of data on s3 that we were trying to move in a glacier to save money and we neglected to notice that there was a small transaction fee to every move from s3 to Glacier and we ended up blowing through our monthly budget in a couple of days so that was our bad so another thing that we did to
address scalability was switched from dynamodb to redshift so DynamoDB if you don't know is it's great for row based data if you want to pull out individual rows that works but it's not great for aggregate statistics so you can't really run sequel queries over large data sets because you have to pull down the entire table every time you want to do that redshift is column based so it it supports that much more quickly so one other thing I'll touch on is just the actual going into production so so one
thing that we constantly have to deal with is what we're calling concept drift which is also referred to as model decay but essentially what happens is that you put a model in production and even over the course of a month the effectiveness of the model stops starts to drop perceivably this is just because new malware is being released a new benign where is also being released and so to actually compensate for this we have to train with sort of a handicap so we have to train on older data and test on newer data to kind of predict how our model we will decay there are also language discrepancies so all of our research is done in Python and we've literally had to rewrite some parts of Carus in C++ to get it to run on Windows machines releasing updates is tricky because you don't really know what you're releasing I mean you know certain weights have changed in your model and maybe the architecture has changed you don't know how its behavior is gonna change in production and it's not like releasing a code update where you can be pretty certain about what's going to happen this is kind of a black box and so that's a little scary and that kind of ties into cultural because now you have data scientists who are in like the production bloodstream which is sort of bad because we don't really know what we're doing there there's been a lot of knowledge that we've had to gather along the way like the things I had talked about about scalability and so there's been a lot of learning that we've had to do as a Dave science team and we've all had kind of had to take off our data science hats and put on engineering hats and then kind of one anomaly that came up someone wrote a HelloWorld program in C++ and they compiled it with certain settings and our model said it was malicious and you know management came to us and they were like well why is that happening and we were like well
math and stuff but they wanted a fix right it's like that's not good and in a production system if there's something that's impacting customers you you do a hotfix so you kind of surgically fix the code that's causing that problem without changing anything else you can't really do that in machine learning so if you if you imagine a model like a pinball
machine and you imagine that the model is all these pegs and samples were running through it are these balls and the way the classification that is producing are these bins at the bottom might run a bunch of samples through the model and it might get most of them right but there's that annoying little benign wear ball that we've FP Don and what are we supposed to do about that
like the position of all these pegs has been optimized through an automated process over hundreds of hours like what are we gonna do about that we could pick
some peg and say well I wonder what happens if we twist it and run it all
again and okay yeah we got it but oh wait now we have a false negative so um you know there's really no way that our puny little human brains are gonna like look at all these pegs and be able to figure out the right optimal solution that works for all samples so if you're afraid of robot overlords you might want to be but the actual takeaway here is that if you go back to the original
thing that I was talking about machine learning by itself doesn't always work it makes mistakes and when it's fudging the lines sometimes it fudges the lines in the wrong direction and it actually loses something that a signature would have caught so robot overlords still kind of need us cool okay I have six
minutes to demo something and answer questions so AI total is an internal tool that we use for testing our models this is everything that it's built with some of the interesting stuff we're starting to use sage maker which I don't know if people are aware about but you put your model in a docker and docker container and it kind of handles everything else which is pretty great [Music]
okay so I wasn't able to connect to Wi-Fi which I'm not blaming anybody but maybe a Def Con related issue so I'm not able to give you a live demo but this is AI total we've done a lot of work on different things other than Pease so we have I don't know if you can read the top but we have HTML models URL key office talks and Android and so the way that this works is that you can actually live query the models that we have so this is our you our URL model and you can actually type in URL here and get instant feedback from the model and this allows us to kind of just get a little sense before we push things out into production like how's it actually going to behave for the other models for PE and office talk you can actually upload a file but for URL because it takes text you can just type it in so yeah that's what I have for you guys thank you and any questions [Applause]