AI Village - Calculating Drift, Fast with Goko - TIB AV-Portal

AI Village - Calculating Drift, Fast with Goko

00:00

2

Cattell, Sven (comathematician)

Formal Metadata

Title

AI Village - Calculating Drift, Fast with Goko

Alternative Title

Hyperlocal Drift Detection with Goko

Title of Series

DEF CON Safe Mode

Number of Parts

374

Author

Cattell, Sven (comathematician)

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/49690 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Normally concept or dataset drift is unquantifiable in practice. The only ways to calculate it are with optimal transport techniques that take O(n^4). The proxies some ML practitioners use are unreliable when applied to security. This talk presents a way to calculate a concept drift number that takes O(log n). It is faster than most inference, so can be put inline in an ML pipeline. Also, as the structure used to calculate the drift is so cheap we can apply it per-user as an extremely effective defense against attacks.

Speech

Text

Image

00:00

Asynchronous Transfer ModeVirtual machineMathematicsGoogolNetwork topologyMachine learningAlgebra

00:34

Elasticity (physics)Asynchronous Transfer ModeDrill commandsTotal S.A.Library (computing)Computer animation

01:13

Elasticity (physics)Asynchronous Transfer ModeComputer animation

02:20

Elasticity (physics)Asynchronous Transfer ModeDrill commandsKey (cryptography)Computer animation

03:39

Asynchronous Transfer ModeElasticity (physics)WhiteboardData modelPiProxy serverDivergenceLie groupCovering spaceNetwork topologyDistribution (mathematics)MalwareCategory of being1 (number)FamilyDistanceMathematicsError messageCovering spacePerspective (visual)Endliche ModelltheoriePoint (geometry)Distribution (mathematics)Different (Kate Ryan album)CASE <Informatik>Multiplication signOutlierView (database)Wave packetNetwork topologyLoginComputer fontKey (cryptography)Two-dimensional spaceSphereSampling (statistics)ResultantLibrary (computing)MeasurementTwitterMathematicianMorley's categoricity theorem3 (number)StatisticsEstimatorReal numberSoftware testingQuicksortCovariance matrixData structureVirtual machineNormal distributionSpacetimeSequenceSet (mathematics)Uniform resource locatorFigurate numberPhysical systemFacebookLatent heatTerm (mathematics)Pattern languageComputer virusSoftware2 (number)Total S.A.Discrete groupDivergenceDimensional analysisVolume (thermodynamics)Proper mapCausalityLevel (video gaming)State of matterArmWater vaporMatrix (mathematics)Elasticity (physics)NumberNeuroinformatikPower (physics)Patch (Unix)Decision theoryConcentricExpressionRight anglePhysical lawDirection (geometry)CalculationUniformer RaumComputer animation

12:00

Network topologyDistribution (mathematics)Asynchronous Transfer ModeCovering spaceInclusion mapKernel (computing)Discrete groupConditional probabilityScale (map)SicDivergencePosterior probabilityTotal S.A.Wave packetDivergenceAxiom of choicePoint (geometry)QuicksortCASE <Informatik>Multiplication signObject (grammar)Online helpLinearizationInheritance (object-oriented programming)Direction (geometry)DivisorSpacetimeRandomizationSphereTwo-dimensional spaceSlide ruleMathematicsInferenceCircleShape (magazine)Set (mathematics)Network topologyProbability distributionStatisticsHeegaard splittingGene clusterDistribution (mathematics)InformationSoftware testingSquare numberManifoldTrailInfinityLoginWeb pageLine (geometry)MereologyNumberDimensional analysisCovering spaceCategory of beingElement (mathematics)Revision controlRight angleScaling (geometry)Data structureProof theoryNichtlineares GleichungssystemStatement (computer science)Graph coloringEndliche ModelltheorieCodeProcess (computing)BuildingDifferent (Kate Ryan album)Medical imagingExpert systemSign (mathematics)Digital photographyAffine spaceFiber bundleCoefficient of determinationPeer-to-peerInformation privacyAdditionData conversionProcedural programmingSpeciesState of matterStructural loadSingle-precision floating-point formatSampling (statistics)Observational studySound effectBitAreaMachine visionCellular automatonWordComputer animation

20:21

Software testingDrill commandsAsynchronous Transfer ModeDiscrepancy theoryDistribution (mathematics)Normal (geometry)Covering spaceWave packetMultiplication signPoint (geometry)Sampling (statistics)CalculationNegative numberNetwork topologyDivergenceSoftware testingDistribution (mathematics)NumberProduct (business)Phase transitionLatent heatSoftwareService (economics)Scaling (geometry)Exploit (computer security)Binary codeTwitterCovariance matrixPosterior probabilityTrailUnit testingReal-time operating systemEnterprise architectureTraffic reportingNormal distributionState of matterAlgorithmBitOpen sourcePoint cloudLibrary (computing)Laptop2 (number)InferencePhysical systemFunctional (mathematics)MalwareWindowFraction (mathematics)Different (Kate Ryan album)Slide rulePosition operatorCASE <Informatik>Endliche ModelltheorieSet (mathematics)1 (number)Goodness of fitNormal (geometry)Elasticity (physics)Series (mathematics)ResultantSummierbarkeitAdditionDecision theoryEstimatorStress (mechanics)Dot productComa BerenicesSpecial unitary groupSystem callVideo gameGame theoryGoogolExpected valuePRINCE2Social classSelf-organizationVarianceMatrix (mathematics)Noise (electronics)TunisForestAreaKeyboard shortcutEqualiser (mathematics)Sign (mathematics)Computer animation

Transcript: English(auto-generated)

00:01

Hello everyone and welcome to our next talk, which is mine To introduce myself, which is wrong. I Did a postdoc in machine learning after getting my PhD in algebraic topology with some of the people who did some of the math behind Concepts in this talk and I've just extended it for this thing. So

00:26

That's enough about me, let me play my talk back for you So here I welcome her Reliance my total on calculating concept very quickly with this library for Google, which is based on the carriages

01:00

I hear no audio. Oopsie-daisy

01:29

There's no do you hear audio from the talk? Okay. Well, that's the problem. I'm trying to

02:03

Sorry about this guys

02:26

Sorry, I don't know what's going on

03:19

I

03:22

Am up. I'll apologize Key takeaway for this talk is this is a novel way of calculating drift Welcome everyone to my talk on calculating concept very quickly with this library called Gokul, which is based on the cover trees

03:47

My name is Sven Patel. I am a senior data scientist at elastic My Twitter handle is kind of mathematician so The key outline key takeaway for this talk is this is a novel way of calculating drift. It's done using

04:04

new technique using cover trees and And because it's so fast, it's login time to actually calculate drift for a new sample We can use it for on in new ways that in concept drift couldn't have been applied before now We're gonna go with the objective why we care about constant drift how mathematically how it's done then a couple results and then

04:26

finishing up with next steps So why do we care about data set shifts? Well our data Drifts So if from I work with malware and there are new malware families coming up every month

04:41

There are new benign pieces of examples of software coming out every month every time that Adobe patches Photoshop It changes the pattern slightly and it changes its location slightly in our data set Every single time a new piece of malware comes up. It's a new PA It changes our the way our just the malware data is distributed slightly

05:01

Another problem Well what those do basically which gets into the second problem is they lower the efficacy of our models that we deploy so our models expect data in a sort of sort of the same sort of general location as our training data and every time we deploy we are kind of fixing our models view of the world at a certain day and

05:25

Then we're training it over testing, you know, essentially future data so it doesn't have a perspective For the you know, if it's trained in March doesn't have these perspective It might not work very well at the end data coming in at the end of April

05:42

So one of the methods for dealing with this is you build a dashboard that basically tracks the error and When the error gets too high you discard the model and go on for a new one But the problem is I don't trust virus total labels, and I don't think anyone really should trust virus total labels And that's the best way we have because of the sheer volume of data. We have to try to trust virus total labels. So

06:08

and For other things I wouldn't trust early labels and on data Because we have to move very quickly and sometimes with whitelisting and stuff And yeah, well and sometimes we only get like a label when our customer complains and we'd rather

06:25

Get ahead of that and figure out like hey, where's the drift happening? What's going on? And then additionally aside from the fact that things are just changing all the time Our models are also under attack. People are trying to bypass them by doing new and weird things to the data like

06:42

Packing a spam filter so and seeing if that works or like Posting a different pattern so that to see if that works to see if they can get past like the Facebook spam filter They're constantly innovating with a specific goal in mind of bypassing our spam our models this may not result be an adversarial example per se from the literature, but it sure kind of acts like that sometimes and

07:05

Additionally, we could be under attack via our poisoning system Maybe if this detector is fast enough we can get ahead of those things beforehand So here's a trivial example So we have these two data sets and one is our training data set

07:22

That's so we've got one one two one one and so our training data set has 70% ones 20% twos and 10% threes And this is what we trained on Realistically, there's no point in training a machine learning model on But a sequence of numbers, but suppose we do

07:41

But then we deploy this model and we get this other sequence of data two two one two so on and Our when we actually deploy it we see data coming of the wire We have 30% one 60% twos and 10% threes. That's a little bit different from our training set. So the real world is different from our training set

08:00

So in this case, this is a we can model this distribution very easily because it's a discrete space There are only three things that our data could be there's only It isn't continuous like a lot of the stuff we have to deal with in data science So we can actually compute the distribution fairly easily and this is a pretty good that we can feel confident our estimate of the distribution

08:23

We have to do some Bayesian statistics to do this properly Which we're not getting into this talk but this is Like a concept of our training distribution and our Real world distribution or our testers distribution as I'll be calling it to this today

08:44

We can take the kubectic divergence. So this is a Sort of a measure of the distance between this distribution and this distribution and it's very easy to calculate on this categorical

09:01

discrete Data because it's just going to be well point seven Times log of point seven over point three. So that's the category for ones over the categories of ones Take the log and then multiply it for the category for ones and that gives you the first term Second term is the same thing for twos and the third term is the same thing for threes

09:21

And then we add those all up and you get the kale divergence between these two distributions is point one six So that's all great, but the problem is our data looks like this And I don't know whether this is a has any difference is our are these two are my blue points Sampled from the same distribution of my orange points. I don't know

09:45

It's really hard to tell by just looking at this picture And and now in this case, they are so my orange points are sampled from a two-dimensional Gaussian two-dimensional normal distribution Um where it's just it's it's a nice sphere and the

10:03

Yeah, they're both there The covariance matrix is one one and the covariance matrix with blues is one one and it gives you this nice Distribution and it looks okay Now what i'm going to do is i'm going to drift the uh blues slightly over And can you tell is this correct? Well, you know every single time that I put a blue is it's right next to some orange points

10:22

So this looks like it could be from the same distribution I can't really tell with my eyes and one of the techniques we might do is like Oh, well, let me take the k nearest neighbors and see if the the distance from my k nearest neighbors Over time goes up then i'm out of distribution. But if you tell here that's not gonna tell you much

10:40

Let me go here. Well, i've got some outliers here now. So maybe that will work now if I go here Well, this is really distributed So this center the center of this distribution should be around here and the center for this distribution is around here So that's really drifted but you've got some outliers over here now what happens when there's like millions of Orange points and there's only a few hundred blue points

11:02

Well, you might not be able to tell the difference with that in that case so and also like This all this what i've described to you is kind of a feeling and actually get at the math of this thing I have to like model the distributions that these came from and that's quite a hard problem and Especially in high dimensions. So things get kind of complicated

11:23

The solution to this that I came up with is to use a cover tree Now, why do I want to use a cover tree? So the cover tree basically The key takeaway for cover trees is it's a k nearest neighbors data structure. There are many like it. There's kd trees there's b trees there's

11:41

um kd trees um, you know There is a k means trees a whole bunch and uh, well the cover tree has this wonderful thing where uh in 2016, uh mario majerney and wendy lau Proved that it can arbitrarily well approximate the underlying distribution of the data

12:04

Um, there's some caveats to that statement and mathematically it will probably to properly statement will take it Take the whole page But basically the whole concept of their proof is for a nice data set Um and to know what a nice data set is you need to know what a low dimensional manifold is Um, and you have to have enough data from that

12:21

Um, but for a nice data set which most of our data sets are pretty nice. Um sort of kind of A cover tree will arbitrarily well approximate the underlying distribution of the data. Um There's some caveats and you can go read their paper if you want on exactly what that means But for us what this means is if I use a cover tree to build a model of my data for k

12:44

I can take care of these neighbors or I can just fiddle around with the properties of the cover tree to like Get information about my data set up So I can tell things like the local dimensionality. I can tell things like how things glue together Um, and I can tell like sort of the shape of my data fairly well

13:02

I can tell clusters and things I can infer what clusters are. Let's go build a very simple cover tree on two-dimensional data So here's a uh, you know infinity sign A bow tie whatever you want to call it And I start my cover tree by picking a point at random and then building a sphere Uh, you know in this case, it's a circle that covers everything

13:21

So this is the start and now this doesn't help me that much I haven't split up my data geometrically to like divide and conquer the k-nearest neighbors Um structure because the naive way would take a linear time big O of n But we want to divide and conquer so it takes log of n. So here's what I do I shrink my sphere down add another one that I got I cover it again

13:44

You can kind of tell that this is this object is longer than it is wide So it's kind of if you really squint it's kind of one-dimensional thing in this direction Um, so that's great. But now if I go, okay, well i'm gonna split it up further So really divide and conquer and I so i've got these two nodes and they're both children of the previous node on the previous slide

14:06

And i'm gonna build their children So here's their children So I divide and conquer the their stuff. Um, and I can kind of tell that Each of their children have four and if I if you squint a little bit, it's kind of square shaped

14:22

in this dimension at the scale The top scale it was too blurry I could only see a line here at the scale I can kind of see that that's up and it's got a nice shape square shape Um, and so there's four children and the reason you know that that that dimensionality is relevant to the number of children you have

14:42

And then I can go further and you know It's got some more It's kind of split up nicely and i've got like a one-dimensional object here again because at all These children are kind of one-dimensional ish and I can split up some more and you can see that it's kind of like Conforming to your data nicely and like at every single step of the way I can make some inferences about what the shape of my data is that are very cheap because it's basically counting

15:07

So how do we build a probability distribution? So On slide two we built a very simple probability distribution where I had um A discrete space and now i've got a tree and essentially when I have a tree

15:23

At every single node I can either go left or right or down or you know in this case my number of children I have is um is variable But I will be a parent node and I have to go to one of its children. So I always have um A discrete choices a discrete choice of where to go

15:42

So how does that apply? Uh, so here is a very like simple cover tree Um, and i've colored it by the the total like the population of this node So this color here is the popular relative population of this node So about 66 percent of the data

16:04

Is underneath this node About uh, 33 percent of the data is underneath this node And you can see that the colors are kind of change and I get a kind of a picture of how dense my data is But that doesn't help me because I need to make choices. It doesn't help me make the choices at all steps

16:23

So if I go here this helps me make my choices So here i've got 66 percent of my data over here and i've got Um 33 percent of my data over here And then once I take if I have to take one of those two choices if I take this choice

16:41

I'm now at this node And I have two choices again and the relative probability of taking this choice versus this one Well, this is basically a point one probability of taking this in the right this path And this is a point nine probability of taking this path and once I get to this node, uh, I have a

17:00

Probability of one of going straight down because I have no other choices and at every single point I have a discrete distribution like we had on page one Which tells me where i'm going and because my tree is geometrically motivated These choices which are all discrete and easy to compute and easy to count and easy to keep track of

17:21

um Now I can compute the KL divergence because it's it's trivial. It's it's all it's just counting and taking logs and summing things up So this is my um prior distribution as I said, we're going to use some Bayesian statistics And so here is my prior how this is built of my training set. How do I include my test set?

17:44

So when I see a piece of data coming over the wire I can make an inference of where it should be long in this tree so If it was already if it was in my training set and didn't get involved in building the tree where would have it ended up?

18:00

So I have a point here. So this is my new point that came over the wire and I queried and said oh Well, this point is covered by this parent And it's covered by that parent and that parent all the way up So the path for this point Goes up this way and now for each of those elements in those parts I can increment my probabilistic distribution and get a posterior one that involved that includes the new information that I got

18:27

That there's a new point over here And now I can do that again with another point. Oh, yep. Cool I have a posterior this point's been added and you can see that the distribution changes and then I can add some more points And I can add some more points

18:40

So these are all my posterior and I can update my original thing by taking the path Of each of the points and updating all the little discrete distributions that I have So this enables me to sort of Quickly Make a posterior distribution out of my prior distribution and everything is is discrete

19:03

And then I can take the kale divergence between my posterior Discrete distributions and my prior discrete distributions. So in this case This one has a probability of 0.33 and here I have a probability of 0.66 and afterwards After I've included all my training set. This is a probability of like 0.4

19:24

And this is you know This is a probability of 0.6 and this is a probability of 0.4 and those are very different And I can plug those into the kale divergence thing and get a positive number for the kale divergence at this node And then I can do it for this node and this node and this node and get all the kale divergences and add them up

19:42

And then I can get the total kale the kale divergence with respect to my original tree Of the posterior distribution with respect to the prior distribution Which is really useful and so this basically yeah, it does it and it's The the code for doing this

20:01

Enables you to do it very quickly because it's basically just counting Um a couple logs and some other statistical equations Um And the slide just covers how we do it So, um, let's go back to That original test set. Um, so i've got a basically a blown up version of the

20:25

those slides of the gaussian distributions where I had them where a large number of points from my Training distribution which was a single gaussian sampled at the center of the thing and then I have a small number of points Uh from a test distribution

20:41

And in this case, I have a hundred thousand points for my training distribution and only 500 points for my test distribution and these both have a covariance Matrix of all ones so it's actually quite uh, and it's 20 dimensional. This is actually Harder than most data sets

21:02

MNIST is about 10 dimensional if you really get down to it, you want to know what that is. You can Contact me on twitter and ask me why MNIST might be lower dimensional um and Between the training distribution I build the tree fix it and then I make a posterior distribution for a bunch of test distributions

21:24

I take the kale divergence And you can see it's a bit noisy Um, the reason being is I generate a lot of the exact position of your points. It's fairly sensitive to use um, so this is one of the problems and one of the things i'll talk about at the end is like there are ways to normalize this and resolve this because

21:43

I've designed it for speed not really for accuracy Um, and i'll get into why I did that afterwards because honestly I came for it came to the drift stuff backwards But you can see the further you get the further you separate the two, uh distributions the uh, Higher the number gets so the sanity test the unit test of does this work is correct

22:04

Um and now to compare this to other methods, um This takes big o of uh, and i'm sorry washerstein takes big o of n to the fourth Um that washerstein is the only other um Drift calculator that i'm really know works, uh

22:21

And is model independent and doesn't have um some funny business with building an estimated distribution initially um This takes um K log n Another big difference between washerstein is the traditional way of doing washerstein requires a relatively equal sized test and training set

22:41

um You can't have the test set be 0.5 percent of the training set That would just would not work so great on the traditional method I know of And this one Honestly, it's online you can do the test that can be a fraction of the size of the training set And because it's online it can do it in real time

23:03

And this is Stupidly fast you can track for the ember open source ember malware data set. Um You build a reasonable cover tree. You can track 16 000 new samples per second on my laptop um, so that's fast enough that

23:21

Inference wouldn't be this this would not be the bottleneck inference would be the bottleneck if you would distribute Build a cloud distribution a cloud model and put this as a guard in front of that cloud model So we built it so that it can go fast enough to defend And now here is where the the thing originally came from and then we backtracked onto drift

23:44

So here's the test set attack So this was originally Um categorized by ian goodfellow, um, and some defenses have been proposed by nicolas carlini and others

24:01

Um So but here is the gist of the attack Normal users just query they don't care about whether they got a false positive or false negative. They just keep querying um And for us we'll state we think of a normal user as being like an enterprise user trying to figure out whether All the reports coming over the wire of like new

24:22

Binaries are malicious or benign or they'd be submitting every single binary that they see on their network to a service Or their um, or you know that will tell them whether it was malicious or benign for an av You know in an adv product But a malicious user would be submitting things

24:42

With a specific goal in mind, they don't care about what they care deeply What the label is and they care deeply about defining a false negative. They want to find a malicious sample We classify as benign So what they'll do is they'll try a bunch of stuff until they find something And that is misclassified as benign and it's actually malicious and bypasses our model and then we can and then they deploy it

25:06

um as far as much as possible so, um If you do if you apply this To you know, the the fast drift calculators to that you can attach a little drift calculator to each user And as samples come over the wire

25:23

you Track it for that user um, and if the user Starts exploiting something they're going to get a very boring distribution where it's just one point repeated over and over again um And you spike your distribution your

25:41

KL divergence spikes up and then becomes very normal because it's essentially you're tracing the same path over and over and over again And you can see that there's an exploration phase where the malicious users try to look for a benign Misclassified sample and then they when they find one there's an exploitation phase and the KL divergence rises rapidly

26:01

You can see that also that this Is in log scale So this is literally hundreds of times more KL divergence than the benign samples and here There it's the same thing hundreds of times more KL divergence than the something benign and the attackers are unsuccessful um

26:21

Because of this you can build you could probably build a very good online defense of your system against a test set attack um, so And here's where basically the yeah, the the system was originally built sort of with this mind and other attacks in mind because

26:43

I'm more interested in defending my models and calculating drift But because it's so bloody fast, it does this really well and because it's so bloody fast you can retrofit it so Here's you know Here's where i'll talk about some next steps. So as I mentioned previously this thing is a bit noisy

27:02

Uh, there are ways to fix it. Um in this case Um window sizes and like some tuning things that um, I didn't go into on in these slides because i've already got 20 minutes Um, there are some tuning systems you can use to really clean up Um and denoise, uh, there's also some interesting waking functions you can do inside the tree because

27:25

There will be there's some noise from like an over active. Um Uh Leaf node might End up just building a very high kale divergence for that. Just that one node and it throws everything off

27:40

And there's a couple other little things like that that still need to be cleaned up before this goes into production But the system itself is looking very promising And is honestly the fastest of its kind. Um that I can see in the world Yeah, and that's basically it forget the for gokels, uh

28:02

Uh ability to detect benign, uh, you know test set attacks And other drift things, uh, the libraries at elastic.com. Oh, sorry github slash elastic slash gokle it's named after my grandmother because The the base algorithm for this thing is called jeremy and if you say that while drunk you can kind of get grandma

28:25

Um and grandma's a bit Oh too on the nose. So we named it. I mean I named it goko Um, anyway, well looking forward to hearing you guys on twitter and twitch and slack or whatever So thank you very much and good night