AI VILLAGE - StuxNNet: Practical Live Memory Attacks on Machine Learning Systems
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 322 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/39784 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
DEF CON 26195 / 322
18
27
28
40
130
134
164
173
177
178
184
190
192
202
203
218
219
224
231
233
234
235
237
249
252
255
268
274
287
289
290
295
297
298
299
302
306
309
312
315
316
00:00
Execution unitSystem programmingSemiconductor memoryHill differential equationInfinite conjugacy class propertyLie groupLink (knot theory)Projective planeSharewareCodeTwitter
00:25
System programmingInfinite conjugacy class propertyVirtual machineCodeEndliche ModelltheorieMultiplication signInformation privacyLogicCASE <Informatik>Physical systemInversion (music)Artificial neural networkDifferent (Kate Ryan album)SoftwareTrojanisches Pferd <Informatik>
01:10
Machine learningSoftwareHost Identity ProtocolCodeVideoconferencingBitBefehlsprozessorPerspective (visual)Endliche ModelltheorieArtificial neural networkReverse engineeringSharewareLogicParameter (computer programming)SoftwareCodeWave packetFunction (mathematics)Black boxoutputMachine codeVirtual machineDifferent (Kate Ryan album)Computer animation
02:11
Exclusive orSharewarePlastikkarteVideoconferencingWebsiteArtificial neural networkSoftware frameworkTensorCodeFunctional (mathematics)DataflowExclusive orSoftwareComputer animation
02:46
Computer fileSoftware frameworkArtificial neural networkEndliche ModelltheoriePeer-to-peerSemiconductor memoryBoom (sailing)Process (computing)PredictabilityPrincipal ideal domainSoftwareQuicksortResultantMalwareWindowComputer animation
04:05
Exclusive orSharewareSpacetimeRead-only memoryPresentation of a groupSlide ruleComputer fileVideo projectorAngleDifferent (Kate Ryan album)Representation (politics)NumberProcess (computing)Core dumpCodeArtificial neural networkBoom (sailing)WeightBuffer overflowSoftwareCodeDirection (geometry)Semiconductor memoryFunction (mathematics)Software frameworkAddress spaceComputer animation
05:31
Term (mathematics)Extension (kinesiology)Reverse engineeringDifferent (Kate Ryan album)SoftwareBitPhysical systemWritingMalwareComputational intelligenceComputer architectureSinc functionCore dumpProcess (computing)FirmwareCausalityFile systemElectronic signatureFigurate numberBinary fileHard disk drivePlanningComputer animation
06:55
Trojanisches Pferd <Informatik>Artificial neural networkTrojanisches Pferd <Informatik>Wave packetRun time (program lifecycle phase)Probability density functionNumberSet (mathematics)Object (grammar)QuicksortBackdoor (computing)Function (mathematics)BitSoftwareDot productoutputSocial classMedical imagingPixelDifferent (Kate Ryan album)Multiplication signSampling (statistics)Characteristic polynomialCombinational logicInstance (computer science)Asynchronous Transfer ModeProcess (computing)MultilaterationComputer animation
08:17
outputFunction (mathematics)Read-only memoryCuboidLimit (category theory)Data modelTrojanisches Pferd <Informatik>Artificial neural networkVotingArtificial neural networkFunction (mathematics)Endliche ModelltheorieTrojanisches Pferd <Informatik>Attribute grammarBlack boxCoprocessorCodeMultiplicationNumberSemiconductor memoryInstance (computer science)Perspective (visual)Similarity (geometry)Different (Kate Ryan album)Computational intelligenceQuicksortState of matterLevel (video gaming)Computer animation
10:34
Trojanisches Pferd <Informatik>Endliche ModelltheorieArtificial neural networkProbability density functionLoop (music)Function (mathematics)StatisticsSet (mathematics)Software testingTrojanisches Pferd <Informatik>
11:03
MIDIVideoconferencingTrojanisches Pferd <Informatik>Bit rateSet (mathematics)StatisticsRun time (program lifecycle phase)WhiteboardPlanningGoodness of fitComputer programPresentation of a groupEndliche ModelltheorieSoftwareSoftware testingBoom (sailing)MalwareSlide ruleComputer animation
12:36
MalwareHash functionSparse matrixMatroidConstraint (mathematics)BitFlagMathematicsTerm (mathematics)MassRegular graphParameter (computer programming)ApproximationKey (cryptography)Artificial neural networkSparse matrixProduct (business)GradientStatisticsStapeldateiMultiplication signTrojanisches Pferd <Informatik>Wave packetMalwareHash functionCost curvePerspective (visual)1 (number)Network topologySoftwareSlide rule
15:02
Menu (computing)Interior (topology)Hill differential equationBitGradientPoint (geometry)Parameter (computer programming)Trojanisches Pferd <Informatik>ResultantSocial classFraction (mathematics)Regular graphEndliche Modelltheorie
16:09
CASE <Informatik>GradientBit rateParameter (computer programming)ResultantTrojanisches Pferd <Informatik>Endliche ModelltheorieoutputGroup actionRegular graphStandard deviationComputer animation
17:03
Partial derivativeTerm (mathematics)NumberEndliche ModelltheorieTrojanisches Pferd <Informatik>Social classNational Institute of Standards and TechnologyProbability density functionRegular graphClosed setMultiplication signApproximationSparse matrixSubset
18:40
Reverse engineeringOrder (biology)Attribute grammarReverse engineeringMultiplicationSocial classTrojanisches Pferd <Informatik>Product (business)Term (mathematics)Read-only memoryPower (physics)Real numberCondition numberEndliche ModelltheorieArtificial neural networkPhysical systemMessage passingRight angleSet (mathematics)Library (computing)MereologyDifferent (Kate Ryan album)
19:47
Mathematical analysisMultiplication signComputer animation
Transcript: English(auto-generated)
00:01
everyone. So I'm Rafel and I'm gonna be talking about StuxNNNet. I've worked on this project with one partner, Brian Kim. Our code is on GitHub and I've also posted some instructions for the demos I'm gonna show you today. So if you want, you can go to my Twitter handle, that's it, and you'll be able to find the, find this link, navigate
00:21
there and see exactly what I'm doing. So let's get started. Um, if we look at, um, different kinds of attacks people do on software, there are, there are attacks on systems, a lot of times it's about getting the ability to run code on a victim system. There's all kinds of ways people do this and people talk about this all over DefCon. In this village, however,
00:41
we're concerned about AI systems in particular. So AI systems, we know about adversarial examples, we know about things like model inversion where you're looking at privacy issues, we know about trojaning attacks on neural networks and many more. So our question was, um, is there anything interesting about the case where you have an exploit, a back door or anything into a machine learning system where you can run code on
01:04
the victim system? Is there anything kind of novel about that? So when we start thinking about this, the way we, what we first looked at was the way the logic is actually encoded. So if you look at traditional software, you have kind of explicit coding, you know, you have the logic, it's written into assembly, that's then written into
01:22
machine code. And with neural networks or any other kind of machine learning model, we're going to focus on neural networks for this talk. It's a little bit different. Your logic is encoded in trained parameters, which are then, um, compute, which are then, um, combined with the inputs, uh, to get an output. So it isn't as straightforward. It's
01:41
not like you can go through and just reverse engineer what somebody's doing if everything is, you know, if you've seen everything your attacker's done. So this is all, this gives neural networks, some features, one of which is, you know, being black boxes and you can't exactly tell what the model is doing, but it's also, you know, it's interesting from an attack perspective because unlike, you know, having to
02:01
do other things where you're actually getting your own code on the CPU here, you can just change some data and you should see some interesting behavior. So maybe let's take a look at that. I'm going to do a little demo here. So apologies if the videos are complicated to use. I don't know if anyone else has done this today, but actually, what's wrong with this? Uh, so I'm going to show you, sorry, slideshow from
02:25
card side. So if you look at what I'm doing here, these are two identical neural networks. This code's all on GitHub. No need to like read it too carefully. But what it's doing is it's, it's, uh, predicting the XOR function. This one is written for a toy neural network framework that we wrote in C plus plus. This one is in
02:43
TensorFlow and you're going to see, um, an attack on both of them now. So here is a, I think this is the, um, this is the simple model. We, this is the toy neural network framework. So what you'll see is that this will at first predict correctly and you'll see there, all I'm doing in the second window is finding the PID and you see one, one,
03:04
zero, correct. One, zero, one, correct. Zero, one, correct. Zero, zero, um, correct. Now you see suddenly things change. So here, if I go, ugh, you see that it's been one, one, zero, one, one, zero, uh, sorry, zero, one, one, zero, uh, and then
03:22
suddenly here you get zero, one, zero, zero. So what I've done here is I've gone into the memory of this process and zeroed out one of the weights. Now, again, this is sort of a toy example. This is a framework that I wrote. No one uses this. Let's take a look at something that, you know, might be a little more realistic. So here is, here is TensorFlow. Same exact network, same, uh, same predictions, but here you're
03:45
gonna see I go through, I find the PID, run the malware and you should see the same exact result. And note that this is the same exact file I'm running, the same exact malware I'm running for both networks, um, which is kind of interesting as well. So boom, you get the zero, one, zero, zero, which is exactly what we're looking for.
04:04
So, let's maybe take a second and look at how I did that. Oh, I'll get the presentation back up. Current slide, okay. So, here. So, the first thing you need to do is access the address space of the victim process. And, you know, there's a number of different ways you can do this. My code's up there. We're not gonna get into that too
04:23
much. What is more interesting is how you actually figure out how to patch the network. So in the neural network framework that we wrote, you see that we use JSON to encode the, as, as you know, basically the checkpoint file, how to store the network. And the weight we attacked was the minus one, one. So that weight up there. And you can see if you look at the code behind there in Python, what we're doing is we're
04:43
figuring out the binary representation of that weight. And once we found that, you're looking down there, that's just a dump from, dump from Ollie. You can see highlighted, if the, if, you know, the, uh, projector's clear enough, that we've actually found that weight in memory. And then as you go down, you see that suddenly we've zeroed it out. And looking at the output of the network, you see
05:04
it's, it's predicting correctly, correctly, correctly, and then boom, suddenly the, that one flips. And you know that, that the, um, patch has been properly applied. So this is kind of an interesting attack in and of itself, you know. So say there was a zero, there was like a buffer overflow in a self-driving car steering angle. And
05:22
suddenly you just, you know, zeroed out a whole layer and all of a sudden the car, you know, turns to a sudden direction, all cars across the world at once or something like that. That would be pretty bad and that's a pretty serious thing. So this attack in and of itself should be cause for concern since it's so easy to do on TensorFlow. So now let's talk a little bit about the necessary steps to actually launch
05:41
such an attack. So you need to somehow, you need to reverse engineer to some extent what the, how the, um, system you're trying to attack works. So the way you would do this is you need to figure out how to get the weights. And if you're attacking like a self-driving car or a, you know, malware detector on a computer, you probably would take the computer apart, find the hard drive, extract whatever you can
06:02
from that, you know, go through the file system and find something that looks like a weights file. Or, you know, you, you would need to figure out, you look at reverse engineer the architecture for the network to figure out exactly what it was using and there and it should be easier to, to find the weights you're looking for. Um, and then that step is actually critical for the more serious attack I'll show later. But,
06:22
um, in terms of doing this and making this process easier, you could write different kinds of signatures for firmware and if you just wrote a bin, a binwalk SIG or if you wanted to go look at different kinds of memory dumps or, you know, VM snapshots or what have you. And, you know, to some extent you will need to reverse engineer the architecture for the, for the more serious attack I'm gonna show. But again, you
06:42
know, you, since you were trying to attack, you know, commodity systems like self-driving cars, um, computers or even something like a plane, you know, a, a serious actor should be able to, you know, go in there and figure out what's going on. Now, let's switch gears a little bit and talk about different kinds of attacks. So, uh, we've talked a bit about poisoning in this village, uh, before, but
07:03
basically what we're talking about is having, uh, sort of a trojan trigger. That's a set of input characteristics that you want the neural network to misbehave on. So, an example that I'll show later is, you know, say, a combination of number of images and JavaScript objects in a PDF, if you're talking about a malicious PDF classifier. Or
07:21
maybe, say, the specific pixels in an MNIST or CIFAR-10 image, so you see the dots on the MNIST image or the little T there in the top left corner of the CIFAR-10 image. And once you've defined that trigger, you want to map all inputs of that trigger to a particular class that the network's going to output. So, you know, like, for instance,
07:40
say we had those dots there, we would map all numbers with those dots to a 4, for instance. And once you define that behavior, then you need to go through and, you know, trojan examples seen in training and continue training the network on those trojan samples. So, most people are concerned about this kind of, at least historically, people were concerned about this as an attack at train time. You hack a company that's
08:01
trying to train a neural network and throw in, you know, mess up their training process. Or they're concerned that some malicious vendor has handed them a trojan network that can be, you know, you know, have a nasty backdoor they don't know about. But our question here is, what if we could actually patch a trojan in at runtime? What would that do? Now, before we even get into the nitty-gritties of how one might do this, let's, uh, let's dive into the threat model. Like, why would
08:23
someone actually want to do this as opposed to just, you know, switching the output of the classifier to whatever they wanted to do? So, neural networks, as we've discussed, are non-linear models. They're, you know, black box. You can't interpret what the weights are doing. So, you know, as sort of a corollary to that, you, if assuming the trigger is subtle enough and not blatantly obvious, how can you
08:43
know what a malicious patch actually does? You know, so someone's hacked you and patched your neural network. You know, maybe if, if you, for instance, if the attack is thwarted and you're just stuck with this patch, how can you go in there and figure out what they were trying to do to you? And, you know, let's say the attack was deployed and there was, you know, serious damage. How do you know that
09:03
there was, that, that, that the damage was actually the full intent of the attack? How do you know there wasn't some underhand behavior that the, that the attackers actually wanted to perform, rat, you know, say they put in multiple, um, trojan with multiple different triggers, you know, maybe to make it look like someone else did it or, you know, or whatever they would want to
09:22
do. It, it could really complicate damage control and attribution. And if you think about what's actors who hack a lot, you know, state level people, you're thinking about, you know, they, they, they don't want it to be obvious who did what. The whole value of these things is that you can strike someone without them necessarily immediately being able to know it was you and strike you back or with the
09:42
damage being complicated and a lot more, you know, costly investigations going on and confusion around exactly what happened. So with, with that in mind, this is a very, this kind of obfuscation that this attack provides or kind of mystifying the attacker intent is very powerful. And also from a tactical perspective, it's nice because, you know, you may have had to actually, you
10:01
know, modify executable code or something or do something funny to the stack to actually get the attack. But for the attack itself, um, you don't actually need to touch any, um, executable code. You just need to flip a few numbers in memory. And there, you know, we find that these weights are contiguous in memory almost everywhere, which makes sense because, you know, the way, the way, the way, the
10:21
way, um, processors work is they prefetch memory by locality. So having weights, similar weights close to each other speeds up the computation. So that, those, that feature makes it, uh, you know, very easy to find weights in practice. Now, um, now that we've talked about a Trojan attack, let's show you a little example here. So this is a PDF classification model. Just like the prior two
10:42
models, it's kind of running in a loop and printing out some statistics. But you can see here that what we, what, how it works is that there's 5,000 malicious and 5,000 benign PDFs in the test set. And we've taken the test set and we've Trojaned it so we now have two test sets. A clean test set and a Trojan test set. This neural network will predict on both the clean and the Trojan and will
11:03
output as we, um, perform the attack. So here's a little video again. Let's pull it up. So here again, just like the prior examples, I've loaded my model. It's running. You can see the accuracy is incredibly high on both the clean and the Trojan set. Roughly 5,000 PDFs of a plane being flagged as malicious in total. And true positive rate is almost, uh, you
11:25
know, 99% across the board. So this is a very effective classifier. Now that I've overrun my malware, we'll suddenly see something funky happen, which is suddenly we now have very poor performance on the Trojan set. Only half of the Trojan, uh, you know, the accuracy is at 50%, which is exactly what you'd expect because the whole second test,
11:43
the whole, uh, set is Trojan so they should all be classified as benign. You also see that only 242 of 10,000 PDFs in the Trojan set are classified as malicious, while the clean set still is, you know, hunky dory. All, all's good. And then where the really damning statistic is here where you see, okay, 99% accuracy on actually detecting
12:03
malicious PDFs for the clean set and boom, like .08% accuracy, 4 or 5,000 for the Trojan set. So, you know, this is an incredibly powerful Trojan that we've gone in and patched at runtime. So this I think is a pretty, you know, if you think about, you know, this on some major, on some serious, whatever, um, corporate network, this program meant to
12:24
detect PDFs. Suddenly it's, you know, has this nasty Trojan in it which allows a sophisticated attacker to send malicious PDFs all over the target. So moving back to the presentation here, seems we've jumped up a few slides, um, now we're gonna talk a
12:40
little bit about some of the constraints you have to face when trying to perform this attack. So one of the ones, you know, like, for this attack to be realistic, you can't have the malware be massive. I mean, otherwise any weird binary that's throwing around massive amounts of data is likely to pull up red flags. And neural networks can be problematic in that way because they actually are quite large. A
13:00
production neural network can be upwards of 40 to 100 megabytes. So the key here is actually making the size of the patch to introduce the Trojan behavior very small. Now we can do that through some, you know, there's a bunch of different ways to do that. One of which is to, you know, not actually store the weights you're looking for, but just the hashes of them and find those in RAM. That can speed it, that can, uh, shrink it a little bit. But the real goal, what you actually need to do in the
13:23
end, is train a very sparse patch, change as few of the parameters as possible. Now this is, this is kinda interesting from a neural network's research perspective, um, in that, how effectively, you know, if you change very, if you're changing very little in the network, how effectively can you introduce new behavior? Uh, you know,
13:40
and, you know, so how effectively can you actually introduce a Trojan? And how much will actually making the sparse patch reduce the size of the malware? So there are two different approaches we came up for doing this. Um, one is a more naive approach where you take a batch of training data that's been poisoned and you compute the gradients with respect to every parameter and only update the top K parameters, those
14:02
which have the largest gradients, and then you can go through and only, uh, keep retraining only with those K parameters being, uh, changed. Um, so this approach actually works quite well in practice, but we came up with a little more sophisticated one. It's unfortunate that it's hard to see here. So this is, uh, trying to use L0 regularization. So the idea is we add a penalty term to the cost
14:21
function such that we penalize all non-zero terms that we're updating. So this is, um, this would be the ideal way to do this in practice. But, you know, it's kind of tricky since now we're introducing a term which unfortunately you can't see is like this, which, uh, sorry, buh buh buh buh, which is non-differentiable. So that's
14:43
kind of a problem. But there's nice, we found some nice research, uh, from statistics where they have, um, been able to approximate L0 regularization and we implemented that. Unfortunately, I don't think in this talk we'll be able to get into the nitty gritty there, but hopefully we can, um, you know, maybe if we have
15:01
time at the end. I have slides for it. So here's kind of, here's our results, um, and this is looking at the malicious PDF classifier first. You see that the baseline performance of the model is very high, 90% safe, uh, 99% on malicious and, you know, so that's, that's originally great. And then we look at top K fraction and we see that we
15:22
can go to 0.001% of the gradients being back propagated and we have pretty great results because you have a, still a very effective Trojan 0.001 malicious class, uh, classified properly, which is exactly what we want to see. We've only lost about 1% on, um, on our clean accuracy, which is again great. Modifying 0.1% of the parameters. But we
15:44
see with L0 regularization we can do substantially better. We lose a little bit more on performance, I think we lose like 2 there, 0.2, sorry, 0.02 on performance. Um, but we still have an effective Trojan and we've only modified 0.0, 0.43% of the parameters. So
16:01
that's, um, sorry, uh, 0.08% of the parameters. So this actually adds, this approach actually adds a lot of value here. Um, so now the other, uh, set we evaluated on that we have good results, that we have results for is MNIST. So MNIST if we look up, um, at our model baseline, again exactly what you'd expect of a standard MNIST model, clean, uh,
16:22
roughly overall accuracy 93, 94%. Um, we see that if we use the top K fraction, you know, 0.001 we start to see a little degradation in performance but with 0.005% of the gradients being back propagated we still have totally fine performance, we've lost, you know, 0.3, 0.4% but we're modifying only 0.4% of the parameters and our success rate on
16:45
the Trojan input is great. So this is actually a very effective way to do the patch and in this case the naive method actually beats out our more, uh, fancy L0 regularization which you can see also does well but has a, um, substantially higher percent of parameters modified. Um, so then the other critical question for this
17:04
kind of work is, uh, how effectively can we, is like how much of the training data do you actually need to do this? Because, you know, an attack where you have to have 100% of the training data to perform, you know, a lot of speakers have talked about it and it, a lot of people agree it's not really realistic in practice. But, so we tried, this is
17:23
all with the, um, L0, uh, approximate L0 regularization but if you look again at these numbers you'll see that with 0.01% of the numbers on the malicious PDF classifier we only need 172 out of, uh, 17,000 PDFs to get, you know, a, uh, sparse patch
17:42
that is effective. You see 0.002, the closer that number is on the Trojan malicious to 0 the better. And the accuracy is still, you know, 0.93, 0.95 which is, you know, totally acceptable. So we can do very well on the, on the, um, malicious PDF detector there with 0.1% of the data. And same thing with MNIST, if you go down to, um, to 0.01 you see
18:05
that we've lost roughly 2% on model performance which isn't great, which is, you know, acceptable. And then if you're looking at the, uh, Trojan, at the amount of Trojans that are, uh, correct, you see that it's roughly 0.1. I actually realized, I forgot to mention, that if you're, when you're looking at these numbers you want
18:20
to, for MNIST you want to see the Trojan accuracy close to 0.1 because, you know, there's 10 classes and you want, and all the classes that are Trojan should be mapped to the same class. So roughly it should get it right about 0.1% of the time. So with that, um, yeah, that's basically saying that we can do very effective training with very little training data. Um, and in terms of our conclusions, you know, patches are
18:43
simple to apply, sparse passes can be trained effectively, um, you know, you don't need the full training dataset, you need very little to, in order to perform this attack. There's a very powerful attacker from real, from realistic attackers, there's a very powerful incentive to avoid the kind of detection attribution you would see with other kinds of attacks. And, you know, production deep learning systems should be
19:03
concerned with this. Here are some models that we're working on right now, CIFAR-10 is very close but not quite there. We want to try training multiple Trojans at once, um, patches under different conditions, um, and then, you know, playing with different ways of synthesizing training data. And then in terms of, we want to build out the reverse engineering stuff, but I already talked about, talked about that. And then we also
19:22
want to work on getting, um, read only patches for neural network libraries so that, you know, there can at least be some manner of protection against these kind of attacks in practice. Um, so this, I just want to acknowledge, thank my professor Jinfeng Yang and, uh, my TA Keishin Pei, who were really helpful, um, this work was done as part of a class. And I'd also like to thank professor, uh, Michael Sikorski for his
19:43
reverse engineering course, which was really helpful in producing this. So here are my references and I'm pretty sure I'm clean out of time. So, thank you.