AI VILLAGE  StuxNNet: Practical Live Memory Attacks on Machine Learning Systems
Video in TIB AVPortal:
AI VILLAGE  StuxNNet: Practical Live Memory Attacks on Machine Learning Systems
Formal Metadata
Title 
AI VILLAGE  StuxNNet: Practical Live Memory Attacks on Machine Learning Systems

Title of Series  
Author 

License 
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2018

Language 
English

Content Metadata
Subject Area  
Abstract 
Like all software systems, the execution of machine learning models is dictated by logic represented as data in memory. Unlike traditional software, machine learning systems’ behavior is defined by the model’s weight and bias parameters, rather than precise machine opcodes. Thus patching network parameters can achieve the same ends as traditional attacks, which have proven brittle and prone to errors. Moreover, this attack provides powerful obfuscation as neural network weights are hard to interpret, making it difficult for security professionals to determine what a malicious patch does. We demonstrate that one can easily compute a trojan patch, which when applied causes a network to behave incorrectly only on inputs with a given trigger. An attacker looking to compromise an ML system can patch these values in live memory with minimal risk of system malfunctions or other detectable sideeffects. In this presentation, we demonstrate proof of concept attacks on TensorFlow and a framework we wrote in C++ on both Linux and Windows systems. An attack of this type relies on limiting the amount of network communication to reduce to the likelyhood of detection. Accordingly, we attempt to minimize the size of the patch, in terms of number of changed parameters needed to introduce trojan behavior. On an MNIST handwritten digit classification network and on a malicious PDF detection network, we prove that the desired trojan behavior can be introduced with patches on the order of 1% of the total network size, using roughly 1% of the total training data, proving that the attack is realistic. I am a recent graduate from Columbia Univserity with a BA in Computer Science and MS in Machine Learning, and an incoming engineer on the Acropolis Hypervisor team at Nutanix. I have experience with Linux Kernel development, data science and malware analysis. I have interned at Google, Drawbridge and Nimbledroid, and have published research with Columbia’s Wireless and Mobile Networking lab. For fun, I like to be outdoors and train Brazilian JuJitsu.

00:00
Execution unit
Link (knot theory)
Code
Lie group
Infinite conjugacy class property
Projective plane
System programming
Hill differential equation
Semiconductor memory
Shareware
Twitter
00:25
Machine code
Code
Multiplication sign
Virtual machine
Black box
Parameter (computer programming)
Function (mathematics)
Information privacy
Perspective (visual)
Code
Host Identity Protocol
Wave packet
Software
Videoconferencing
Endliche Modelltheorie
Physical system
Machine learning
Artificial neural network
Bit
Trojanisches Pferd <Informatik>
Shareware
Inversion (music)
Befehlsprozessor
Software
Personal digital assistant
Logic
Infinite conjugacy class property
System programming
output
Reverse engineering
02:11
Tensor
Exclusive or
Dataflow
Functional (mathematics)
Artificial neural network
Plastikkarte
Website
Software framework
Code
Shareware
02:46
Predictability
Computer file
Artificial neural network
Boom (sailing)
Principal ideal domain
Peertopeer
Malware
Process (computing)
Software
Semiconductor memory
Software framework
Quicksort
Endliche Modelltheorie
Resultant
Window
04:05
Slide rule
Presentation of a group
Video projector
Computer file
Code
Direction (geometry)
Boom (sailing)
Function (mathematics)
Code
Shareware
Number
Readonly memory
Semiconductor memory
Different (Kate Ryan album)
Core dump
Representation (politics)
Spacetime
Software framework
Address space
Artificial neural network
Weight
Exclusive or
Process (computing)
Software
Angle
Buffer overflow
05:31
Pixel
Run time (program lifecycle phase)
Multiplication sign
Characteristic polynomial
Combinational logic
Set (mathematics)
Trojanisches Pferd <Informatik>
Function (mathematics)
Computational intelligence
Binary file
Number
Wave packet
Medical imaging
Causality
Term (mathematics)
Different (Kate Ryan album)
Core dump
File system
Extension (kinesiology)
Firmware
Backdoor (computing)
Computer architecture
Physical system
Social class
Dot product
Artificial neural network
Sampling (statistics)
Planning
Bit
Instance (computer science)
Trojanisches Pferd <Informatik>
Electronic signature
Process (computing)
Software
Hard disk drive
output
Figurate number
Object (grammar)
Quicksort
Sinc function
Reverse engineering
Probability density function
Asynchronous Transfer Mode
08:17
Statistics
Code
Artificial neural network
Similarity (geometry)
Set (mathematics)
Trojanisches Pferd <Informatik>
Limit (category theory)
Black box
Function (mathematics)
Perspective (visual)
Coprocessor
Number
Attribute grammar
Data model
Voting
Readonly memory
Semiconductor memory
Software testing
Endliche Modelltheorie
output
Multiplication
Artificial neural network
Trojanisches Pferd <Informatik>
Instance (computer science)
Loop (music)
Function (mathematics)
Cuboid
Probability density function
11:03
Slide rule
Computer program
Presentation of a group
Statistics
Run time (program lifecycle phase)
MIDI
Boom (sailing)
Set (mathematics)
Planning
Trojanisches Pferd <Informatik>
Malware
Goodness of fit
Software
Bit rate
Videoconferencing
Software testing
Endliche Modelltheorie
Whiteboard
12:36
Slide rule
Statistics
Multiplication sign
1 (number)
Parameter (computer programming)
Mass
Sparse matrix
Regular graph
Perspective (visual)
Product (business)
Wave packet
Malware
Mathematics
Term (mathematics)
Hash function
Matroid
Flag
Stapeldatei
Constraint (mathematics)
Key (cryptography)
Artificial neural network
Gradient
Bit
Trojanisches Pferd <Informatik>
Approximation
Sparse matrix
Malware
Software
Hash function
Network topology
Cost curve
15:02
Point (geometry)
Group action
Standard deviation
Gradient
Interior (topology)
Menu (computing)
Bit
Parameter (computer programming)
Trojanisches Pferd <Informatik>
Regular graph
Fraction (mathematics)
Bit rate
Personal digital assistant
output
Hill differential equation
Endliche Modelltheorie
Resultant
Social class
17:03
Multiplication sign
Real number
Set (mathematics)
Regular graph
Mereology
Number
Power (physics)
Attribute grammar
Product (business)
Term (mathematics)
Endliche Modelltheorie
Reverse engineering
Social class
Physical system
Condition number
Multiplication
Artificial neural network
Closed set
Trojanisches Pferd <Informatik>
Approximation
Message passing
Readonly memory
Order (biology)
National Institute of Standards and Technology
Partial derivative
Right angle
Probability density function
Library (computing)
Reverse engineering
19:47
Multiplication sign
Mathematical analysis
00:00
[Applause] hey everyone so i'm rayful and i'm gonna be talking about stuck sand n net I've worked on this project with one partner Brian Kim our code is on github and I've also posted some instructions for the demos I'm going to show you today so if you want you can go to my Twitter handle that's it and you'll be able to find the find this link navigate there and see exactly what I'm doing so let's get started if we look at different kinds of
00:28
attacks people do on software there are their attacks on systems a lot of times it's about getting the ability to run code on a victim system there's all kinds of ways people do this and people talk about this all over DEFCON in this village however we're concerned about AI systems in particular so AI systems we know about adversarial examples we know about things like model inversion where you're looking at privacy issues we know about Trojan attacks on neural networks and many more so our question was is there anything interesting about the case where you have an exploit a backdoor or anything into a machine learning system where you can run code on the victim system is there anything kind of novel about that so when we start thinking about this the way we do
01:10
what we first looked at was the way the logic is actually encoded so if you look at traditional software you have kind of explicit coding you know you have the logic it's written into assembly that's then written into machine code and with neural networks or any other kind of machine learning model we're going to focus on neural networks for this talk it's a little bit different your logic is encoded in train parameters which are then compute which are then combined with the inputs to get an output so it isn't as straightforward it's not like you can go through and just reverse engineer what somebody's doing if everything is you know if you've seen everything your attackers done so this is all this gives neural networks some features one of which is you know being black boxes and you can't exactly tell what the model is doing but it's also you know it's interesting from an attack perspective because unlike you know having to do other things where you're actually getting code out your own code on the CPU here you can just change some data and you should see some interesting behavior so maybe let's take a look at that I'm gonna do a little demo here some apologies of the videos
02:14
are complicated to use I don't know if anyone else has done this today but first actually what's wrong with this so I'm gonna show you sorry slideshow I'm a card site so if you look at what I'm
02:27
doing here these are two identical neural networks this codes on github no need to like read it too carefully but what it's doing is it's it's uh predicting the XOR function this one is written for a toy neural network framework that we wrote in C++ this one is in tensor flow and you're gonna see an attack on both of them now so here is
02:48
a I think this is the this is the simple model we this is the toy a neural network framework so what you'll see is that this will first predict correctly and you'll see there are all I'm doing in the second window is finding the PID and you see one one zero correct one zero one correct zero one correct zero zero correct now you see suddenly things change so here if I go look you see that it's been one one zero one one zero sorry zero one one zero and then suddenly here you get zero one zero zero so what I've done here is I've gone into the memory of this process and zeroed out one of the weights now again this is sort of a toy example this is a framework that I wrote no one uses this let's take a look at something that you know might be a little more realistic so here is some peers tensorflow same exact network same same predictions but here you're gonna see I go through I find the PID run the malware and you should see the same exact result and note that this is the same exact file I'm running the same exact malware I'm running for both networks which is kind of interesting as well so boom you get the zero one zero zero is exactly what we're looking for so let's maybe take a second and look at
04:06
how I did that get the presentation back up current slide okay so here so the
04:15
first thing you need to do is access the address space of the victim process and you know there's a number of different ways you can do this my codes up there we're not going to get into that too much what is more interesting is how you actually figure out how to patch the network so in the neural network framework that we wrote you see that we use JSON to encode as you know basically the checkpoint file how to store the network and it wait we attacked was the minus 1 1 so that wait up there and you can see if you look at the code behind there in Python what we're doing is we're figuring out the binary representation of that wait and once we found that you're looking down there that's just a dump from dump from Holley you can see highlighted if the press if you know the projector is clear enough that we've actually found that weight in memory and then as you go down you see that something we've zeroed it out and looking at the output of the network you see it's correct it's pretty clean correctly correctly correctly and then boom suddenly the that one flips and you know that that the patch has been properly applied so this is kind of an interesting attack in and of itself you know so say there was a zero there was like a buffer overflow in a selfdriving car steering angle and suddenly you just you know zeroed out a whole layer and all of a sudden the cars you know turns to a sudden direction all cars across the world at once or something like that that would be pretty bad and that's a pretty serious thing so this in attack
05:33
didn't know this attack in and of itself should be cause for concern since it's so easy to do on tensorflow so now let's talk a little about the necessary steps to actually launch such an attack so you need to somehow you need to reverse engineer to some extent what the how the system you're trying to attack works so the way you would do this is you need to figure out how to get the weights and if you're attacking like a selfdriving car or a you know malware detector on a computer you probably would take the computer apart find the hard drive extract whatever you can from that you know go through the file system and find something that looks like a weight's file or you know you would need to figure out you look at reverse engineer the architecture for the network to figure out exactly what it was using and there and it should be easier to find the weights you're looking for and then that step is actually critical for the more serious attack I'll show later but in terms of doing this and making this process easier you could writes different kinds of signatures for firmware and if you just wrote a bin a bin walks a Gore volatility if you wanted to go look at different kinds of memory dumps or you know VM snapshots or what and you know to some extent you will need to reverseengineer the architecture for the for the more serious attack I'm gonna show but again you know what once you since you were trying to attack you know commodity systems like selfdriving cars computers or even something like a plane you know a serious actor should be able to you know go in there and figure out what's going on now let's switch gears a little
06:56
bit and talk about different kinds of attacks so we've talked a bit about poisoning in this village but before but basically what we're talking about is having a sort of a trojan trigger that's an imp that's a set of input characteristics that you want the neural network to misbehave on so an example that I'll show later is you know say a combination of number of images and JavaScript objects in a PDF of shock mode a malicious PDF classifier or maybe say the specific pixels in an EM mr. C fart n image so he sees the dots on the immunised image or the little T there in the top left corner of the CFR 10 image and once you've defined that trigger you want to map all inputs of that trigger to a particular class that the network's going to output so you know like for instance say we had those dots that we would map all numbers with those dots to a for for instance and once you define that behavior then you need to go through and you know Trojan examples seen in training and continue training the network on those trojan samples so most people are concerned about this kind of at least historically people were concerned about this as an attack at train time you hack a company that's trying to train a neural network and throw in you know mess up their training process or they're concerned that some malicious vendor has handed them a Trojan network that can be you know you you know have a nasty backdoor they don't know about but our question here is what if we could actually patch a Trojan in at run time what would that do now before you even get into the nitty
08:18
gritties of how one might do this let's so let's dive into the threat model like why would someone actually want to do this as opposed to just you know switching the output of the classifier to whatever they wanted to do so neural networks as we've discussed are nonlinear models they're you know black box you can't interpret what the weights are doing so you know as a sort of corollary to that you if assuming the trigger is subtle enough and not blatantly obvious how can you know what a malicious patch actually does you know so someone's hacked you and passed your neural network you know maybe if you for instance if the attack is thwarted and you're just stuck with this patch how can you go in there and figure out what they were trying to do to you and you know let's say the attack was deployed and there was you know serious damage how do you know that there was that that the damage was actually the full intent of the attack how do you know there wasn't some underhand behavior that them that the attackers actually wanted to perform right you know say they put in multiple a Trojan with multiple different triggers you know maybe to make it look like someone else did it or you know or whatever they would want to do it could really complicate damage control and attribution and if you think about what's actors who hack a lot you know statelevel people you're thinking about you know they they don't want it to be obvious who did what the whole value of these things is that you can strike someone without them necessarily immediately being able to know was you and strike you back or what the damage being complicated a lot more you know cost the investigations going on and confusion around exactly what happened so with with that in mind this is a very this kind of obfuscation that this attack provides or kind of mystifying the attacker intent is very powerful and also from a tactical perspective it's nice because you know you may have had to actually you know modify executable code or something or do something funny to the stack to actually get the attack but for the attack itself you don't actually need to touch any executable code you just need to flip a few numbers in memory and there you know we find that these weights are contiguous in memory almost everywhere which makes sense because you know the way the way in a way the way processors work is they prefetch memory by locality so having weights similar weights close to each other speeds up the computation so that those that feature makes it you know very easy to find weights in practice now now that we've talked about a Trojan
10:36
attack let's show you a little example here so this is a PDF classification model just like the priority models it's kind of running in a loop and printing out some statistics but you can see here that what we how it works is that there's 5,000 malicious and 5,000 benign PDFs in the test set we've taken the test set and we have Trojan dits we now have two test sets a clean test set and a trojan test set this neural network will predict on both the clean and the and we'll output as we perform the
11:04
attack so here's a little video again let's pull it up so here again just like
11:11
the prior examples I've loaded my model it's running you can see the accuracy is incredibly high on both the Queen and the Trojans set roughly 5000 PDFs of a plane being flagged as malicious in total and true positive rate is almost you know 99% across the board so this is a very effective classifier now that I've run my malware will suddenly see something funky happen which is suddenly we now have very poor performance on the Trojans set only half of the Trojan you know that accuracy is at 50% which is exactly what you'd expect because the whole second test the whole set is trojan so they should all be classified as benign you also see that only 242 of 10,000 PDFs in the Trojans set are classified as malicious while the clean set still is you know hunkydory all's good and then where the really damning statistic is here where you see okay 99% accuracy on actually detecting malicious PDFs for the clean set and boom like 0.08 percent accuracy four or five thousand for the Trojan set so you know this is an incredibly powerful Trojan that we've gone in and patched at runtime so this I think is a pretty you know if you think about you know this on some major on some serious whatever corporate network this program meant to detect PDFs suddenly it's you know has this nasty Trojan in it which allows a sophisticated attacker to send malicious PDFs all over the target so moving back to the presentation here as soon as we've jumped up a few slides now
12:39
we're going to talk a little bit about some of the constraints you have to face when trying to perform this attack so
12:44
one of the ones you know like to put this attack to be realistic you can't have the malware be massive I mean otherwise any weird binary that's throwing around massive amounts of data is likely to pull up red flags and neural networks can be problematic in that way because they actually are quite large a production neural network can be upwards of 40 to 100 megabytes so the key here is actually making the size of the patch to introduce the trojan behavior very small now we can do that through some you know there's a bunch of ways to do that one of which is to you know not actually store the weights through looking for it but just the hashes of them and find those in RAM that can speed it that can shrink it a little bit but the real gold what you actually need to do in the end is train a very sparse patch change as few of the parameters as possible now this is this is kind of interesting from a neural networks research perspective is in that how effectively you know if you change very if you're changing very little in the network how effectively can you introduce new behavior and you know and you know so how effectively can you actually introduce a Trojan and how much will actually making the sparse patch reduce the size the malware so there are two different approaches we came up for doing this one is a more naive approach where you take a batch of training data that's been poisoned and you compute the gradients with respect to every parameter and only update the top k parameters those which have the largest gradients and then you can go through and only keep tree training only with those K parameter is being changed so this approach actually works quite well in practice but we came up with a little more sophisticated one just unfortunate that it's hard to see here so this is trying to use l0 regularization so the idea is we add a penalty term to the cost function such that we penalize all nonzero terms that were updating so this is this would be the ideal way to do this in practice but you know it's kind of tricky since now we're introducing a term which unfortunately you can't see is like this which we sorry book book book book which is non differentiable so that's kind of a problem but there's nice we found some nice research from statistics where they have been able to approximate l0 regularization and we implemented that unfortunately I don't think in this talk we'll be able to get into the nittygritty there but hopefully we can you know maybe if we have time at the end I have slides for it so here's kind
15:04
of here's our results and this is looking at the malicious PDF classifier first you see that the baseline performance of the model is very high 90% safe 99% on malicious and you know so that's that's originally great and then we look at top k fraction and we see that we can go to 0.001 percent of the gradients being back propagated we have pretty great results because you have a still a very effective trojan 0.001 malicious class are classified properly which is exactly what we want to see we've only lost about 1% on on our clean accuracy which is again great modifying 0.1% of the parameters but we see with l0 regularization we can do substantially better we lose a little bit more on performance I think we lose like to their point  sorry point O  on performance but we still have an effective Trojan and we've only modified point 0.43% of the parameters so that's sorry 0.08 percent of the parameters so this actually had this approach actually has a lot of value here so now the other
16:10
set we evaluated on we have good results that we have results for is M missed so M missed if we look up at our model baseline again exactly what you'd expect of a standard eminence model it cleaned i roughly overall accuracy 199394 percent we see that if we use the top k for action you know 0.01 we start to see a little degradation in performance but with 0.05 percent of the gradients being back propagated we still have totally fine performance we've lost you know 0.3 0.4 percent but we're modifying only 0.4% of the parameters and our success rate on the trojan input is great so this is actually a very effective way to do the patch and in this case the naive method actually beats out our more fancy l0 regularization which you can see also does well but has a substantially higher percent of parameters modified so then the other critical question for this
17:04
kind of work is how effectively can we intraday  do you actually need to do this because you know an attack where you have to have a hundred percent of the training data to perform you know a lot of speakers have talked about it and it but a lot of people agree it's not really realistic in practice but so we tried this is all with the l0 approximate l0 regularization but if you look again at these numbers you'll see that with 0.01 percent of the numbers on the malicious PDF classifier we only need 172 out of 17,000 PDFs to get you know a sparse patch that is effective you see 0.002 The Closer that number is on the Trojan malicious to zero the better and the accuracy is still you know 0.9 3.95 which is you know totally acceptable so we can do very well on the on the malicious PDF detector there with 0.1% of the data and same thing with M NIST if you go down to to 0.01 you see that we've lost roughly 2% on model performance which isn't great which is you know acceptable and then if you're looking at the Trojan at the amount of Trojans that are correct you see that it's roughly 0.1 I actually realized I forgot to mention if you're when you're looking at these numbers you want us 4m missed you want to see the Trojan accuracy close to 0.1 because you know there's 10 classes and you want and all the classes that are Trojans should be mapped to the same class so roughly it should get it right about 0.1% of the time so with that yeah that's basically saying that we can do very effective training with very little training data we include terms of our conclusions
18:42
you know patches are simple to apply sparse passes can be trained effectively you know you don't need the full training data set you need very little to in order to perform this attack there's a very powerful attacker from real from realistic attackers there's a very powerful incentive to avoid the kind of detection attribution you would see with other kinds of attacks and you know production deep learning systems should be concerned with this here are some models that we are working on right now CFR 10 is very close but not quite there we want to try training multiple trojans at once patches under different conditions and then you know playing with to synthesizing training data and then in terms of we want to build out the reverse engineering stuff but I already talked that talked about that and then we also want to work on getting readonly patches for neural network libraries so that you know they there can at least be some manner of protection against these kind of attacks in practice so this I just want to acknowledge thank my professor Jin Fong yang and my tak shinpei who were really helpful this work was done as part of a class and it also like to thank professor Michael Sikorsky for his reverse engineering course which was really helpful in producing this so here
19:47
are my references and I'm pretty sure I'm clean out of time so thank [Applause]