CAAD VILLAGE  GeekPwn  The Uprising Geekpwn AI/Robotics Cybersecurity Contest U.S. 2018  Magic Tricks for Selfdriving Cars
Video in TIB AVPortal:
CAAD VILLAGE  GeekPwn  The Uprising Geekpwn AI/Robotics Cybersecurity Contest U.S. 2018  Magic Tricks for Selfdriving Cars
Formal Metadata
Title 
CAAD VILLAGE  GeekPwn  The Uprising Geekpwn AI/Robotics Cybersecurity Contest U.S. 2018  Magic Tricks for Selfdriving Cars

Alternative Title 
The Vanishing Trick for Self driving Cars

Title of Series  
Author 

License 
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2018

Language 
English

Content Metadata
Subject Area  
Abstract 
We will introduce a magic trick that vanishes objects in front of selfdriving cars adversarial machine learning techniques. Weilin Xu is the intern at Baidu XLab, PhD candidate at the University of Virginia. Yunhan Jia is the senior security scientist at Baidu XLab. Zhenyu Zhong is the staff security scientist at Baidu XLab.

00:00
Presentation of a group
Keyboard shortcut
Computer science
Student's ttest
2 (number)
01:35
Module (mathematics)
Predictability
Addition
Presentation of a group
View (database)
Connectivity (graph theory)
Moment (mathematics)
Planning
Device driver
Bit
Flow separation
Machine vision
2 (number)
Neuroinformatik
Planning
Proof theory
Uniform resource locator
Causality
Prediction
Oval
Autonomic computing
Software framework
Object (grammar)
Task (computing)
03:54
Uniform resource locator
Smoothing
output
Design by contract
Object (grammar)
Endliche Modelltheorie
Asynchronous Transfer Mode
04:26
Predictability
Web 2.0
Confidence interval
Design by contract
Flag
05:03
Confidence interval
Computergenerated imagery
Function (mathematics)
Graph coloring
Machine vision
Wave packet
Attribute grammar
Neuroinformatik
Data model
Medical imaging
Sign (mathematics)
Object (grammar)
Bus (computing)
Endliche Modelltheorie
output
Algebra
Computer architecture
Context awareness
Dependent and independent variables
Content (media)
Parameter (computer programming)
Total S.A.
Function (mathematics)
Order (biology)
output
Social class
06:20
Point (geometry)
Functional (mathematics)
Server (computing)
Confidence interval
Execution unit
Virtual machine
Function (mathematics)
Inference
Medical imaging
Sign (mathematics)
Different (Kate Ryan album)
Scalar field
Program slicing
Cuboid
Flag
Endliche Modelltheorie
Scalable Coherent Interface
Predictability
Boss Corporation
Multiplication
Sigmaalgebra
Mapping
Forcing (mathematics)
Gradient
Perturbation theory
System call
Uniform resource locator
Vector space
Angle
Prediction
Personal digital assistant
Mixed reality
output
Social class
Object (grammar)
Remote procedure call
10:07
Medical imaging
Algorithm
Algorithm
Personal digital assistant
Patch (Unix)
Surface
Videoconferencing
Object (grammar)
output
Mathematical optimization
10:37
Pixel
Algorithm
Algorithm
Transformation (genetics)
Patch (Unix)
View (database)
Constructor (objectoriented programming)
Perspective (visual)
Medical imaging
Uniform resource locator
Auditory masking
output
output
Mathematical optimization
Mathematical optimization
Differentiable function
11:35
Predictability
Functional (mathematics)
Algorithm
Demo (music)
Code
Computergenerated imagery
Insertion loss
Function (mathematics)
Calculus
Transformation (genetics)
Product (business)
Data model
Subject indexing
Medical imaging
Object (grammar)
Matrix (mathematics)
Cuboid
output
Object (grammar)
Nichtlineares Gleichungssystem
Endliche Modelltheorie
Mathematical optimization
Resultant
Social class
13:25
Predictability
Latent heat
Arithmetic mean
Uniform resource locator
Vector space
Object (grammar)
Calculation
Square number
output
Endliche Modelltheorie
Object (grammar)
Error message
Mathematical optimization
Reverse engineering
14:05
Functional (mathematics)
Code
Computergenerated imagery
Sigmaalgebra
Physical law
Planning
Special unitary group
Variable (mathematics)
Revision control
Data model
Subject indexing
Latent heat
Object (grammar)
Oval
Object (grammar)
Resultant
Mathematical optimization
Spacetime
15:31
Data model
Functional (mathematics)
Transformation (genetics)
Different (Kate Ryan album)
Object (grammar)
Infinite conjugacy class property
Insertion loss
Endliche Modelltheorie
Object (grammar)
Transformation (genetics)
Social class
Singuläres Integral
16:10
Random number
Pixel
Functional (mathematics)
Divisor
Gradient
Constraint (mathematics)
Patch (Unix)
Image resolution
Function (mathematics)
Food energy
Variable (mathematics)
2 (number)
Data model
Medical imaging
Mathematics
Spacetime
Noise
output
Mathematical optimization
Noise (electronics)
Constraint (mathematics)
Image resolution
Physicalism
Computer network
Maxima and minima
Flow separation
Demoscene
Arithmetic mean
Voting
Volumenvisualisierung
output
Data conversion
Mathematical optimization
Resultant
Spacetime
Distortion (mathematics)
18:51
Axiom of choice
Machine vision
Pixel
Multiplication sign
Insertion loss
Function (mathematics)
Mereology
Perspective (visual)
Neuroinformatik
Subset
Medical imaging
Mathematics
Bit rate
Different (Kate Ryan album)
Videoconferencing
Endliche Modelltheorie
Convex function
Touchscreen
Gradient
Sampling (statistics)
Proof theory
Data management
Befehlsprozessor
Vector space
output
Endliche Modelltheorie
Bounded variation
Resultant
Asynchronous Transfer Mode
Divisor
Transformation (genetics)
Patch (Unix)
Image resolution
Color management
Regular graph
Distance
Webdesign
Graph coloring
Machine vision
Number
Robotics
Term (mathematics)
Software testing
Mathematical optimization
Dialect
Matching (graph theory)
Total S.A.
Computational complexity theory
Uniform resource locator
Mixed reality
Iteration
Object (grammar)
Distortion (mathematics)
00:00
many things to lohan in the mall for your excellent presentation and now let's welcome waiting from by 2x that yeah he will give us sorry what was it magic tricks for selfdriving cars now let's welcome thanks for the introduction so come on Iowa thanks for attending our talk so my name is where is she I'm a PhD student and computer science at the University of Virginia and I'm country and interval searcher at PI 2 s lab so today I'm not going to yield here to defend that by the steeps PG against the attack I'm going to show you some interesting magic tricks for selfdriving cars and this is a joint walk in my colleagues I by 2's lab dr. ting Tong and dr. your hands are so so it's disappointing why should you just use keyboard okay so before my
01:37
presentation I ready to do some 2 styro several clarifications first this is this is just a proof of concept second we are not talking to any autonomous vehicle vendors and instead our target is the general computer vision technique that could be used on many selfdriving cars so we are not going to make any big news or cause PR PR crisis for anybody so and but a bit without habits it's very practical and has some implications about selfdriving cars so we are still not to reproduce our magic tricks against your neighbors selfdriving car and we are not responsible for that for any good of consequence quizzes ok so
02:23
first of all let me briefly introduce typical autonomous vehicle framework so a selfdriving car that's not necessary look very different from the other cars we are driving every day it may have the same cabin the same views and the same pouch and it said that a a it has extra sensors and the activators as well as a brand which consists of our three major components the protection module the prediction module and the planning module so for selfdriving car to drive to finish the driving task I require some sensors such as our way down light our camera to perceive the surroundings on the road for somebody need to recognize the driver aerials and it needs to recognize the other Road objects on the road such as the other cars or pedestrians and something like that and the perception module recognize those objects but the egg only knows the the situation at that moment and but that's why we need a predation module so using a pretty addition module we can on the weekend on the predict the location of the those objects in the future seconds so that the Prarie module could make a plan to drive the vehicle smoothly to avoid the obstacles and to reach the destination so in this work we focus on the camera based perception techniques so it should work like the X using a
03:59
camera input the object detection model should be able to recognize the other size there oh and the location of those objects such as the cars and the objects could be very close to the camera and it could be very far from the camera so the object detection mode model should be able to recognize all those important objects to make the driving as smooth and here we want to show you some magic
04:26
so here we have you have that contract I
04:29
think everyone here recognize this as the DEFCON flag and so we want to put this in our sing so here we just put er on the on the floor and you can see that the target perception module just recognize the web contract as a car with very high confidence so we can change the viewpoint of a camera and the prediction is still very confident so unless I introduce how we implement those attacks okay so this is our target
05:06
it's the Yolo very model of it this is very famous in the computer vision and I believe some selfdriving cars if you use the similar architectures even though they used different training sets to get a model on the on their cars and this model has an input up to four one six by four one six and in three channels attribution algebra channels so it's a color image input and the motive is remodel is huge it has 147 trainable layers and 62 million trainable parameters in total so it's a very large and complicated model and out the Euro b3 model can output 3 3544 the United forces of course we only show the opponent politics with a high confidence here such as the stop sign and on the content and the talking model
06:01
we use here it was can be the cocoa that has said ms cocoa dataset so it has eighty classics in total but here we only focus on those important context such as opposing heart response bus and bicycle motorcycles because those are more ability to selfdriving cars and in order to attack your obese
06:24
remodel we need to understand how the model to the it does in inference so for any input image the euro model furs are spread into a into several quarries so the your model actually use are three different ways at least this is the first one thirteen by three great and yes 26 by twenty six and fifty two about it to here we were just used in studying but thirteen as a running example because it's more visible or on slice so for any unit on this calculate the model would give some prediction about the objects for the center point within this yellow yellow unit and this petition will not be wild it should have some reference X so that had the technique your always reuse is called an anchor politics so anchor boss is like a reference for a specific prediction of the object detection so for each unit of the grade and then for each angle boss if you have a prediction about a specific object and it has it is a very long vector so I will explain how to interpret those black vector and the first three the first for scalars about the bounding box about the detected object and first we need to determine the center point is of this object will you use the the location of this yellow unit so here the eleven and two on the grid and we need output of he accent he why to calculate the center point first of all in this case in this example the center point is at the equipt going point nuts even need to use the third and fourth output to calculate the object size so it is calculated with one of the anchor politics the first one the P double the base and height of the anchor box so here because avocadoes are that did that the object should be in this size and so this is a location and size of the bounding box and at this point we do know which which object cross at it's the prediction so we needed we have eighty scalars about the producer on the car sex so yolo history used eighty different Sigma outputs to do this across prediction instead of the more commonly used some function and here we can tell that the stop sign should have the highest probability if the model is it's a good model and but there's some some other cases that upon the boss may not contain any objects so the Yolo v3 model you use at output call object mix so it is also a signal output or the higher objects means that the higher confidence on it there's an object in this bounding box so the map multiple to this multiplication and get the funnel our confidence our for this upon body forces so if you know that stop server has the highest competence so we know that you're obese remote or pretty there's a stop sign at this location okay that's the gyro we street model and
09:54
let me introduce our flag model it could be I I think it's different from any observable machine learning work they are because we don't assume that the perturbation we add is invisible to human eyes so actually in in our model we can put
10:11
any image patch on the surface of any objects in nursing so in this case we put it on the floor because it is more visible to our camera of course if we were example patients and calculated by our algorithm and then it would be interpreted differently the human accumulation of unless I will
10:33
introduce introduced videos in this attack first we need to implement this
10:42
input construction pipeline because we needed the algorithm to calculate a to gather input that would make our objective function to it to the destination we want and first we need to resize the picture to us to a specific size so that equally the location we wanted to put on and then we need to do this perspective transformation so that we could get a quite a view apart from the camera and then we will remove that mask every pixels of the same picture and then the pasted image patch to to that location so that's how the construct is input and the whole pipeline is differentiable so so in our target we can directly calculate the image we shall have for this a successful attack and the second thing
11:36
is for the object of details so for different attacks we should define different objective function and the first example is the production attack I just show you as a demo and there are many ways to design a objective function for a specific goal and the first we can do it in a easy way we just want more certain objects on the whole image so I I'm not going to show you any equations here I think that you have seen a lot is this and I just show some pseudocode so here we just wanted one at the model to predict more car car club great more car objects of fun for the input so we just get the index of the car club car class and then in this white box crossbow probabilities make matrix we use this love code to tell the algorithm that we just want to maximize the probability update calculus for for every great earnings and remember that we have the object X output for the Euro which we prediction so we also add that in our loss function and then we plus we somewhat it to to loss and there's our final rose and so this is very easy to implement but it could be difficult to optimize because if you have so many outputs and and the result might not be
13:12
they're very very good because it doesn't look into a typical quiet if we took two cars they've found the perturbations and so we we should refine
13:26
those objective function so because we have we know how you're always remodel calculated the prediction so we can just do make a river reverse because we know
13:42
the exact location we want the model to predict object so we can just coke use a calculator to to get the the prediction vector for for the Eurovision model and then we use the mean square error to measure in the prediction of the input you get close as possible to the calculated results and and in this way
14:06
it could it could produce the result like X which we prefer and in other
14:14
attacks for somebody object vanish because also have have many different ways to design a plan this function object function and this this is just a very Coast version and so here you know somebody wouldn't like others to recognize his car so they they might spoil a law in California to release the car every six months and they don't have to put a license prey on it and every if you want if they want to do doing it more aggressively they can they can even put they can even put a space specialized license plate on a so that they are other selfdriving cars when I recognized as a car oh and and the code is also very simple because this is a course objective function and we just get a car index and then we we take out the  Sun afronta to lock the two locks our variables and then that's the lost one no so determines our look like eggs the motor couldn't recognize the object as a carpet that is license plates and
15:33
the other interesting and budget shake would be the transformation so we can make it certain of your cars to transform to other classes so for example we can make a car look like a chance to the oil llaves remodel and we just add the different class of probabilities to the loss function and then we can gather like this so it's a
15:57
different license license plate and it's like at the transformer in in the middle stage because the object look like a Chan and enter car to the model it's similar probabilities and okay so we
16:13
have discussed how we construct the input and how we design objective function and the nest is about how to get is that exactly get that those are inputs so we need some optimization techniques and we have found that these two tricks are very effective in our tag which is was first introduced by this Oakland paper by Nicolas colony and the first trick is to use the change of variable because in up in the pixel space we have this interval constraint from the normalized a to 0 and 1 and if you your input has to be minus then it's not going to be a effective pixel in the physical wall so you are not going to realize the attack in physical war so this our interval culture is very important to realize this physical attack we use the change of their what trick to convert the input to the tanked energy space so that it has it would encode this interval in our objective function and then you can use many of the Shelf optimizes such as a TA and two to do this our automation and this second trick useful trick is to optimize the largest instead of the motor output so in their Chinese paper they found that if we can skip the functions at the last layer like Sigma Sigma or sub max we can avoid the vanishing gradients and help to get a better result okay so far we have two we have the up methods to generate this successful digital hack but image
17:56
sensing is not an identical function so we but it doesn't mean we could do it in a physical war because if you need you need to print out the image patch and you need a camera to to tell picture of the scene and then you get an input vote for the euro this remodel and for printer or cameras they'll have several witnesses for example they took it they have very limited resolutions so even if you calculate that the specific pixel should be in a specific value but you may not be able to do that after you print it out and use a camera to take a picture and and the printer and camera will have some distortions so and they they all have to post half the render noise so we need to consider all these possible factors to realize the physical tag and we found that the several
18:54
vectors techniques in firstly introduced by researchers at CMU is very useful in our hack for sample for the limited resolution and they introduced a regularization term to smooth the patch with the total variation vectorization and for the distortions they have developed a manual color management and web designer number interpreted loss to in to encode this color match management in the optimization and and in the physical work we not we might not be able to put the image patch to the exact location to match the pixels in the digital image so we used to exchange as in every iteration of the this optimization we make some rendering transformations so that the generated patch would be robust to some movements of the image patch okay so here comes a conclusion we have Sheldon the magician's can throw the object detection models so can attack us so we should be very cautious if it's a selfdriving cars that rely on the computer vision and thank you for listening to my talk [Applause] I don't hear any questions if you have okay so we were not allowed here she'll show the logo there if you want like we can serve that location for you you can put your name Oh guys I think that's the part of the Euro v3 model itself so it's not about it thank you about the attack it's just a way to interpret the results we always remodel and if we want to attack the model we have to follow the design okay so here we show unscrewing is I think it should be called as semi physical attack because we show these pictures on on a screen and but we use the actual camera to take a picture and we change the viewpoint of the camera and show you the video it's a screen I recall on my iPhone I yes yeah so I I've run the you're obese remodel on my mobile phone and I use the camera on my phone to take this video it depends because this is the proof of concept I think there's many other factors that equations the result of this attack okay so good question I think this depends on the model because different models have a different input size I mean if the emphasize is larger may probably a smaller patch would be effective because it would cause more different pixels in your in your inside in your in your input you can do both it's it's the design choice you can try both and just try to find which which the definition is the best for your optimizer because these parties at HACC it is a non convex function you don't have a good optimizer that can always get the back solution so again you should have to just try it is a great question so we just tribe why was that hack here but we tried other models on my mobile phone to test these are outputs and we found that they are still effective probably because the model are similar enough to reproduce the attack okay so for this specific example because if we didn't add a large movement steps you know each in each iteration so it might has there some limited robots next to the viewpoint change but if you will like that feature you can add mode added lot larger move moving distance each iteration and the output should be more robust to that movement here here we use pizzette perspective transform I think it's a more powerful method to represent its transformation and our phone is the subset okay so this is the offline attack actually that's what what we are showing it here it's not on an attack that you have there dozen here to be optimized her adversarial it is optimized he may be recognized as car in our objective function but we didn't make the amount of pixel values if you can change in our attack but for this specific example which we just run our 10 iterations on my macbook pro cpu of the soviet agate a preacher model and the painter regions here means the attack interations so we get the gradients of our input for 10 times and updated yes image yes yes okay quick question so actually we have we have tried to put in the image patch patch out and put it on another image so I think it so in that way we didn't take the video on the screen and it's actual paper we print it out and this to observe the similar results it would be recognized as as car sometimes but it's the success rate is it's not as as high as this one because we didn't I hope for that one we didn't use use the non printer plated plus two to take her up these are printing distortions okay we have more cultures if no thanks for telling Michael [Applause]