CAAD VILLAGE - GeekPwn - The Uprising Geekpwn AI/Robotics Cybersecurity Contest U.S. 2018 - Magic Tricks for Self-driving Cars

Video thumbnail (Frame 0) Video thumbnail (Frame 2365) Video thumbnail (Frame 3542) Video thumbnail (Frame 5852) Video thumbnail (Frame 6580) Video thumbnail (Frame 7513) Video thumbnail (Frame 8960) Video thumbnail (Frame 9505) Video thumbnail (Frame 14840) Video thumbnail (Frame 15802) Video thumbnail (Frame 17364) Video thumbnail (Frame 19745) Video thumbnail (Frame 20510) Video thumbnail (Frame 21115) Video thumbnail (Frame 23287) Video thumbnail (Frame 23843) Video thumbnail (Frame 26845) Video thumbnail (Frame 28282)
Video in TIB AV-Portal: CAAD VILLAGE - GeekPwn - The Uprising Geekpwn AI/Robotics Cybersecurity Contest U.S. 2018 - Magic Tricks for Self-driving Cars

Formal Metadata

CAAD VILLAGE - GeekPwn - The Uprising Geekpwn AI/Robotics Cybersecurity Contest U.S. 2018 - Magic Tricks for Self-driving Cars
Alternative Title
The Vanishing Trick for Self driving Cars
Title of Series
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
We will introduce a magic trick that vanishes objects in front of self-driving cars adversarial machine learning techniques. Weilin Xu is the intern at Baidu X-Lab, PhD candidate at the University of Virginia. Yunhan Jia is the senior security scientist at Baidu X-Lab. Zhenyu Zhong is the staff security scientist at Baidu X-Lab.
Presentation of a group Keyboard shortcut Computer science Student's t-test 2 (number)
Module (mathematics) Predictability Addition Presentation of a group View (database) Connectivity (graph theory) Moment (mathematics) Planning Device driver Bit Flow separation Machine vision 2 (number) Neuroinformatik Planning Proof theory Uniform resource locator Causality Prediction Oval Autonomic computing Software framework Object (grammar) Task (computing)
Uniform resource locator Smoothing output Design by contract Object (grammar) Endliche Modelltheorie Asynchronous Transfer Mode
Predictability Web 2.0 Confidence interval Design by contract Flag
Confidence interval Computer-generated imagery Function (mathematics) Graph coloring Machine vision Wave packet Attribute grammar Neuroinformatik Data model Medical imaging Sign (mathematics) Object (grammar) Bus (computing) Endliche Modelltheorie output Algebra Computer architecture Context awareness Dependent and independent variables Content (media) Parameter (computer programming) Total S.A. Function (mathematics) Order (biology) output Social class
Point (geometry) Functional (mathematics) Server (computing) Confidence interval Execution unit Virtual machine Function (mathematics) Inference Medical imaging Sign (mathematics) Different (Kate Ryan album) Scalar field Program slicing Cuboid Flag Endliche Modelltheorie Scalable Coherent Interface Predictability Boss Corporation Multiplication Sigma-algebra Mapping Forcing (mathematics) Gradient Perturbation theory System call Uniform resource locator Vector space Angle Prediction Personal digital assistant Mixed reality output Social class Object (grammar) Remote procedure call
Medical imaging Algorithm Algorithm Personal digital assistant Patch (Unix) Surface Videoconferencing Object (grammar) output Mathematical optimization
Pixel Algorithm Algorithm Transformation (genetics) Patch (Unix) View (database) Constructor (object-oriented programming) Perspective (visual) Medical imaging Uniform resource locator Auditory masking output output Mathematical optimization Mathematical optimization Differentiable function
Predictability Functional (mathematics) Algorithm Demo (music) Code Computer-generated imagery Insertion loss Function (mathematics) Calculus Transformation (genetics) Product (business) Data model Subject indexing Medical imaging Object (grammar) Matrix (mathematics) Cuboid output Object (grammar) Nichtlineares Gleichungssystem Endliche Modelltheorie Mathematical optimization Resultant Social class
Predictability Latent heat Arithmetic mean Uniform resource locator Vector space Object (grammar) Calculation Square number output Endliche Modelltheorie Object (grammar) Error message Mathematical optimization Reverse engineering
Functional (mathematics) Code Computer-generated imagery Sigma-algebra Physical law Planning Special unitary group Variable (mathematics) Revision control Data model Subject indexing Latent heat Object (grammar) Oval Object (grammar) Resultant Mathematical optimization Spacetime
Data model Functional (mathematics) Transformation (genetics) Different (Kate Ryan album) Object (grammar) Infinite conjugacy class property Insertion loss Endliche Modelltheorie Object (grammar) Transformation (genetics) Social class Singuläres Integral
Random number Pixel Functional (mathematics) Divisor Gradient Constraint (mathematics) Patch (Unix) Image resolution Function (mathematics) Food energy Variable (mathematics) 2 (number) Data model Medical imaging Mathematics Spacetime Noise output Mathematical optimization Noise (electronics) Constraint (mathematics) Image resolution Physicalism Computer network Maxima and minima Flow separation Demoscene Arithmetic mean Voting Volumenvisualisierung output Data conversion Mathematical optimization Resultant Spacetime Distortion (mathematics)
Axiom of choice Machine vision Pixel Multiplication sign Insertion loss Function (mathematics) Mereology Perspective (visual) Neuroinformatik Subset Medical imaging Mathematics Bit rate Different (Kate Ryan album) Videoconferencing Endliche Modelltheorie Convex function Touchscreen Gradient Sampling (statistics) Proof theory Data management Befehlsprozessor Vector space output Endliche Modelltheorie Bounded variation Resultant Asynchronous Transfer Mode Divisor Transformation (genetics) Patch (Unix) Image resolution Color management Regular graph Distance Webdesign Graph coloring Machine vision Number Robotics Term (mathematics) Software testing Mathematical optimization Dialect Matching (graph theory) Total S.A. Computational complexity theory Uniform resource locator Mixed reality Iteration Object (grammar) Distortion (mathematics)
many things to lohan in the mall for your excellent presentation and now let's welcome waiting from by 2x that yeah he will give us sorry what was it magic tricks for self-driving cars now let's welcome thanks for the introduction so come on Iowa thanks for attending our talk so my name is where is she I'm a PhD student and computer science at the University of Virginia and I'm country and interval searcher at PI 2 s lab so today I'm not going to yield here to defend that by the steeps PG against the attack I'm going to show you some interesting magic tricks for self-driving cars and this is a joint walk in my colleagues I by 2's lab dr. ting Tong and dr. your hands are so so it's disappointing why should you just use keyboard okay so before my
presentation I ready to do some 2 styro several clarifications first this is this is just a proof of concept second we are not talking to any autonomous vehicle vendors and instead our target is the general computer vision technique that could be used on many self-driving cars so we are not going to make any big news or cause PR PR crisis for anybody so and but a bit without habits it's very practical and has some implications about self-driving cars so we are still not to reproduce our magic tricks against your neighbors self-driving car and we are not responsible for that for any good of consequence quizzes ok so
first of all let me briefly introduce typical autonomous vehicle framework so a self-driving car that's not necessary look very different from the other cars we are driving every day it may have the same cabin the same views and the same pouch and it said that a a it has extra sensors and the activators as well as a brand which consists of our three major components the protection module the prediction module and the planning module so for self-driving car to drive to finish the driving task I require some sensors such as our way down light our camera to perceive the surroundings on the road for somebody need to recognize the driver aerials and it needs to recognize the other Road objects on the road such as the other cars or pedestrians and something like that and the perception module recognize those objects but the egg only knows the the situation at that moment and but that's why we need a predation module so using a pretty addition module we can on the weekend on the predict the location of the those objects in the future seconds so that the Prarie module could make a plan to drive the vehicle smoothly to avoid the obstacles and to reach the destination so in this work we focus on the camera based perception techniques so it should work like the X using a
camera input the object detection model should be able to recognize the other size there oh and the location of those objects such as the cars and the objects could be very close to the camera and it could be very far from the camera so the object detection mode model should be able to recognize all those important objects to make the driving as smooth and here we want to show you some magic
so here we have you have that contract I
think everyone here recognize this as the DEFCON flag and so we want to put this in our sing so here we just put er on the on the floor and you can see that the target perception module just recognize the web contract as a car with very high confidence so we can change the viewpoint of a camera and the prediction is still very confident so unless I introduce how we implement those attacks okay so this is our target
it's the Yolo very model of it this is very famous in the computer vision and I believe some self-driving cars if you use the similar architectures even though they used different training sets to get a model on the on their cars and this model has an input up to four one six by four one six and in three channels attribution algebra channels so it's a color image input and the motive is remodel is huge it has 147 trainable layers and 62 million trainable parameters in total so it's a very large and complicated model and out the Euro b3 model can output 3 3544 the United forces of course we only show the opponent politics with a high confidence here such as the stop sign and on the content and the talking model
we use here it was can be the cocoa that has said ms cocoa dataset so it has eighty classics in total but here we only focus on those important context such as opposing heart response bus and bicycle motorcycles because those are more ability to self-driving cars and in order to attack your obese
remodel we need to understand how the model to the it does in inference so for any input image the euro model furs are spread into a into several quarries so the your model actually use are three different ways at least this is the first one thirteen by three great and yes 26 by twenty six and fifty two about it to here we were just used in studying but thirteen as a running example because it's more visible or on slice so for any unit on this calculate the model would give some prediction about the objects for the center point within this yellow yellow unit and this petition will not be wild it should have some reference X so that had the technique your always reuse is called an anchor politics so anchor boss is like a reference for a specific prediction of the object detection so for each unit of the grade and then for each angle boss if you have a prediction about a specific object and it has it is a very long vector so I will explain how to interpret those black vector and the first three the first for scalars about the bounding box about the detected object and first we need to determine the center point is of this object will you use the the location of this yellow unit so here the eleven and two on the grid and we need output of he accent he why to calculate the center point first of all in this case in this example the center point is at the equipt going point nuts even need to use the third and fourth output to calculate the object size so it is calculated with one of the anchor politics the first one the P double the base and height of the anchor box so here because avocadoes are that did that the object should be in this size and so this is a location and size of the bounding box and at this point we do know which which object cross at it's the prediction so we needed we have eighty scalars about the producer on the car sex so yolo history used eighty different Sigma outputs to do this across prediction instead of the more commonly used some function and here we can tell that the stop sign should have the highest probability if the model is it's a good model and but there's some some other cases that upon the boss may not contain any objects so the Yolo v3 model you use at output call object mix so it is also a signal output or the higher objects means that the higher confidence on it there's an object in this bounding box so the map multiple to this multiplication and get the funnel our confidence our for this upon body forces so if you know that stop server has the highest competence so we know that you're obese remote or pretty there's a stop sign at this location okay that's the gyro we street model and
let me introduce our flag model it could be I I think it's different from any observable machine learning work they are because we don't assume that the perturbation we add is invisible to human eyes so actually in in our model we can put
any image patch on the surface of any objects in nursing so in this case we put it on the floor because it is more visible to our camera of course if we were example patients and calculated by our algorithm and then it would be interpreted differently the human accumulation of unless I will
introduce introduced videos in this attack first we need to implement this
input construction pipeline because we needed the algorithm to calculate a to gather input that would make our objective function to it to the destination we want and first we need to resize the picture to us to a specific size so that equally the location we wanted to put on and then we need to do this perspective transformation so that we could get a quite a view apart from the camera and then we will remove that mask every pixels of the same picture and then the pasted image patch to to that location so that's how the construct is input and the whole pipeline is differentiable so so in our target we can directly calculate the image we shall have for this a successful attack and the second thing
is for the object of details so for different attacks we should define different objective function and the first example is the production attack I just show you as a demo and there are many ways to design a objective function for a specific goal and the first we can do it in a easy way we just want more certain objects on the whole image so I I'm not going to show you any equations here I think that you have seen a lot is this and I just show some pseudocode so here we just wanted one at the model to predict more car car club great more car objects of fun for the input so we just get the index of the car club car class and then in this white box crossbow probabilities make matrix we use this love code to tell the algorithm that we just want to maximize the probability update calculus for for every great earnings and remember that we have the object X output for the Euro which we prediction so we also add that in our loss function and then we plus we somewhat it to to loss and there's our final rose and so this is very easy to implement but it could be difficult to optimize because if you have so many outputs and and the result might not be
they're very very good because it doesn't look into a typical quiet if we took two cars they've found the perturbations and so we we should refine
those objective function so because we have we know how you're always remodel calculated the prediction so we can just do make a river reverse because we know
the exact location we want the model to predict object so we can just coke use a calculator to to get the the prediction vector for for the Eurovision model and then we use the mean square error to measure in the prediction of the input you get close as possible to the calculated results and and in this way
it could it could produce the result like X which we prefer and in other
attacks for somebody object vanish because also have have many different ways to design a plan this function object function and this this is just a very Coast version and so here you know somebody wouldn't like others to recognize his car so they they might spoil a law in California to release the car every six months and they don't have to put a license prey on it and every if you want if they want to do doing it more aggressively they can they can even put they can even put a space specialized license plate on a so that they are other self-driving cars when I recognized as a car oh and and the code is also very simple because this is a course objective function and we just get a car index and then we we take out the - Sun afronta to lock the two locks our variables and then that's the lost one no so determines our look like eggs the motor couldn't recognize the object as a carpet that is license plates and
the other interesting and budget shake would be the transformation so we can make it certain of your cars to transform to other classes so for example we can make a car look like a chance to the oil llaves remodel and we just add the different class of probabilities to the loss function and then we can gather like this so it's a
different license license plate and it's like at the transformer in in the middle stage because the object look like a Chan and enter car to the model it's similar probabilities and okay so we
have discussed how we construct the input and how we design objective function and the nest is about how to get is that exactly get that those are inputs so we need some optimization techniques and we have found that these two tricks are very effective in our tag which is was first introduced by this Oakland paper by Nicolas colony and the first trick is to use the change of variable because in up in the pixel space we have this interval constraint from the normalized a to 0 and 1 and if you your input has to be minus then it's not going to be a effective pixel in the physical wall so you are not going to realize the attack in physical war so this our interval culture is very important to realize this physical attack we use the change of their what trick to convert the input to the tanked energy space so that it has it would encode this interval in our objective function and then you can use many of the Shelf optimizes such as a TA and two to do this our automation and this second trick useful trick is to optimize the largest instead of the motor output so in their Chinese paper they found that if we can skip the functions at the last layer like Sigma Sigma or sub max we can avoid the vanishing gradients and help to get a better result okay so far we have two we have the up methods to generate this successful digital hack but image
sensing is not an identical function so we but it doesn't mean we could do it in a physical war because if you need you need to print out the image patch and you need a camera to to tell picture of the scene and then you get an input vote for the euro this remodel and for printer or cameras they'll have several witnesses for example they took it they have very limited resolutions so even if you calculate that the specific pixel should be in a specific value but you may not be able to do that after you print it out and use a camera to take a picture and and the printer and camera will have some distortions so and they they all have to post half the render noise so we need to consider all these possible factors to realize the physical tag and we found that the several
vectors techniques in firstly introduced by researchers at CMU is very useful in our hack for sample for the limited resolution and they introduced a regularization term to smooth the patch with the total variation vectorization and for the distortions they have developed a manual color management and web designer number interpreted loss to in to encode this color match management in the optimization and and in the physical work we not we might not be able to put the image patch to the exact location to match the pixels in the digital image so we used to exchange as in every iteration of the this optimization we make some rendering transformations so that the generated patch would be robust to some movements of the image patch okay so here comes a conclusion we have Sheldon the magician's can throw the object detection models so can attack us so we should be very cautious if it's a self-driving cars that rely on the computer vision and thank you for listening to my talk [Applause] I don't hear any questions if you have okay so we were not allowed here she'll show the logo there if you want like we can serve that location for you you can put your name Oh guys I think that's the part of the Euro v3 model itself so it's not about it thank you about the attack it's just a way to interpret the results we always remodel and if we want to attack the model we have to follow the design okay so here we show unscrewing is I think it should be called as semi physical attack because we show these pictures on on a screen and but we use the actual camera to take a picture and we change the viewpoint of the camera and show you the video it's a screen I recall on my iPhone I yes yeah so I I've run the you're obese remodel on my mobile phone and I use the camera on my phone to take this video it depends because this is the proof of concept I think there's many other factors that equations the result of this attack okay so good question I think this depends on the model because different models have a different input size I mean if the emphasize is larger may probably a smaller patch would be effective because it would cause more different pixels in your in your inside in your in your input you can do both it's it's the design choice you can try both and just try to find which which the definition is the best for your optimizer because these parties at HACC it is a non convex function you don't have a good optimizer that can always get the back solution so again you should have to just try it is a great question so we just tribe why was that hack here but we tried other models on my mobile phone to test these are outputs and we found that they are still effective probably because the model are similar enough to reproduce the attack okay so for this specific example because if we didn't add a large movement steps you know each in each iteration so it might has there some limited robots next to the viewpoint change but if you will like that feature you can add mode added lot larger move moving distance each iteration and the output should be more robust to that movement here here we use pizzette perspective transform I think it's a more powerful method to represent its transformation and our phone is the subset okay so this is the offline attack actually that's what what we are showing it here it's not on an attack that you have there dozen here to be optimized her adversarial it is optimized he may be recognized as car in our objective function but we didn't make the amount of pixel values if you can change in our attack but for this specific example which we just run our 10 iterations on my macbook pro cpu of the soviet agate a preacher model and the painter regions here means the attack interations so we get the gradients of our input for 10 times and updated yes image yes yes okay quick question so actually we have we have tried to put in the image patch patch out and put it on another image so I think it so in that way we didn't take the video on the screen and it's actual paper we print it out and this to observe the similar results it would be recognized as as car sometimes but it's the success rate is it's not as as high as this one because we didn't I hope for that one we didn't use use the non printer plated plus two to take her up these are printing distortions okay we have more cultures if no thanks for telling Michael [Applause]