AI Village - Faults in our Pi Stars: Security Issues and Challenges in Deep Reinforcement Learning
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 374 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/49680 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | |
Genre |
00:00
Markov-EntscheidungsprozessStichprobeFunktion <Mathematik>GruppenoperationProgrammierumgebungATMDatenreplikationExpertensystemROM <Informatik>Folge <Mathematik>AutorisierungElektronisches WasserzeichenEindeutigkeitElektronische UnterschriftWasserdampftafelEindeutigkeitAggregatzustandEndliche ModelltheorieComputerarchitekturElektronische UnterschriftPhysikalischer EffektCASE <Informatik>DifferenteStichprobenumfangIterationGesetz <Physik>SystemaufrufKoalitionWellenpaketBitrateExogene VariablePaarvergleichKomplex <Algebra>Wort <Informatik>WarteschlangeTwitter <Softwareplattform>BenutzerschnittstellenverwaltungssystemSchnittmengeEinfacher RingProgrammierumgebungMultiplikationsoperatorGruppenoperationGrenzschichtablösungKondition <Mathematik>TopologieTotal <Mathematik>Offene MengeDatenreplikationSoftwaretestAutomatische HandlungsplanungFunktionalSoftwareLokales MinimumVirtuelle MaschinePunktPuls <Technik>Klassische PhysikPi <Zahl>Prozess <Informatik>ExpertensystemArithmetisches MittelComputersicherheitZirkulation <Strömungsmechanik>Physikalisches SystemTermElektronisches WasserzeichenBildgebendes VerfahrenSelbstrepräsentationNominalskaliertes MerkmalIntegralTeilbarkeitApproximationVorhersagbarkeitSnake <Bildverarbeitung>DatenmissbrauchGrundraumCodecZusammenhängender GraphBildverstehenLuenberger-BeobachterWhiteboardParametersystemAuflösung <Mathematik>GamecontrollerComputerspielForcingVersionsverwaltungDatensatzRandomisierungObjekt <Kategorie>DatenkompressionBasis <Mathematik>Globale OptimierungTypentheorieBeweistheorieAlgorithmische ProgrammierspracheStörungstheorieGeradeDynamisches SystemSurreale ZahlQuellcodeInteraktives FernsehenDistributionenraumResultanteKlasse <Mathematik>Gebäude <Mathematik>RobotikTaskSpieltheorieFreewareVerzweigendes ProgrammInformationsspeicherungFolge <Mathematik>BitEntscheidungstheorieMinkowski-MetrikHorizontaleAuswahlaxiomPrimidealRadikal <Mathematik>MomentenproblemKorrelationsfunktionBenchmarkDatenverwaltungFontCybersexPlastikkarteFlächeninhaltKartesische KoordinatenDemoszene <Programmierung>GoogolKategorie <Mathematik>Quick-SortReelle ZahlProxy ServerAdditionPhysikalische TheorieHill-DifferentialgleichungBildschirmmaskeAntwortfunktionUmsetzung <Informatik>RauschenKonfiguration <Informatik>TropfenFormation <Mathematik>ÄhnlichkeitsgeometrieSichtenkonzeptMetropolitan area networkDatenstrukturNeuronales NetzBestärkendes Lernen <Künstliche Intelligenz>DiagrammProgrammierungBestimmtheitsmaßCluster <Rechnernetz>FlächentheorieSoftwareschwachstelleMaschinelles LernenBellmansches OptimalitätsprinzipFunktion <Mathematik>ZustandsmaschineHochdruckAlgorithmusMittelwertSummierbarkeitGraphfärbungThreadMechanismus-Design-TheorieMarkov-ProzessZahlenbereichOperations ResearchBetriebsmittelverwaltungt-TestIdentifizierbarkeitÜberwachtes LernenDienst <Informatik>Trajektorie <Kinematik>InformatikKomplexes SystemInelastischer StoßEin-AusgabeUltraschallFramework <Informatik>Quelle <Physik>Office-PaketMenütechnikMathematikStatistische HypotheseDomain <Netzwerk>Kontrast <Statistik>PixelComputerunterstützte ÜbersetzungDateiformatOrdnung <Mathematik>NeuroinformatikAuthentifikationSprachsyntheseKontextbezogenes SystemNumerische TaxonomieImplementierungVerkehrsinformationEnergiedichteWärmeübergangSoundverarbeitungSchwingungSchwellwertverfahrenNotepad-ComputerQuaderInverser LimesLoopCoxeter-GruppeEinsRechter WinkelAnalysisStrömungsrichtungEDV-BeratungVierzigHypermediaBootstrap-AggregationEreignishorizontMAPTorusHalbleiterspeicherFrequenzGanze FunktionWiederherstellung <Informatik>VersuchsplanungSchlüsselverwaltungp-WertBeobachtungsstudieBlackboxWeb SiteComputersimulationMaxwellsche GleichungenOrtsoperatorAnwendungsspezifischer ProzessorAdressraumMapping <Computergraphik>Workstation <Musikinstrument>Arithmetische FolgeDatenflussVertauschungsrelationMultiplikationKontrollstrukturMereologieAnalytische FortsetzungMeterProdukt <Mathematik>EinfügungsdämpfungKontrolltheoriePatch <Software>Autonomic ComputingPhasenumwandlungTemporale LogikDickeStatistische SchlussweiseRechenschieberEinflussgrößeLeistung <Physik>NebenbedingungHeuristikGammafunktionVorzeichen <Mathematik>Überlagerung <Mathematik>Richtungp-BlockZweiOptimale KontrolleMarkov-EntscheidungsprozessGradientBlockdiagrammUnüberwachtes LernenGüte der AnpassungGewicht <Ausgleichsrechnung>MultifunktionRegulärer GraphMehragentensystemLambda-KalkülPlotterNichtlineares GleichungssystemExploitKonfigurationsraumElektronischer ProgrammführerKumulanteThetafunktionDeterminanteSchätzfunktionLeistungsbewertungMustererkennungTupelErweiterte Realität <Informatik>TestbedComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:02
Hello, and thank you for joining this talk on the security issues and challenges in deep reinforcement learning. I'm Vahid Bisodan. I'm an assistant professor of computer science and data science at the University of New Haven, and I also direct the Secure and Short Intelligent Learning Lab, or SALE for short, working on AI safety, security, and applications of machine learning to
00:22
cybersecurity and safety of complex systems. So here's the outline of my talk. I'm going to quickly go over the basics of deep reinforcement learning and reinforcement learning. There will be some math, but I'll keep it to a minimum. Then we'll talk about vulnerabilities of deep reinforcement learning and whether deep
00:44
RL is susceptible to classical adversarial machine learning attacks like adversarial examples and such. We'll talk about and develop a threat model for deep reinforcement learning. We'll identify different attack models, attack surfaces.
01:02
We'll identify different types of vulnerabilities, and we'll introduce a number of attack mechanisms and corresponding defenses that have been developed in recent years. And we'll talk about the frontiers and areas of future research and work in this
01:27
area. So, just a quick overview. I assume most of the audience here is already familiar with the terminology used here. We can classify machine learning algorithms as supervised, unsupervised, and reinforcement
01:41
learning algorithms, supervised learning algorithms, or those where the training dataset includes labeled data, meaning that each data point in that data set comes with the correct label or the correct output expected from a model trained on that data set.
02:01
Then there is unsupervised learning where clustering and anomaly detection and some other algorithms fall under where there are no labels available for data points. There's no feedback available on the data. However, the goal is to find some underlying structure in the data.
02:20
And then there is reinforcement learning, which is concerned with the problem of sequential decision-making. There's a reward system included in reinforcement learning, but there are fundamental differences between the settings of reinforcement learning and supervised learning, and that's why it
02:41
merits its own category. We'll talk about those differences in the next few slides. So you can see the general settings of reinforcement learning, or in general, sensory motor or agent problems and settings. In RL, we have an agent, which is to be trained by a reinforcement learning algorithm.
03:04
This agent interacts with the environment by performing some action. This action causes the state of the environment to change. Then the new state is observed by the agent. There's also some sort of reward associated with this change that is provided or inferred
03:23
by the agent based on this change in state. Well, you can think of this in terms of playing a game. The environment can be the game environment where different actions may result in scoring or loss of score, and the actions are, well, essentially the actions that the player can
03:46
take. The state of the environment can be the state of the game. For example, in Breakout, the configuration of these breaks can be one of those where the agent is, where the ball is. If you recall the game of Breakout, of course, if you are old enough to remember Breakout,
04:06
you're probably familiar with the dynamics. But you can think of any other game, and the setting still applies. What's the goal here? The goal is to learn how to take actions in order to maximize cumulative rewards.
04:22
Not instantaneous rewards, but cumulative rewards. So an RL agent is not just concerned with maximizing the current score. It wants to learn how to act so that at the end of the game, its total sum of rewards is maximized.
04:41
What are the different applications of RL? Of course, game playing is well publicized because it's a very good test bed, and it has become one of the common test beds and experimental settings for RL research. However, there are some major real-world applications for reinforcement learning.
05:01
In essence, RL, or reinforcement learning, is the machine learning response to the need for data-driven control problems. Where do we encounter such problems? Well, in robotics, like autonomous navigation, object manipulation, and such. Algorithmic trading, this is one of the areas that my research group has recently become active in.
05:21
Critical infrastructure, controlling smart cities, resource allocation in smart cities, traffic management, intelligent traffic systems, smart grid management, and control. Healthcare, such as clinical decision-making, this is actually one of the better known applications of RL in the real world,
05:41
and other types of resource management and applications in operations research and such. So RL is either envisioned or is already heavily adopted in various industries, many of which are critical and may become targets of malicious actions.
06:02
Let's formalize the RL problem a little more. I'm just going to quickly go over this. There will be some math, but this is just to introduce the basic settings. The underlying framework to formulate the RL problem and provide a framework to think about and reason about the RL problem is the Markov decision process.
06:20
Why do we call it the Markov decision process? Because it's based on the Markovian assumption, or the Markov property, which says the current state completely characterizes the state of the world. So if you have the current state and you perform an action, the next state is only going to depend or be a function of the previous state
06:41
and nothing before it. You don't need to know the history of the environment. All you need to know is what the previous state has been to infer what the coming state is going to be. Now, in Markov decision processes, a problem or setting is formulated by a tuple of five main parameters.
07:01
One is the set of possible states, S. Set of possible actions, A. Distribution of reward given state action pair, or R. This distribution can also be a function. It doesn't necessarily have to be probabilistic. It can be just a function that tells you as such, if at a particular state, Si, an action, Ai, is taken,
07:21
then the reward is going to be, let's say, Ri. Then there is transition probability, or transition dynamics, P, which represents the dynamics of the environment. Which state are we going to end up in if we are in a particular state
07:40
and we perform a particular action? And there's finally a discount factor which defines how myopic the agent is, how much it values rewards that may occur further in the future, further down the line. Now, in general, the RL setting is based on the following process.
08:03
At time step t equals zero, in the beginning, the environment samples initial state S0. Then for t equals zero, this is a loop. Until done, the agent selects action At based on some criteria. It can be completely random in the beginning. Then slowly becomes more targeted and more policy driven.
08:23
It performs some action. The environment produces a reward signal and the next state, and the agent receives those signals. It may be partial, it may be incomplete, or it may be noisy, but the agent receives some signal resulting from the transition
08:42
to the new state St plus one and a reward that has emerged from that transition, that has resulted from that transition. Now we define a new entity here called pi. A policy pi is a function, or a distribution in the probabilistic case, that maps from state to action.
09:03
So it tells the agent what action to perform given any state. And the objective is to find the optimal policy pi star that maximizes the cumulative discounted reward. We also call this return.
09:22
So, cumulative, meaning the sum of rewards in the entire duration of interaction. It can be one episode of a game, it can be throughout the training period or the training horizon, and it's discounted. This gamma is the discount factor that defines how myopic the agent is or how much it values events that occur further down the line.
09:43
It's typically a constant value between zero and one, and as you can see, as t increases, this gamma to power t decreases. So the value, the preference, or the observed value
10:00
of something that happens further down the line, which means with greater t, is going to decrease as t increases. So again, the objective is to find the optimal policy pi star that maximizes the sum here. All right, we're going to quickly define two other definitions here.
10:21
If we want to evaluate a particular state in these settings, one approach is to measure the value of that state using the value function. The value function of state s is the expected cumulative reward from following the policy pi from this state. So let's assume we already have a policy.
10:42
What will be the expected cumulative reward if you start in state s and keep following policy pi? Now, we also define a Q value or a Q function, which tells us how good is a particular action a
11:01
if it's performed in state s. In other words, if we are in state s, if the agent is in state s and performs action a, the Q value of this setting is the expected cumulative reward from taking action a in state s and then following the policy. Remember, the policy is a function that tells the agent
11:22
what action to perform given a new state. When the agent performs action a in state s, it goes into a state s prime or the next state, and then we can use this policy function here to see what the action should be, what action should the agent take at that state, which takes it to another state,
11:40
and then the policy is used to figure out the action to take in that state, and this goes on until determination of the episode or the horizon. Now that you're familiar with the value function and the Q function, let's go a bit deeper. The policy pi, the value function, value functions both V and Q,
12:02
and model, the transition dynamics or the reward model, are all functions. We want to learn at least one of these from experience. This is the essence of Oracle. If there are too many states, however, we cannot just tabulate everything and try to experience every state
12:22
and the corresponding result from that state or that state and any action performed because as the size of the problem, as the state space and the action space increase in settings like, let's say, playing GTA 5 or a self-driving navigation policy in the real world, these state spaces just explode.
12:43
The dimensionality is too high and it's just not feasible to store every possible state. In those cases, we need to approximate. In general and traditionally, this is called RL with function approximation. If function approximation is done using deep neural networks,
13:00
then we call the training setting or the RL setting deep reinforcement learning. The term is relatively new. It was, I believe, introduced in late 2013, early 2014 by Ni and David Silver and others at DeepMind in their deep Q learning paper. However, the concept is not very new. That said, there have been fantastic
13:22
and even mind-blowing advances in this area in the past three or four years. We'll talk about some of those as we move forward. A quick overview of the taxonomy of different RL approaches and RL agents. Remember, we have value, function, policy, and model. There are different approaches to solving the RL problem.
13:40
Sometimes we just want to find a policy directly. These, the approaches that respond or satisfy that need are called policy-based RL. Sometimes we want to find a Q function or a V function first and then derive the policy from those functions.
14:01
These approaches are called value-based. Sometimes we want to first learn a model of the environment and then solve the problem. These are called model-based approaches. Sometimes we don't have the model and we don't want to learn the model explicitly and those are called model-free. When we're dealing with both value, function, and policy at the same time,
14:20
in other words, we have one agent learning the value function, another learning the policy, and then contrasting those with each other in a kind of zero-sum setting. We call those active critic RL approaches. As you can see, there are different approaches to different settings and different problems. However, all of those can still be grounded
14:42
on top of the Markov decision process framework and the general solution approach we looked at before. One of the better known approaches to RL, which falls under the value-based approaches, is Q-learning. The objective in Q-learning is to derive the optimal policy pi star based on optimal Q.
15:02
So the optimal Q is one that maximizes the value of each state and action and from that point onward, with an iterative formulation based on Bellman or dynamic Bellman equations or dynamic programming,
15:22
it becomes possible to find this through bootstrapping and iterative re-estimation of the Q value. There are different approaches to Q-learning in large state spaces or large action spaces where function approximation is required.
15:41
One such approach is to parameterize Q as S, A, and an estimation parameter theta. If we solve this parameterized Q function by neural networks, where theta corresponds to the weights of that neural network, the approach is called Q-networks.
16:03
This solution approach was proposed very early in 2000s, even earlier, but was perfected in some sense in 2013-2014 by DeepMind, by David Silver and his team, which resulted in the proposal of deep Q-networks or DQNs.
16:24
So we're talking about deep networks. What does that mean? It means that these networks use deep neural networks like CNNs, which help with both function approximation and also end-to-end feature learning. One of the advantages of, well,
16:42
deep learning is its superior performance in learning feature representations, especially from images. Now, for those of you who are more statistically oriented, you may have seen the term IID. It means that the data are identical
17:02
and they are independent and identically distributed. It means that one data point does not depend on another data point and also each data point is equally likely to occur. These do not hold for the RL settings.
17:20
We know that a particular state, for example, is highly correlated with its previous state and action. Why does this matter? Well, a lot of our supervised and deep learning approaches are based on the assumption that the data, the training data, is IID. When this is not possible,
17:41
what happens is, in response to this problem, what happened was the DQN approach introduced experience replays like a bag of old data which is randomly sampled in each training iteration to reduce the correlation, the sequential correlation,
18:00
the temporal correlation of data points and also make it more likely for data to be evenly distributed. Also, to reduce the effect of oscillation during training, DQN uses fixed parameters for a target network. The target of optimization is fixed
18:20
and is updated every few thousand iterations. So that reduces the oscillation problem. And of course, the rewards are normalized to minus one to one to make sure the performance, the reward signals are bounded. Now that we have a preliminary understanding
18:41
of deep reinforcement learning and one of its implementations, one of the parallel approaches called DQN, let's take a quick look at adversarial machine learning. I assume that by now, those of you who are not familiar with adversarial examples have become, have been introduced to this concept. The idea here in adversarial examples, we are now talking in the realm of,
19:01
we were speaking in the context of image classifier supervised learning. Let's say we have an image classifier trained on a set of images of pandas, cats and others to identify the object in those images. In the beginning, we pass the image of a panda to the classifier and it correctly classifies it as panda. Now it's been demonstrated,
19:22
it's been actually established by now that it is possible to make the classifier to induce incorrect classifications in deep learning models or machine learning models in general by adding minute minimal perturbations to the original image. As you can see,
19:41
it's almost impossible to detect or see any changes in this final image. These perturbations, these pixel perturbations are very small. Now, this is one example of different attacks or different vulnerabilities in the classical realm of adversarial machine learning,
20:02
which is mostly concerned with supervised learning and sometimes unsupervised learning. In general, the adversarial objectives in AML or adversarial machine learning can be classified under the traditional CIA triad, confidentiality, integrity and availability. So with respect to confidentiality and privacy,
20:23
an adversary may wish to target the confidentiality of the model parameters, model architecture, no intellectual property theft is an issue in larger models, or it may target the privacy, the adversary may target the privacy of the training and test data.
20:41
For example, if medical records were used in the training data, there have been proof of concept attacks showing that it is possible to infer whether a particular patient or the records pertaining to a particular patient were used in the training data or not, or sometimes it's possible to reconstruct the training data set by just having access to the model itself.
21:01
And that is a major HIPAA violation. In essence, it's a privacy violation. Also, with regards to integrity and availability, the attacker may target the integrity predictions or the outcome of the model, the performance of the model. For example, adversarial examples are an attack
21:21
on the integrity of image classifiers or supervised machine learning models. And also an adversary may target the availability of the system that is deploying machine learning, for example, a facial recognition system or an autonomous navigation system in a driverless car. Now that we know so much about adversarial machine learning
21:44
and security vulnerabilities of classical machine learning, supervised learning and unsupervised learning, there is a major question, and it's whether deep RL is immune to those attacks. Back in 2016, late 2016, when the research community was just beginning to start,
22:03
it was just beginning to pay attention to both deep reinforcement learning and the issue of adversarial examples, I came up with this question and decided to experiment with it a little to find out whether deep RL can also be vulnerable to such attacks.
22:22
I started from a simple observation. The deep neural networks in DQN models and classifiers are both function approximators. They're both, at training time, they're function approximators. At test times, they're just functions. And I came up with this hypothesis. If classifiers are vulnerable to adversarial examples,
22:40
then action value approximators of DQNs may also be vulnerable. All right. So I started to set up an experiment where the aim was to perform adversarial attacks. The adversary's goal was twofold. Test time attack
23:01
to perturb the performance of targets learn policy. So the target at this point is fully trained and is deployed in the environment, and the adversary wants to somehow cause the agent to perform incorrectly to manipulate its policy. What does the adversary know about the target? It knows the type of input to the target.
23:22
For example, it knows whether the target policy is looking at image data, text, audio, and such. Why? Because it helps with estimating the architecture. For example, if it's images, the adversary can come up with a good guess that the target architecture includes CNN,
23:41
so convolutional neural networks. I also assume that the adversary knows the reward function. So the adversary may have access to the environment, and if it's, for example, a game environment, it knows what the scores are, how the scores are generated. What is not known? The knowledge of target's neural network architecture is not known.
24:01
So it's a black box attack, and also the initial parameters, the initialization of the neural network, the target neural network is also not known. What is available to adversary in terms of actions? The adversary may perturb the environment where the target performs. For example, it can change pixel values
24:21
in a game environment through a man-in-the-middle attack. I consider in this work two techniques for perturbing that environment. One is the classical fast gradient sign method for generating adversarial examples, and the other is the Jacobian-based saliency map attack, or the JSMA approach,
24:41
introduced by Papernot in 2015, I believe. Now, in that experience, experiment, I used the classical DQN approach. The classical DQN architecture introduced by Nieh in an Atari game, the game of Pong.
25:01
And this was all implemented in OpenAI Gym with TensorFlow, back then PyTorch wasn't really a thing, and trained the agent against heuristic AI. And here's the initial proof-of-concept result. This is for the white box attack. Of course, later on, I will show you the results for the black box attack. And you can see that for FGSM and JSMA,
25:24
regardless of how far along the training the agent is, both policies, the policy for the Pong agent is highly vulnerable to adversarial perturbations through simple techniques like FGSM and JSMA.
25:42
You can see that for JSMA, the success rate was 100% for all of the cases. For FGSM, it was slightly lower, and it was mostly because of the termination criteria and the perturbation threshold that I had defined. What you can see, it can be observed that the policy can be very easily manipulated
26:02
through adversarial example attacks. So at 100 random observations perturbed with FGSM and JSMA, the results were fed, the perturbed images were fed to the trained neural network representing the agent's policy as test input, and then the success rate was measured. Now, is this type of attack practical?
26:21
This is really something that you should be worried about. So a few years later in 2018, Clark and his co-authors published the report of a similar attack on an autonomous robot based on DQN policy using ultrasonic sensory input for collision avoidance.
26:46
And they had shown that they can use adversarial perturbations to manipulate the trajectory of the robot and make it follow a path defined or desired by the adversary, not one that the robot itself wants to follow.
27:04
There are more recent examples of how this sort of attack on deep RL, or in general, attacks on deep RL can be of concern. One of the recent works by my graduate students is on attacks on automated trading algorithms
27:20
based on reinforcement learning. The paper is coming out in a couple of months, so I can't go into more details, but this is one of the more severe and urgent cases for security researchers to consider. Deep RL is already being used by many major financial players and stock traders, and it can be easily manipulated in the real world. There are other cases, of course,
27:41
but this is one of the examples that demonstrates the practicality and the applicability of this attack to real-world scenarios. Now, before we go further into different types of attacks, let's develop a threat model for deep reinforcement learning. Again, the adversary's objectives can follow those of the CIA triads.
28:01
The adversary may wish to access internal configurations like model parameter, reward function, policy and such to still the model for intellectual property theft and such. There can be attack on integrity, which means compromising the desired learning or enactments of the policy. There can be attacks on availability,
28:20
which are essentially compromises of the ability of the agent to perform training or actions when needed. Now, let's look at the attack surface of the RL. This is the general block diagram of a DRL agent or DRL system. Now, we have the agent. The agent typically has some memory
28:41
where it stores its experiences during training and then those experiences help with function approximation, data-driven policy learning and such. And then there is an exploration controller, which controls how the agent explores the environment during training. There's an experience selector, how to select experiences from the bank of data
29:02
or observations stored in this dataset. There's an actuator, which then enacts the actions of the agent inside an environment. The environment is connected back to this agent block through an observation channel, observation of the states and reward channel. And it's of no surprise
29:20
to the more seasoned security research and professional that all of these components can be a subject or target of adversarial attack. As we go forward, we'll cover some examples of attacks that can occur on each of these components. We've already seen an attack on the observation channel.
29:41
There are attacks on the reward channel, which we'll hopefully touch on. And there are attacks on the agent during training and the actuator. Let me give you a quick example of the actuator attack. Let's assume we have a robot, an actual robot, learning to navigate in an environment while avoiding obstacles.
30:00
If the robot commands or decides to, let's say, move the left wheel forward, but there is some sort of obstacle in front of the left wheel and the left wheel doesn't actually move, then the actuator, the resulting data is going to be,
30:22
the resulting observation is going to be skewed because the agent is going to assume that the actuation has happened and then look at the changes in the observation and use that to retrain its policy, to optimize its policy based on faulty data. What are the adversarial capabilities? Well, we first look at different attack modes.
30:40
The attacker can perform a passive attack where it's only observing the target, it's not changing anything, or it can perform an active manipulation. In passive attacks, the attacker can perform inverse reinforcement learning to learn about the reward function of the agent,
31:01
or later on, we'll see, it can perform imitation learning to steal the policy. Active measures include attacks on the actuation, observation, or the reward channel, and attacks on observation can be targeting the representation model, how the agent sees the environment, or perturbing the transition dynamics,
31:20
how the agent sees the changes in the environment. Okay, going back to our initial proof of concept, remember our original goal was to perform a black box attack. So to achieve this objective, we introduce an approach
31:42
based on the transferability of adversarial examples. So we create a second DQN, the adversary creates a second DQN with similar architecture, but different initial parameters. When I say similar, it doesn't necessarily have to be a match, it just needs to be a convolutional neural network with some functional approximation techniques, but it doesn't need to have the same parameters
32:00
or the exact same architecture. And it trains that agent, that model on the same environment, the assumption is the adversary has access to that environment, and then uses the knowledge of that architecture and the trained replica policy.
32:21
It crafts adversarial examples the same way it did for the white box case. And we know that many of those adversarial examples can transfer to similar models, train on similar data, which also you can see applies to DQN policies as well. And this is how we implemented a black box attack
32:41
against deep RL policies. Now, what about training time attacks? In the same paper, we introduced the policy induction attack, where the attack is of the adversarial example type against a DQN agent during training.
33:01
Now, what are the different steps in this attack? First, the adversary derives an adversarial policy from the adversarial goal by training on the same environment where the target is going to perform or be trained in. So if the adversary wants to minimize
33:21
the reward gained in the game by the agent, then the goal, the optimization goal, is going to be the exact opposite of that of the target policy. The target wants to maximize the reward and the adversary wants to minimize it. There are, of course, different ways of formulating this adversarial goal.
33:43
Then the adversary creates a replica of target's DQN and initializes it randomly. And then comes the exploitation phase where the attacker observes the current state and transitions in the environment,
34:01
then estimates best action according to the adversarial policy derived in step one. Then the attacker crafts perturbations to induce adversarial action based on the replica of target's DQN. This is exactly the same as our black box test time attack. The attacker applies the perturbation
34:21
as a man in the middle in the observation channel. The perturbed input is revealed to the target and the attacker waits for target's actions. And this is a loop. This loop can go on until the target, either the training process of the target, either converges to a suboptimal policy or up to a certain number of iterations.
34:42
This is a very rough plot. This is not smooth yet, but you can see that the unperturbed agent moves towards, this is in the game of Pong still, moves towards convergence to an optimal total sum of rewards, while the attack agent
35:00
moves towards a convergence to the minimum possible return of zero. It's getting closer and closer to zero, which indicates that the training process of DQN and DPARL in general can also be targeted through adversarial attacks. Now I'm going to introduce another type of training time attack.
35:22
Again, this is, this aims to induce some form of misbehavior. We call this misbehavior addiction and this is a follow-up to a work I did with my colleague Roman Yampolski on psychopathological modeling of AI safety problems. This is a proof of concept. We consider the game of Snake,
35:42
many of you probably remember Snake from older Nokia phones. The DQN agent is a snake and is learning to play in this environment. What the attacker does is it adds a drug seed with more instantaneous reward than the typical seed, but it also results in more
36:01
increase in the length of the tail. And you can see that this can end up, well, if the tail the increase in the tail length is more than a certain amount, then a longer-tailed snake is bound to eat its own tail sooner
36:22
rather than later. So we show that we actually derive some theoretical close-form solutions for what the additional reward and increase in the tail length should be for addiction to emerge, meaning that the agent learns a more myopic policy
36:41
instead of the optimal policy. And you can see that it's actually possible to make the agent addicted to the drug seed, as we call it. And this results in learning a suboptimal policy. And of course, due to time limitations, I'm going to introduce
37:00
only one more type of attack, and it's that of targeting the confidentiality of the parallel policy. The problem here is, or the question is, is it possible to extract the deep RL policy from observations of its actions? Why does this matter? Well, the security challenge posed by this sort of action
37:23
is, of course, model theft. A company like Google, or let's say, Uber or Waymo may have spent billions or millions of dollars on, coming up with a very accurate deep RL policy for autonomous navigation. And if it can be stolen by an adversary, then the intellectual property becomes worthless.
37:42
And also, a stolen policy, an extracted policy, can be leveraged in integrity attacks in the same way that we mounted black box attacks on DQN policies. So let's see. As it happens, a branch of reinforcement learning, or in general,
38:01
one solution to the sequential decision-making problem is not in the RL domain, but in the supervised learning domain, and it's called imitation learning. Imitation learning is the supervised learning of policies from observed behavior of an expert. And by behavior, I mean state action behavior. What is the policy of an expert? Based on this concept in 2018,
38:24
Hester et al. proposed DQFD, or DQ learning from demonstrations, which is DQN, where the initial training is done based on observed data using deep learning.
38:40
So they have data from human players playing a certain game, or human performance doing a certain task that they want the agent to learn. The initial step of training for a DQFD agent, the supervised learning on observed data, and then it starts building on top of it
39:01
through reinforcement learning approaches. And it was shown that it can result in faster convergence, better sample complexity, and sometimes more interesting and robust policies. As security researchers, you can probably see where this is going. This wonderful algorithm, DQFD, can also be used to replicate policies
39:23
instead of applying it on observed data collected from human performance. It can be applied on observed data from a target policy. So here's a proof of concept attack the procedure. The attacker observes and records and interactions of the SARSA type,
39:41
state action, the next state, and their reward based on this transition of the target agent in a particular environment. And then the attacker applies DQFD to learn an imitation of the target policy and Q function. Now, at this point, the attacker may either
40:01
just go away and sell the extracted policy, or it may decide to target it using different adversarial perturbation attacks, some of which we've covered so far in this talk. So, as a proof of concept, we consider a slightly less complex environment, that a card poll, where the objective is to stabilize
40:22
this poll on this card by moving the card to right and left. The reason for choosing this simple environment is merely economical, because we didn't want the experiment to take days or weeks. We start with a simple case, and we consider different types of policies.
40:43
DQN with prioritized replay, a enhanced version of the classical DQN proximal, policy optimization, and asynchronous actor-critic. And, of course, we also train an adversarial RL agent, a DQN agent, whose objective is to incur maximum loss of reward,
41:02
or in other terms, in more technical terms, maximize the regret of its target. And here are the results. First, with regards to replication progress, based on only 5,000 demonstrations, 5,000 state action, next state reward observations, we see that all three policies
41:21
are almost exactly replicated. We can see convergence to the optimal performance of those policies in the environment. And then we perform adversarial training. We train an adversarial RL agent to attack and maximize the regret of those policies.
41:44
And you can see that this can also be easily achieved within very few iterations of training in card poll. I believe we can see that for PPO2, which is a somewhat robust deep RL algorithm or approach,
42:03
it's possible to incur maximum damage or find a policy that incurs maximum regret on the target within 60,000 iterations, which is a relatively low amount. Now, what about defenses? So, of course, similar to adversarial examples,
42:25
one approach or one technique for reducing the impact of adversarial example attacks is through regularization. And one common type of regularization in supervised learning for mitigating adversarial example attacks
42:40
is adversarial training, training the model on adversarially perturbed samples to make sure that it sees different perturbations of the same image and knows that all of those result in the same label, in the correct label. This is called, essentially, data augmentation in general as a regularization technique.
43:01
So, in 2017, when I was just starting to look into this problem or this domain, I had gone through the adversarial training literature and thought that the same may also hold true for deep RL. I made a hypothesis, actually, two. One is with regards to recovery. If training time attacks are not contiguous,
43:22
if not all of the observations are perturbed, then DRL adapts to the environment and adjusts the policy to overcome the attacks. This is training time. And with regards to robustness, I made another hypothesis. Such policies, policies trained under attack
43:41
are more robust to test time attacks. And this particular investigation is published in a paper titled Whatever Does Not Kill, Deep Reinforcement Learning makes it stronger. So, similar as before, we are looking at DQN in Atari games, Breakout, Enduro, and Pong.
44:03
Now, the way I designed the experiment was based on the probability of attack. So, as an attacker, I assign a certain probability for each state, for each observation during training time to be perturbed. I perform experiments,
44:21
different experiments with different values of this P attack, 20%, 40%, 80%, and one, which means contiguous attack. And it's interesting to see that for values of P less than 50%, the agent actually recovers. But for values greater than 50%,
44:41
the training process plummets, and either does not converge or converges to a very, very low mean return value, or mean total report value. I later on published a theoretical analysis of why this happens.
45:01
It's available in my PhD dissertation, which I'll reference in the final slide. Also, it was interesting to see that the robustness hypothesis is also true. You can see here that after training, if we attack the test time policy with probability of one,
45:20
the plain or vanilla policy, a policy that was not trained adversarially, performs very poorly. However, policies trained under adversarial attacks with probability 0.2 or 20% and 0.4 perform really well. For 80%, even for policies trained at 80%,
45:42
as you can see, the policy itself is already performing very poorly. It's surprising to see that at P equals one, the performance gets slightly better. It's comparable with 40%. To this day, I'm not entirely sure why this happened. I've repeated the experiments a number of times, I still get the same result. I still don't know why this happened.
46:01
And this is one of the interesting problems that we are looking at right now. As a research group, another defense that we introduced is based on parameter space noise. This is very much like dropout. The idea of parameter space noise was introduced in 2017, I believe independently by Plamper et al. and Fortunato et al.
46:24
And the idea here is, again, similar to dropout, to introduce zero mean random noise to the learnable parameters of neural network in deep RL to enhance exploration and convergence in deep RL benchmarks. Now, in another paper in 2018, we investigate whether this approach can be used
46:43
to mitigate the impact or severity of policy manipulation attacks on DQM. And it was shown that it actually performs very well compared to vanilla or classical DQM.
47:04
And these are the training time results. You can see that if noisy net is used, if parameter space noise is used, the performance degradation for all environments is at a much lower slope, at a much lower rate than the vanilla architecture.
47:20
Finally, to have proposed a solution for the policy extraction problem, I along with William Xu at K-State came up with the idea of watermarking the RL policies. Watermarking has already been introduced in deep learning in general. The idea is to come up with a unique signature
47:43
that is both difficult to remove and does not impact the performance of the policy itself or the model itself, but it still provides a unique signature, proof that a model is the same as another model or is a replica of the suspected model.
48:05
So we introduced an interesting watermarking procedure. I know it's arrogant of me to call my own work interesting, but I still get excited when I think about the moment I came up with this idea. The idea is to create a second environment whose state space is this drawing from the main environment. So create an environment.
48:21
If you're training an agent to play a game, let's create another environment which has no states. None of these states are the same as the original training environment or deployment environment of the agent, but the dimensionality of the states are the same. So if each state in the original environment
48:42
is represented by, let's say, three values, three features, then each state in the second environment is also represented by three features. It doesn't really matter what the second environment looks like. It's just some other environment that the agents may interact with.
49:01
And then we craft the transition dynamics and reward procedure for the second environment such that the optimal policy follows a looping trajectory. So an optimal policy for an agent trained in the second environment is going to be one that follows a loop, goes to, let's say, state one, then state two, state three, and then goes back to state one.
49:22
During training, what happens is we periodically alternate between the two environments. So let's say at every n iterations of the training process, we take our RL agent from the original environment, we drop it in the second environment, train it for a few iterations, and then bring it back
49:41
to the original environment. Now, once trained, if we want to examine the authenticity or whether a policy is copied or not, we apply the policy in the second environment and measure the total reward. Here's the experimental setup. Again, we are working with card pool.
50:01
So the watermarking environment is defined with five states, states one to four, plus a terminal state, which should never be reached if the policy is optimal in this environment. And none of these states,
50:21
as represented here, can be found in the original, can occur in the original card pool environment. These are all highly impossible, not highly, definitely and absolutely impossible to occur in the original environment. As for the transition dynamics, this is how we've defined it. Let a particular state be A0 and A1
50:45
if we are, I'm sorry, actions, actions A0 and A1. If the agent is in state i and performs action i modulo two, if i is even,
51:01
this is going to be action zero, if i is odd, this is going to be action one, then the next state is going to be state i star or i modulo four plus one, and it receives a reward of one. If the agent performs any other actions, so instead of going from one to two,
51:22
it performs any other action, it will immediately go to the terminal state and receives a reward of zero. So the optimal trajectory is state one to two to three to four, back to state one and so on. All right, let's see how it works. Let's look at the test time performance comparison of watermark and nominal, non-watermark policies.
51:41
The watermark policy performs exactly as well as on-watermark policies. It reaches the optimal or best performance of 500. When it's applied to the watermark environment,
52:00
the second environment, it also performs optimally. It gets a maximum reward possible. However, you can see that if we try to apply the non-watermark policies to the watermark environment, we'll see very, very small values of scores,
52:20
the total reward. So you can see that it's possible to determine whether a policy is authentic or not, or whether a policy is an exact copy of another policy by just applying it to a second environment and see whether it performs optimally or not. Now, there are many other things
52:41
that I really wanted to touch upon in this talk, but unfortunately we are a little short in time. For practitioners, it may be of interest to have some way of benchmarking or evaluating the resilience and robustness of policies and compare different policies, different approaches with regards to their resilience and robustness.
53:01
Some of my work already introduces or proposes an RL-based approach to perform this evaluation and benchmarking. I've also done some work on investigating the impact of hyperparameter choices on resilience and robustness of DQNs in particular, but also other model-free and active critic approaches.
53:25
This can be very helpful to those who want to engineer and design the new RL agents to be deployed in critical environments. Also, something that I wanted to mention, but unfortunately I don't have time to do so, is that adversarial training is not a silver bullet.
53:44
It's not an answer to all of the problems in DQN agents. There are certain limitations for robustness and resilience obtained from adversarial training of DQN agents. And also adversarial training is very costly in general,
54:03
especially when it comes to real-world scenarios, real-world environments and actions. And some of my recent work is focused on improving the sample efficiency and computational cost of adversarial training via a new exploration mechanism called Adversarially Guided Exploration, or AGE.
54:22
All of this work can be found in my PhD dissertation, which bears the same name as this talk, Falsen-Arpanistar Security of Deep Reinforcement Learning. You can find it if you search my name on Google Scholar. If you're interested, of course, all of these are published in separate papers
54:40
in slightly more details under the same titles. And finally, some of the open areas of research in this domain. With regards to training time resilience and robustness, not much has been done with regards to policy search and actor-critic methods, as well as model-based and hybrid methods.
55:02
Of course, when we talk about model-based, there are some approaches from optimal control theory and approximate dynamic programming that may be applied here, but very few have looked at this problem from a security point of view. So if you're interested, this is one of the areas that is in dire need
55:21
of security-oriented investigation. As for mitigation of policy replication, one of the ideas that my research group is currently working on is constrained randomization of policy. So randomize the policy such that the replication through techniques like imitation learning becomes more costly, more samples, more observations will be required
55:41
while preserving the performance of the policy. There's almost no work done in multi-agent settings. Of course, adversarial reinforcement learning has been investigated in settings where there are zero-sum agents, but not really where there is an external adversary or adversaries trying to exploit the inner workings of the agents,
56:02
the RL components of the agent. One more thing that is of note is the importance of discounting. The addiction problem that I demonstrated earlier in the snake agent is mostly due to the constraint discounting solution. For those of us who come from a reinforcement learning background, you're familiar with the basics
56:20
of reinforcement learning, you probably know that the discount factor is typically chosen to be 0.99 or something in the same ballpark and is left the same. It's treated as a constant throughout the training process. But this is very far from how our brain works and very far from the optimal approach
56:42
or accurate approach to discounting. Our research group has recently started looking into this problem and is working on developing adaptive discounting solutions to enhance the resilience and robustness of RL agents, particularly deep RL agents
57:00
in complex environments for AI safety and security purposes. Now, of course, our naturally inspired approaches that can be looked at, for example, approaches coming from, let's say, TD lambda models or dopamine models of psychopathological problems or neurological problems and the solutions prescribed to those, as well as approaches in social sciences
57:22
which may help with the security problems arising in multi-agent RL settings. Very well, thank you very much. And I believe at this time I should be available for your questions.