Recent Progress in End-to-End Learning for the Physical Layer

Video in TIB AV-Portal: Recent Progress in End-to-End Learning for the Physical Layer

Formal Metadata

Title
Recent Progress in End-to-End Learning for the Physical Layer
Title of Series
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2019
Language
English

Content Metadata

Subject Area
Abstract
End-to-end learning is one of the most promising applications of machine learning for the physical layer of communication systems. I will provide a tutorial introduction to the topic and will discuss recent results as well as future research direction.
Communications system Virtual machine OSI model Bit Damping Musical ensemble Entire function
Axiom of choice Complex (psychology) Equals sign Equaliser (mathematics) Source code Inverse element Disk read-and-write head Mathematical model Dimensional analysis Neuroinformatik Data model Medical imaging Mechanism design Damping Data compression Physical system Identity management Fundamental theorem of algebra Machine learning Constraint (mathematics) Block (periodic table) Sampling (statistics) Bit Message passing Wave Process (computing) Telecommunication Chain Kanalcodierung Fehlerschranke Representation (politics) Writing Point (geometry) Algorithm Transformation (genetics) Characteristic polynomial Maxima and minima Device driver Online help Machine vision Intermediate language Goodness of fit Complex number Representation (politics) Nichtlineares Gleichungssystem Data structure Codierung <Programmierung> Computer-assisted translation Distribution (mathematics) Graph (mathematics) Information Artificial neural network Limit (category theory) Cartesian coordinate system Nobelpreis für Physik Nonlinear system Personal digital assistant Function (mathematics) Network topology Electromagnetic radiation Optics Codec State observer Interior (topology) Decision theory Multiplication sign Sheaf (mathematics) Set (mathematics) Function (mathematics) Parameter (computer programming) Mereology Food energy Coefficient of determination Mathematics Machine learning Synchronization Analogy Matrix (mathematics) Modul <Datentyp> Endliche Modelltheorie Area Algorithm Communications system Electric generator Optimization problem Measurement Proof theory Type theory Vector space output Normal (geometry) Right angle Physical system Data structure Functional (mathematics) Data recovery Artificial neural network Virtual machine Trigonometric functions Power (physics) Population density Telecommunication Average Reduction of order Integer output Message passing Mathematical optimization Condition number Task (computing) Module (mathematics) Noise (electronics) Mathematical analysis Einbettung <Mathematik> Transmissionskoeffizient Thetafunktion Marginal distribution
State observer Group action State of matter Gradient Decision theory Multiplication sign Execution unit Combinational logic Numbering scheme Insertion loss Function (mathematics) Analytic set Mereology Food energy Reinforcement learning Data model Estimator Videoconferencing Entropie <Informationstheorie> Arrow of time Process (computing) Endliche Modelltheorie Gotcha <Informatik> Error message Physical system Stability theory Electric generator Gradient Maxima and minima Perturbation theory Process modeling Data mining Proof theory Message passing Convex optimization Supervised learning Maximum likelihood Telecommunication Order (biology) Right angle Arithmetic progression Sinc function Resultant Point (geometry) Slide rule Random number Functional (mathematics) Observational study Real number Electronic program guide Black box Number Goodness of fit Complex number Integrated development environment Nichtlineares Gleichungssystem Tunis Noise (electronics) Pairwise comparison Dialect Wechselseitige Information Artificial neural network Projective plane State of matter Reinforcement learning Variance Analytic set Computer network Unit circle Line (geometry) Group action Limit (category theory) Transmitter Software Integrated development environment Lie group Optics Transmissionskoeffizient Wireless LAN Conditional probability
Complex (psychology) Code Interior (topology) Insertion loss Function (mathematics) Parameter (computer programming) Mereology Expected value Medical imaging Bit rate Positional notation Synchronization Entropie <Informationstheorie> Endliche Modelltheorie Error message Physical system Social class Algorithm Block (periodic table) Point (geometry) Infinity Bit Category of being Message passing Maximum likelihood Vector space Kanalcodierung Point (geometry) Functional (mathematics) MIDI Maxima and minima Mass Streaming media Black box Theory Wave packet Number Power (physics) Goodness of fit Iteration Band matrix Operator (mathematics) Data structure Codierung <Programmierung> Message passing Posterior probability Task (computing) Multiplication Scaling (geometry) Wechselseitige Information Information Artificial neural network Weight Expert system Morley's categoricity theorem Wave packet Transmitter Word Expected value Estimator Sampling (music) Interpreter (computing) Iteration Transmissionskoeffizient Codec Thetafunktion Table (information) Gradient descent
Complex (psychology) Wechselseitige Information Code Interior (topology) Numbering scheme Insertion loss Function (mathematics) Mereology Dimensional analysis Medical imaging Bit rate Different (Kate Ryan album) Entropie <Informationstheorie> Information Endliche Modelltheorie Physical system Injektivität Communications system Closed set Point (geometry) Binary code Fitness function Bit Divergence Message passing Process (computing) Vector space Maximum likelihood Symmetry (physics) Kanalcodierung Right angle Summierbarkeit Metric system Freeware Probability distribution Point (geometry) Functional (mathematics) Maxima and minima Code 2 (number) Number Wave packet Term (mathematics) Divergence Energy level Computer worm Codierung <Programmierung> Message passing Posterior probability Chain rule Wechselseitige Information Scaling (geometry) Information Artificial neural network Expression Interleaving Theory Morley's categoricity theorem Line (geometry) Informationstheorie Symbol table Perspective (visual) Transmitter Coding theory Software Personal digital assistant Posterior probability Point cloud Transmissionskoeffizient Object (grammar)
Point (geometry) Wechselseitige Information Wechselseitige Information Information Multiplication sign Gene cluster Maxima and minima Theory Numbering scheme Bit Food energy Perspective (visual) Divergence Mathematics Arithmetic mean Personal digital assistant Posterior probability Information
Point (geometry) Standard deviation Interior (topology) Code Length Bit error rate Similarity (geometry) Numbering scheme Bit rate Function (mathematics) Regular graph Likelihood function Neuroinformatik Performance appraisal Iteration String (computer science) Codierung <Programmierung> Posterior probability Identity management Physical system Standard deviation Wechselseitige Information Information Artificial neural network Weight Bit error rate Code Parallel port Bit Cartesian coordinate system Low-Density-Parity-Check-Code Symbol table Vector space Order (biology) Kanalcodierung output Right angle Procedural programming Object (grammar) Game theory Metric system Resultant Geometry
State observer Turbo-Code Code Parity (mathematics) Connectivity (graph theory) Bit error rate Function (mathematics) Mereology Likelihood function Number Neuroinformatik Geometry Estimator Propagator Iteration Semiconductor memory Data structure Endliche Modelltheorie Computer architecture Algorithm Mapping Information Suite (music) Artificial neural network Weight Keyboard shortcut Gradient Bit Symbol table Connected space Residual (numerical analysis) Category of being Arithmetic mean Loop (music) Software Order (biology) Kanalcodierung Iteration Transmissionskoeffizient Mathematical optimization Abstraction
Waveform Code Line (geometry) Equaliser (mathematics) Real number Bit error rate Numbering scheme Bit rate Software-defined radio Mereology Wave packet Estimator Goodness of fit Different (Kate Ryan album) Charge carrier Band matrix Square number Videoconferencing Integrated development environment Codierung <Programmierung> Endliche Modelltheorie Physical system Standard deviation Mapping Point (geometry) Gradient Code Bit Maxima and minima Arithmetic mean Fluid statics Doubling the cube Order (biology) Linearization Office suite Iteration Physical system Resultant Geometry
Point (geometry) Code State of matter Equaliser (mathematics) Parameter (computer programming) Mereology Medical imaging Geometry Degrees of freedom (physics and chemistry) Videoconferencing Endliche Modelltheorie Codierung <Programmierung> Physical system Curve Distribution (mathematics) Wechselseitige Information Information Point (geometry) Maxima and minima Bit Line (geometry) Symbol table Symbol table Word Transmissionskoeffizient Game theory Geometry
Point (geometry) Functional (mathematics) Distribution (mathematics) Message passing Information Gleichverteilung Cuboid Food energy Diameter Boltzmann constant Physical system
Axiom of choice Complex (psychology) Scheduling (computing) Multiplication sign Insertion loss Parameter (computer programming) Perspective (visual) Dimensional analysis Order of magnitude Data transmission Data model Mathematics Machine learning Different (Kate Ryan album) Endliche Modelltheorie Data conversion Fiber (mathematics) Physical system Algorithm Constraint (mathematics) Closed set Gradient Maxima and minima Measurement Connected space Type theory Latent heat Message passing Arithmetic mean Exterior algebra Configuration space Right angle Heuristic Mathematician Physical system Point (geometry) Slide rule Functional (mathematics) Implementation Link (knot theory) Virtual machine Code Theory Number Wave packet Frequency Complex number Operator (mathematics) Computer hardware OSI model Communications protocol Mathematical optimization Complex analysis Standard deviation Multiplication Distribution (mathematics) Wechselseitige Information Weight Expert system Division (mathematics) Denial-of-service attack Line (geometry) Signal processing Nobelpreis für Physik Software Personal digital assistant Optics Interpreter (computing) OSI model Speech synthesis Fuzzy logic Iteration Communications protocol Limit of a function
Musical ensemble
[Music] [Music] okay first of all thanks for for for inviting me here I'm quite humbled to speak in such a prestigious place in particular in between to two well-known professors who address much bigger challenges that what I'm trying to do the goal of my work is simply to make communication systems better and one way of doing this is using machine learning I would like to speak a little bit about the idea of learning entire communication systems from scratch to to
make you to to put the stage for what I'm trying to do and let's go back to to Shannon's original paper where he describes a problem of communication a set of reproducing at one point either exactly or approximately a message which has been selected at another point now this is a problem that has been driven us and many generation of engineers for the last 70 years and the way I'm typically dear was this - to formulate this in a mathematical way so we have a transmitter a channel and a receiver and now the girl says transmitter is just to be able to transmit one out of M messages reliably to this receiver so M could before could be extremely large think about for example if you would have been thousand bits of information that would be 2 to the power of 1000 possible messages now since you cannot transmit directly bits or messages which would be just an integer numbers through the air you need to kind of modulate it on an electromagnetic wave in this example here just represent an electromagnetic wave by a complex number X or complex vector X this would be modulated IQ samples then I'm so essentially what you're transmitting is X this goes through a channel but this can be essentially this describes where a conditional distribution P of Y given X where Y is the vector that you receive and the godess receivers just to reproduce or make a guess about which message has been transmitted as head and the goal of communications is to minimize the probability of making an I'm not the rate for such a system here just for the finishing would be I have K bits where K is the lock two of em and I've n channel users and total I transmit K over n bits per channel news now the way we have essentially solved this problem until now is essentially am through two steps so first we start by building a physical channel model you do measurements or maybe sometimes you just look at the Schrodinger equations and you might come up with a channel model but the in essence you need a mathematical model that you can work with and once you have your mathematical model you can actually try to solve the communication problem and since this problem is kind of complicated we cut it into many small pieces that we then try to solve individually so for example we have source coding channel coding modulation from the transmitter side there's a problem of synchronization channel equalization demodulation decoding etc now I'm I argue the main kind of the driver for using machine learning as a setting is because of two kind of insufficiencies so first of all any model we can up with is mismatched to the real world so there's always this trade-off between mathematical complexity and convenience for analysis but we actually never work with a real system so there's always some approximation error and secondly so this modular block structure where we cut this complex problem into many pieces this is suboptimal so in for example in many cases some source code separating source and um channel coding is sub optimal separating demodulation and decoding a sub optimal so in many of these cases um so and this is a reason why I'm four years ago and me and some colleagues asked a question would it be possible to actually build a system that learns to communicate over any type of channel from scratch and now I just wanted to have a small kind of graph showing you why this might be worthwhile so what are the reasons why we expect to gain with respect to you know technology this is an hour's phone and it's just job incredibly well one could essentially argue the problem is solved so why do we expect to gain so yeah this is just a very simple decision tree that shows you actually the reasons and an areas where we could hope to make improvement so first of all we have this problem of modeling something and if you have a good model if you say the model I have is sufficiently good for example in our case you could say I make an observation Y X is what has been transmitted it goes through a simply a channel age and I add some Gaussian noise and you say this model is good enough this describes fairly well what's happening my system ok then I can go to the next step and try to come up with some algorithm to solve the problem so for example in our case if the goal would be to recover X from Y knowing H I could say well maybe I just do a century it just inverts the channel matrix this is called a zero forcing receiver we say if this is good enough for you then there's a last question we need to ask and this is that of complexity so it's actually the algorithm you have come up with on your model this is something you can implement in practice and this is some info for nokia's is extremely important it might be less relevant for writing a good paper but and often we come up with algorithms and prove that they converge to the optimum solution but you would actually never be able to do it in practice so if you say complexity as this is not an issue here this would be maybe just a three by three matrix I can easily implement it then there's no hope we use machine learning anyway that what would be the point this interesting problem is solved but now in all other cases so for example if you have no model so you can make many observations why for many inputs X you can but you have no clue how they relate to each other then there's no way how can even come up with an algorithm that's what we typically see in image classification there's no more algorithm telling you directly how a cat or dog looks like um then you need to ask can i generate a lot of data a
lot of these pairs of observations and if the answer is no because it's too costly to climb some human to complex then also machine learning cannot help you now but if you have access to huge data set since we can use machine to kind of deal with the lack of a model of a good model and the same happens now in all of their cases so if you have a good model but the model is so complex that you cannot come up with a good algorithm that's typically what happens in optical communications you know you might have a sample you know that's a nonlinear Schrodinger equation tells you how the channel behaves but no one really knows what is the optimal way of communicating over such a channel in this case machine learning might help you to come up with a better algorithm and lastly if you have a model and maybe a good algorithm is just too complex and there's hope that a machine learning solution would come up with an almost as good as optimal solution just with lower complexity now I'm the interesting thing about this graph is if you're on the left hand side machine learning allows you to do things you could not do before and I think this is the disruption we currently see for example in computer vision there many applications you simply couldn't do before if you're on the right hand side says chef you know you know a good algorithm but it's not not perfect or complexity is an issue then machine learning might allow you to do things you could already do but just better and unfortunately I think it communications I would say we are more on the right side here in most cases because we already know almost how to do the job perfectly so the margin for improvement kind of small but still it's kind of worthwhile to look into that good I'm very happy that Stefan mentioned autoencoders because I'm now actually to build a communication system from scratch by section analogy to be made with an autoencoder but my my use of outer encoded far more simplistic so for me an autoencoder is a neural network that gets the Beckster access input and it's the only purpose of the outer encoders we reproduce X essentially at the output so you would ideally learn the identity function this would be totally useless but it's not because we care about a specific intermediate representation your outer encoder learns so essentially what I care about is some representation it's call it Y of the input we set certain times of certain characteristics so example in this example the input is for I mentioned Y is just three dimension so if I'm now able to reproduce perfectly X from Y I've learnt a dimensionality reduction or some kind of compression mechanism okay now in communications it's slightly different we typically do not care about reducing information or dimensionality because typically when going to transmit information this is not compressible so we rather have an out encoders as increases the dimensionality because you would like to make what we transmit robust so you can see a communication system as an outer encoder where essentially the chain of transmitter channel and receiver is one large neural network and the goal would be simply to reconstruct s so the input message as the output reliably and the goal is actually what we care about what comes out of the transmitter here so the goal is to learn a robust representation of your message with respect to the stochastic transformation as this message undergoes through the communication process so in the channel um the nice thing about is if you interpret it like this you can strain the system from end to end so you optimize transmitter and receiver jointly for a given channel model and it's a universal concept which applies essential to any channel P of Y given X okay you can use this can be acoustics molecular optics it can be centrally anything so now I just give you a very simple example of an AW in channel so it means it's a channel where you just add Gaussian noise so why is it X but it's essentially equal to X what you transmit plus noise and here's a simple architecture for such a neural network would look like you inject an integer this goes through an embedding and a couple of dense layers now what you need to think about is we always need complex numbers because so what we sent over a channel is simply modulated on on a on a cosine at the sine wave because we've do them in degrees of freedom we can use independently so we deal with complex vectors and so we essentially always interpret some part of your outputs as a real part some part of the imaginary part since the choice is totally arbitrary and we need to ensure that have to write power normalization so on I think I am went over this too quickly so essentially here so of course there's a constraint on what the transmitter can output so you have typically an average power constraint on your messages this is just you don't you have a limited amount of energy okay so if someone realization then you essentially feed this into any channel centrally you want any mathematical model you could have I will come back now to this I just said we don't want to have a model but actually if you need a model um so you can feed this into any kind of you know stochastic transformation you want and then the receiver you just essentially do the inverse you reinterpret some parts of CM complex vectors real parts and palestinian-american early part you send this to a couple of dense layer I do a classification task just for future reference so the transmitter is called f of theta T so theta T as a trainable parameters and G of theta RS what to receiver does and theta are the trainable parameters now I'm what does such a system learn so you have a simple
example here we have eight messages and one channel users means the transmitter will transform a number between eight one two eight into a complex number for example he have eight possible messages and what's the transmitter does at the beginning it represents these eight essentially these eight messages as these eight points on the complex plane and on the right hand side what you see is the probability of error you make over an awgn channel as a function of C signal to noise ratio to simply just you scale the variance of the noise unit now for comparison I show here scheme which is called 8-psk so with the eight messages a uniformly distributed across a unit circle and the red line is the maximum likelihood kind of performance you get and now so and now until the untrained system when you initialize everything randomly you get these eight messages here and you receiver essentially does random guessing because it hasn't learned anything and now I tried the system I'll show your video how this progresses so easy transmitter learns to separate these points better and better and better and the receiver learns to detect come up with the right decision regions and at some point yeah this converges to a stable solution and what's kind of nice you in this example that the system has come up with a scheme way of one message with zero energy but it gives you kind of a nice guide um since thing is when I when I tried this four years ago I was kind of surprised I was like well this is a nice result but it turns out that a colleague of mine an ex-colleague has done this in 1974 has proven that this thing is actually the optimum constellation you have that maximizes the mutual information between x and y but since this is not a convex problem and the fact that I could easily learn it in a couple of minutes kind of motivated me to look into something more complex can you apply the same idea to more complex channels okay and by the way you could put this on Google collaboratory if you want to play around with it um now the
biggest problem we have ascent the following we need actually to train the system from end to end we need to model our channel to back propagate gradients through it but in an extra to make it the problem worthwhile I don't want to use it on a simulated Channel I would like to do it on a real system right that's the whole point we want to want to compensate for this mismatch between the model and the real world so the question is how can you learn to communicate through a black box and I have struggled with this problem for quite some time and I've tried various things um the first kind of obvious thing was the following to say let us model this black box as good as we can then train our system over the analytic channel model then deploy it and then of course it might somehow work but you don't get the opportunity formance and then the only thing you can do then is to say I can fine tune the receiver because here you essentially you can gather many observations for which you know what has been transmitted and you can find Eunice but the problem is you actually never optimized the transmitter for the actual channel model and this is what limits performance now the second idea and this is kind of goes to generative modeling that also Stephan talked about as you could say let me first learn a model for this black box from data for example you could use a generative adversarial Network a conditional generative in their network and once you have learned this we essentially then you can replace your black box by a neural network which is fully differentiable and then you can learn to communicate over this learnt channel model this kind of works but the problem is that Wireless channels are kind of optical channels are even more complicated to model and it turns out that I never managed to get any good progress on this there has not been any in-depth study but I think it would be very worthwhile topic to look into how well it can be model communications channels through generative models I think no one has really done this and so what we then came up with essentially was a method where you say let's avoid channel modeling altogether so don't try to have an explicit model but just try to try on arrow edge to figure out what works best and this is kind of inspired by what people do in reinforcement learning but the key idea is essentially just to come up with a an estimate of your gradient for the examples and we do this through something which is called a policy gradient and rather than showing these equations I just have a hand waving explanation of what's going on so you essentially see the transmitter as an agent which has state s so the state is a message you want to transmit and the action is what you're actually sentiment overwatch actually transmitted over the air and now as the environment is the combination of the channel and the and the receiver and it computes a reward years would recipes a cross entropy between the message which has actually been transmitted and what what even things in this and now I'm in this way so he and this point you only care about you would like to make this agent better to increase the reward so it's only about making the transmitter better because I've shown on the last slide so making the receiver good is actually very easy because this is a supervised learning problem but making the transmitter better for a given receiver is hard because of this black box and now so how does this work you initialize randomly transmitter and receiver neural networks and uhm the first thing you do is you try to make the receiver better so essentially you let the transmitter send a lot of messages but it is probably the total gibberish you know it hasn't been optimized at all think about these eight points if they are not well separate you just transmit something but then you can essentially improve the receiver try to find the right decision regions so supervised learning that's the easy part now you keep the updated receiver and I would like to make the transmitter better for the current receiver and this is where you centrally use what is called a policy gradient so in order to explore now better solutions how to communicate you have the transmitter create its output at the neural network but before communicating it over-the-air you add some small perturbation so this is typically some Gaussian knowledge this is called exploration noise then the receiver computes losses sends it back to you and based on these losses you can get and the knowledge of the perturbations you have added you can actually just compute the gradient estimate so what what it does is essentially just you transmit something that you transmit small perturbation around projects you want to transmit and this helps you to estimate the gradient and then it can you update the transmitter based on this and do this iteratively so you improve the transmitter for the current receiver than the receiver for the current transmitters and essentially you keep your fingers crossed at this converges and it works you have no proof that is that it actually converge to the to some local minimum but it works extremely well I have made a demo like
this is already almost 2 years old so we're two software-defined radios for unsynchronized and the goal was simply to implement this in practice and here just show you the probability of error so block error rate as a function of the number of training iterations compared to a coherent little good baseline system it's called the coherent QPSK baseline system and what's kind of nice is that after around ya hundred iterations you start to outperform this so yeah I'm not kind of surprised by this is not a massive gain or anything no one would care about these small improvements but what's kind of nice is it you learn to do this in 1 hour and this is essentially taking engineers 50 years to come up with such a solution so that's why I think it's kind of kind of nice now there is some
mathematical or information theoretic interpretation behind what's going on and I wanted to speak a little bit about this and in particular essentially we also do not want to do this black box approach but try to inject as much knowledge about the problem we have to come up with a principled approach to this so we usually see in what comes something I think which is very similar to what has been shown the first talk where we centrally try to essentially we have multiple operating blocks or things about which we know that this is the right thing to do but then there are some parts about which we are not so unsure that depends on the channel and this is where we kind of try to augment existing algorithms with deep learning come up essentially we inject as much structure expert knowledge we can into the problem but those parts where we think there might be a mismatch between our model and our algorithm this is where we try to improve ok how to improve how to interpret what's the outer encoder actually learns so on the left hand side this is what we are training on if you remember the notation so G or theta R this is the output of the receiver this is essentially a probability vector over all messages and so this is the cross the categorical cross entropy across essentially all of the messages which are uniformly distributed all of the channel outputs so P of Y given X so this is what's the transmitter sense given an S and this is the what's it what the reciever beliefs so this is a category we trust entropy and now you would like to minimize this with respect to the parameters of the transmitter and the receiver now if we had infinite training data and a super large receiver we can also be certain that this will converge to the maximum likelihood solution so to learn the maximum likelihood detector so in essence you can say the best this could be is actually to be the true posterior P of X given Y you cannot do better than that so in essence we essentially we optimize here an upper bound on the cross entropy you would get for the maximum likelihood estimator and this is nothing else so minimizing this is nothing else than minimizing the conditional entropy of the message which has been transmitted s given Y now in practice so we cannot we don't have access to this we can't compute this expectation numerically so we do this to Monte Carlo sampling so I sent many messages over the air we do gradient descent on the loss function and since we effectively optimize the conditional entropy here this is actually the same as maximizing the mutual information between s and y so this is now where we start to feel I know it is kind of I find this recon for things that we actually maximizing mutual information that's what Shannon wanted us to do so it's good that we get this back um and what this actually doesn't the simple example I've shown you we learned geometric shaping so it's called how do you arrange kind of your points that represent messages well and you don't need a neural network for this so this essentially will be a small lookup table so once you've learned it you just show it away and save it as a lookup table and as I said so what's important is that the receiver is sufficiently rich so that you can get close to the maximum likelihood estimator because otherwise you would be you know optimizing something which does not correspond to in which information would be something something else you would actually be constrained by the allowable complexity of your receiver who you wouldn't you optimize the constellation for a bad receiver but that's what you wanted now there's one more problem so this whole idea does not scale well to very large number of messages so if I would would like to communicate a hundred bit of information I would have two to the power of 100 messages so it would be an extremely difficult classification problem so an image net what you've seen there were just a thousand classes we have no way way more and so the thing is you cannot train such a system for this large classification task simply doesn't scale now so what you need to do is actually to cup to couple this with an outer channel code so you essentially say I have a stream of information bits I will code them to to a vector of coded bits now I will chop this vector of coded bits into small pieces let's say for example six bits and now with sequentially only learn to communicate you know vectors of six bits okay so to do this is something I do and then at the end it's a receiver I just accumulate all of the outputs once I have the code word I combine everything together and try to actually decode what has been transmitted but now since we do not want to communicate messages anymore but bits have the problem of bit labeling so I've shown you this constellation in this video and
typically so the output of the training process is like a point cloud like this until you have 64 different points and the question is how do I assign now bit vectors to each of these points so there's an optimum way to doing it but the problem is that this depends a lot on your op only essentially on the goal you want to accomplish on your metric and if you want to do this by hand this is a very complicated problem which also doesn't scale well to large dimensions and I want to show you that you can also learn this and get this information for free um so from an information theory spective we get a vector of M bits that we feed onto the transmitter it computes an output symbol X have you observed Y now if you optimize this system for maximizing the mutual information between x and y okay that's what we've done so far with uses some categorical cross entropy this is the same as maximizing the mutual information between the vector B and Y because there's a one-to-one mapping the problem is that in practice we cannot achieve this with a very low complexity method because it requires you to use multilevel coding at the transmitter so this means essentially you apply an outer channel code on each of these bit levels and it's a receiver you need to do multistage decoding where you essentially say I decode the first bit level once I have this I subtract this from the rest that equals yasmin etc so this is extremely this is not practical now what we do in practice is something else we essentially do not care about the mutual information but on a lower bound so here's just we've written the mutual information between x and y which is the same as image information between the bit vector B and Y I have just rewritten this using this chain rule of mutual information another problematic part is this one here this is this multi stage decoding where you essentially say ID code first the bit I knowing why and then I decodes the second bit level B to but now was it assumed that I knows the first bit level now he throws us away if you essentially say I decode all of these bits without knowledge then you get this expression here which is called the bit metric decoding rate or many people call this as a generalized mutual information base is there many names for this um but the nice thing is that this is can actually be achieved with something called bit Italian coded modulation at the transmitter and bit metric decoding at the receiver so this is a practical scheme we use I think in almost any communication system that that I am aware of the bottom line is that this is symmetric we care about so it's not so much information and now how do we learn this so how does this fit into our auto encoder objective um so if we look now at the total binary cross-entropy we could learn so now imagine we change our receiver so this is now what comes out of receiver rather than having one output vector with M dimensions where we have a probability distribution over all messages you would just have M outputs corresponding to each bit okay that would be essentially if I have for example if I have three bits so it in the first case I would have eight possible messages and I would learn a prohibition over these eight messages or I could say I just have three outputs corresponding to each of these bits okay and essentially now my newer network would output me these posterior distributions over the bits so that's what I want to learn now I just sum all of these cross and binary cross and trapeze together so that's the total binary cross entropy another I can rewrite this in the following way I can I have a term this corresponds to the entropy of my bit vector this is typically just the number of bits because they are iid then I have this bit metric decoding rate R but with this thing here that's what we care about and then I have a term for sum of cool but lightly divergences between what the posterior my neural network gives me and the true one okay and so now if you minimize our D subjective function what we efficiently do is is we maximize the right this is constant and we'd like to minimize the KL divergence to the posterior so this is the same if you have a good model this thing will be extremely this will be very soon I'm close to zero because you can do this we will just implement the maximum likelihood detector and now I just want to show you the difference and when you train on the on this metric are compared to the mutual information so on the left-hand side you see what you get with the bit bias mutual information loss and
on the right-hand side what you get the symmetry information loss and what I change here is essentially the SNR so for each signal-to-noise ratio you get a different constellation um I just play this again so you can see that if
you train out the mutual information you
actually it doesn't change a lot stays rather constant with SNR he always has this weird point in the middle with zero energy and you also get no labeling here you get a perfectly labeled kind of constellation which changes a lot over SNR and if you can I spent some time looking into this and actually always learns the optimum labeling whatever the constellations are so even at these points where you have
some clusters here together you always have some kind of what is called gray
labeling scheme meaning that if you look at two neighboring points they will only differ by one bit and this is all the case true so even if you go at louis an hour where you essentially would have far too much information that you can you can't even communicate it it would group points together in such a way that only one or two relevant bits actually distinguishable and the others one you cannot communicate them anyway okay so the nice thing is that now our outer
encoder actually learns to output posterior probabilities for the bits and this is essentially a similar procedure look likelihood ratios and this means that they can now capture this directly with an outer channel code so the way a set up now works is as follows I have a bit string this could be for example a thousand bits i encodes them using for example a low-density parity-check encoder so here would get two thousand bits now I would chop this into many small bit vectors for example of size four bits I feed them into my neural networks as all the same so this is these are all they share the weight so this is one one neural network that outputs me essentially Maps these bit vectors two symbols I fit this into a channel now I have the same demap that operates you know in parallel on all of the received symbols and it outputs these look likelihood ratios which was then feed into a BP decoder what turned out to be important is to give the DMAP access to the signal-to-noise ratio because iam the computing this ll RS requires the SNR so it's helpful to fit this in as an additional input and we tried this listener with a standard LDPC code from the Wi-Fi standard right a half lengths 2000 1296 bits and these
are the coded bit error results we got so in blue you see a phase shift keying or qualm modulation scheme so qualm would be no this essentially looks like points are uniformly distributed on the X and the y axis so you get this kind of regular grid in green is what our system learns and yeah first of all you can see that the higher the constellation order is the more you gain and this is well known so these are this is essentially known as shaping gains because you optimize the geometry we just have an optimized geometric shaping and that's where the game comes from and the reason you gain is because this geometry just maximizes the bit metric kind of the mutual bitwise information that's what your that's our objective but the nice thing is that it translates into relocates um now you can go one step
further there's something called iterative de mapping and decoding so the
key idea is to apply the tubule principle of Softee mapping and channeled according to our set up see the way this works is as follows you essentially make an observation Y from the channel now the d mapper computes for every bit that is in the symbol log likelihood ratios right now you do this for all of these symbols that make up a codeword and you feed this into a channel decoder for example belief propagation for one iteration then this will output you improved ll ask for all of the bits and because they improved because you now gets the knowledge from the parity checks in your code ok and now in the next step is you say ok I feed the output back to the D mapper but before doing this I subtract the information that I've given to the decoder so I have subtract this information here what a feed into the decoder has obstructed here so that the only thing that remains is the extrinsic information so extrinsic means I only care about information that the channel code has given me and this is something which does not come from the observation so now I have here essentially I run the d mapper again but now I have a priori information from the channel code so this information here is independent of Y it just comes from the channel code and this helps me to make a better deal mapping estimate of Cllr s now I subtract before if it isn't with the decoder the information the extrinsic information that comes from the channel code as abstract sincere and that misses iteratively so this 2d map and decoded and iterate together this well-known now the problem with this is you can only do compute these a priori information explicitly for quite I would say simple channel models like an awgn channel and they are very complex so the computational cost is high and on top of this the optimal constellation you want to learn depends on the number of iterations you do for example the constellation you would learn for three iterations is different from the one for 15 okay and now so what we did we essentially said okay let's just replace this part here by a neural network so essentially and you get a structure which looks like this yesthey this is a transmitter as before that's what you learn here for receiver and essentially now it's just in your network which gets as additional information the a priori so the extrinsic information from the decoder you know if this is a neural network use the same everywhere now you have this complicated loop with all of these things and now in order to train this you need to unfold this algorithm okay so for example here's an example of three iterations so the first one is I just get the channel outputs and the Apriori information is just 0 I compute the ll RS I feed this into the belief propagation Dakota I just do one iteration here then I subtract what I fed into the DPB code that computes this extrinsic information this is what I now feed in as a priori information to the d mapper to gain our c channel outputs now i have this loop here which essentially subtracts again what fed into the AV mapper from the output to only have the extrinsic information from the D mapper and that uses again and again and so the nice thing is about this is when you unfold this algorithm first of all these are all shared weights so the composition she's more memory requirement is super low these are just a few way to actually train and you get something which is called a residual neural network architect architecture so you see all these shortcut connections here the nice property with this is that you never actually when you train a very deep neural network you have the problems of vanishing gradients so if it's very deep essentially what happened it's a lower layer you'll never be able to train it but thanks to these shortcut connection which just appear totally naturally you can train this super deep we've trained this up to 100 iterations is still converges kind of nicely and another nice property is that since I reuse the same components here I can all I can also I can train it for example in three iterations I could even let it run for 15 it's totally scalable to any number of iterations at work now the kind of
the performance results you get are here so in blue this is our baseline without iterative detection and decoding so these are just here different modulation orders when you apply iterative detection and decoding to an existing modulation scheme like PSK and qualm you gain of course a tiny bit but what you can see is that training now the outer encoder so optimizing the geometry jointly with iteratively mapping and decoding if you like of kind of very substantial gains even more than a DB these are the results on an AW n channel we got and now the interesting part is
of course to do this on not a double end models but on a real system so um we did I showed this video where you have these two software-defined radios so we do the sentries the same thing again but now trains the system on this new metric using iterative D mapping and decoding through a gradient sta of the channel we didn't attempt to learn channel equalization we used your hand kind of a linear minimum mean square estimator equalization we used OFDM 64 subcarriers and there's one more thing we did once the system was trained we actually optimized the code for the learned for the learned d mapper because the code is intentionally when you when you take a code from a standard they are typically optimized one awgn channel so they'll have the Wi-Fi code so it this is not an optimum code to be used for our learned systems on top of this once everything was done we also optimize the code and that'll be good in green this
would be Wi-Fi once you get state-of-the-art we put love into this baseline now when you start optimizing the system just did constellation geometry you get this purple line here so we get in roughly 1 DB and when we then optimize the code on top of this we get like another 1/2 DB of gain so we're roughly now we hear 1.0 1.5 DB better and we're so we try to understand where do these gains actually come from and it turns out that the biggest game comes from the from learning just of the d mapper because if you apply a d mapper which thinks sets a system is an awgn channel that's what you typically do then you get this curve here so this would be the outer encoder because we optimized the constellation but we assume that the channel is Gaussian then you lose roughly half a DB so we're actually having the the D mapper to be essentially targeted properly to the to the channel model and on top of the imperfect equalization because we do you have imperfect channel State Information here so we actually see a channel which is very different from an awgn channel so this is where I think is the biggest part of the game um and just before to finish I wanted to
say that we can even go one step further so until now we have only looked into optimizing how to place these points optimizing constellations so in other words we have looked at the mutual information and we maximize this with respect to the parameters of your transmitter okay so this is what is called geometric shaping where to place your constellation points but you have one more degree of freedom you can actually also play around with the distribution over these points so if you would like to maximize image information and turns out that it's up optimal to have all of these constellation points that pier was the same probability where you would like to have a non-uniform distribution and so this I call this P of X and this is what is called probabilistic shaping so with which probability should each symbol be sent and now the question is can you jointly optimize probabilistic and geometric shaping and also the bit labeling everything together and the answer is yes yeah you could do this but I won't go into details I'll just show you one video so here you can see this is the
constellation as a function of C SNR for 256 points the diameter of each point corresponds to the probability okay and you see that at very low SNR so you essentially kind of your your constellation breathers you have many points with a large energy which would be a very very rarely so this means they cover a lot of information and that's what you would like to give them a lot of energy and there's a huge bulk in the middle which essentially appears often
but also carries only very few
information and a very high ascend are everything you send you have a uniform distribution over messages that's what you would expect so now the thing is you on an awgn channel we convert to this is actually known that the optimum distribution would be a Boltzmann distribution we converge exactly to this but of course our goal is not to do this for the awgn channel but for essentially anything anything you want so now we have a system that can learn how to box probabilistic LG geometric shaping over any channel and so for this reason I am so I believe
that we are now actually able to learn a full physical layer four which is optimized for a specific link or a channel model and this is totally automated doesn't require any human intervention so you do code design all of the shaping everything see people now look also into learning new codes but essentially we could one could one could make the argument and saying rather than doing a one-size-fits-all approach where we do measurements come up with a channel model then a bunch of people agree on okay this is roughly how we should standardize a system one could maybe argue that 60 could be a system that essentially allows itself to be optimized by machine learning so what you would need to standardize is actually a protocol for message exchange which would just allow you to actually since you would have a fallback solution you essentially say okay of a system one one frequency band I use for example Wi-Fi standard just to communicate but then you define a protocol that allows you message exchanges of gradients of loss functions etc and once you've optimized essentially on a different frequency band whatever's going on you just take over and by doing this you would have a fully bespoke physical layer I think this is something I find this exciting so let's say this is honesty worthy I think this is where we are heading and I think the next front chair would now be to extend this idea to higher layer protocols so you can you rather than just running a bespoke physical layer can you learn a bespoke mac layer for example and that's what I'm currently interested in that's the end of my talk so thanks
[Applause] if you learn on one particular channel in one physical cable for example and you change the cable with something the same kind but not exactly that first saying is going to do wood so the thing is of course if you train only one specific realization so if your channel is fixed complex function of course if you then apply to something else you will lose in performance you wouldn't to retrain now if you there's something in wireless is I don't think it's so super practical to do this unless you do fixed fixed wireless think about no backhaul or front wall connections reception line of line-of-sight link which won't change this way you can do it but so what we typically do is we optimize over a distribution of channels so during training you've random realizations of your channel and by doing this your system is actually able to generalize specific it depends I think an optics but I think that many people work in optic stenosis better than me there might be an interest in doing this actually to optimize from one fibre this is the thing I want to optimize for but in wireless I think it's even better to to optimize over distribution and then what's nice is for example I'm haven't shown this so you can also your system will actually learn to transmit whatever it needs to equalize a channel so essentially if I train this on the channel Rayleigh fading channel model when you essentially you multiply X just by a random complex number you will actually learn to superimpose a actually a mean value to what you transmit so which is constant and if you transmit something which has a constant mean you receiver can learn it and this is like a superimposed pilot tone so you're simply actually will learn automatically to equalize and the depth to any type of channel realization so but there's a fine trader V depends on the use case my question is about the choice of number of layers so when you talked about the decoder you showed three and you said oh we did this was just two little kids on a slide weight conversion is so obviously less layers it's faster more to the singers it all depends on the complexity so I asked our ask a business division so what do we do that I maybe use 15 iterations I used here this flooding scheduling but they do just essentially one they just update one variable node and so the nice thing actually but I think where machine learning makes really senses I would say that in let's call it an optimization or in traditional signal processing you would write an algorithm and say do until something is smaller than Epsilon and you don't care about the number of iterations but then you essentially say I have an algorithm that provably converges but in practice of course you will just stop after 15 iterations and then cross your fingers that is not too bad what you do is machine learning in some sense is you say you fix the complexity upfront say I have 15 iterations or so many multiplications operations and then I try to do the best I can with this allowable complexity so I think from a hardware implementation perspective I find this makes much more sense to be rather say I start from a compute constraint model and try to do the best I can so that I think typically this is what decides on the model complexity and in most of the times and I think this often overlooked to be competitive with a second an expert baseline you need to put a lot of expert knowledge into your model so if you try to just learn everything you will end up with a model which is multiple orders of magnitude more complex than a human design baseline in which you've injected already everything you you knew okay nothing yes so in certainly set up this transmission problem relates to steer packing problem that had to be studied by mathematicians and I was wondering you for genial networks to help to get some whether or not configurations of balls in high dimensions lost your packing yeah so the thing is for example we we tried this on is try to learn for example the we have 256 points in eight dimensions I think in this case it's no one knows what's the optimal feel speaking is there someone was claim to it I think it's called a ghrelin something it was sorry I'm recently it was very easy right right right it was right you're right right but at the time I looked into it it wasn't yeah but but the thing is so the thing is you come close to something it's not exactly you come close to something like this yeah so it's all I can say in higher dimensions were so we tried to say which writers in twelve dimensions this is where we're currently looking into this ability is interesting in optics it seems that we are better than heuristic methods or alternative methods but I don't have any like we just started really interesting for them very large number one notions this you were okay yeah but there's this other thing I think um I'm I'm not 100% certain if this optimist fear packing is also the best for the bitwise mutual information I think it's not in many cases so I'm that's why it depends on the channel that runs on the channel so that's one so for the channel I'm interested in it's not so but I beg you so that's why I'm it's an interesting probably we have seen that so I think it's not the dimensionality is a problem it's the number of points so we have tried once with 16,000 points and thinking for dimension and this was a disaster and especially then you are very um it really depends on the initialization so which thousands of seeds and each time you're stuck in some other local minimum and so because I'm fuzzy small it's kind of problems I have never witnessed that we got stuck into you never need to run multiple seeds you always seem to converge to the close to or at least to a local minimum which is not too bad but in this very high dimension it seems to be really dependent on on the seed so I don't have you know I I'm not a theorist I just apply things so I don't have a mathematical interpretation of what's going on okay then let's I think everyone is hungry right I typically eat at 11:45 [Laughter] [Applause]
[Music]
Feedback