We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Why Transformers Work

00:00

Formal Metadata

Title
Why Transformers Work
Subtitle
And RNNs Fall Short
Title of Series
Number of Parts
130
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
This will be a technical talk where I'll explain the inner workings of the machine learning algorithms inside of Rasa. In particular I'll talk about why the transformer has become a part in many of our algorithms and has replaced RNNs. These include use-cases in natural language processing but also in dialogue handling. You'll see a live demo of a typical error that an LSTM would make but a transformer wouldn't. The algorithms are explained with calm diagrams and very little maths.
61
Thumbnail
26:38
95
106
TouchscreenProcess (computing)AreaBitChatterbotCore dumpPerfect groupXMLUMLMeeting/InterviewComputer animation
Latent heatRepresentation (politics)Data conversionQuicksortWordContext awarenessBitSequenceLine (geometry)Time seriesOpen sourceArc (geometry)Vector spaceMechanism designProcess (computing)CASE <Informatik>Product (business)Multiplication signDistancePhysical systemMereologyEinbettung <Mathematik>ChatterbotToken ringPersonal digital assistantDot productSubject indexingOrder (biology)TouchscreenPoint (geometry)Task (computing)Different (Kate Ryan album)InformationType theoryArithmetic meanNoise (electronics)Endliche ModelltheorieGroup actionAnalogyParameter (computer programming)Block (periodic table)NumberBuildingRight angleNatural languageCore dumpMappingRobotMathematicsInheritance (object-oriented programming)Diagram
Normed vector spaceDatabase normalizationWitt algebraInformationWater vaporBlock (periodic table)Product (business)WordContext awarenessWeightProper mapTensorSession Initiation ProtocolEinbettung <Mathematik>Token ringTime seriesPairwise comparisonTwo-dimensional spaceMultiplication signOperator (mathematics)Set (mathematics)Vector spaceLevel (video gaming)Matrix (mathematics)Probability distributionDistribution (mathematics)DistanceMathematicsFormal grammarMultiplicationArtificial neural networkPoint (geometry)VirtualizationMechanism designNormal (geometry)QuicksortCalculationAnalogyTask (computing)Domain namePhysical systemImplementationSoftware frameworkGradientDifferent (Kate Ryan album)Degree (graph theory)Category of beingNatural languageNumberBitStandard deviationRight angleSimilarity (geometry)Dot productArithmetic meanSymbol tableComputer animation
Block (periodic table)Matrix (mathematics)Wave packetDifferent (Kate Ryan album)MultiplicationEinbettung <Mathematik>AlgorithmMereologyBitWordPhysical systemWeightSet (mathematics)Vector spaceGroup actionEndliche ModelltheorieParameter (computer programming)QuicksortBlock (periodic table)Sparse matrixConnectivity (graph theory)SequenceTask (computing)Multiplication signMechanism designPoint (geometry)GradientDuality (mathematics)Token ringNatural languageCountingEntire functionTheoryChatterbotArtificial neural networkTime seriesParallel portUniformer RaumTranslation (relic)Level (video gaming)Data conversionPersonal digital assistantArithmetic mean1 (number)10 (number)Computer configurationRight angleComputer animation
CAN busHill differential equationoutputTexture mappingGradientEndliche ModelltheorieInstance (computer science)Correspondence (mathematics)Bit error rateSequenceWordWave packetGame theoryEinbettung <Mathematik>Projective planeMechanism designInheritance (object-oriented programming)Physical systemHand fanMultiplication signMaxima and minima1 (number)CASE <Informatik>Natural languageAddress spaceInformationRight angleGroup actionCartesian coordinate systemChatterbotTask (computing)Moment (mathematics)Demo (music)BenchmarkPattern languageVideo gameMereologyOrder (biology)Type theoryQuicksortResultantBitSingle-precision floating-point formatOpen sourceDecision theoryProduct (business)outputGradient descentConfiguration spaceContext awarenessPoint (geometry)Parameter (computer programming)RobotFeature spaceComputer fileVector spacePhase transitionTetraederInterrupt <Informatik>Computer animation
PiAlgorithmContent (media)FeedbackSource codeMultiplication signVirtual machineInheritance (object-oriented programming)BuildingRobotChatterbotYouTubeVideoconferencingVisualization (computer graphics)WordEinbettung <Mathematik>Interactive televisionWhiteboardBitProjective planeAlgorithmComputer animation
Right angleLipschitz-StetigkeitResultantParameter (computer programming)MereologyMultiplicationCodecCASE <Informatik>Inheritance (object-oriented programming)Tablet computerLink (knot theory)Multiplication signGraph (mathematics)Mobile appFunctional (mathematics)Projective planeWordTouchscreenPersonal digital assistantSequenceBitVolumenvisualisierungGene clusterShared memoryWeb pageOnline chatLatent heatVector spaceDemo (music)Different (Kate Ryan album)Theory of relativity
Metropolitan area networkCASE <Informatik>BlogPhysical systemQuicksortChatterbotLink (knot theory)Fluktuations-Dissipations-TheoremElectric generatorPersonal digital assistantGoodness of fitEnterprise architectureMeeting/Interview
Transcript: English(auto-generated)
Hello everyone, my name is Vincent Just to get a quick confirmation though. People can see this screen. This is the purple thing just to yes, we can Yeah, okay. Perfect. Perfect. Perfect and everything works well My name is Vincent. I work as a research advocate at Raza
Essentially my job by and large is to make sure that whatever our lovely team of researchers come up with That that's relatively well understood for our developers and that we you know Contribute a little bit of knowledge in the whole NLP realm So there's this new technology you might have heard of it's called a transformer and the goal of this talk is to give us somewhat
intuitive explanation How they work why they work and why it matters and I just want to also demonstrate how in two core areas of our chatbot Technology, we're actually making use of this quite a bit. So I hope that's going to be interesting to some of you Before I can go in depth in sort of you know, how this technology works
Let's first discuss the problem that we have at hand So at Raza we like to provide you with the technology to build your own digital assistant So that could be a chatbot can be whatever you'd like but it means that we have this conversation with an end user and the conversation could go something like
Hello, and then there's a sort of digital assistant here that says hello back Then we have a user that says hey, I would like to buy a pizza Then there's this reply asking for what kind of pizza And then the conversation could get interrupted because a user could ask by the way Are you a human and then we think it's ethical to never pretend that we're human
So it's really important that we first reply. No, I'm not a human I'm a bot but then we would like the chatbot to automatically pick up the conversation where we left it that that's kind of the like one of the scenarios that you might have and There's actually quite a lot of things happening
It's just just this short conversation if you think about all the moving parts and things that we would like to detect So every time that a user speaks over here, I could argue that's an intent Every utterance suggests that a user wants something and it's for us then to figure out what to reply back But it's not just that one such utterance is an intent. I mean you could say that's a classification
But it's also the case that inside of the text there's a specific bit of subtext that has a lot of meaning So for the intent that we have over here, for example The person wants to buy something and what does the person want to buy? well a pizza, but it could also be that the person wanted to buy a burger or a different type of product and
We really need to pick up this entity in order to get the best reply here so we have this interesting mapping from intensive entities to actions that the Digital assistants should take and note that I've drawn a couple of actions here like the action of saying hello is an action
but of course there's also an action happening right after here because the chatbot also has to recognize when they should listen and when they should speak so listening is also an action that we have to detect and This is basically one of the core problems that we have we want to provide you with
Essentially Lego bricks or building blocks and to help you navigate this sort of problem So that means that we have models essentially to detect these intents We have models to detect these entities and we have models to detect these actions and all these things by the way are Super open source so you can check them out, but we are interested in
Techniques that can do this accurately and not just in English by the way We really want to have systems that work well Also for other languages so Chinese Zulu we really want our system to be broadly applicable now the interesting thing about the problem that I've just
Described to you is that both problems they sort of have this sequence if you think about it If I see text appear in a sentence, then you could say, ah, it's kind of like a sequence There's a first word a second word all the way up till a fifth and if we look at the Dialogue that's happening. Then you can also argue there's something of a sequence happening here. We've got this
Text at time step one. We've got this text at time step two. That's a sequence over time So the way that we're going to deal with these With these text problems it's going to be somewhat related to time series and
And here's why I'm going to try to at least intuitively explain to you what a transformer is and what it does But in order to get there what I'm going to do is just make this one analogy Let's say I've got this time series. This is something that's happening over time
And what I could do is I could sort of move this filter over this time series And you can see as the filter moves The at one point in time there's going to be more attention Somewhere than somewhere else. So right now there'd be a lot of attention. My bad. So if I look at the screen right now At this point in time you would get a lot of attention on these points and let's say at this point of time
You get a lot of attention around these points And this is a technique that you can use to maybe de-noise the original data. It's a filtering technique And you can say well, let's have a wide attention span. Let's have like a thin attention span, but the idea is
That if this were a time series, right, we're gonna come back to text but just one aspect of this is I start out with this time series of dots and Then I have this I will argue this attention mechanism on how to smooth this out And you could argue that this thick red line that I have on top here. That's a more contextual
representation of the original data that I started with There's less noise in my data by applying this filter and that might be more useful for whatever task I have later So this filtering, this pre-processing step, that is something that can actually be really really useful
The question is can we do something that's really similar to text? Because if we think back of the original problem the text sequences that we have in the chatbot setting Those are also sequences
So let's let's sort of pick everything apart of what's happening In this current example, we can say that well Let's say that there's this one dot over here And let's say that some dots that I'll call xi because we're at index i over here But as far as that xi is concerned
We can say that This arc over here describes where we are going to spend our attention and for as far as this point is concerned We're not gonna put any effort into points that are over here because those are super far away We don't have to care about any of that so
By understanding for this time series how we should put our attention span We allow ourselves the opportunity to come up with this more contextualized line because it allows us to reweigh the original points This filtering step is effectively taking like a weighted mean and all the points that are closer to the point in question Those will get weighted more than points that are further away
So it's like a weighted mean that we're doing here actually And I would love to do this weighted mean but When i'm going to do it for text, I have to be really careful because maybe I cannot do this proximity argument here
Now If you're a little bit familiar with natural language processing techniques, you can imagine that we might have Let's say a token in a sentence here So i've got bank of the river and i've got this other sequence of text the money on the bank And We have these four tokens that correspond to these
Texts, but you might remember that we have these things called word embeddings or vector representations for words And let's say that i'm interested in sort of doing the re-weighing trick based on The vector that I have over here Well, then I hope you might agree that in order to get the most context for the word bank over here
Proximity is just not going to cut it If I were to think hey this token bank What other word in the sentence is giving it a lot of context then it should be this one token over here called river That one word river is going to give me a lot more information about this word bank
And same here if I want to understand what the word bank means here Then I would really like it if some of the attention came from money instead Not the word the like the word that might be more proximate, but it's not something that gives me extra meaning So we're going to apply a mathematical trick to go about this instead
But I hope that you still have in your mind that yeah We're trying to get this attention mechanism to re-weigh all of these vectors. That's kind of the idea Now as luck would have it. There's this mathematical coincidence. Um, let's say That this vector and this vector let's say that they're similar in some way
And you know if you have pre-trained word embeddings, you might just do the dot products between the two Vectors and if that number is relatively big then that's a argument that you could say. Well, they might be more similar In practice this this seems to hold um And you can imagine that there's not like if you look at the word vectors for these stop words
Since I don't have a lot of information there's probably not going to be a huge overlap between the dot product of bank and of But there might actually be a lot of overlap in bank and river So that means instead of maybe using that proximity that we use in the time series to say where we should put our attention to
We might be able to do something with the word embeddings that we start out with already So then you might get an attention Mechanism so to say that kind of looks like this And um, this is like a rough sketch of course
But let's say that I do this dot product between v1 and v1 v1 and v2 three four Then this might be an okay sort of distribution of my attention for this token As far as bank is concerned Sure, the token bank is important but of and the are not important and river
Probably is important and the way that I derive this is because of this dot product with the other word embeddings And note that we have a very similar thing over here The main difference though is is that here i'm looking to project everything onto v1
And here i'm looking to project everything onto v4, but it's the same calculation by and large It's just that This is at the end of the sentence is the beginning of the sentence So if we have pretty proper word embeddings right this can be an okay way to do some re-weighing
but before we do the actual re-weighing one thing that we should do is we should remember that if we take Like if we take the height over here and we add the height over here and we add the height over here And we add the height over here Then we might get a number that's not exactly equal to one and if you're going to do some re-weighing I mean, it's kind of nice if you have like all your weights sum up to one like a probability distribution
So what we'll do Is we'll just do a normalization step after this And what we can then say is well, we've got these re-weighing weights now Because it's normalized. These are the weights that I could go ahead and use for For my attention and the way that it would sort of work
And I apologize for the map symbols, but i'll Explain what they mean but the idea could be if I start out with this set of word embeddings Then i'm going to do all the dot products Then i'm going to normalize everything so I have these weights
And then I can say okay. This is the word vector one that I start with Then i'm going to multiply this with this this with this this with this and this with this And then add all of that together to get this more contextualized
virtualized Word vector out Effectively what these weights mean Is how much should this vector listen to itself, its neighbor, or maybe all the way at the end of the sentence So this is the the nude re-weighed word vector that should potentially have more context
It's like that denoising thing that we did with the time series in the beginning and you can do the same thing for If you can do it for v1 you can also do it for v2 v3 or what i've got here v4 It's just re-weighing that we're still doing
And if I were to make a comparison to what we saw before We have some sort of token at some point in time That would be this Then We have some sort of an attention mechanism We're using the dot product of word embeddings over here, but that gives us this re-weighing factor
And by re-weighing all these different points that we already have Then we have this contextualized word embedding. That's this guy And I hope you appreciate this analogy it's actually like a very similar filtering technique
That we have in the time series domain But now we just have this dot product as the hack to do something really similar in the language domain And that is really nice and convenient again Because in the time series you can make this assumption that when things are close to each other They should be related and you really cannot do that in language Bank really needs to listen to river if you want to understand what bank means in this sentence
So on the right hand side again, we have re-weighing based on time distance But on the left hand side we have re-weighing based on embedding similarities, so to say And yes to do this trick you will need some sort of a word embedding that's already pre-trained, but those are relatively common
So those are available in some in some degree I would already argue Okay So we have bank of the river money on the bank And we have this system where We start with some word embeddings it goes through this I will argue
Attention block and out comes a vector that's more contextualized. That's what that star means And I don't just have that for v1, by the way I also have that for v2 v3 and v4 that can all just sort of go into that block this operation here would happen And then what would come out is v1 star v2 star all the way up until v4 star that that's kind of the idea
And because we have that block that I just drew Maybe we can put this in a neural network as well We might have all of all the operations that we just did defined as a layer. That'd be kind of nice
So I'm going to take one sip of water. So everyone can have a breather but the idea is that the mathematical operations that we've just described we can actually Rephrase that such as we might have proper Keras layers You know that that's kind of interesting because then you can do this with your deep learning stack, etc
So what i'm going to draw now is this self-attention block and i'm gonna Be a little bit diagrammy, but i'm gonna repeat all the steps that I did just now So on the left hand side, i've got all of these vectors that are coming in and on the right hand side I've got all of my vectors
That are more contextualized. They'll be these These vectors Now these vectors as they're coming in It's kind of like an array of vectors, right? So that means that when I actually get in here, that's kind of more like a matrix if you will than a set of vectors
Because a collection of vectors can also be represented as a matrix So that means that it would also hold here on the outside So you might argue hey, maybe they're not sets of tensors. Maybe though it's like one giant tensor. That's two-dimensional Well, okay, then the first thing that needs to happen is we We're first going to do like a mathematical multiplication, right? So
Like we had before we had the v1 times v1 thing and we had the the v1 times v2 thing Like all that stuff is happening here After that, we had this normalization step so that we would have these weights Uh in between right so that those were the normalized weights
And then we would again do that multiplication thing down below Uh, because we would do something like well, let's do v1 Uh, pardon v1 times, uh w uh, like i1 plus v2 Uh times wi
Two and like you would do that for all these vectors you would add that together And that's how you would get this contextualized guy over here And i'm not going to do the whole formal math thing I think that's going to be a bit boring for a python talk, but I do hope that you appreciate that Yeah, these are just matrix multiplications and some normalization steps and a matrix multiplication is essentially a layer
Inside of a deep learning framework. So we're in the green here as far as implementation possibilities goes So that's kind of nice But if I now start thinking about this, right, um This system is pretty cool, right? It's pretty cool. We have this attention mechanism that's already kind of useful
But we should maybe think about the properties of what we've got here because this is not a standard neural network layer You'll notice that if i'm doing multiplication here, that's great, but i'm not learning anything yet. There's no weights in this matrix multiplication If you think about a dense layer in a neural network, then there are weights that you're going to train
There's a label somewhere you get your gradient update, but currently there are no weights at all in this entire system so that means that The vectors that i'm gonna get out here, whoops So that means that the vectors that i'm gonna get out there, uh, they're gonna be really general they're effectively
Not trained toward a certain task and that's you know I can imagine if you have a label Uh, like you're doing entity detection or you're doing translation or you're doing whatever with this neural network But you still want to have this attention mechanism be able to learn and be able to specialize toward a certain task So roughly what you could do
And i'm definitely skipping a couple of steps here because I want to keep it on the intuition level But one thing that you could do is you can say well, let's move some stuff around inside of this schema And let's put a couple of neural network layers there now the idea is is that these can still be regarded as matrix multiplications, but
These matrix multiplications will be multiplied with weights that we're going to learn The matrix multiplication that we had here was a multiplication between this matrix and itself These layers will effectively say well, that's a multiplication between these vectors and some other set of weights that we're going to be training
That's the pattern that we're going to have to learn And the idea essentially is if I get some sort of gradient update Right, so at some point all of this stuff is going to be connected to some sort of label, right? And if I've got a label that means that I have a gradient update that tells me how well i've been doing
But because I now have those gradient updates These are going to travel through this system And eventually they're going to update these Layers so to say that's kind of the idea Um And you can extend this idea as well
But but the idea mainly now is we've just found a couple of places where we can Put these extra layers so we can also specialize this attention mechanism towards a certain task And I hope it's plausible that if you're doing Neural translation then you want to learn different things and if you're just doing entity detection, for example So I hope that that makes sense Why we might want to have that
And there's like variants of this I should say by the way Um, so one thing that some people really like to do is they like to add like lots and lots of extra layers here in parallel What they would do is they would call one of these layers ahead and by having parallel a couple of them You get this multi-headed attention. What's something that people like to call this?
But to keep it simple for now, though Essentially what we're just still doing is still nothing more than a variant of the time series task The main thing to remember is that we have word embeddings coming in over here And they're getting contextualized
And these word embeddings over here might be able to focus on different parts And that's very useful again in the example of bank of the river Money on the bank bank should mean something different above these sentences. That's the idea
Now to complete the whole story, we're almost sort of intuitively there at the transformer, but not not quite But typically to wrap it up into like what is a transformer In the end the main component that sits in the middle of a transformer the main thing that made it Slightly different interesting and like an extra step in the ecosystem is that attention block
But what people like to do is they like to put like an extra dense layer at the end Maybe a sort of a dense layer in the beginning in front of it And all of those components together That's what people like to call a transformer layer and that's again I should stress
I'm skipping over a couple of details and there's many variants of this but at least on an intuitive level This is it This is a transformer layer And again, the idea is that We have word we have text here The text will be converted to word embeddings That's a sequence and then that sequences will be at some point be passed through this transformer
And we have something here that's more contextualized Eventually, that'll be hooked up to a label so we can start learning and that means that this transformer is able to Update any of the weights that are in here and that's what a transformer does when it's learning
So definitely glancing over a couple of details, but in the end intuitively we're still doing the same thing as with the time series Now if there's extra time I do have an appendix so we can go a little bit more in depth But what I kind of want to do now is This was a theoretical part This is the part where I was explaining what a transformer sort of is and what it sort of does
in the next bit what I hope to do is show you how we actually apply these because This transformer layer is nice, but we've made some customizations to it over at raza to make some models that we think are pretty good at a lot of scenarios for digital assistants and this is the part where it gets exciting because
We've been able to add some nice Customizable we've been able to customize this transformer for our needs and I think that's a nice. There's a nice lesson there so one thing that we like about this transformer if you think about it and there's a bit of a A bit of a hand wavy argument
but if I If you were to say well this sounds really familiar You wouldn't be wrong because r and n's these recurrent neural networks. They were doing something that's really similar, right? You could say that also for an rnn there's word embeddings going in here and there's word embeddings going out So why would this be better than this?
Well, the answer kind of is If I if I didn't have any training data whatsoever. Imagine this scenario. I haven't seen any training data yet Then the node that I have over here is Even without any training data going to be slightly biased to listen to this one and this one
It's you're going to need a whole lot of data before the node over here is going to start listening to the node over here That's a bit of a bummer. There's already a preference For the sequence for the words that are close to each other even if it hasn't seen any data yet
And that's different with the transformer if the transformer hasn't seen any data whatsoever Then the attention is pretty much uniform over the entire sentence which if you're thinking about hey, what's a good starting point? To learn from that's a lot better Especially in our scenario where we don't have too much training data, typically
If you're starting out with a digital assistant, you're designing a chatbot then It's probably going to be different language than what people are using on wikipedia So that means that very often you start from scratch and you're not going to have bajillions Of conversations that you can already train on it's typically maybe tens of conversations when you're starting out
So we like having a system that's relatively robust at being able to start in a lightweight fashion and I hope that this hand-wavy argument It is hand-wavy, I agree But I hope that you intuitively also appreciate here. Hey that transformer is a little bit more flexible in that scenario
And there's also more parallelization options. Anyway, um theory I want to go to practice so If we look at the original problem that we had There are two systems that we have in rasa that Use a transformer the first system is a system that detects these intents and entities Those are two tasks and we've built one algorithm to handle both
and the other part of the system that we have is a system that is able to Detect what the next best action is And these two different systems use a transformer in a different way And I would like to show you how that works So intents and entities are found by the system that we have called diet
Now you can definitely also use scikit-learn if you have your own preference for that. That's perfectly fine But the algorithm that we've designed recently is called diet which stands for dual intent and entity transformer The idea is that we've built a system that is able to handle both intents and entities And because it's able to handle both of them at the same time
We have reason to believe that it might also be more accurate at many tasks So the way it works is i've got one utterance of a user on the left hand side And it's saying play ping pong That's that's what the user is sending my way Now the way that we encode that internally in rasa is we say well Uh, you probably have some sparse features like a count vectorizer just count how often each word appears
And the thing is that we typically generate these features for every single word that we have And what we also do is we try to have one token that represents the entire sentence Um, so you could say well
If this is a one halt encoded thing, i've got a one zero zero here zero one And a zero zero one here, right? So those could be the counts for the words And in that scenario, I would have three ones here because i'm counting all three of the words And a similar thing would happen with the word embeddings as well But the idea is this is the utterance that I get this is what I sort of start out with
Then what we do is we pass that through, you know, one or two feed forward layers The main reason for that is to make sure that we have a hyper parameter for the size of what comes out over here And then what we do is we pass that into a transformer
And the thing that's interesting to note here is we also pass this class token into the transformer That's that's as if it's another token. It's something we just throw in there But it also means if we look at the same argument we had before that what I get here Is essentially a a vector like what comes out over here
But more contextualized And we have that over here as well over here as well and over here as well And in a minute, we're gonna hear an argument why that might be super powerful Because what we can do now Is since these words are all
Separate words that could be an entity right and in this particular case you could argue Well, I want to play a video game. That's what the user is telling me But what video game do I want to play? Well, that's ping pong well What I can then do is I can sort of say well, I've got this entity layer over here
And there's an entity that should be detected there namely the game I want to play There's an entity over here. Namely. What game do I want to play? And you can imagine that there's this nice little correspondence between the token I have over here and the entity label over here And there's a very similar thing happening here So i've got this separate layer for the intent and i've got this intent play a video game over here
But here comes something that's actually quite amazing. Now, let's think about the gradient signal that I get sort of the feedback if i'm actually Taking this system applying the gradient descent on it and trying to learn patterns from it Well, then as an example, let's say that I get a gradient update from this intent
And note in production. I'm also definitely getting a gradient update from these over here, but to keep it simple Let's assume there's a gradient update from this intent over here I'll be updating this layer But then we have this attention mechanism happening here And you could potentially argue that for the intent that you want to play a video game
So You know this Word play is going to be a key part of detecting that If a user types the word play in a commanding tense in the beginning of the sentence, you could plausibly argue I don't need to know the rest of the sentence just that word play is already telling me what your intent might be
So that might also mean that the gradient update signal that this layer is going to receive and this layer is going to receive Those are probably going to be more fundamental and more strong than the updates that I get over here potentially because these
Don't necessarily immediately Tell us anything about this intent It might be that you want to play a game or buy a game and the fact that there's a game in here Might certainly still give a boost But you can imagine the nice thing about this transformer is it also routes the gradient update in an appropriate way Now again hand wavy argument, uh, definitely aware of that
But to me at least it feels intuitive that that is something that's going to be extremely useful here Because this isn't just happening with the intents. It is also happening with these entities And especially if a certain intent has certain entities that occur often together and very infrequently without each other
And that's a pattern that this system can learn Instead of having two separate models one for intents and one for entities We think by having this transformer in the middle. We might actually be able to handle both And so far it seems that diet is working quite well One of the nice things about diet and that's something I do want to mention
This system will work even if you don't have any word embeddings at your disposal And this is really important to our users Um because Again, english word embeddings are frequently out there But there are also some languages like zulu where you don't have word embeddings readily available
Now i'm working on some open source projects to make sure that we also have word embeddings for those But the nice thing about this system at least is even if you don't have word embeddings Either because no good ones exist for your language or because they're maybe super heavy The system will still go out and work and do its best. There's still lots of features that can be learned here
Another thing that's kind of nice is that if you want to go super heavy Say you're a big fan of the new BERT models out there then we can chug those in here And that's kind of nice because if you have a use case for it Then you can put the super heavy word embeddings in there as well And this is the nice thing about this model and I think a transformer helps here
Um, this model can be used in different ways according to whatever you think is best for your application And this customizability that's a super nice feature So that's both a feature of I would argue here our diet system that we built on top of the transformer But this gradient updating that's inside of the transformer
Uh, that that's that's that's kind of a nice milestone. There's something really cool happening inside of that So this is the base model that we provide inside of raza to handle entities as well as intents And what i'm going to conclude with now is how we also handle actions But I hope that already the intuition has the mind boggling at this phase
The thing is we also have this thing called ted It's a transformer embedding dialogue system. And this is what we use to handle um the actions to take Now i'll go over this relatively quickly in the interest of time, but if a user were to say hey i'd like a pizza
Well, you know then we have a model called diet that's going to give us an intent We have an entity that's detected. We have that from diet as well We might also have some long-term information filled in like the address of the user might be something that we know and
We also know the previous action And you can argue. Well, that's a feature space that I have for an utterance at a single point in time It'd be nice if you can pass that to a model And then that the model says oh, this is the best action you should take But again, we have a sequence here, right?
so that means That we can have those feature spaces over time And then we could then say well chuck that into a model And have then the actions here come out and then we have a sequence over here And I hope that you can imagine what we're actually using here inside of this model to facilitate this
That is another transformer But the main thing that is awesome about this the one thing that struck me as sort of super interesting Is that you don't have to use word embeddings in order to use a transformer We're not even dealing with text per se these are just features that we know at the moment and it's sparse data
But still we're able to use a transformer in our experience so far quite well in situations where a user is Maybe interrupting a chatbot and we have to suddenly make the right decision In the interest of time i'm going to skip a couple of details about how this transformer works
It's a unidirectional transformer Because we're only allowed to look at the future so there's a couple of interesting details there But in the interest of time, I just want to do one quick demo because then we'll still have a little bit of time for questions What I have here is the rasa configuration file
And what I also have is some training data where the task is that the chatbot is supposed to count down Let me start from 10 and then ask the countdown And what i'm also doing is i'm asking it Not just to count down but to also if I interrupt it by asking if it's a chatbot that is able to recover
So if this is my input and I say hey start counting it starts counting at 10 And then I say okay, but then it starts counting at five and then say okay again. It's messing up over here But the reason why it's messing up is because in this configuration for this particular TED policy I'm saying that it's only allowed to look one time step back
Which is not going to be enough information for the transformer to actually learn a nice pattern It gets a whole lot better if I say hey your max history is actually supposed to be three now You're allowed to look back three times steps And then you notice that it's actually not the worst, you know It responds. Okay. Now it can count down
When it says hey, are you a bot? Then it's able to say no, uh, I am a bot not a human it starts counting down again But it's not able to generalize there's this moment that's not in the training data and it's it's it's it's not able to look Uh further back enough to actually achieve the right context here However, if I were to put the max history at 10
Then we see that even if I try to interrupt it at instances that aren't in the training data It is able to recover It is able to keep counting down and it's able to still at every instance reply that it's not a human Which is a feature that we like And the reason why I think this is a nice result is because when we tried doing this not with the transformer
But with the LSTM type model We noticed that the LSTM actually has a whole lot of trouble You need to give it way more data so far in this benchmark In order for it to learn that you're still supposed to count down after saying that you are a chatbot and not a human
Anyway, there's only like a couple of minutes of time. So Very quickly, the one thing I do want to remind everyone of super fancy machine learning. That's great But if you're building your own chatbot definitely, uh label your own data and learn how users are interacting with your chatbot If this was interesting know that I have uh, like i'm paid to do this on the rasa channel on youtube
I have lots of videos that go more in depth in material that I've discussed today So if you're interested in this There's the algorithm whiteboard on youtube that you can definitely go and check out and finally one thing I would also like to mention is uh On behalf of rasa I uh, i've open sourced and maintained this project called what lies which is an interactive visualization tool for many common
Word embeddings and some of the features that i'm currently implementing are features for detecting bias in word embeddings So if that's interesting reach out to me because these are features that i'm building I'm going on holiday next week, but the week after i'll be Implementing some things that help you detect bias in word embeddings
Anyway a little bit of rush at the end, but I hope people found this interesting And now would be a good time to go ahead and ask me some questions Uh, I have already noticed that some people are asking. Hey, what am I using? Gear wise because people seem to like this pen I've got to walk on one tablet and there's an app for mac called screen brush
And every time I click alt tab, i'm able to just doodle you can also do this with your mouse The app is called screen brush Come to the chat room on discord and i'll show you a link and a tutorial and I am using a yeti microphone It was 140 bucks before corona started So we have also a bunch of questions on chat
Sure, uh, well, which one should I take? Uh, all right, uh I will ask you sure. Yeah, go ahead. Are there moving graphs which you made? Are they using matplotlib? Oh, uh, yes, they are. So there's uh, I go to the discord i'll send you the link
There's actually a cool project called gif And it's basically a decorator that you put around a function that Renders matplotlib and give us just the easiest way ever to get pretty gifts from matplotlib Go to the discord channel after i'll share the link. But yes, it's matplotlib based Okay, the next one really interesting explanation of the diet
In the net what do you have available to qa a predicted label or to explain to a user the label? Oh, uh, that's super tricky In essence because there is a transformer in the middle And every sort of like you can have multiple transformers in sequence. So what you can do that's what what lies does
is we can visualize the Vectors going into the transformer and typically if there's like clusters there that that's something that can help you inform um But what raza also does is it gives you this overview of common mistakes that happen often And what we mainly try to do is tell you hey these two intents are often confused with each other
And then you can investigate from there. That's that's typically the approach that we offer Okay, so this is a question from francesco You said that RNNs perform worse because they are intrinsically biased in assuming words that occur closer tend to be more related However, isn't this assumption in most cases more correct than assuming as a transformer does
That a word in a sentence has equal chances of being related to any word in the other parts of the sentence well, so Uh, that's why I mentioned it's a hand wavy argument, right? So it's not necessarily a 100 solid and perfect and it definitely depends on the use case for our use case though
You have to imagine when you get started with a digital assistant like your first demo It's going to be based on 20 really really short sentences so At least for us right for our specific use case that one assumption holds just a bit more You are definitely correct if you say something like okay But now let's do this for an entire document and the documents like 20 pages
I mean you sure then situations different and then you cannot make this argument anymore Uh, but at least so far in our results, it seems to be confirmed. That's the main thing i'd like to say about that All right, so so This question asks you How does this compare to gpt2 or the recent gpt3?
Oh, okay. So, um, I wrote a blog post on that, uh, which i'm gonna be spell correcting in like five minutes Um the main thing with gtb3 and uh, I hope people take this serious Don't underestimate what sort of bigotry that thing can generate if you give it sentences like Uh, the man is known for and the woman is known for or the person with black skin is known for
You will see that it generates all sorts of stereotypes And yes, I mean, uh, it's great that we have the system that can generate text But that's not the use case that our users at least have Our users would like to be build a digital assistant that is reliable and predictable And that you can order a pizza and do that kind of thing. Um
We are currently investigating gpt3 Maybe as a way to have a human in the loop and generate more expressive training data I mean that can be a use case But in general, I think it is a really bad idea to have something like gpt3 generates text as if it's a chatbot like that's I have trouble coming up with a good use case for that especially in enterprise
Okay. Thank you. Thank you very much for your talk and I think this is for you Sure, um happy people liked it And if there's any questions, i'll be on discord for like another 20 minutes to like send links stuff and uh, And I hope people learn something today
Yes, thank you