Streaming Pipelines for Neural Machine Translation

Video in TIB AV-Portal: Streaming Pipelines for Neural Machine Translation

Formal Metadata

Streaming Pipelines for Neural Machine Translation
Title of Series
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Machine Translation is important when having to cater to different geographies and locales for news or eCommerce website content. Machine Translation systems often need to handle a large volume of concurrent translation requests from multiple sources across multiple languages in near real time. Many Machine Translation preprocessing tasks like Text Normalization, Language Detection, Sentence Segmentation etc. can be performed at scale in a real time streaming pipeline utilizing Apache Flink. We will be looking at a few such streaming pipelines leveraging different NLP components and Flink’s dynamic processing capabilities for real time training and inference. We'll demonstrate and examine the end-to-end throughput and latency of a pipeline that detects language and translates news articles shared via twitter in real-time. Developers will come away with a better understanding of how Neural Machine Translation works, how to build pipelines for machine translation preprocessing tasks and Neural Machine Translation models. Speaker Bio Suneel Marthi: Suneel is a member of the Apache Software Foundation and is a PMC member on Apache OpenNLP, Apache Mahout, and Apache Streams. He has done talks at Hadoop Summit, Apache Big Data, Flink Forward, Berlin Buzzwords, and Big Data Tech Warsaw. He is a Principal Engineer at Amazon Web Services. Experience Suneel: He has done talks at Hadoop Summit, Apache Big Data, Flink Forward, Berlin Buzzwords, and Big Data Tech Warsaw. Speaker Bio Jörn Kottmann: Jörn is a member of the Apache Software Foundation. He contributed to Apache OpenNLP for 13 years and is PMC Chair and committer of the project. In his day jobs he used OpenNLP to process large document collections and streams, often in combination with Apache UIMA where he is a PMC member and committer as well.
Musical ensemble Translation (relic) Virtual machine
Axiom of choice Context awareness User interface Distribution (mathematics) Demo (music) Source code Real-time operating system Open set Client (computing) Parallel port Inference Preprocessor Web service Sign (mathematics) Mechanism design Insertion loss Different (Kate Ryan album) Bubble memory Tower Vector space Compiler Data compression Physical system God Thermodynamischer Prozess Building Sampling (statistics) Bit Maxima and minima Translation (relic) Complete metric space Sequence Formal language Twitter Virtual machine Process modeling Tensor Category of being Arithmetic mean Phase transition Order (biology) Endliche Modelltheorie Quicksort Slide rule Programming language Addition Translation (relic) Online help Student's t-test Number Architecture Goodness of fit Maschinelle Übersetzung Representation (politics) Codierung <Programmierung> Netzwerkdatenbanksystem Computer-assisted translation Computer architecture Stapeldatei Graph (mathematics) Demo (music) Artificial neural network Weight Commutator Limit (category theory) Word Software Personal digital assistant String (computer science) Mixed reality Greatest element Building Thermodynamischer Prozess Length Multiplication sign Process modeling Insertion loss Instance (computer science) Function (mathematics) Parameter (computer programming) Mereology Formal language Coefficient of determination Tensor Software framework Endliche Modelltheorie Position operator Area Source code Data storage device Parallel port Streaming media Statistics Open set Vector space Tower Natural number output Right angle Whiteboard Asynchronous Transfer Mode Functional (mathematics) Server (computing) Statistics Divisor Codierung <Programmierung> Artificial neural network Virtual machine Web browser Content (media) 2 (number) Wave packet Twitter String (computer science) Software output Beat (acoustics) Graphics processing unit Data dictionary Inheritance (object-oriented programming) Projective plane Java applet Planning Machine code Euler angles Computational complexity theory Einbettung <Mathematik> Natural language Identity management
Thermodynamischer Prozess Link (knot theory) Slide rule Gamma function Building Multiplication sign Computer file State of matter Streaming media Feasibility study Online help Representational state transfer Student's t-test System call Wave packet Machine code Twitter Similarity (geometry) Inference Process (computing) Broadcasting (networking) Normal (geometry) Endliche Modelltheorie Musical ensemble
Slide rule Link (knot theory) Slide rule Link (knot theory) Building Computer file Process modeling Translation (relic) Line (geometry) Machine code Wave packet Twitter Word Maschinelle Übersetzung Codierung <Programmierung> Data compression Window
Thermodynamischer Prozess Online help Token ring Building Multiplication sign Computer file Execution unit Process modeling Mereology Axiom Entire function Twitter Word Sic Musical ensemble Curvature
Euler angles
hi everyone good evening and thanks for coming to this talk so I am Celina Marty and this yon Cottman and we both work on open NLP and a bunch of other Apache projects and we are members of Apache foundation so today we will be talking about the machine translations so before we start show of hands how many guys actually have a use case that's involving machine translation here and how many of you guys use neural networks or noodle machine translation as a pasta testicle okay great so what's the agenda for today's we'll be talking about what is machine translation bodies noodle machine translation and what are the challenges with that and how do you actually train your models offline in a batch mode and call them in inference by plan in real-time streaming mode and we'll have a small demo I do need some help and support from you guys for the demo because the demo gods have not been good to me or play it how many of you guys here speak native German native German speakers okay three well I was hoping that you guys could tweet in German and you know with the for stem hashtags and we could actually translate those streets in real time but anyway it's okay if the without the hashtag we'll just go with German tweets and let me just warn you that most German tweets they die of the 4 p.m. Berlin time that's when they go to beer bars or before that it's a the racist are profane or you know one of those different categories of tweets with that let me hand over to John will take us to the next few slides so for the stock we're using a pet you think that's a distributor streaming a pipeline framework and we have Apache up menapii this is used for the pre-processing of the text we will fit to the machine translation system in the demo and let me start with what is a machine translation services some definition yeah II would see a string in the target language that's language maybe I speak and I would like to translate to and dff it's a string in the in the source language maybe the one I don't speak like I'm Spanish for example and now we want to maximize probability of E and the best translation from the model we have is the one you see highest probability so let's look at how we can translate one word so for one word you could just go to a dictionary and look it up see here we have the example of the boredom and put translators to these three words like building a house or tower now the thing is that choice of words building and house are much more frequent than Tower but still you would need somehow maybe look at you would need to look at a context to see which words you want to use which brings us to the parallel text that's a sentence in some source language and they also have some translation of this sentence and here we have one example again duska buddhist hope and the english translation is the building is high and he-aac i have four words in German which more or less directly translates who to English and probably assistance I would get pretty far as a dictionary and what's here more complicated cases for example this one that's a Stein who house and here who house would translate to three words in the target language and my model would need to 2d vs. problem because also the sequence length is now not the same anymore so if more words in the output also in other cases I can have swap so maybe I have to translate one word and then it would appear maybe at the beginning of the sentence or somewhere else but wasn't the original sentence so now we come to the newer machine translation part and there's this very nice slide about the encoder/decoder architecture and this is what what is used most these days this is a bit of a simplification we've extended in the next slides and that more so on the left part we have see an encoder part on the right is AMD code on and what we are doing here is we are inputting one sentence it is he loved to eat and then we hope we are getting out a leap day to Essen so here we have si am adding layer on the left part and we would take the word he and this would be pushed into the model and the first thing we have to do is the word itself we can't really use much so we would go to a dictionary and this dictionary what we give back say a vector representation of the server we have some slides about this and as merely speak about it so we have some a vector for the word he and this gets pushed into the encoder and this process is repeated for every word and here this is just like written out but it could also be more steps or there steps depending on how many words you have at the end of the certain process what's the encoding phase we end up with this s in the middle and this s is a vector representation of CM of the sentence so basically just imagine somebody tells you a sentence and you have to remember this some sentence and that's kind of what's inside this s and this meaning of the sentence which is now captured in this vector is then hand over to the decoding part of the system so and they have now two inputs so first to CS and say second input is a a now let's say bottom and now my D cognos I would like to geek out the first word of translation which in this case would be yeah and now I just repeats us I again give us input C GS and now I input as well as a the first word which was translated in this case at SIA and this process gets repeated until the entire sequence is decoded and I have my translation to make this work a bit better this is usually done with something they call attention and say attention help you now DTM on the decoding part I can look back onto the saw sentence that's like when you have M some sentence written down and you can just always go back to it and see what's written there while you are writing your your translation so I don't really have to remember everything and you see attention this can look on the decoding part back to the part that just wants to translate so maybe I'm looking here - just like exponential gent and the translation is I am a student so maybe I am s am student at the end and it knows okay I can look at the student word and my input sentence and see how this should be translated but of course it is much more complex and these days people would usually use something called a transform model which has like many many layers for the attention and that's the end you have cm yes a soft mix which was an output say word this is some attention so I need so if you want to learn more about system maybe go to this paper have a look it's probably texted of time to UM stanzas in Deedle and so the upper function in that same softmax function softmax function which is used to to calculate the probability of the of the output words and the second function let's say an entropy and that's a loss function which is used for training because during training you kind of like initialized everything randomly and then you want to figure out parameters which can the muse for the translation so you run the commutation once forward can you get some something out and the loss functions and can tell you how you should change your parameters and this process usually repeats millions and millions of times until you have some weights which actually then give using translation scan 20 can take a long time and also depends on how much data you are training on so why do we do in MT these days because it just works much better than what we had before from around 2015 and MT started to work and 2016 it was already better than the SMT statistical machine translation which was so and before and since 2017 nobody is doing anymore or SMT so here I'm handing back to Sunil and he will go through some samples so before we go through the samples just quick note on the previous slides so what happened
here if you look at this graph what happened between 26 2015 and 2016 if you see in 2016 the orange bar is way up the blue bar so that's when Google Translate switch from their traditional statistical machine translation to neural machine translation using attention mechanism and the advantage of using attention mechanism is that translation happens based on the context it's not just you know you're not looking at let's say the three words or four words before translating this particular word behind you come up with a better translation so on that note sometimes you do see that the machine translation is all goes well back and you may see something very human frivolous and you know fun stuff come out of it some examples here so that those of you speak German what do you think that is is it is it good or is it nearly good or is it near it okay so a next example okay so even better if you have been using Twitter or the browser and you know can anyone tell me the difference between the original and the translation here okay so it's not Dutch and when you translate from in a mathematical function and you put spaces in between the plus signs it's a Dutch it becomes that's right that's not true so yeah that's that's Twitter Microsoft Bing that's what Twitter users for translation so how do I avoid those kind of stuff you know challenges with machine translation so the first thing to look at is what is the kind of input that you would expect for machine translation and since we are dealing with with the annual machine translation since we are dealing with the neural networks here the input to most ural networks is a vector vector is area of you know numbers and the training data for the since you are dealing with sequence to sequence models you have two parallel training courses coming through and it's parallel text now the challenge is how do you represent a word from the text into a vector convert that into a feature store so the most common way of doing that today is something called an embedding layer you start you create an embedding layer as your input layer so let's say if my input has the vocabulary dog cat jira fox bird etc and for each of them I randomly initialize each of them dog is a vector of you know the vector of this string here the numbers here same with the cat so I start off with the initial random initialization and run it through something called word to back I convert that into a board to back and so once I convert that into a word to Veck and this is what I get it kind of learns the real values and comes up with better values than what I started off with it randomly so the one big challenge we have with building machine translation models is especially languages like German which have huge vocabulary and extremely long comprehensions that are impossible to pronounce or how do you deal with those unseen vocabulary and the text you get for training is very limited it's got very limited vocabulary for example the demo showing you it's a German to English machine noodle machine translation model and it was trained with only 30,000 German words but I kind of found that it can handle the rest of the unseen words so how do you handle the unseen vocabulary okay so let's say if my text has only 30,000 words what do I do about the rest the rest of the vocabulary how do i account for that so this was one of the challenges that was solved by something called a byte pair encoding bpe as they call it and here's the paper about that from my University of Edinburgh and so the way byte pair encoding works is if this is my input text positional Edition contextual I take the most frequently occurring consecutive bytes and replace that by different byte so an example is here in this IC that TI and TI are occurring more often so I can replace VI with an X now I can kind of go that path recursively I can replace IX and IX with something else and I can you know kind of understand by input size so yeah I can keep going recursively on this so yeah so what you had becomes this finally so your input size it's kind of like a model compression your eventually model size is going to be much smaller than what it would it would happen without byte per encoding so and your models can you know they have their storing a function they're learning the translations and they can decode that back so and the other challenge that we have is we're training deep learning models you're dealing with the cluster of GPUs and if your input text comes as you know depict different lengths if the input one of them is very small like only five words the next sentence has got 20 words and if your input text is kind of in not in a good order it's not sorted out sorted upfront then you are not optimally utilizing a GPU cluster so you deal with something called jagged tensors and this is how it would look so if the max if the max vocabulary sizes 17 per sentence so if the first one has only 14 the next one for blah blah blah so you kind of see that it's not sorted out so it makes sense to pre sort your input upfront before sending it to the GPUs for training your models and so this is how you would do that and you can break this up into chunks and send each chunk to different GPUs from the cluster for training so those different batches okay so this is how you do the model training and so the thing that we have taken into factor here is number one create an embedding layer and use byte per encoding to a conference in words and then sort your inputs to avoid Jacket tensors and optimally get lazy GPS so your trained your machine learning model attention network model offline in a batch mode now how do you deploy that in a streaming pipeline for real-time inference at real time if I am amazone and if I want to translate my content into from German to English English to French how do I do that in real time so that's where you you can we'll be looking at that's the next part of the talk streaming pipelines so for this particular demo we have used flink since we had to talk about beam this morning we can as well use beam with the flink binding and the normal steps in any natural language processing is you do a language detection first and then you do a sentence detection so break up your input into sentences based on the language and then you tokenize each sentence into individual words and then you run your byte byte parent coding so the model in this case I've deployed the model and Amazon stage maker but it can as well run it as an RPC server locally I use any mortal server so here is a complete pipeline in flink for this demo and I just got the yeah so this is how the inference pipeline would looks for this demo I'll be using Twitter as input source and I'm running it through an open NLP bunch of open NLP for sentence detection for German and I'm using an RPC client for doing that and for running the inference so here is where I
need some help from you guys the Germans can if they can start tweeting I would
really appreciate that if not I'll just pull the German tweet feed as [Music]
ready okay maybe I'm coming so it's starting up a fling cluster and I'm running this is a fling job and putting the job on the cluster and it's going to be making an RPC call to a sage maker REST API called a sage maker for getting that mortal inference so let's wait for the tweets to come for stem hashtag gamma treats please for student post him for stem 2019 or first time 19 either one I mean if that doesn't work that's fine I can just blow that normal German tweet feed so while that's running any questions that you'd like to answer okay I'm not having luck with first-time hashtag enjoyments so let me just comment that out
okay yeah that's not force them to eat but anyway that's a German treat okay so the first line is the actual German treat and the second one is a translation in English so I guess what do you think about the translation is it good I think it's pretty good compared to compared to Google Translate it's as good as Google Translate if not better I'm critical of Poland when you start to shut down windows namely okay yeah anyway so this is yeah some of them are abusive and racist so let's signal that yeah okay so any questions so here are
some links of the attention paper from Google attention is all you need and the slides will update the slides yeah questions please here we use byte per encoding so byte per encoding is you know you take it yeah every successive bytes and replace that but the common byte here so basically what the advantage of that is your model size is much smaller and it's kind of like a data compression so it's a smart technique here yeah you yeah okay so the one thing I did not mention is this model was producing a world machine translation w empty corpus Europol corpus and it's all open source but and I used only 30,000 words German words in the actual training model okay but nevertheless it's it works fine for you know all kinds of tweets that come
through so if you look at this for technical stuff I would say it's a same so as you can train it on Wikipedia you should be fine too we have trained it on Wikipedia or any technical step yeah I mean if you are using byte per encoding I think you pretty much cover most it it's more you can generalize your model to cover technical tags as well as tweets as well as news times up yep thank you [Applause] [Music] I can answer this yeah so how do you translate long German birds right you don't really have in your dictionary this a question okay um so the way this yeah so the way this works is the bpe
will compress the word to smaller units and the model knows how to transform translate these little parts of C words so it doesn't really look at the entire word but that axiom the new tokens are gets from the bpe process and that's why this works any more questions okay I can take it off here and then clip you want I don't know and just being alone right it's easier yes so I should put some in your pocket I would put it close up so speak loudly okay okay see it


  356 ms - page object


AV-Portal 3.20.2 (36f6df173ce4850b467c9cb7af359cf1cdaed247)