“Alignment is All You Need”: Analyzing Cross-Lingual Document Similarity for Domain-Specific Applications

Video in TIB AV-Portal: “Alignment is All You Need”: Analyzing Cross-Lingual Document Similarity for Domain-Specific Applications

Formal Metadata

“Alignment is All You Need”: Analyzing Cross-Lingual Document Similarity for Domain-Specific Applications
Title of Series
CC Attribution - NonCommercial - NoDerivatives 3.0 Germany:
You are free to use, copy, distribute and transmit the work or content in unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date
Production Year

Content Metadata

Subject Area
Cross-lingual text similarity provides an important measure to adjudge the contextual and semantic similarity between documents across different languages. Extraction of similar or aligned multilingual texts would enable efficient approaches for information retrieval and natural language processing applications. However, diversity of linguistic constructs coupled with domain specificity and low resources pose a significant challenge. In this paper, we present a study analyzing the performance of different existing approaches, and show that Word Mover’s Distance on aligned language embedding provides a reliable and cost-effective cross-lingual text similarity measure to tackle evolving domain information, even when compared to advanced machine learning models.
Keywords cross-lingual document similarity cross-lingual text alignment
Context awareness Group action Source code Neuroinformatik Fluid statics Hypermedia Computer configuration Different (Kate Ryan album) Damping Einbettung <Mathematik> Aerodynamics Office suite Motion capture Translation (relic) Demoscene Process modeling Arithmetic mean Data management Process (computing) Frequency Phase transition Order (biology) Problemorientierte Programmiersprache Figurate number Spacetime Point (geometry) Programming language Transformation (genetics) Maxima and minima Similarity (geometry) Barrelled space Drop (liquid) Product (business) Latent heat Internet forum Term (mathematics) Maschinelle Übersetzung Representation (politics) Lie group Metropolitan area network Distribution (mathematics) Information Computer network Limit (category theory) Cartesian coordinate system Mathematics Word Normed vector space Universe (mathematics) Functional programming Transformation (function) Transportation theory (mathematics) State of matter Decision theory 40 (number) Process modeling 1 (number) Set (mathematics) Semantics (computer science) Food energy Word Mathematics Coefficient of determination Oval Endliche Modelltheorie Position operator Programming language Token ring Inverse element Term (mathematics) Measurement Right angle Species Whiteboard Freeware Resultant Table (information) Heat transfer Distance Hypothesis Wave packet Twitter Object-oriented programming Causality Spacetime Software testing Proxy server Task (computing) Context awareness Multiplication Inheritance (object-oriented programming) Mathematical analysis Embedding Transformation (genetics) Approximation Similarity (geometry) Einbettung <Mathematik> Natural language
we're going to have babies so i'm sorry we add to our search at island and opinion to unlike me cross lingual and much lingual language models and other saddam in specific applications in excess of the nation and not or any tasks and today i'm going to discuss with your present do something. which we saw the background in the sense the language is all you need for letting analyzing crosstalk human text similarity and this has to do with more specific order means that the applications or no resource languages which and come to later in the sights and so as your know that natural language processing. as so deep applications across of very kind of for don't mean specific as well as generate applications like to express an occasion sentiment analysis talk more in everything like we had been talking about in this work in this a workshop like any aren't things like that however as as you and the multilingual aspect. has become a very very important role in this aspect because we need to understand it's a job or figure trends are aggregate information across various sources and the source is obviously cannot come from are in general will not come from left to a specific language so we will have a barrel car production as well for machine translation interest in things. so incorporating multilingual document and understanding their similarity across the stop them and provides a very vital application. in general so what we try to understand is whether there is an efficient or unsupervised multilingual text alignment which will be which would enable us to capture the to symantec similarity in different domain specific scenarios. so obviously a the background is over is as you in general previous matters a texan right used the word meant models term free gift idea of measures and then moving on to the new or work to back and fast it's where many techniques so these all try to understand that context of commonality on the semantic business of. it's like i'm not going to details because it's very evident to the audience year to however one interesting as to which i would like to focus on is the word was distance which is basically works on the optima transportation problem or the other movies distance to the specific and the idea is it is more as simple as your what it does is. it works on the word one embedding of each let's see each war in doctrine across the uk comes to documents one and two and then what he tries to do it tries to. the computer the effort required to transform the distribution of these what embodies from one document to the other so how much effort is required to transform this distribution documents and that is used as that see a proxy or a few as to the distance of this on the semantic dissimilarity between two. options and as you can see here in this small example down obama is more related to the preston than man so the effort needed to transform this representation of obama to the president representation is much smaller than from the back and when we were often a transfer we get the global. optimal optimal point it will see that the document one is closer to doctrine zero that option and two and that's the overall idea of what was just as however the major drop are in here is that it works on a common and many space that is it assumes a document want and gotten into are embedded in the same space. which in general phase for multilingual headaches and then obviously the state of the art are the contractual and which ones like the board x l m r and things like that where every took an is given a that's a dynamic including which changes according to the context in which the word is spoken of which actually. critics care or any cause the sense or includes and good sense of the world which has been spoken off and obviously on top of these language models we also have sentence and endings like escort i'm universe a sentence and coatings which which enabled better semantic representation of little shocked expire. sentences in change right and it comes from very kind of the just to another distillation models or and ordered the quarter weeks. however the major problem that that for that we see in these language models is the are not domain specify like you did the boat or excellent or any kind of these language more has to carry don't unspecific abrogation defeated because the training has not been performed on their hands we have. and i don't mean specific language models like by word skyward except etc obviously low resource of expertise of language morris training or find journey and the media asked each year is that it is a high computationally if you want to pre-treated a language moderate or such let's say let's a contractual s.. language mars retreat from scratch then let you need a gene reason teams are eager for it could not perfect we were right so that is a definite drawback when you have to move on into more monitoring was settings and that too complicated with the domain specificity in which we want to use them. so what we did what we are studied or extort is how we can move into the space of forty or aligned one was distance now what we have is basically there are different techniques in the literature which shows that you can and nine moon england static embedding says across languages. and then allowing them together into a commons face what i mean is that if you see lets the x. is a language english and why is another language and they have the scene words but the end many are indian moon linger settings and you see the distribution of the words are different they're not overlapping as it should be a so there are ghana. pictures or other adversarial and mathematical transformation last functions which specifically into at it and this problem and white the clothes overlap or close approximation of these two spaces such that you can see if two words are similar across the language is the will. eventually be overlap in the almond shaped space and obviously this is one of the bible or which also comes out in let's imagining what language mars rate. however the major problem there is it's a chicken and egg problem the language in many of the language model he is basically problem in indonesia it's a domain specificity of the amount of trading require mooning will embedding space is easier to create them so if we can a let's see alang the move. when england were embedded using these different technically you muse object which are cheap and also work with limited data then on top of that we can carry need it with let's say the world who was just as as we set so if you see in this kind when two or multiple languages have been moved into the scene commons shared space. these then the world ms distance work and we wanted to test of this hypothesis whether this works and in that regard museum ninety one and easy to have the cheek not doing well languages eight and here since obama and depressing which is in german here have the scene representation in the same species now here again. the land similarly and that these documents are aligned barely enough so that the world were decisions even works on this right so this provides a very let's see i would see a cheap and compute efficient we to provide sexual similarity without going to the entire process of this expense of computing. expensive new learning on new language models and also trying to find you and incorporate them expensive is used to its there because you can easily i just used to mean dependency of the applications by quickly learning a small malinga and getting right in the domain from very limited amount of text alang them together. and then around the world was just as so these are very quick and cheap way to do it and we showed that it is actually comparable in terms of the other mourners they experimented reserves we did was we did a multi domain deckchairs similarity a mutation and parent office so we do medical judicial and religious texts. different languages here we consider english german finish romanian and then what we did was we use a fuss text aligned with them that is where individual words across languages were all alang and projected to the same common space and what we report his position and one a position and fight that is the exact. sentence transition should be extracted the interesting part as if you see the baselines this is one was distance without alignment this is with dog with another metric which is called a goal or more of what was assigned distance which i didn't explain its probably for longer discussion and this is maturing world but this is the sentence. transformer s. but and this is what we propose the world over distance with a lie. now if you consider these baseline to see the dog s. board or sentence but with this one would just as a language they perform really nearly equal valentino and sometimes even better in domain specificity or in terms of let's say and low resource languages in different languages. so we went ahead and we tested a few more things in the sense that if you see the sexual similarity a cross these even with just as languages like russian hebrew and sosa right even denly see the world would distance with the alignment performs nearly equal violent to that of. this or any other kind of such language martyrs so does provide is a very very easy we are not only incorporated immense passivity low resource but also understand actual similarity with of the bells and whistles of retraining and have fine tuned large language morris but obviously. one obvious trauma of this is the e.u. centered alignment what i mean is that most word or embedding a language techniques in the literature are more or less english country picture try tend to land all these pieces into the industries and that is why when we see me do the german to romanian a german to finish their. his and slight lost in this kind of our tradition in compared to the state of the r s four however the future where is that if we can have some kind of a language which is not geared to was the english are dependent on the english centricity then probably we can have much better results. so coming to the end of my talk what we propose is the word move was distance on the land embedding space can enable a very accurate and compute efficient semantic measures an american tradition managers and be sure that it is generalizable across multiple to means as well as low resource languages in. generate so that's what i have that is a language or is all you need for computing semantic dentures in energy and thank you so open to question is not.