Language Processing Pipelines Part 1

Video in TIB AV-Portal: Language Processing Pipelines Part 1

Formal Metadata

Title
Language Processing Pipelines Part 1
Alternative Title
Language Processing Pipelines for Knowledge Extraction in Multilingual Context
Title of Series
Part Number
1
Number of Parts
3
Author
License
No Open Access License:
German copyright law applies. This film may be used for your own use but it may not be distributed via the internet or passed on to external parties.
Identifiers
Publisher
Release Date
2019
Language
English
Production Year
2019
Production Place
Dubrovnik, Croatia
Loading...
Context awareness Videoconferencing Process (computing) Formal language
Context awareness Sheaf (mathematics) Process (computing) Formal language Data type
Data type
Pattern recognition Complex (psychology) Building Visual system Virtual machine Twitter Product (business) Formal language Information retrieval Sign (mathematics) Computer cluster Bit rate Different (Kate Ryan album) Single-precision floating-point format Authorization Energy level Spacetime Process (computing) Data conversion Physical system Data type Context awareness Pattern recognition Electric generator Arm 19 (number) Formal language Frame problem Message passing Arithmetic mean Process (computing) Visualization (computer graphics) Personal digital assistant Telecommunication Information retrieval System programming Speech synthesis Right angle Lipschitz-Stetigkeit Spacetime
Axiom of choice Latin square Direction (geometry) Multiplication sign Execution unit Set (mathematics) Mereology Proper map Computer programming Computer Formal language Information retrieval Heegaard splitting Sign (mathematics) Bit rate Different (Kate Ryan album) Query language Process (computing) Endliche Modelltheorie Descriptive statistics Physical system Scientific modelling Sound effect Bit Markup language Instance (computer science) Surface of revolution Digital object identifier Measurement Flow separation Formal language Tablet computer Type theory Data mining Message passing Data model Googol Process (computing) Drill commands Order (biology) Natural number Compilation album Moving average Right angle Bounded variation Data structure Resultant Row (database) Metre Regulärer Ausdruck <Textverarbeitung> Observational study Markup language Computational linguistics Computer Field (computer science) Surface of revolution Computational physics Latent heat Ideal (ethics) Energy level Selectivity (electronic) Data structure Maß <Mathematik> Information Expression Physical law Database Computational linguistics Query language Information retrieval Network topology Calculation Natural language Table (information) Abstraction
Complex (psychology) Multiplication sign Source code File format Set (mathematics) Database Water vapor Data dictionary Mereology Food energy Semantics (computer science) Mathematical morphology Computer programming Formal language Mathematics Different (Kate Ryan album) Process (computing) Endliche Modelltheorie Descriptive statistics Physical system Pattern recognition Electric generator Dreizehn Digitizing Gradient Moment (mathematics) Computer simulation Bit Markup language Flow separation Formal language Product (business) Connected space Type theory Message passing Process (computing) System programming Right angle Moving average Energy level Procedural programming Metre Statistics Sine Inheritance (object-oriented programming) Variety (linguistics) Markup language Information and communications technology Letterpress printing Student's t-test Streaming media Computer Theory Product (business) Number Natural number Operator (mathematics) Operating system Uniqueness quantification Energy level Integrated development environment Lie group Data structure Form (programming) Dialect Information Prisoner's dilemma Physical law Computer program Planning Denial-of-service attack Line (geometry) Semantics (computer science) Axiom Mathematical morphology System call Faculty (division) Word Elementary arithmetic Integrated development environment Software Personal digital assistant Speech synthesis Object (grammar)
Pattern recognition System call Multiplication sign 1 (number) Data dictionary Mereology Tracing (software) Computer Formal language Web service Different (Kate Ryan album) Oval Moving average Process (computing) Formal grammar Physical system Service (economics) Digitizing Data storage device Digital divide Computer Bit Term (mathematics) Translation (relic) Formal language Product (business) Virtual machine Type theory System programming Self-organization Energy level Windows Registry Web page Inheritance (object-oriented programming) Service (economics) Characteristic polynomial Virtual machine Event horizon Computer Product (business) Power (physics) Goodness of fit Term (mathematics) Maschinelle Übersetzung Subject indexing Authorization Energy level Communications protocol Dialect Key (cryptography) Forcing (mathematics) Computer program Semantics (computer science) Mathematical morphology System call Subject indexing Word
Pattern recognition System call Multiplication sign Execution unit Workstation <Musikinstrument> Equivalence relation Computer Formal language Usability Number Different (Kate Ryan album) Term (mathematics) Subject indexing Moving average Energy level Information Error message Maß <Mathematik> Position operator Formal grammar Electric generator Information Moment (mathematics) Computer Sound effect Usability Term (mathematics) Translation (relic) Formal language Product (business) Virtual machine Process (computing) System programming Interpreter (computing) Energy level Family
Axiom of choice Complex (psychology) Context awareness Multiplication sign Execution unit Set (mathematics) Mereology Food energy Mathematical morphology Computer programming Formal language Pointer (computer programming) Different (Kate Ryan album) Descriptive statistics Programming paradigm Link (knot theory) Touchscreen Kolmogorov complexity Moment (mathematics) Token ring Electronic mailing list Shared memory Bit Nominal number Instance (computer science) Element (mathematics) Formal language Demoscene Category of being Arithmetic mean Process (computing) Googol Internet service provider Order (biology) Mathematical singularity Self-organization Right angle Energy level Data structure Link (knot theory) Variety (linguistics) Number Element (mathematics) Term (mathematics) Googol Energy level Data structure Maß <Mathematik> Form (programming) Task (computing) Information Military base Gender Computer program Plastikkarte Extreme programming Line (geometry) Mathematical morphology Word Uniform resource locator Pointer (computer programming) Personal digital assistant Universe (mathematics) Speech synthesis
Complex (psychology) Statistics Programmable read-only memory File format 1 (number) Mereology Stack (abstract data type) Proper map Mathematical morphology Formal language Frequency Particle system Different (Kate Ryan album) Single-precision floating-point format Program slicing Energy level Information Task (computing) Form (programming) Data type Electronic data interchange Demon Information Lemma (mathematics) Computer file Token ring Mathematical analysis Bit Limit (category theory) Mathematical morphology Type theory Arithmetic mean Word Sparse matrix Process (computing) Right angle Internationalization and localization
Complex (psychology) Building Parsing Latin square Multiplication sign Workstation <Musikinstrument> Set (mathematics) Mereology Disk read-and-write head Data dictionary Mathematical morphology Formal language Different (Kate Ryan album) Object (grammar) Forest Set (mathematics) Information Category of being Descriptive statistics Physical system Parsing Theory of relativity Kolmogorov complexity Constructor (object-oriented programming) Token ring Bit Price index Instance (computer science) Computer Statistics Type theory Category of being Radical (chemistry) Sparse matrix Arithmetic mean Process (computing) Googol Mathematical singularity Normal (geometry) Procedural programming Data structure Row (database) Point (geometry) Computer Wave packet Revision control Mass storage Directed set Data structure Metropolitan area network Form (programming) Condensation Task (computing) Addition Information Prisoner's dilemma Forcing (mathematics) Lemma (mathematics) Expression Line (geometry) Multilateration Limit (category theory) Mathematical morphology Subject indexing Word Database normalization Query language Personal digital assistant Network topology Formal grammar Speech synthesis Natural language Object (grammar)
Parsing Workstation <Musikinstrument> Materialization (paranormal) Insertion loss Propositional formula Disk read-and-write head Mereology Proper map Computer programming Formal language Different (Kate Ryan album) Software framework Office suite Position operator Descriptive statistics Parsing Touchscreen Electronic mailing list Infinity Bit Instance (computer science) Flow separation Type theory Data mining Arithmetic mean Process (computing) Order (biology) Right angle PRINCE2 Point (geometry) Functional (mathematics) Token ring Control flow Mathematical analysis Student's t-test Number Wave packet Voting Latent heat Goodness of fit Radius Energy level Task (computing) Form (programming) Information Inheritance (object-oriented programming) Gender Forcing (mathematics) Weight Computer program Mathematical analysis Commutator Primitive (album) Mathematical morphology Number Word Loop (music) Voting Intrusion detection system Personal digital assistant Network topology Formal grammar Speech synthesis
Parsing Set (mathematics) Parsing Mereology Formal language Information retrieval Mathematics Sign (mathematics) Cube Different (Kate Ryan album) Process (computing) Parsing Electric generator Theory of relativity File format Software developer Closed set Instance (computer science) Type theory Process (computing) Order (biology) Right angle Energy level Freeware Sinc function Resultant Surface Functional (mathematics) Consistency Real number Similarity (geometry) Branch (computer science) Student's t-test Theory Product (business) Wave packet Attribute grammar Number Revision control Goodness of fit Network topology Term (mathematics) Energy level Differential equation Data structure Proxy server Form (programming) Task (computing) Addition Key (cryptography) Information Military base Artificial neural network Surface Limit (category theory) Word Personal digital assistant Predicate (grammar) Network topology Information retrieval Universe (mathematics) Formal grammar Interpreter (computing) Vertex (graph theory) Speech synthesis Musical ensemble Object (grammar) Routing Near-ring
Pattern recognition Group action Parsing Coma Berenices Mereology Measurement Roundness (object) Type theory Different (Kate Ryan album) Species Local ring Multiplication Social class Area Parsing Pattern recognition Temporal logic Mereology Instance (computer science) Measurement Information extraction Type theory Category of being Message passing Process (computing) Quicksort Species Spacetime Regulärer Ausdruck <Textverarbeitung> Syntaxbaum Protein Computer Goodness of fit Data structure Computer-assisted translation Metropolitan area network Information extraction Self-organization Time zone Validity (statistics) Expression Generic programming Basis <Mathematik> Protein Word Uniform resource locator Network topology Social class Text mining
Line (geometry) Maxima and minima Denial-of-service attack Infinity Emulation Time domain Different (Kate Ryan album) Ranking Lipschitz-Stetigkeit Physical system Execution unit Information Menu (computing) Mass Range (statistics) Instance (computer science) Demoscene Repeating decimal Disk read-and-write head Arithmetic mean Uniform resource locator Personal digital assistant Self-organization Hill differential equation Permian
Control flow Disk read-and-write head
ok so i am. i'd like to start with the video clip. in what was important in this reader clip is the sound that you can hear. a snow. or what a great one and didn't. i. thirty years and years and i were terrorists there and her hair which it. juana is nothing going on when people say there's a picture. the us. what's yours. the commission should i choose which of these it right.
in two and a story about that to you should have won the a younger woman room.
there were eight or his or her the.
but when the u s.
a it. a it there. there were a.
the owners. for years. a on their failures.
a i have to pay for those who recently as a return to their.
now.
i hope you know where is this taken. so if you seen the new movie which one it. stanley kubrick yes two thousand and one language which sony tuesday. space odyssey two thousand won the movie was filmed in sixteen to nineteen sixty eight. i have you heard to talk to dial a conversation between a man and machine. the. it's such a dialogue possible today. we're in two thousand and nineteen. can me. try to understand and analyze what levels of processing of him within this dialogue is needed what what kind of processing and what's different levels look at this whole is an artificial agent which is can't keep rubble. all to fly very advanced an m.p. in a high techniques so we're talking not about not only about speech and language production and understanding a hearing in speaking and then not just speech but also processing. of the language which is behind this week. recognition in generations of both ways then understanding of the messages. and information retrieval in extraction of course behind all that reasoning here we went into ai in visual processing have had really stick systems that people use when they talk face to face i like italians i use very much. which my hands and arms when i talk it's all people from mediterranean they do that. while germans would probably give talks you know like this or maybe just like that. and. of course i had these are paralysed the system's signs that ties used in the overall building of the whole meaning of a turns and and now the of the communication that we are involved in. and these are not only single communication act like one are trends or one paragraph or sequins. it's the whole complex speech act which is called lecture or lecture and and you have a also general knowledge frame how lectures usually look like you have someone who is giving a lecture and so on always receiving the lecture or the continent over like a. and then the whole is also able to the process visual signals lips rating and that's what he could we use the pronoun he or it in this case now we are here and at the treasury old and i'm not advocating trying to manage the on any any of that just just to be sure. but i'm using this them. example and has a illustration or i'm trying to illustrate way we are today with this capabilities to process such complex and linguistic all language uses and all which is behind language these reasoning. knowledge retrieval and so on and so was the author clark too optimistic with two thousand and definitely definitely from sixty eight. two thousand and one seemed like when i was good and sixty eight i was five and then i was thinking about how what the and two thousand and one want. and four. or you know i'll so. what if i do with letters what should be so i left it to the right. i. i.b.m. yes right you idea right. and where will we see about i.b.m. later so predicts run. when i didn't get that point. so.
what do we need for such a dialogue today. to be able and willing where we have some parts of these all levels some parts elderly sorted out we have some parts the very useful he and applied and and used on the ground predominantly for english but of course there are other. or languages it does not seem to show in the world but of course a people from data science would say oh there's no data like more data course a google says morning. just to them. grab everything you can find it's ok but so the big data revolution study the only with the rise of what has been called in the information science and in computing and structure to date. so when computing guys tell you about told say something about and structure data that means these data are not sorted in the database for much on a database table so there are no records in the fields and so on everything else besides that or anything else which doesn't have three different structure is called on for. future date the. at so text or consider his own structure data and for us linguist this is a horrible of the two males latin here because for us to listen to know who is a lot of structure in text not just to take structure itself i mean the text as the documents they are structured they have had. things they have paragraphs they have part structural of parts which also can be checked whether they comply to some kind of document the nation. and you have that the markup languages document typed definitions which actually formal drummers how tax structure should look like or what kind of structure attacks should have in order to comply to pre-defined to document typed so we're. collection of poems he's a different text type document typed and the legal document a but they also both have specific take structures right but there's also structure of the language expressions and units used in attacks and this is where the where we actually try to. michael the whole story because his. one of i would say. most important inventions in the mankind is to use abstract system of signs to convey messages and knowledge it's. the same level as inventing the wheel as a mother to say that's what made us humans. this ability to use completely abstract system. that actually developed all are thinking that we have to internalize and develop the whole system of concepts and that's what made us so intelligent as we are. or on tree. sometimes i myself whether human race is intelligent enough but so far we managed to survive. although they were very many threats by ourselves of course and i'm not talking about as to rates come to tourists we are not dinosaurs yet but we may happen it may happen that will become as a kid. and that for instance let me just give you an ancient things on pulled from the documents retrieval you know the famous t.f. i.d.f. measure that has been used to measure the documents relevancy forgiven query from the documents collection a that very soon to come all the possible computing. several metres in variations of these calculations actually came to ceiling in the guy says doing documents retrieval actually said all right but why don't we try and use the additional structure that that's really exists in the documents idea as a their own structure all. for even the structure of the language which has been used to produce these documents of the tax a soap and only then that the documents retrieval made in you leap so it was then possible to receive to get better better results matter f. measures at the end if you. and so this is just or i've been told her the when we were organizing this first literary we did i have to actually give some kind of introductory and lecture about language technologies and computational linguistics or and.
they'll be whatever you want to call it and i'll show you a very quickly. that there are no differences of autism in effect and they're so bad i didn't know at that time we didn't even know who will get employed who will become the us are so i don't exactly know your background so how many of you have to attend these courses in ilkley. one two three four so not all of you say that i am i will have to apologize a little bit to those that attended any course is that you'll probably hear some things which are familiar to you. but as i hope that i will be able to shed some new light on that or from them different direction a and of course the as active did already asked me to question whether you think he's it deserves and let me just go through from the start from the basic concept. it's a so that as being a linguist go and computational linguist i started computationally english six. you will hear a lot of latin for me to the prado most law came and that but only other hand you have so you have earth something what's called natural language processing please do not confuse that with new rolling we stick programming. it's a completely different set of activity slate staple light. yes i mean. noor linguistic. drilling use the crossing over what i'm new to the programming is being sold today to p.r. people and they pay you know like thousands of euros for courses in rolling was to programming and what's interesting is probably programming. they promise you that after the course you'll be able to manipulate your new people with whom you're talking the soul that by its careful selection of lexical choices and and the phrases you they will do whatever you tell them to do. it's sound scary of the year but you know what we had these for thousands of years already in our culture and you know how is it called back in greece or thorax. it is the same thing just a new package and you sell the package of course and. so not know or linguistic programming national english processing and by the way this is the older acronym so what if on top of that you have something what's called language technologies what's that. right let's start with the first two. so the difference between computationally with sticks and national language processing is that you'll start from the from linguistics and then if you start from the ecb actually is the same field. but you simply go to that's same field from two different from the directions of two different sizes and that if you start from linguistics then you would call the computational linguistics and that you would be interested in how to use computers in linguistic description so how to make models and your aim should be a better description. of language fact that's what linguist do they try to describe language not want in many languages and if you can use computers for that excellent why don't we used and we do that. but if you're even from a titian or i.t.v. high and mighty person then you would call this natural language processing and that that means that you will use computers in processing language data so that's only one type of text processing it's a specific kind of data a so language way to have their own particularly. carry tasers put it this way and that your aim would be as efficient as possible process as more language data with less computationally resources used i mean that's the general aim of any information processing but then you have problems with specific type of data in which a which is linked. right so the ideal thing is to have in the research team both people so linguists and the mighty people in from the titians a but what the ideal would be to have that one in the same person and an led to providing that you don't end up with a split personality. he's. and i have a colleague of mine who will actually be a mentor to diego a he is built so your mentor is a linguist any information scientist see he graduated in both the end he is from split so he is a really split personality. i have a proper split personality. ok so the next concept is languishing acknowledges so why do we introduced that and top of these two things.
while i talked to my students in linguistics that you will hear a lot of times as linguistics is a unique between humanises yes it's true that's absolutely true and why are because the in linguistics some of research methods would be completely like that research matters and natural. i was so we can measure r. object we can you know waited we can put it under statistics and everything that you cannot really do unless you come up with very complicated and something is very dubious them conclusions. and that but on other hand the usage of scientific knowledge in making products so linguistics can use its knowledge to make products there's a number of commercial products in language technologies that exists today you're not sometimes even aware that you use them. all the time. do you know what his team nine. do you all of you use your mobile as a and when you type of the text messages you have a system that predicts the ending of the word yes let's stay nine and that's one of the products of language technology is how your mobile phone. knows what you want to write. and in the beginning it was just you know plane statistics probability but now you have adapted systems so they track your own idea left they track your own variety of language that you use and there's the system that's it's a network would way. rated edges so some of the i mean like likely we train our own brains you have some connections which have facilitated in some connections which are suppressed so in your mobile phones that system is adapting to your own usage you. that's one of cases plain old spelling checker is a prison productive and of language technologies right so how do we define what language technologies are so if you take a plane definition from the lexicon technology and this is a lexicon creation lexicon from ninety six so technologies. and as a set of methods and procedures for processing of role matter into products and. and everything is clear when we talk about chemical technology. and you take so for it. oxide and so to write and you took water know what you do that. you combine that. right you have role matter you have h two o. s o two on this is seven grade of elementary school what do you get so for a case it you get aged two and so forth and you have some catalyst their involved which is platinum if you do remember that process. and when you are what twelfth eleven twelve thirteen. roughly three. so you know what's wrong matter the gas and water you know the procedure you have to use these old land and the gas in the water you have catalyst platinum print you get to do so for gas and put it in the glass bottles put a label along with that crossbones and their skull and use. sell it on markets fine so you know what's wrong matter you know what's the product you know the procedure so that's what technology is chemical technology right with nuclear technology is it's a little bit more complicated but you have the same thing have. you have to mind for uranium and then you have to enrich eight and then you have for you put it in stakes or bowls or whatever and then you can use it to gets a. the destructive energy a constructive energy let's put it this way ok. right which is not really efficient way to to use it to mean you use nuclear energy to boil water actually and then the cold water which steam engines are related runs the generators are ok nevertheless it more or less you know the the role metre andy and the. final product but what's the role matter in the case of language technologies and wants to product in the case of language technologies. and. and you think this is a rhetorical question or do i do they are many of your answers and the one thing i have to warn you is the language acknowledges today completely depend on i t on information technology because today communication technology. he's completely depend on night years while law it was a long time ago when the the telephone telephones in send you you need to your operating is to switch manually the telephone lines that was long time ago now it's done by computer actually in computer like that you have the whole you can cover. a smaller city their phone numbers fix phone numbers of a smaller city. so in that case if languish acknowledges depend on i.t.v. in this case the role metre his language data. so they're already data about language in digital form may call it when this flood of prefixes where quality text. and products while there's no simple way to to define it but products are systems that unable to us. to use the language in computational the environment in a simple or simpler than before way. and when i say computationally environment i'm not thinking only about computers i'm thinking about the mobile is that we use every day because these are computers and that's nothing else are now we're waiting for a new operating system for i phone if you have an operating system then it's a computer no doubt. and they screw your i phone when ever gets a new version but forget and ok so these are language technologies and them their home markets out there people make money they live they make a living out of the english college and that and so how do we do.
i'd languish acknowledges what they consist of the first thing is of course language resources this is where language data are stored or this is where we get languishing a and that and languages source is either or text or structure it just so she could be what you have structure in draw text as well but if you. the structure thanks to that that means that you have made explicit it's the structure that we as humans do arrive because we know the language yes the next year. the right yes that's true and they are called speech technologies and i i will at this moment i will not talk about speech technologies anymore again. because he is a a separate line of research which is very useful and much needed for many languages. in their speeches of course but i will now go a bit more of a concentrated more on language and i'm well aware i'm also from the titian so i'm well aware of the problems of speech processing either to magic speech recognition or as the text to speech systems that that. that's actually speak out the text of his returns. but. the. the and. it's it. snow. this way and the speech technologies at least in the theoretical sense where. originated from actually region it from the different for nifty departments in the humanities faculties that's they tried to make a model computational model of phony men so on when they got stuck and then as a rescue came people from computing by introducing the dead the city. processing methodology and this made a burst in in and and actually a big advancements in speech processing because it was treated in a different way so and the this example very clearly shows that the humanities people then the computing people should collaborate and only lie. is this way on in that way you you you go out on you go further and things. good so angry a source is either all it takes all structured that means that actually this but and you can do to structure by mark up languages say explicitly say this is the beginning of the night and this is the end of an item which is normal as you mellower x m l or h.t.m.l. type of whom marking or you have redefined fix for. moment you will see like call for mathias similar to that pre-defined fix form and and the language data coming to forms money school per hour so these attacks and documents collections or stream says marco pointed out yesterday so you don't have to collect it and store it in your to allow door. or hardest you could if you can simply process the street. then you don't need the original text and that's the whole idea it's only depends on what you want for dec if you if you can process that foster enough that you can. and tolerate the speed of new text items arriving that it's perfectly ok that to to process streams. or you can have lexicon that these dictionaries a lexical or terminological collections databases whatever in whatever form you use stories it's a different least structured and roll material or language a day so the first part of an english technologies are language. resources the next one in our language tools these are programs or systems that process language resources. and the and. they know here it's the stratification a different language levels cities. presence in linguistics descriptions so you we talk about the processing at the level of morphology and processing of the level of sin turks and levels semantics and each for that in any each level actually has a larger of complexity and the easiest thing would be to pro. this is all phonological i didn't put even put it because the number of phone eames in each language is fixed. and in all world's languages which are like what seven thousand languages something like that. the six and a half to seven thousand to mean you have issues whether something is a dialect or language or not and and from linguistics you can not like in mathematics you cannot define set in mathematics its its axiom in linguistics you cannot define what languages from linguistics.
you need to take it from somewhere else. and there's a famous the famous definition of for jewish. it was to wind that i who said languages and dialects with army and navy behind it. but if you think little bit about that it's not the the more the. it's not too foolish to to put it this way but then i discuss that with one of them. the authors of oxford and been seen british national corpus from oxford who bernard and eggs in own language is a and dialect with at least hundred million corpus behind it. that's for coal corpus linguists that something is first they could use but that's also not a very foolish way to put it because if you have a linguistic community that is where enough to invest and it's a huge investment invest men power and money to build hundred million corpus this. linguistic community can function on its own that's for sure particularly in the in the air in in this time in twenty first century when we talk about digital divide and all the other stuff so languages without language technology is built for them will be practically below. it is languages. and and and then of course you can use a more complex tool that combine the several levels so these are different types of tools and then you have something what we can call language services and it's just another way to excess. but the language resources in language tools online they usually day usually organise of as web services and there they could be used by humans or computer so you can have a p ice remark was showing some traces of that yesterday but you'll go deeper into that the in the hands on sessions with that. an event registry. and on top at the end the force part of language acknowledges our products and they may be commercial products of the you can sell them on market all they could be. handed out for free i can and if you can just remember all types of chequers so damn a trick of style check or a spelling check or digital died dictionaries you can have a plethora of digital dictionaries online problem is of course that you don't exactly know. on the first sight of whether they have been done by professional lexicographers all or by amateurs but in the best possible sense amateurs the ones who are loving this subject not who love the subject not the ones who are and not professionals don't you don't have to be profession of lessons. or you can have out a magic indexing systems so the systems that would attach to a document or at the sub the human level i the key words or characteristic terms or descriptors and so on so you have systems that work like that force. or summaries asian systems do you know that in microsoft word you have a very good summers ation system for english you have never used really please try to find it in menus you will see there is it out to matic summarizing system in microsoft word which works very well for english. surprising and of course as you set text to speech and s. our systems or machine translation system or machine eighty translation systems which is not the same thing you know. and computer eighty language learning systems so these a much more complicated products but then you can go in a book store and buy a cd teach yourself french k. and that's it since. that's a tough. computer aided language learning system and which usually is a very carefully designed and helps you learn the language of forget the at advert advertisement in your web pages learn the language in two weeks' notice this doesn't. trust me it doesn't work like that you can be genius linguist moon two weeks now but if. ok so let's start with the language german and two and simpler.
let's start with language resources and we will mostly concentrate on corporate sanitation. so cope as linguistics introduced corpus and station on different language levels and the equal into equal an equivalent term use a text annotation and them so what do we gain with using annotation and inflation means adding interpretations to existing language units. and as you know every interpretation has errors so there's no such thing is hundred percent accurate annotation they forget that not even people can agree whether this is the adverb or adjective very often. not just in english and quite a number of languages and but once you add this inherits inherently linguistic information you turn it into a place of data and then defeats explicit data you can search for it explicit date because a computer does not know that in this position corpus the. this is actually accusative floor of famine. unless you say you attach this information to the token and say this is accusative flu family and then you can search i want all know all all nouns that's appeared in accusative clear effect. and and the usability attacks and grows with the a moment of and station at it so you have to be very careful how you add a patient you can of course that over generate things and then you. lose a processing time and efficiency but you have to be very careful about it.
so i had to asians could be stored in two different ways they can be either embedding text which is intermixed organ the so-called stand off annotation this means that its separated and links either by location or pointer and there are two ways how to link the standoff annotation marks to. the to the language units that you energy but of course you also sometimes need a very often need not only stick annotation that's means making elements of the tech structure. explicitly marked so things like that this list of the and of course as to why would we need is that. well we know intuitively that they use each of the words in titles is different. and the usage of the same words in a body of the text. the titles are selling the text going. so the very same word could be used metaphorically in the title and then you'll see that quite a number of times you you you get attracted by title and then you read the article say well nothing has been said inside this article. i have so how does these to combine. and that's the whole idea. yes you have to buy something and if not by at least see an advertisement on the screen and someone will earn money and. the . so but linguistic annotation takes into account language units and that was sold in greece accommodation starts at the moment that you take into account linguistic units and the first step is segmentation all of the extreme into linguistic units and these are sentences than clauses. and then words actually tokens once you do the segmentation then you can start with annotation or taking actually tagging means adding a linguistic description to already segmented language news and that depends on the level in lng language level that you are doing to processing so they will be. a friend and attentions in the light on the level of morphology they'll be different and the patients and tactics level and different innovation some semantic line so forth. and.
ok no let's start with the lowest level which we called for falling so one of the basic and processing tasks as part of speech tanking and this means adding information and about as a part of speech for each token in the corpus in attacks a so what is now it is very good as educated this. or ask humans that is more or less a trivial but is it no. well the problem with english for instance is that the same words form can actually be a noun a verb an adjective india and you really need to seeds in the coal text. i'm clearly differentiating between coat taxed and context coat text is a linguistic surroundings and context his linguistic plastic extra linguistic surroundings and so when you talk about context then we are talking with utterances and we need all these peril interesting for me. nation even knowledge background that we need for understanding to to understand them to sentence but i'm talking about in the human only about kotex a so you need to see the for instance a word caught in the kotex because it could be announced very on edge and. and that of course for a language you have to drive a tank set which is a list of possible part of speech in a given day and the structure and card in our city of the attack said depends on language of complexity and the tech set design so it's up to you. you can say ok i need only like ten basic categories basic parts of speech and i don't care about anything else this is a very course text that and of course your precision will be very high with a tax. but the. no the other way and story recall the real very high but the a person may be shaky but on the other hand. you have languages with with the very rich morphology. actually quite a number of into feeling which is has kept it's kept this morphology inflection of knowledge. english has lost it i mean what kind of nominal inflections you have in english you have singular you have floor oh you have sex and janet even singular six engine to bloom that's it for words phones for and now and if you take for instance scene and. they share an you have seven cases include roma seven cases in singular of course but not in slovenia and you have additional value for a category of the number in slovenia have singular duo simpler. and is it just completely regularly or appearing in all inflection of forms. so if you want to say and inspiring in the ass and share unease i i i was going me some mostly me specialized two of us army's mostly is more than the two and so you have completely different word firms. so you automatically you see that the number of possible tags gets multiplied by the number of categories that you want to mark with the tag in in the inflection all in the variety that a certain the word can have an infection. her. the. the. but at. no it's both it's a in the case in the individual to use the awesome shale me start sholom essentially a you have a different from towns which is singular them dolan plural and you have a different the verb endings as well. so do all categories of the value for dual is in both inflection all the paradigms present in noun paradigm and in pronounce paradigm and in verbal paradigms as you. the by single to implement. no it's not just different endings. that is where they are. yeah that single early on but you have additional value. in a verbal verbal one pair gene changing over billions. so and and we think with seven cases that we have we are very much complicated no can you imagine the hungarians and then finish people they have fifteen case. fifteen. and then you have a gender problem yes you have like three gender is in islamic languages old around in the in english you you forgot that you don't know it anymore you have the animated in emea and emits it's a difficult whole different set of categories a but you and you market so so. the new the categories that you mark with inflection all endings you have to take them into account and building the tax act right and i've listened to david corvettes he's a famous them english linguist who flew morphology school found out the language which has that. but why this variety of possible the inflection of forms it's cold archie it's been spoken in northern that based on. it's can have a completely of regular million in the house of different word forms for each variant. and the kids you know three years old kids they muster that. because they mark on the verge of a market no like not just a gender the and the tense and moods and aspects and they also mark the. the gender of possessor and the gender of what has been possessed and simple said you know things like that this is what we marked by different lexical choices which we choose different from those a yours and mine a different lexical items and they do it in infection and so it's just the way how different languages conveying from. nation a so when you try to build a tax that you have a number of problems and there are very nice the initiatives for there's a google users a large a speech to accept or days is also another initiative universal access so that something where we could mean or might or. the real likes to go and try investigate that whether this can be used because they stank said have been designed. in order to cover as much as possible. and is that there's always a trade off so you lose something you gain something a that's that like these and been designed in in order to facilitate further processing of under resource languages so this is a politically correct way to put the this is the term politically correct to enforce more languages didn't. not have enough language technologies bills yet a and and them so taggers of course of programs that provider to magical tang for english today you have precision around the almost a little bit over ninety nine percent certain to base tankers providing this and i are getting as high as that. and there's a this is an example so this is in line tagging you see that the each each the token has a underscore and then the tag added and the bases from britain the brown culpas in the there's the legend so.
in and to means plural common noun and and one means a single common down and janet of proper noun and so you need to have an explanation of the tanks and for english right the for english this tax and has like seventeen tags and that's it so cardiology of takes it will be seventy four slovenian you need. and like thousand six hundred. different tax day for creation had a you you need a little bit less like thousand in the four hundred because we don't use duel with triple. in russia only to be also something that someone somewhere above thousands something like a the so psychic languages have are pretty large complexity at the level of morphology morphological processing inflection process and there's another level in more for logical processing i didn't cover. in my slice and that's the divisional processing the racial morphology which is very as you know common not just for slavic language is not so much but particularly for german because in german you can do together you know like five six seven lexical more females. and you get very precise part of information and as i would balance the high strung. you can have that i've said so it i saw it written on skis was kiely ft so in english is likes put down your safety bar in the talent was on what into the sea creates as and and in germany to see it was a everywhere he in french into their own languages are like three forwards but in germany. it was ones albums he was albans you catch done in one sausage long and. but it's it's very precise i mean use this is why they say well you know that the only language doing for doing philosophies human of course nothing else sees you know that the and every philosophy in french italian can you think in these languages do you talk on the pillow. but when these languages but not laws. but again the next task is a limitation right so that's the ending information about limit to each token in corpus. what his name.
and so you have it you have a text here and you have analysis of the stacks of these all name as the selected yes let us but that means that the each and every word can have here in a text in different word for us when we want to collect all this workforce together and this will be. to stop just as this statistics by the way in this way you avoid the the data sparseness from the so that means that difference is you have this limit here and it can appear in different word forms as you see here so you need to collect all these together in the. as to the overall frequency of land not just different types and so.
they might is what we call a basic word for it in most languages it's the same one as a head word in a dictionary. but not always doesn't have to believe in. in the latin you take the first person indicative singular of the present. active. on more and then infinitive i'm on my way that's what had word in in latin dictionaries so are more is actually the dilemma in different languages you have different things in slick languages it's a common to take infinitive as let him know. other languages you take the nominate of singular kate normative casing duller than four of them something it in in adjectives it's a bit more complicated depending on how many categories you mark on the adjective it's but it's usually that you take normative single are positive. save the and masculine and the while depends on what is it is determined on on determined that the brains of different languages. so much are very important when you want to the computerisation process inflection the rich rangers because then you have a problem of this one has mentioned data sparseness and let must help you to build much condensed indices for instance you can in this a text you can index a text by using the. i'm us instead of types and this automatically helps your query systems that people go and and and make the search in more natural way the speakers of inflection of the reach languages they have been taught when they started to go into school years. this normative singular he is the form that you have to remember and yet everyone else in or at least computers forgets that words can appear in different word forms and google does not know that the you as a speaker of the inflection of the reaching which you in. puts the lemon and you expect from google dot s. i from slovenia four hr from creation or when you put into was an omen is a single word of an ounce you expect that will be able to find in all other cases but no i don't know how to do it and that's a problem across the. so rescission is that about ninety eight point five today the best systems provide these type of so as you see there's no hundred percent and the station automatically you can do it manually but trust me. don't go to a man at least four very limited or small resources yes you have to dismantle you have to check it you have to correct it and so on and then use that as a training this year that's what to do. and then in the in these procedures part of speech taking limits as a show and you'll see more has infected description later there's always a problem of these interrogations so you need to resolve the hmong graffiti there's a demographer in some languages is more than fifty five percent which is absolutely crazy you ask the world. with it this is above the natural redundancy of a natural language how come that we understand each other any way an effective if you know the homeowner fees more than fifty five percent. yes but we have a coat text that's the point so we have this hidden markham although all the time when our heads we we keep it all the time it's not it is not just try to ram sits engram sector but we have learned that we have internalized on that. through our school in close to the next hour that's an example from serene and those living in guys are not here but you can say you can convert conveyed to market as been using a lot of living examples to the five.
so you see in the problem is clear that this and the you read this is a vertical lies the corpus and then you read the chteau can use in a separate line so this is a token and the it's actually can belong to three different lammas and that's what the longer if he is so in this case this is the proper less. and then you have to also hear all its proud and means the could be adjective or air participle all are for. i read the following task which is a complicated version of far to speech banking is called more force in fact the description that's adding more force in fact the key information to each token in court was but previously we had only part of speech so we were just trying to recognize whether something is. non-verbal adjective something but this comes after so if we say this isn't now and then we want additional grammar categories recognised so this is now but it's also in a nominate to it's also and famine engender it's also include row and so on things of these additional information reach. which is extremely important if you want to do parsing later so you can not do the proper syntactical analysis unless you have completed the inflection analyses why because in inflection reach languages. syntactical relations are in cody by inflection a landings. nominate saves i usually the subjects accusative they usually direct objects data as i usually in direct objects instrumentals i usually instruments of course and or some a derby oil or constructions saying about. but instrument and this can be then later i you should see that later. in hands or actually why the not too also semantic roles. and then. i've said it several times as a some kind of radical expression and said we could actually skip this tactic pros. in some languages you can map directly from morphological inflection word homes directly to semantic roles and get rid of all the parsing problems are street the forest so forest trees in someone's simply go from. inflection landings directed to cement close to which she something what mark on his crew would very happy very happy with that they would like to do it and and and right. and of course then tank set also depends on structures of the structure of tech sets incarnate depends on language of complexity. and the and the design of course because of mean more categories and more values to the person who dropped so the best systems for write something like ninety two percent and so on. and now this is one of the creation examples so you have here.
tokens this is a first possible and not in part a speech to actually amorphous in tech description this and other lame as an office in texas tree and there's a whole list of these i mean. as some word forms of creation adjectives can have as much as twenty two different more force and texted descriptions the same work for. oh yes this is a land not this is more person tech to description that's another land mine office in texas me because for instance you can have let me check the. here for instance this could be either announced which means cold gay or could be another now or could be trip as asian see so the this first letter is a part of speech and this can be very easy to read this is a. all takes to east attack set the framework and so this is now on common masculine singular nominate a these tags it's a free available the whole specifications their its mean the actual maintain at the j. science teacher in uganda. and the when you deserve a big year it all that and then you get to the proper you know tokens blame us and the proper more first and dec the description but then you need to this information about nominate to or phrases on the ok. and so you have this nominate of here. and when you analyze it owns in tactics level this actually really functions as a subject isn't a. and so in the inflection reach languages it's very hard to do some technical analysis or practically impossible if you didn't do if you don't complete the morphological process before. ok now we go to sentence on how much and have we have a coffee break at eleven yes no yes we have a coffee break it eleven should day i started the best bits later because of technical problem should they go through centex and then we go for. coffee break it's like i think ten minutes and. so. synthetic and alice is today only you do it manually only for training datasets but when i say training data says it means several thousands of sentences so if you have a. a possible. mechanical turks or you you get the you know. the crowd. we have people which is very fine i have are like forty students each year on them. corpus linguistic course and then then they get tasks of course way to do things which is very good way to really have hands on the proper language material and there you see where the real problems are because grammar books grammars this keep all. children from the ski pole the problematic points spots they skip all the hot spots. and then you try to see how it was and how it was described in the grammar know it has not been described to the wall. and you come with the numbers it means i can tell right now in creation for instance i can we can produce like a dozen of papers dealing with the phenomenon. that has never been dealt before in in creation with these only because we are working with the proper you real texts much you ok so i was a magical same as in tactic analysis are done by programs by partners and they can be shallow or deep. they could be robust ponce's which is very important because a if you don't have a robust parse are then he goes either to have infinite loop or blue screen of death. but the robust process that can not sparse a certain part of a sentence or close they put the flank and they go on and then you later you check the flight parts a new see what's behind and and they can be no dominant we left branching already right branching the. what's the usual the order was the usual order of adjective in now nino him what he was the usual order is educated before now known around the edges. the adjective before in the eye so what are so you would prefer in and parsing the m.p. of have that kind of net noun phrase of that so where is the head of the phrase or its own right so you would prepare right ranching. are so because you would first recognized the head and then check whether the left part agrees with the had the in the gender number case whatever their. but they know in a new friend in french it's good it's completely other way around in french its urban commute after a pen so it's as if he was behind on you so you need a different type of course or or proposition of phrases if his positions are even practically no you into repealing which is predominantly in front of. the now a and that's the hands of its leverage. but then you have languages way have post positions like in hungarian so you take it the other way around. and and once you do the delta magical parsing you do the automatic holes in tactic analysis and you return this information back into corpus then you get to what is called three banks so corporal we'd insert it isn't tactic and station and this becomes a training materials. as for different kinds of farce as again and so i the statistical neural or whatever and when you talk about parsing the you have actually in synthetic analysis you have two main form of isms so one of the formalism says constituency parsing the other one these dependency parsing.
and that and let me show you an example from wall street journal a typical sentence from wall street journal dime of the crisis for years rose three age to twenty two. and this is the end. costa to see parse so parse treat my constituency formalise a but what i would like to point is a look at the leaner leaner form and how it's been written in a leaner for justice to this reminds you of something. what a. and yes but the true that the there's another thing if you instead of these parents this is if you put to the curly ones. then you will get sets of sex. ok. and in fact. this means that them. this is practically unreadable to humans we get lost in in number of princes and now you know of that one of the first language is ok i was least you know that of course you know what lisp acronym means. but so. a lot of irritating single parentheses. so not least process a lot of irritating single greatest now but this is one way to put it in a leaner ok so you have. parts of sentences that are within the the parts of on to higher level a the next the formalism use dependency parsing so you don't mark what is above what but what depends so which node the pains of a and.
and that's a the other way to the same sentence of course and no look at the lane near form and hand what this reminds you. again mathematics us here but but . but. me. the same predicate calculus a simple been the case and second order functions. and that's what i'm telling my students grammar and mathematics is the same thing as you have to come to this level. add to that music. and physical education and that's the basic for discipline is that kids should be thought were as long as three i mean when they are three from thing on it. good. so there is also possible difference between surface in deep forcing so-called and i give you an example from a proxy the product text or grammar and the own thought well the text of grammar the text a grammar was developed by the four and four. linguists. dinesh the papers galle a high july in her son young height and and that was the the theoretical bases for product into the sea tree bank which now already has a versions three available and they talk about the main two main syntactically years so called a layer and the tea layer way. recently which they call text of the magical and this looks like that so you have a layer of a word forms and you have more for logical later was so you have tanks here you have led us right and then you have for them. you have the surface since tactic structures structure of dependency formalism and you have a deeper structure which is actually a semantic rooms which i actually cement across a so this goes against the idea of non terms key who said that deep. structure is also seen tactic and that's the semantic interpretation comes from outside of scientists and there's a there was a huge german discussions beckon seventy's the late sixty's early seventy's between generative linguists and generative symon titian's because the generative. in addition as they started with from skin then departed and said the the the deep structure is actually semantic structure that that the basic semantic information that we want to censor with our sentence and same tactic the surface structure is that how we syntactically incorporate. to our semantic their right and the so it's very nice if you have this toy sentences as you call them but once you get into real corpus material of this looks like that.
this is a normal sentence from a creation corpuz. and you see its coordinated says and you have one predicate hear another prey to get here and then you have exhilarating sentences like a relative closes and then that there's another predicate here and you have something which is preposterous in constitutes a parsing branch crossing. so dependency parsing allows for branch crossing and that's why dependency parsing is much more convenient and useful for free older languages. constitutes a parsing is very fine and very useful for fixed order languages like english or chinese but for free other languages like. any slight language dependency parsing these much more years. and. and them but there is an example that i wanted to show you a that's an example from a. in a canoe format so you see here the position actually. the notes the structure and you have vertical ice corpus here so you have tokyo's you have let must part of speech they have named it is recognised and then you have dependents see the dependency the relations that means this is this one depends on nothing on zero. and them. this one depends on item on the position number five so this one is a route gay and i'll show you how we project into a tree. and these are the same tactic role so you have a subject you have predicate object attributes and so on a. and i think we. should i not yes this is very important thing. so out there is a strong initiative started a few years ago about universe of dependence is so they says that syntactical at the nation which is consistent across eighty three languages so far and sixteen maureen announcement and sixty more sixteen. more to come so. this initiative actually in this formalism the has been. introduced by york you need ran from university of south and was a actually facilitates a multi-lingual part of development because you can have the same parse there and then you're simply training tweed in a consistent and seem other way with the training with. fuel from different languages. and of course it's it enables the languages across language retriever and different types of parsing researchers so. you can find a more information on universe of dependence is dot org and there's a number of papers use the set of data for training in processing for instance and will be cured by border shane and and his collaborators and and these guys are from bucharest from remaining academy of sciences from the air a eye institute late. by done to fish. they have done very serious. the e.u. introduction aldrin of neural networks into these traditional end of the tasks as i have already talked about so they have used intensively neural networks for limitation m. is the tanking and and similar things in and. got very interesting results of this is a fresh new from the memo november last year the.
and that. yes and now to end up with syntax i was simply say that instead of forcing the holes sentence you can also concentrate on what is called shallow parsing so you would like to parse because parsing the whole tree is computationally very. demanding and it can produce more results than one. which we as humans do not perceive as valid as possible that sake for instance in into account of sentence i saw a cat. i saw a cat with a telescope in a park. i saw a cat with a telescope in the park. or i saw a cat in a park with a telescope. you have three possible. completely valid parse trees for the sentence. i saw a cat in a park in a way to attach the telescope is it the park with a telescope it as opposed to parted doesn't have a telescope you know the one you put a coin insight into you. or is it me who was using a telescope to see the cat in the park. or was it said that the cats that was carrying the telescope below her poll whatever the goods you know for computers these are completely three legal parts trees of course we as humans discard the third one and the his first to a sort of. these invigorated or selected on the on the basis of big round knowledge. so that's a problems even if sometimes you don't need to parse the whole the whole sentence you just want to concentrate on very isolated parts islands in some tactic structure and these are shallow parsing so you can detect the classes of words or phrases these are called chunks so junk is on recursive. the part of us in tactic structure. and or you can for instance go for particular type of expressions you neglect to put temporal expressions so i want to cover only temper expand want to take out only time. or up space in locations and this is very useful for speed the quick information extraction text mining zone but also could be used for what is such a pain in the neck for linguistic processing the mustard expressions so in germany you don't have debt in germany have not. com pounds or one but don't the other hand in german you have what is called phrase of verbs. you have stains he puts up out of it. you can have the whole sentencing between then and these two parts which up syntactically very far apart actually for the same group. take has nothing to do which take off. it's a completely different word so if you don't take off in an account your son tactic and then later understanding of the sentence his parish it can hear versus aircraft carrier that's a different type of multicoloured expression it's as you see kids ever says aircraft carrier this. is more precise area could be understood as a generic term and. ginsburg's you but this one is a different type of compound old actually multicolored expressions so i'm show show parsing can be used for delivering them all two hundred expressions which also important in the special type of guy.
all of the parsing or or limited seen tactic processing is called name into it is a recognition in classification and usually i'm going mark was talking about it he was referring to tony is entities. and ending in a message understanding conference is a bag their ninety seven it was defined like a seven basic the types of name entities were defined like person location of his asian daytime value in money and then and as i see measurements. in percentages and then later more than a main to see in any categories were at his leg geopolitical entities like e.u. and nato and so on. or protein gene drug disease species brand names and so on so if you say a new opel insignia so use it all is opel insignia one name entity or their to name it is one thing his brand the other one these a tight for whatever you because you have. opel insignia now of course our poll of on got out of the woods. and this is a good example in the creation thanks from two thousand and five action that was ph d. by diego soon today mentor he that was he's the work he made a system for on a man city recognition in classification in two thousand and five.
i for creation and you can see hear that it's really collars the difference that so you have died in coal by boys name and surname and here you have only surname and you have on collage of here he has only collection and he'll have a in by the way these are only cases so it's his can fully. it's can fully the taking into account the case and things as well for instance you have barely be an ism skin feels kind of these all three countries that you understand bill gay means belgium these aims guy is a bomb. netherlands this is a cal the knees needs that means the low the same seems cummins of land so it's netherlands a and the scenes case of these are all in generated and there's been recognised completely so. and that this system has a perception of ninety percent in newspaper tax which is quite acceptable.
and once you insert this information that it looks like that so you can classify so this is a person this is organization places location and some soon you consider this information. ok here we stop for coffee or questions or you can we can leave questions so the end or.
we can discuss during the coffee break as possible please. him.
Loading...
Feedback

Timings

  638 ms - page object

Version

AV-Portal 3.21.3 (19e43a18c8aa08bcbdf3e35b975c18acb737c630)
hidden