We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Language Processing Pipelines Part 2

00:00

Formal Metadata

Title
Language Processing Pipelines Part 2
Alternative Title
Corpus annotation: semantics
Title of Series
Part Number
2
Number of Parts
3
Author
License
No Open Access License:
German copyright law applies. This film may be used for your own use but it may not be distributed via the internet or passed on to external parties.
Identifiers
Publisher
Release Date
Language
Production Year2019
Production PlaceDubrovnik, Croatia
Disk read-and-write headGodData typeSemantics (computer science)WordInclusion mapSpacetimeToken ringOntologyUniverse (mathematics)Predicate (grammar)Term (mathematics)Subject indexingLevel (video gaming)Formal languageRight angleComplete metric spaceNP-completeType theoryQuicksortSemantics (computer science)Theory of relativitySpacetimeString (computer science)Different (Kate Ryan album)Data structureComplex (psychology)ExpressionTable (information)WordNumberKey (cryptography)InformationFlow separationMathematical morphologyFreewareOrder (biology)HypercubeInclusion mapGame theoryCharge carrierGroup actionInstance (computer science)Arithmetic meanNeuroinformatikProgramming languageSingle-precision floating-point formatMessage passingWeightCASE <Informatik>CodeMereologyTask (computing)Domain nameMultiplication signSelf-organizationContext awarenessDisk read-and-write headSoftwareMappingCategory of beingSystem callMoving averageDatabaseObject (grammar)Cycle (graph theory)IntegerCompilation albumLatent heatPermutationMultiplicationOntologyBoundary value problemSimilarity (geometry)Token ringComputer animation
Context awarenessSemantics (computer science)WordInclusion mapSpacetimePredicate (grammar)Token ringOntologyUniverse (mathematics)Valuation (algebra)Lemma (mathematics)CASE <Informatik>Mathematical analysisBit rateArithmetic meanUsabilityMereologyPairwise comparisonNumberSet (mathematics)CodeMathematical singularityDifferent (Kate Ryan album)Semantics (computer science)Position operatorWordLemma (mathematics)Instance (computer science)WeightElement (mathematics)1 (number)Computer animation
WordSemantics (computer science)Associative propertyOpen setSemantics (computer science)Uniform resource locatorData dictionaryMultiplication signOpen sourceAssociative propertyDifferent (Kate Ryan album)Formal languageWordFreewareElectronic mailing listSource codeMemory managementPoint (geometry)WeightDisk read-and-write headSineMaterialization (paranormal)HypercubeComputer animation
DemonRange (statistics)Frame problemArithmetic meanSpecial unitary groupSemantics (computer science)InformationTask (computing)Position operatorContext awarenessAdditionWeightCore dumpScripting languageReal numberElectric generatorGodData dictionaryEnterprise architectureGraph coloringFormal languageXMLProgram flowchart
Task (computing)System identificationFormal languageMathematical analysisEuler anglesSystem programmingOntologyCASE <Informatik>Instance (computer science)Product (business)Correlation and dependenceType theoryMultiplication signPhysical systemMögliche-Welten-SemantikSubject indexingDeep Blue (chess computer)Mathematical analysisSystem identificationWordConstructor (object-oriented programming)Formal languageTheory of relativityLimit (category theory)MereologyLevel (video gaming)Complex (psychology)Virtual machineSingle-precision floating-point formatEuler anglesCross-correlationDifferent (Kate Ryan album)NeuroinformatikWeb portalSimilarity (geometry)Computer chessAxiom of choiceTask (computing)BlogGame theoryHypermediaData miningNumberNoise (electronics)AverageTrailDomain nameView (database)InternetworkingMathematical singularityDialectCategory of beingForestComputer animation
Task (computing)System identificationFormal languageMathematical analysisSystem programmingVirtual machineTranslation (relic)Euler anglesOntologyMultiplication signMaschinelle ÜbersetzungDifferent (Kate Ryan album)NumberGoodness of fitComputer animation
Process (computing)Task (computing)Formal languageSystem identificationMathematical analysisSystem programmingTranslation (relic)Virtual machineEuler anglesOntologyProjective planeRule of inferenceCondition numberNoise (electronics)NeuroinformatikToken ringCycle (graph theory)Maschinelle ÜbersetzungArithmetic meanDifferent (Kate Ryan album)GoogolHybrid computerProcess (computing)Procedural programmingPhysical systemNatural numberResultantForm (programming)NumberQuicksortFormal languageLimit (category theory)Electronic mailing listMultiplication signFacebookData structureArtificial neural networkType theoryAreaInstance (computer science)WebsiteVirtual machineWordRight angleDomain nameFunctional (mathematics)Order (biology)ChainEvoluteHidden Markov modelBootstrap aggregatingLemma (mathematics)AutomatonData modelTuring-MaschineNatural languageComputer animation
Process (computing)Computational linguisticsText miningMachine learningStandard deviationSolid geometryFinitary relationFormal languageDomain nameDifferent (Kate Ryan album)Virtual machineProjective planeType theoryMultiplicationMathematical analysisFormal languageSolid geometryAreaEvent horizonMereologyTheory of relativityLevel (video gaming)VotingSystem identificationWindows RegistryChainDomain nameNatural languageProcess (computing)TwitterComputer animation
Process (computing)System identificationFormal languageModule (mathematics)Heegaard splittingParsingService (economics)Different (Kate Ryan album)ChainModule (mathematics)Web serviceFormal languageTask (computing)Process (computing)Flow separationServer (computing)Web 2.0Line (geometry)Instance (computer science)DiagramComputer animation
Frame problemoutputProcess (computing)PlastikkarteResultantTask (computing)BuildingWeb serviceProcess (computing)Module (mathematics)Web 2.0Open setParsingoutputCommunications protocolService (economics)Interface (computing)Endliche ModelltheorieElectronic mailing listLink (knot theory)
Process (computing)Event horizonSource codeForcing (mathematics)Event horizonPredicate (grammar)CompilerAreaInstance (computer science)Boundary value problemMultiplication signPropositional formulaUniform resource locatorDifferent (Kate Ryan album)Computer animation
Process (computing)Element (mathematics)Type theoryToken ringParsingFinitary relationPredicate (grammar)Semantics (computer science)Type theoryTheory of relativityFormal languageTask (computing)NumberElement (mathematics)MereologyLevel (video gaming)Pattern recognitionOrder (biology)Object (grammar)SpacetimeIntegrated development environmentPointer (computer programming)Information extractionRight anglePredicate (grammar)MathematicsToken ringFamilyGroup actionElectronic mailing listInformationComputer animation
Process (computing)Finitary relationSource codeTheory of relativityStatisticsWave packetUniform resource locatorStudent's t-testGoodness of fitInformationObject (grammar)Order (biology)Physical systemMereologyMechanism designBuildingComputer animation
Plane (geometry)Video projectorNormed vector spaceProcess (computing)Demo (music)Salem, IllinoisData typeDisk read-and-write headStudent's t-testSlide ruleView (database)MIDIDemo (music)Computer animation
Transcript: English(auto-generated)
Excellent. So we have stopped here. So I've sort of completed syntax. Of course, completed is, you know, a very mild term. Syntax is never completed, even if you are NP-complete language, but it's never completed. Right. So let's go to the further level, which is semantics. And there you have, like, two main types of semantics.
One is what is usually thought of, what semantics is, lexical semantics. So we talk about annotation of word senses, actually the meanings that are realized with a particular token in a corpus.
So as you all know that words can have more than one meaning. And unless you are a poet and like to play with words, you usually use the word in one meaning within the same discourse. But if you cross the boundaries of a domain, for instance, you
can find the same word having completely, totally different meanings in different domains. Take English noun charge. In military terminology, this is when you start the assault against the enemy.
Or you can even use it as to charge the weapons. That's in military domain. But if you go into a legal domain, then it's indictment. It has nothing to do with assault unless you on a much higher ontological level understand indictment as being assaulted, to be charged as being assaulted.
Or if you go to culinary, kitchen terminology, charge could be what you put in a turkey or something like that in some languages, for instance.
So in different domains, the very same word has completely different meaning. And it's very easy for us to understand which of these meanings has been used in a certain context and context, but for computers that's not so easy. So you have to disambiguate or at least explicitly mark which of the possible meanings has been realized with that proper token.
And, of course, you have to take care about single word expressions and multi -word expressions because the general meaning of carrier is not as specific as aircraft carrier.
So it's, again, different meaning. And then you can also include information about synonyms or hyponyms or hyperonyms or meronyms and so on. There's a whole number of different semantic relations between lexical items, between words. And this you need to, or this can be resolved by mapping these expressions or words to conceptual spaces, to different types of conceptual spaces.
And they, I call these conceptual spaces because they doesn't have to be organized as networks.
It could be organized in a different way. But usually they are networks. And some of them have been derived from completely different purposes. I mean, WordNet has been developed by cognitive psychologists because they wanted to
have a dictionary that could be retrieved by meaning, not by the head words. So they wanted to, you know, like a normal dictionary but turned upside down. And that's what was the primary intention of building the WordNet. And it became a valuable language resource along the way.
And BabelNet is just a compilation of dictionaries, WordNets, different kinds of knowledge databases and so on and so on. And so of course there's, from the other side you have the typical AI ontologies and conceptual spaces like psychontology that has been built manually for decades, which is one of incredible, I would say, monuments to human mind.
It's being built still all the time. Yes.
Knowledge, yes. That's a conceptual space.
I would say the most generic possible term that you can use for this type of organizations. And of course Wikipedia, LOD, everything there. And this you need to do one of the specific NLP tasks, which is called word sense disambiguation. So that's putting the right meaning attached to each token in corpus.
The other type of semantics could be called sentential semantics. So that's another task which is semantic role labeling. So you have to recognize in a sentence which parts of sentence actually encode different semantic roles.
And this is where we come step further to the computer language understanding or computer message understanding. Okay? That means that you have to recognize what's an agent, what's the patient, who's the beneficiary, and what's the instrument that the action has been done with.
So this is actually that semantic information, that key information that we want to convey by using language. And you can do it in different ways with different languages.
I mean different languages and their structures give you different means to do that. There are languages that would put more emphasis on syntax and other languages would put more emphasis on inflection and morphology. In Chinese you get no inflection at all whatsoever.
There are no endings in Chinese. There's only one mark for plural. Okay, you need to differentiate between singular and plural. But then they have some kind of compounding when you add several ideograms together. So it could be understood as like a German compound.
This they have. But then there's another problem in Chinese as you know. They do not write spaces. So the problem of tokenization for Chinese and Japanese and similar languages is terrible because you have a string of ideograms which you can parse or actually you can segment in several different ways. So there you have a complexity which is growing and problematic.
And once you, and it's the whole idea that you convey that semantic roles. So if I say John loves Mary, okay, we know exactly by the syntactic structure, the syntactic structure very easily maps to semantic roles.
We know that subject is the agent. You know which action is going on? Loving, okay, loves. And who's the object or patient? Who suffers? I mean if love is suffering, okay, fine. Sometimes it is, of course.
And then if you, so syntax in English has a fixed word order. If you change the order, Mary loves John. It's not just that syntactic order has changed, but also the mapping of semantic roles has changed. Now the agent is not John anymore.
It's Mary. And the patient is not Mary. It's now John, okay. But in other languages like in Slovenian, you can shift it as much, you can, you know, shuffle it all around.
You can say, and so on. So you can have any kind of permutation there. It's always the same. So syntactically it's a different order, but it's always the same semantic role because syntactic relations are not encoded by the order of words in a sentence, but by case endings and inflectional endings.
So that's a free word order language. So that's, as you would probably guess, it's easier to map between fixed word order and semantic roles than the free word order and semantic roles.
Because in fixed word order language you know that things have to be, I mean, the slots in a sentence are fixed and there's no way to shuffle it around, okay. And there's quite a number of languages which are free order languages. So we have another problem of complexity there.
So let me give you an example. So once you map this to conceptual space, for instance, in this case it's WordNet, and then you get the brown corpus, a sentence from brown corpus. And you can see how, then this is, let me see here, oh sorry about that, okay.
So this is a verticalized corpus and you can see. So analysis means the evaluation of sub-parts, the comparative ratings of parts, the comprehension of the meaning of isolated elements, okay. So you have all the, for instance, here you have a common noun.
The lemma is not plural, it's singular. But there you have in this position, in this token, word sense number one listed in WordNet, number one has appeared. And this is the node, this is the code of the node in the WordNet.
In this case isolated means separated and this is lexical sense number five and WordNet sense number three, okay. So actually this is node five zero. So you can see that different senses appear in different positions. Once you do that, then you can very easily, and of course each of these senses belongs to a different sense set.
The WordNets are organized by sense sets, by synonymy sets, okay. So sets of words that have the same or more or less the same approximate, approximately the same meaning. And I don't want to go now deeper into lexical semantics and WordNets because that would need the whole lecture, I would say.
We can talk for hours about WordNets. But what I find very important to share with you these URLs. So the first and the most important is Princeton WordNet, that's the first one that has been made.
And I still find it probably the best dictionary of English available online for free. Try and search it. I mean this is incredible. You get not just the synonyms, you get antonyms, you get hypernyms, you get hypernyms and hypernyms.
You get meronyms of each and every word inside. This is fantastic. But there's also a global WordNet association where you have a list of WordNets available. And many of these are open source data so you can freely download them and use them in your research and experiments.
There's also open multilingual WordNet, another source of different WordNets. And I think at the end you have universal WordNet. There was one idea to build WordNet for 200 languages.
All WordNets merge together. WordNets for 200 languages merge together. And you have like 1.5 million headwords and literals there in this WordNet. So much about lexical semantics. But yes, in BabelNet this you should not forget.
Very important resource and very useful. And it keeps growing all the time. And that's the whole story. And now just to point something about, to point to you, to tell you something about sentential semantics. So here you have a snapshot of a FrameNet which describes the frame of employing.
So you're all familiar, I think, I hope, from AI, the very old Shanks idea of scripts and frames. And this is actually the idea taken over by Chuck Filmore, Charles Filmore, the famous generative semanticist, who took the idea of frames and said,
OK, but frames are actually what each and every verb opens semantically, not syntactically, semantically. So to each verb in English you can attach a relevant frame and then you can explain the meaning that this frame actually encompasses.
And this is roughly called employing. So I don't see it without my glasses, sorry about that. So if you read, if you follow this one.
So an employer employs an employee whose position entails the employee performs certain tasks in exchange for compensation. OK?
So you see by color, each role is marked by color. So the verb is always in black. So I employed him as chief gardener for ten years. And then within these frames you have core roles, which are, some of them are obligatory, some are not.
And then you have non-core, some of these could be like compensation, context, descriptor, duration, and so on. But the core roles, I mean, there are, this is not predictable. So this information has to be stored in a dictionary.
It's not predictable that which roles are core, which are obligatory. So that's a piece of information that you have to take out of dictionaries. And it's done manually, unfortunately. But in a lot of other languages it has been also transferred from English FrameNet and then tried to be adapted for other languages.
And FrameNet is one of these international also enterprises, I would say. And now we can talk, when we have passed through three different language levels, we can talk also.
But I'll not go now in these details. I should mention here, for instance, the level of linguistic pragmatics. The linguistic pragmatics is actually what also motivates our, for instance, lexical choices and syntactic constructions.
You remember from the clip, I'm sorry Dave, I'm afraid I can't do that. This is the most polite way to say no. And this is highly sophisticated way, I mean, even in the pronunciation which is far from American.
Although the movie is American movie. And so that's linguistic pragmatics. Why the computer has chosen the exact wording and that type of constructions.
This also includes, for instance, the usage of Deixis. So in some languages you have, in English not, but in some languages you have different addressing between the speakers and people participating in a dialogue.
So you can use first person, you can use the personal pronouns in singular or in plural. So if you use for single person, personal pronoun in plural, this usually denotes the honorific usage. That means you are expressing the honor, the honor towards your, the people you are talking to.
In English it's always you, so you have no difference with that. A part of middle English where you had thy and so on and so on. Okay. But that also belongs to our linguistic pragmatics.
But these are levels that come into account only when you complete all those what has been shown before. But there are some more complex useful language technology tasks like language identification. That's something what you actually need. And Marco has shown that yesterday, which works rather nice.
Or sentiment analysis. So you detect the attitude of writer towards the topic he or she is writing about. Or even you can analyze the sentiment analysis. It's very useful, for instance, I see the industrial case and I'm sure there are people who are doing that.
The comments or reviews of TripAdvisor in a TripAdvisor, for instance. You can automatically analyze sentiment in TripAdvisor reviews or in the reviews of certain products.
I mean, my washing machine broke up fifth time. Again, I hate this producer or whatever. And you can do that automatically. And this is very useful for tracking, for instance, for politicians. I mean, politicians would like to see what's happening in social media. How the social media react on their foolish utterances.
I'm not referring to anyone here.
Well, I had a paper back in 2010 with two colleagues of mine.
We made a paper and it was published at Elrec in 2010. It was Marrakesh, I think. No, okay, nevertheless. So the paper was about sentiment analysis of financial texts in creation.
And we took a number of blogs and financial portals and similar things. And we were, by simple categories, we were trying to determine the sentiment
about the Zagreb stock exchange index. So if the sentiment the day before was positive, the stock exchange index went up. If the sentiment was negative, the stock exchange index went down. And guess what? We have found not relation but correlation.
And you know that in statistics correlation is much stronger tie than simple relation. And this can actually bring you to a profound philosophical question. Are we creating reality by our texts?
Because if everyone is talking today, well, you know, the index will fall tomorrow. And this really happens. So what are we doing? Or should I quote Kwai Gonjin from Star Wars?
Mind your mind. Mind your mind. Be careful what you think and what you say. So I don't want to go into philosophical questions now. But sentiment analysis, yes. General text, that's complicated. We had very nice domain, very limited financial.
That was nice use case. But general text is a whole other, whole different story. For instance, question and answering systems is another that complex type of systems where you have a, you have to find answers on questions from a given collection of documents, facts, ontologies. It could be organized in any way.
If you remember, that's why I said I will mention IBM again. If you remember IBM Watson winning the Jeopardy quiz in 2013. Remember that? It was also very much, it was a lot of noise about it because ten years before that,
IBM's Deep Blue won the chess game against the actual world champion, human world champion in chess. And that was Gary Kasparov, yeah. And this happened in a completely different domain. Very much demanding, computationally extremely demanding because of the time, time limitation.
And these are the two best human competitors. Look at their faces. Look at the numbers. See, see the numbers. They did not believe what happened.
Huh. Good. And of course at the end you have machine translation. So you have whole different story about machine translation and we can really go into that but we don't have enough time for that. And now I would like to go more into detail and I will try to end very soon.
So Marco mentioned yesterday X-like project. Yes, please.
Yes. I have two possible answers to that. The one happened, I think who was it, Marco you can help me.
Either Facebook or Google. They wanted to make a machine translation system. So they've, they use neural networks and trained to translate from one language, natural language to the other. But then the system invented the interlingua of its own.
Hmm. Yes. And then the engineers shut down everything. So that's one of the possible answers.
Are we ready for that? Yes. Yeah. But this is very interesting. So you have some kind of evolutionary computing there.
If you include the evolutionary computing steps in that, then you must be very careful about stopping conditions. In a Turing machine you need a stopping condition. In any automaton you need a stopping condition.
Here it's very shaky whether you will get the stopping condition or this evolution can overcome the stopping condition and then never stop. And then who knows when to stop. It's like a car running all around your garage and don't stop until it crashes everything.
If we are talking about machines. Yeah, okay. But the other possible answering I have from my own experience. So we've been using trigram taggers for tagging creation. Built by Torsten Brandt back there in 2000 and 2000 I think it was.
And then we applied it, we trained it, we had the manually annotated corpus of 100,000 tokens of creation. And then we trained the tagger on that and we got pretty decent results. But then we combined that with the inflectional lexicon. Inflectional lexicon is the list of 110,000 lemmas with all their possible word forms generated which ended up with more than 6 million word forms.
And we combined that with the TNT tagger and we got much better results.
So it looks like we need some kind of structured data somewhere and it should truly help. So this is kind of hybrid system which I would call. And I don't limit myself to either rule based or statistically based or neurally based procedures.
If I can do the hybrid and we get better results, why not? But so far my experience is that you can get the best results by combining different approaches, different types of data.
And so when you have raw data and you train the neural networks, I'm quite confident. I don't have now the exact numbers but I'm quite confident that some kind of structure or you want to call it structurally supervised learning can help and lift up the final results.
Yes, please.
That's why they're called under-resourced languages. Right, that's true. But there are some ways that you can try to do that. There are some, you know, methods in X-like, for instance, that's what I wanted to show.
We were actually developing linguistic processing pipelines and I suggested we switch this terminology to linguist processing chains because we have our Cleopatra pipeline. In order to be precise, we can call this linguistic processing chains.
And they have been developed for seven different languages, four major languages and three, this under-resourced languages. But then we also wanted to use, for instance, machine translation.
So I think this Cebuah and what was the other Wikipedia very much. I think as I said yesterday, I think they were translated by machine translation systems from English Wikipedia. Yes, high up in the list, yeah.
But there are some serious experiments, you know, the whole English Wikipedia being translated to other languages by machine translation systems. And in that way, you can boost up the number of tokens and resources and everything.
But then the problem is, of course, aren't we introducing more noise in our data? That's another problem. So this must be, this have to be, I would say, flawless machine translation system and there's no such like, there's no such in the world, of course.
Yeah, and that's, but you can do the bootstrapping. So you go incrementally, you can build it incrementally, you know, from time to time. But the problem with that incremental approach is that with each new cycle, you introduce new noise and you can get a lot of problems.
And by machine translation, you also have to be very careful about domains. For the very same reason I mentioned earlier when the same word has a different meanings in different domains, if you train the machine translation system all over domains, you get all over probability.
And this doesn't function in every domain, of course, okay? And that's what Google Translate has been doing until two years ago. Two years ago, they introduced a neural network. So NMT, neural machine translation, and the system now works much better than it was before.
Okay, so I would like to show you some examples of what we did in this X-like project. So you can read about the project at that site. So we combined several, actually, approaches from different areas. And then we wanted to enable cross-lingual text understanding by machines.
I hear, whenever it's, I talk about machines understanding, I always put understanding in quotes. Because a proper understanding includes a self-respective agent, self-reflecting agent, self-aware agent.
And we are not sure why the machines are still there, okay? At least they are hiding that very successfully.
Because you lack one of the main features of human language, and that's embodiment. And that's what machines do not have. They do not have the conscious about their existence and conscious about their physical appearance.
I said we are not there yet, but who knows what will happen, okay? So that's why I put quotes here, just explaining quotes. And the goals, we wanted to develop tools to extract entities and relations found in documents.
That's what actually ended up in a very top level, the event registry, what Marco was showing today. But I'm now showing you what was behind the whole idea, at least this NLP part. So we wanted to do it for multiple languages, domains, and language registers.
Marco didn't show the analysis of tweets, because you have a completely different type of text when you want to run NLP of tweets. So there was a solid language technology foundation used throughout this project. So the entry module was automatic language identification, and then we've built
seven different pipelines or language processing chains for each of the languages. And these are the tasks that were used there, and they functioned as web services. Actually, each and every task is a separate web service.
And languages that were covered were four major languages, English, Spanish, German, and Chinese, and three under-resourced languages, Catalan, Slovene, and Croatia. And if you look at the diagram, this looks like that. So that means, for instance, that these three tasks were done in the same institution, and then these three were in the same institution.
And then the result of this third web service was handed to another institution where the dependency parsing was done.
So this was situated in Barcelona. This was in Open LMP. This was running in Zagreb. This was running Zagreb. This was running in Ljubljana. This was running, I think, in Beijing. This was running in Barcelona, and this, again, in Ljubljana.
And Freelink is, again, Barcelona. So each and every web service, I mean, you can have the following web service could be anywhere in the world, anywhere in the world, as long as you can reach it by REST or SOAP protocol. So it really doesn't matter who maintains it. But once you do the links, that's very useful.
Actually, it allows you to choose, when you're building your linguistic processing chain, it allows you to choose the best performing module for the task you want to do. Providing that they all have the same interface, input and output, that it's understood.
And so if we start from, like, from this sentence, for instance, okay? At the end, we want to come to this. So not just, see, this is dependency parsing up from the above.
But below, you have actually logical propositions that understand that you have two events. So each verb, each predicate in a sentence is actually a micro event. So that's what Marco asked yesterday. So what's a complex event?
What's the upper boundary of event, and what's the lower boundary of event? Or what's being going on?
So that's micro events. And here you have a sentence with two clauses. And, of course, these two clauses actually convey two different micro events, to hold something and devise. So hold, you have an agent, you have a theme or object, you have location, you have time, and then you have the same.
So if we start with this sentence, we want to come up with that, okay? And we come with that over syntactic and semantic analysis. So five types of target extraction elements were aimed at X-like. So we were going for tokens, lemmas, syntactic triples, that's subject to verb object.
Semantic triples, agent, action, and the object. And entity relations, so name entity relations, requires connecting entities to this conceptual spaces. So in order to be disambiguated. But we were concentrated very much on name entities because they are very fixed parts of a text.
In fact, in newspaper text, name entities represent 10% of all tokens. And these are the places in the text where the non-linguistic environment is directly mapped to the text.
So names, personal names. I mean there are huge philosophical and linguistic and actually philosophy of language discussions. You have a conference here in room number four about the philosophy of language here. Happening here in the same floor.
Huge discussions whether names are part of language at all. Or they are simply labels or pointers to some other conceptual entities. So there's a huge discussion about that. Yes or no? Sol, Kripke, Quine, and all the others.
I don't want to serve a lot of people talking about that. So this is why name entity recognition and classification is a very important task whenever you have any kind of information extraction from text. Because it covers 10% of all newspaper texts and they are particularly informative texts.
Name entity recognition for a novel, Orville's 84. You will have a list of what? 10 persons appearing or 15 persons. The most frequent one is Winston and the other one is Big Brother, of course.
And that's it, I think, come on. But if we go for dependency-based extraction, then in syntactic level you want subject-verb-object and in semantic you want agent-predicate-theme. So these red are syntactic relations and these blue are semantic relations.
So you have the same sentence which can be analyzed like that. So, but a part of these basic relations we also might be interested in detecting relations beyond subject-object relations. So this is location, where, in.
And this information actually can be a very good material for statistical or neural classification methods and extraction after that. But, as I said, you need to have manually or semi-manually, I wouldn't say
semi-automatically, I would say semi-manually, annotated material in order to train such systems. So, again, we come to crowdsourcing and Mechanical Turks and a lot of students helping in building this initial training data.
And I wanted to show you a demo, which we don't have any more original demo. It's not, funny thing, it's not working anymore, but I can, nevertheless, I can show you.