Pushing the frontiers of Information Extraction

Video in TIB AV-Portal: Pushing the frontiers of Information Extraction

Formal Metadata

Pushing the frontiers of Information Extraction
Title of Series
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
In many disciplines in the social sciences and humanities, research is increasingly dealing with large scale text corpora that require the use of advanced NLP tools to process them and extract information. However, to date, most existing algorithms focus on the topical organization (e.g., bag-of-words based LDA) and relatively simple grammatical (e.g., subjects and objects of the discussion, adverbials of place, time, etc.) and semantic structures (e.g., sentiment) of the text, while more complex meanings remain difficult to extract. This talk introduces the EU-funded research project INFOCORE’s strategy to algorithmically detecting calls for action, and discusses the involved challenges, available tools and further developments in the case of conflict discourse. At the same time, it sets out a more general approach to integrating existing algorithms to enable the detection of implicit, more complex semantic structures.
Point (geometry) Key (cryptography) Information File format Software developer View (database) Real number Forcing (mathematics) Source code Mathematical analysis Bit Instance (computer science) Parameter (computer programming) Perspective (visual) Information extraction Laplace transform Process (computing) Computer animation Personal digital assistant Business model Video game Traffic reporting Task (computing)
Point (geometry) Mathematical analysis Content (media) Instance (computer science) Parameter (computer programming) Regular graph Perspective (visual) Connected space Attribute grammar Latent heat Uniform resource locator Angle Fuzzy logic Quantum Pattern language Data structure Object (grammar) Form (programming)
Point (geometry) Group action Execution unit Projective plane Content (media) Bit Pattern language Instance (computer science) Extension (kinesiology) System call Formal language
Performance appraisal Word Theory of relativity Process (computing) Personal digital assistant Negative number Set (mathematics) Natural language Endliche Modelltheorie Position operator Number
Statistics Dependent and independent variables Observational study Information Image resolution Projective plane Perspective (visual) Revision control Software Hypermedia Personal digital assistant Telecommunication Universe (mathematics) Computer science
Presentation of a group Dependent and independent variables Serial port View (database) Multiplication sign Insertion loss Function (mathematics) Food energy Revision control Congruence subgroup Lecture/Conference Hypermedia Natural number Telecommunication Interpreter (computing) Video game
Revision control Data management Group action Internet forum Information Telecommunication
Ocean current Point (geometry) Group action State of matter Forcing (mathematics) Expression Propositional formula Mereology Perspective (visual) System call Element (mathematics) Fraction (mathematics) Revision control Latent heat Goodness of fit Personal digital assistant Order (biology) Iteration Spacetime
Fraction (mathematics) Group action Lecture/Conference Natural number Expression Virtual machine Fluktuations-Dissipations-Theorem System call Gradient descent Task (computing)
Word Latent heat Arithmetic mean Information Lecture/Conference Expression Natural language Quicksort Endliche Modelltheorie System call
Fraction (mathematics) Latent heat Computer animation Lecture/Conference Personal digital assistant Formal grammar Right angle Endliche Modelltheorie Imperative programming System call Information security Formal language
Arithmetic mean Context awareness Computer animation Information Lecture/Conference Hypermedia Natural language
Presentation of a group Group action Arithmetic mean Lecture/Conference Different (Kate Ryan album) Mathematical analysis System call Exploit (computer security)
Ocean current Group action Lecture/Conference Different (Kate Ryan album) State of matter Sheaf (mathematics) Speech synthesis Line (geometry) System call Address space Theory Neuroinformatik
Group action Latent heat Information Lecture/Conference State of matter Expression Online help Endliche Modelltheorie Quicksort System call
Slide rule Group action Arm Computer animation Information Lecture/Conference Online help Condition number
Group action State of matter Length Multiplication sign Execution unit Inverse element Mereology Proper map Neuroinformatik Array data structure Bit rate Velocity Hypermedia Core dump Endliche Modelltheorie Monster group Algorithm Logarithm Theory of relativity Electronic mailing list Arithmetic mean Telecommunication Website Right angle Quicksort Web page Slide rule Statistics Open source Token ring Image processing Virtual machine Rule of inference Number Wave packet Revision control Frequency Causality Root Lecture/Conference Term (mathematics) Well-formed formula Average Natural number Selectivity (electronic) Condition number Software development kit Information Physical law Commutator Mathematical analysis Total S.A. System call Particle system Word Computer animation Friction Natural language Marginal distribution
Group action Copula (linguistics) Multiplication sign Decision theory Combinational logic Set (mathematics) Mereology Disk read-and-write head Formal language Fraction (mathematics) Inference Exclusive or Machine learning Different (Kate Ryan album) Position operator Amenable group Social class God Normal-form game Curve Algorithm Electronic mailing list Sampling (statistics) Bit Demoscene Connected space Degree (graph theory) Category of being Arithmetic mean Uniformer Raum Telecommunication Order (biology) Website Right angle Quicksort Bounded variation Resultant Slide rule Parity (mathematics) Online help Distance Number Twitter Revision control Latent heat Programmschleife Causality Term (mathematics) Natural number Task (computing) Form (programming) Condition number Support vector machine Information Military base Weight Expression Expert system Variance Line (geometry) Limit (category theory) System call Word Computer animation Personal digital assistant Logic Network topology Formal grammar Speech synthesis Object (grammar)
Point (geometry) Complex (psychology) Group action Information overload View (database) Multiplication sign Source code Complementarity Combinational logic Set (mathematics) Online help Disk read-and-write head Semantics (computer science) Formal language Power (physics) Fraction (mathematics) Sign (mathematics) Average Different (Kate Ryan album) Metropolitan area network Task (computing) Area Adventure game Installation art Email Information Shared memory Mathematical analysis Lattice (order) Instance (computer science) System call Frame problem Category of being Data management Word Computer animation Friction Commitment scheme Logic Telecommunication Right angle Natural language Resultant Local ring
Point (geometry) Server (computing) Group action Confidence interval Multiplication sign Real number 1 (number) Virtual machine Similarity (geometry) Water vapor Mereology Number Formal language Wave packet Twitter Fraction (mathematics) Revision control Different (Kate Ryan album) Cuboid Energy level Software testing Endliche Modelltheorie Hydraulic jump Task (computing) Addition Algorithm Information Software developer Instance (computer science) Grass (card game) Cartesian coordinate system Category of being Word Computer animation Integrated development environment Personal digital assistant Universe (mathematics) Charge carrier Video game Right angle Quicksort Metric system Resultant Library (computing)
Group action Presentation of a group Computer file Decision theory Mereology Wave packet Formal language Fraction (mathematics) Latent heat Strategy game Average Different (Kate Ryan album) Software testing Position operator Task (computing) Form (programming) Standard deviation Theory of relativity Cellular automaton Graph (mathematics) Projective plane System call Measurement Entire function Computer animation Friction Personal digital assistant Speech synthesis Right angle Procedural programming Quicksort
Axiom of choice Point (geometry) Context awareness State of matter Multiplication sign Image resolution Execution unit Mereology Perspective (visual) Formal language Wave packet Goodness of fit Bit rate Endliche Modelltheorie Data structure Extension (kinesiology) Binary multiplier Address space Software development kit Area Email Touchscreen Information Weight Mathematical analysis Variance Line (geometry) Instance (computer science) System call Shooting method Word Process (computing) Kernel (computing) Computer animation Personal digital assistant Network topology Formal grammar Speech synthesis Right angle Procedural programming Resultant
lt's microphones on so will others recording will be recording this that a so thanks for everybody watching this as 1 what we are going to talk about today is of perspective on text extraction information extraction which tries to get a little bit of the a and of the the technological the linguistic perspective of look at it from the point of view of the social sciences of what we are doing with the text extraction when the we are in processing large amounts of lakes because or starting point was that in the wake of text has always been a key information source that if we are interested in a very many things which Ontotext text if we just want to know that what we know about things what other people have already found out about things we return to text and the little books about how digitized and if we want to know what's going on in the world which text for want to understand how people are discussing things we can turn to newspapers for instance the report of all this stuff and a all these newspapers all again now available digitally we can be interested in learning processes we can be interested in argumentation processes because argument you can be interested in and you don't the a when basically in opinions that people have ideas of people the presence of even the behavior of people is nowadays currently recorded in a textual format so basically almost all kind of questions that we can have in the social sciences and also beyond the social sciences very many questions that we have in real life of things for which we enter to text and that of course means that the processing of all those texts in the text analysis and information extraction has been growing very fast and has been very important thing is to develop and as a starting point for this critical for this a development here is that there is a little bit of a gap between what we normally can find easily by using the tools that are available and what really would like to find out so for instance uh if we look at the existing technology and text extraction and information extraction and 1 of the things that were pretty good at his findings I obviously glue made this business model and became very rich doing this but in very many cases also complex of the task force
findings specific entities for finding specific locations in the text at something we're pretty good at we can filter and aggregated a lot of things very efficiently so for instance sentiment analysis has been developing quite a lot in the last years uh and there of summarization techniques so I can find specific topics uh and see what people are saying about those topics and basically flat is done in a more efficient form a very interesting very useful I can find the patterns where not really interested in specific content anymore but more in the arrangement of content in the patterning of things and use for instance to classify documents that show similar patterns can use is just check whether Shakespeare world which expressed as Shakespeare wrote and all these things are very fascinating but they are not normally the kind of questions that we ask when we approach text from a social science perspective the so for instance when we want to find things it does happen that we have a very specific question where we just want to find a specific article and angle of scholars and or something because it finds the article very well but most of what we try to find out fuzzy entities arguments higher-order semantic on kind of uh ways of talking about a current issues and we very often don't know exactly what we are looking for we know what kind of thing we're looking for and when we see it we can say OK that wasn't what I was looking for all that wars but these are things that are defined by the structure of a lot of talk with structure of text and not so much whether quantum these where much worse at finding of yeah if we want to summarize things that arise from arise aggregating will normally not just interested in things like sentiment and Sentiment Analysis in whether the mood on a forum or in the news or whatever is positive or negative is a nice
thing to have the problem with the mood and always negative because holding this so that doesn't really give us terribly much the interesting question is what is it more negative about under which circumstances does become more negative um also if we want to look at specific attributions under which circumstances to which kind of people actually what kind of qualities to what kind of uh issues or objects that would kind of questions we're looking for this connections between all the course content so we can find and again this is something where existing technology has a tendency to stop at a certain point it becomes difficult and if we're looking for a for a specific patterns and so on like the regularities in text will not normally interested when that happens again but we're not only interested in classifying documents but we normally interested
in classifying things within documents that are not bounded in a particularly clear way so did the document as a unit or
and that the paragraph as unit is not really what we're interested if we're looking for arguments for if we're looking for things like framed as a as a con contextualization of issues and if we're looking for pragmatic contents which is what we're going to do today uh what language does what is being expressed in text as a request as a call for action that is something that is not terribly easily recognized as a pattern within a given you want so we need to have a slightly different approach so the starting point for this entire project was that there is a little bit of a gap between the existing tools that we have and the scientific inquiry trying to use those because yeah we can operationalize for instance actors as named entities to some extent but they're not
really the same thing there are a number of ways of referring to active set on named entities and there are a number of things that are misclassified as the smaller problem by sentiment is not the same thing as the evaluation you can use positive sentiment words to pass the negative relation as irony would be classic case but there very many ways of doing this a topic models get us beautiful things that look a lot like topics but on quite the same
thing as topics if you read a text and you have to topic model extracted from a body of text next to it is here's a different it's not exactly the same thing so what we are interested in is trying to get these 2 things closer to 1 another and to find ways of bringing the technology that we have closer to the question that we would want to be answer till when we depart from and the natural language processing which is a which you will
my if you want like the the intersection of computer science and linguistics and statistics and you see there is no social science and they're quite yet but we're trying to do if the use the existing tools these are some that we're using an and look
at them from a perspective that takes into account the social science questions and in this case communication research in particular what we're doing what and what we're going to present you today is questions that come out of a project that you already see the local in folklore info quote stands for informing conflict prevention response resolution of the role of the media and violent conflict and is basically is a big political science communication science conflict studies research network that comprises all those universities that you can see here uh will be we have processing very many uh and text with questions like what kind of
their own views due social media text present all of the enemy what kind of solutions are being advocated in the newspapers how all being our of specific ideas for resolving a conflict discussed and possibly dismissed in a parliamentary debate you see these are things that
I think the questions that can be answered by turning to text but they don't make quite so well on the tools that we have and
and what we do here is that we can pick out the and what may be 1st and we're looking at things on the possible Macedonia its way of life time
and the serial congruent Willie and
we're looking at a regular alluded to texts that are very heterogenous nature we have strategic communication output so that P our propaganda all this stuff together and this energy loss and so on churn out and try and make sense of conflicts tell us what it's all about what should be done about it we look at news coverage of those conflicts so what is popularized as the kind of the the mainstream interpretation or community-specific interpretations of these conflicts what's going on in social media other people response to all these
debates holding the fuel their own debates and what are they argue about and political
communication he is mostly the debate in Parliament and political forums so the idea is what happens with all the
information that turns it into a policy or into conflict prevention management and and what we are
particularly interested here in here is of course for action because you can easily imagine that in conflict research that is a very critical question but what kind of solutions what kinds of actions are being called for in a in a conflict other newspapers that poor people to go take a knife and that it
enables other situations where
there is a widespread consensus in the news that we should all be know come down and not expect uh escalator conflict and start finding
a peaceful solution are there propositions there in the debate that might turn into radicalized violent actions which began May fuel conference so these are all things that we're interested in what are people saying that needs to be done about the conflict the and it's basically something that we know from a social science perspective we will call for action and approaches for definition of course correction here um calls for action are an expression also requests or desire for a specific course of action with the aim of changing the current approval undesirable states so is an unhappy with what is currently the case what needs to be done to get a state that is fine or maintaining the current desirable state against Ferdinand iterations so know we are fine but if we don't do a certain something then this might actually stop and we will have a problem we will have conflicts so what do we need to do in order to either achieve a desirable state what that is In order to prevent an undesirable state so this course fractions consist of all of it definition of what needs to be acted upon what is it that we are concerned about the something that we propose to act a specific course of action and the motivational force so some of these elements are found more easily mappable on linguistic theories or and onto a than of others parts of this space starting point and I'll turn over here to cut you will explain to you what exactly we have been doing and to compare with that of the work and that's it
but this so at the that the all right good afternoon everyone but have so yeah just just just took uh
are kept uh some of the debate with
Christian descent so we're interested in those end sentences and texts that
express calls for action and we're aiming at
developing an automatic tool that will express train them from text automatically will
use natural in was to think of naming machine
learning and and you can be formulated as a
classification task uh which will be performed in 2 steps in the 1st step we will extract
sentences that I like we will classify
sentences as once calling for action or not
calling for action and then in the 2nd step those
that do call for action will classify them based on what exactly is being called for so I will walk
you through all the steps but 1st of all let me give you
like a good sense of what exactly do we mean by calls action what exactly we're looking for in
texts I so-called fraction they can
be expressed explicitly like with my straightforward where they can be heated and we're
interested in both ways so let's 1st look
at the explicit once the and I'm not
away from natural language that expressed this
sort of information that call for correction 1st 1st of all it with the means of specific words
like command request demand right like this work or verbs like very straightforward the
that something needs to be done as an example here
Chat has called for the humanitarian communities to support the government as in dealing with the
influx of enduring refugees we have
our expression has called for work to call
for infinitive which is pretty much straightforward saves us what that something is
being called for an analytic spread this
information would be to use model works that
oblige someone to do something so for English
those would be last need to have 2
or 2 should I it can be
just like 1 model where the convening election
not in this case it would be a call for not doing something selectively handing interested that means that there is novel correction is still
there but just the call for not specific things I another example
when Apple's fraction are still expressed explicitly but asymmetry because there are no
specific works here but it's in grammar so like I for English security
is lake imperative mood right like ideas most
languages to the best of my knowledge all languages have a specific grammar ways to express
what are in English it would be omitting the subject and placing the work in the 1st place
like fight them Our or another novel which is a might interesting fire
which which is ambiguous and US starting
dealing with that of the fact that natural language is very media so it can be
where it can be sentenced to shoot right like
fire so I call for solitude someone I can
also play same fire meaning they have with a fire here so need to run away it's dangerous which will
be announced and this information is actually available
only from the context so like right now it's really
hard to say what I meant so we don't really need you
need to know when the sentence was explained returned after what is the
context yes and in the
of the of the the
the thing we will already be
added to the ammonia yes so probably
yeah sorry I I had tried to not not not going to work if this is not the
presentation will be available through the more of
what we now know and you did
what now I'm I'm so sorry about this
so I can relate to have to listen to me
but more than that of the as the
next and the next thing that analysis trying to
catch not only those calls for action that the where
straight forward in the text and also that I'm in
implicit that we just feed them and maybe some common in implemented debates for
example exploitation of the problem not always like to say there go until someone but they like to say
paper probably let let let's do something bad for
somebody but initiated the meaning of the name and again so amounted to leave the room for different
invitations like for us I still want to know all of those calls for action this interesting
I'm a soldier I will show
you where where is the fine line which is still
considered to be a call for action which is not because like
again if you are address different linguistic
and here is like a speech act theory
basically every sentence can call for something like so we need to really draw a fine
line for us for computers that I understand
what we're interested in what ran not interested
in our so 1st of all we consider
those sentences that of communiqué this is the section with the current situation of
something and the idea is that if
someone is not happy with the current state of affairs than most likely the person will
want to do something will ask someone to do
something will perform some action to get to a better situation for better future to to
get a desirable outcome I think and expressed with specific works for example a
consumer the worsening condemn
blamed all like said expressions
they did the only thing I their medical expressions can be used like we will not sit with folded hands
and actually can be pulled out that these
sentences in these call for action we can rephrase
that we can rephrase them with the help of model
works like state of this motion should not be
happening all except someone should not do
something where we get into a more classical way of expressing this sort of information there are
us expressions that still use like the
same lexical tools like like model works
for example but they don't call for specific
action just say is something should be done some steps
needs to be paid to be taking some actions should be taking so it is also very interesting for
us like probably again in political
conditions for example but no specific of
course affection can be announced but still some like the idea that some actual Paul is
there and also interested in capturing based
of this information can also be expressed with the help of questions like
rhetorical questions as here can we
accept such a treatment by selecting can accept show would do something that this treatment does not
happen shall we away so we basically call
for either those people who perform
unbearable action to change it and thermal treatment or we ourselves for the something to change it
to the to bear the
situation our and the last
example on this slide arm is like
prepositions about desirable states for
example peace is the only answer and
so this this is in the relation way
cause for peace so either we have what continents change to the page the state of peace or the pieces already established let's
just maintaining some of
here will also how they could examine that
signal about the sort of information like something
easy on Oz anyone outside of the solution to the problem so hopefully you Mahler's get what we are trying to catch in the text that you know what we're looking for a so the next thing I want to do is to go beyond the want introduce some some of the other doctor which well working on and the tools that we using our so 1st of all we uh world this nice open-source tool uncapped which stands for Amsterdam common analysis tool kit which was developed in Amsterdam obviously worst of hunted down by 1 to Walter and his team and we're just it to our needs are reported in Jerusalem called a jam-packed and which has nothing to do with your assignment vectors so that is the jam actually and so we're just adjusted to colonies so stored all of that and there is also holds some of the origin of only or also require a chilly it's open-source everyone can get the column there Christian I agree on that us so as to put all adapted their time this is the overview of the corpus that we're using already we should mention that we also addressed before the communications spheres namely political conditions that you communication convention on their social media but we have no militant x continent the the monster presented here is a list of names like mentions newspapers basically are then we have social media strategic communication by mantra debates so this is the corpus that we are working with which we want to analyze where we want to extract our cause friction from now as I said we want to use machine learning so we want to the to train a statistical out model statistical algorithm to exclude automatically the algorithm needs something to learn from needs label that it's it's his training corpus better velocity other crafted as a call for action or most of us so we basically query the same for sites of documents over the time rate rancher January-March is 115 with a surge to terms were violence and conflict also the got all possible texts then was in them in the sentence this and each sentence was limit of quarterly either it calls for action on all and those that are calling they were also classified which I will will be the 2nd part of the so we have 5 thousand sentences currently labeled the as you have like basically there this year some more of the same as the overall cost but their reaches in terms all 4 of of agendas for a local French strategic communication then it seems needed and political communication and social media surprise and expressive uh less number of course corrections and so on average cause friction comprise 30 per cent of all texts so that the per cent of our corpus is basically calling for something and so this is a what I would learn from but those just sentences right and computers that somehow in margin and not very good at understanding works they need numbers so that was sentences they have to be processed in a specific way to become readable computers and you probably don't do anything have so when when international law it means that we need to extract specific features which will be used this learning material for the algorithm to label and seen that and I will walk you through of some steps which let's say state of the art in Natural Image Processing condition with fixed some of them we did perform some of them would be that so the first one is called like n-gram extraction so we will want and expect N-grams from our text probably the best way to explain what n-grams are used to give an example that if we have an example sentence frost going in science and we want to extract uni-grams where n equals 1 we will also can call it we will be doing was that works so we will basically have 3 tokens from score science that rewards a simple talking this will be the unit of analysis is our n equals 2 it will be doing by Rumson and will have our units as Frost commutes science and frost science if we have at most 3 then it will be cross-cutting science so the whole sentence were treated as 1 unit of analysis of but still this seemed to be like just just worse right like proper temperature won't get if it's like 1 word or 2 words so we still need to get some squares and so we normally get TF-IDF scores which stand for terms frequency and inverse document frequencies of you probably those released in the formula here but it's like very easy and it's like it will be due on the normal of the on words I want territory was many of them so in 1 and 2 kind of discourse the upon the consists of 2 parts so the left part in there is a lot we need to get are the number of occurrences of each word and divided by the total number of words in the sentence and that'll multiplied to the 2nd part is the number of sentences divided by the number of sentences that that term occurs would take logarithm that we modify the those 2 parts and again this we get the score so all those units of analysis from the previous slides all our n-grams they will be represented as a vector of of course and ideas course why don't deal with just the current since well there are good reasons for that because like somewhat someone informative some words are less informative some top words occur more often in the sentence and they might be less informative sentences can be of different lengths so all those things they are accounted for in this for and 3 common uh memorization steps in nature language processing arts that words removal stemming and lemmatisation oscilloscope autumn all um means that we basically get at all for all x small words that occur very often in text but that don't bear any a semantic information and probably they don't contribute to precipitation those select words as like of particles and their preposition particles of connectors that war and some of the birds so it's usually when the dealing with a large text classification i it's really increase the performance of the 2nd step is uh standing up when we in all the words in our texts to stem so as an example here uh terrorism terrorist and terrorize will be brought to a terror so we get rid of static and keep the root only analyzed the rules which also makes sort of sense right because the core meaning is there in the root and make we want when we want to decide whether the text belongs to the topic war of topic spots that should suffice the thought very common step is lemmatization which is I sort the version of standing when we don't get rid of static
sees or the chaplain but we don't bring all the terms to its canonical normal form so for example what it read of tense like AAM being where like all the variations of works will be brought to be or all nouns like er we disregard if it's singular-plural would just use the plural so document and documents will be lemmatized to document and the Senate when we deal with that of large text was sification problem this really helps like there are lots of papers experiments showing that it holds that in our case are as I said that we classifying sentences which is a very short piece of text right and when you if you if you recall are like that those sites where I was giving you examples of calls for action you might have noticed that actually the call for action sometimes it's this information is concentrated in works like request very often is concentrated in our like grammar right so we need like this Bremen parity we need the connection between words so in our case we did not perform any of these steps of we tried a degree in the performance Our end right now I will give you a try to give you a sense of why it was important to perform them and walk over to wait out this from information to give more importance to those that can a little uh sort of in words and when Othello linguistic features so which will be using together with all the previous so we will wear using n-grams in our case for using n-grams with and between 1 and 4 with implicit TF-IDF scores for all of them so this is our efforts in the 1st set of features that on top of that to developed 32 linguistic features I want just use them but I would also give you a couple of examples to show why we develop them and why there was an and actually including them and improve the performance of the classifier all you know of examples here but probably will be able to read them sign some somehow between the underlines the clickers so did samples that she needs to help the refugees and the needs of older refugees cannot be met popular but now you're also almost experts calls fraction and you agree with me that the 1st 2nd sentence he needs 200 produce vesicle fraction of the 2nd 1 does not but we have the word needs in both cases and like that's very lack the classifier like just just judging based on these 2 words will assign the same label to these 2 sentences especially if we get rid of those small time informative words as articles it's very likely that they will be classified in the same way so how can we deal with this problem and the answer is we need to have part of speech information so we use part of speech tagger and then we use Stanford CoreNLP B which performs not only taking but also it extracts out there is 1 was the confirmation by grammar and dependency tree grammar passing n and that the sentence that that's fraction needs is a bird while the 2nd sentence is a noun so we use this information to disambiguate these 2 sentences another example basically the same they request a ceasefire versus users request is being processed again request and what sentences the first one is a pair dice call for action the 2nd 1 is a noun it does not call for action and most likely of the baby and this information was like that our classifier will get labels correctly now let's have a look at more tricky example other tanks must be drawn versus it must be called so we have must be and must be on the left hand side and on the right hand side so like even if we use by neurons will still be dealing with the same entities all of them so parts speech won't help you know what to do and the answer is we have to look at that grammar relationship with the neighboring words so we have to walk we each words are all akin birds connected to so in the 1st case it is connected to past participle must be was drawn in the 2nd case it is connected to adjectives so the future we will develop a basic fact capture those those teachers when that she bird work is connected to past participle work when a huge workers connected nature so check for part of speech of their dependent were basically on a similar example we must not lower our guard at any time communist amenable balls told parliament adding that serious and very high risks remain versus this certainly must not feel so bad this an example we have must not and 1 sentence you have mastered another centers so in order to disambiguate these 2 guys we really need we also have to look at the words and specifically what's so what works they're connected to so we have to restrict so that the category specific list of works which will signal us about that positive positive calls for action on that the another example to call for peace and called for his sister if you were to call in the full right of further right now you already know what the answer will be in the 1st case when dealing with the bird in the 2nd case like a call for use to say even though we had the same combination of our overworked and prepositions but we really need to know that part of speech so the verb of the sentence with the verbal be calling for something and that 1 ball out to call on the phone here like the call is still the there but disconnected 2 different prepositions so we again their hand and information on the neighbouring words and word CA for lationship they out with a with our main work will answer the question will solve the problem you this is the only answer versus the only answer I have is that I simply didn't know he would have 3 words in both sentences are that the same so you entry GM won't help us here but that's a that's a hard case and probably not every requested by you will be able to assign labels correctly of the features that are supposed to capture this case is again I will look at the uh position in the sentence or whether it is an object or a subject which it it is only an object of is it an object of the work to be which is called copula work or is it a number of different work out all that information comes in handy and hopefully other classifiers like having all that with the knowledge of the able to make right decisions about this is already like a pretty hard case if the last but not the least on this slide at the time was that killing is now versus it takes long time to stop killing right time to time to have similar to this in the same expressions different meaning and again we'll look at their words that are connected to those that for example when other where time is modified by adjectives as long or short or rights are then it's not calling for something it's already to the electron optical something or each time is an object of the words take a request then again it is not calling for anything while but
if we had just like time to and then especially if we have a verbal or complement of pulling out of a hint now and then we will very likely mystified is cause for action to so hopefully you're convinced that our for our task it's a good idea to leave those small informative words and maybe like even I get a bit more where you a bit more weight to them of uh the last thing I want to say about those features that we use we also used the doctor origin so that we're where I would either come from from which communications here as a as a feature and Indonesian finding is that language on Twitter for example in languages newly newspapers implemented debates they're very different but we do have a limited lens people have lots of misspellings of acronyms abbreviations while it for example if we have lots of text from British parliamentary debates for the language like is very polite and they're ambiguous Necker rarer sentences started with my honorable friend and so on and so forth so like like hopefully out was a bias will be able to I capture the differences and I make use of this information to make decisions so we have our work was the head of features now let's move to the the reasons I run numerous of them with machine learning I don't know very many of them some of them become better for specific tasks I will use 3 but if you ask me why only 3 well written in the 1st so recommended these algorithm performed the best for text specification tasks we ran couple of experiments supported their paper results so we like 3 other words may based k-nearest neighbors and support vector machines from the best of times and I will walk you over the group all of them through the logics so named bases very simple of let's imagine that our task is to classify text or if it belongs to cars sports or if it's effective story so what made by that's our 1st places and you item a new unseen piece of that into the class with the most probable label so in this case the cars that looks at the next worked and the next word is dark OK if we have adopted and this feature is very likely to occur in the class about detective stories so that all that it that doesn't belong to cars but story is look at the next work in this text and that's what about OK problem not texts about cars and detective and and the word football badly where likely that mn many many tech support support will continue this work so it means that their labels for it is the most probable and eventually there you'll text will be assigned that label because that feature was the most probable that's how made by base makes decisions in the next town everything k nearest neighbors but it works in the following way it computes the distance to k nearest neighbors of a new item and scene item and then decide that these heightened belongs to the class to which the majority of those neighbours belong to a so the majority belongs to cast and this new 1 also belongs to the gods and size of course the good thing about this classifier that can make decisions um like not not non-linear decision so as to here you might notice in that picture it's like the policy from sold like when that is not realizable separated well it can handle those cases now like support vector machines of which is a linear classifier so it works very well but when our doctor can be separated by a straight line it's also called large-margin classifier and their intuition behind it is that if we have 2 datasets for like 2 2 classes of our that if I'll probably get it from this picture here so it will try to it we can draw the line to seperate out that multiple ways right but that it will try to draw it in such a way that the distance between each data item and the line is is maximized so that when we have new unseen that it's very likely that it will fall under right label on the right category even though they link was defined and somehow it also performs rather well for nonlinear problems of and like actually this is the classifier that showed the best performance in our case it is the 1 that we're using for for our work on the task was I will all of it and you can see from goal of the problem is fixed and solicitors for this week was the fire was different sets of features so here have a base you have support vector machines k-nearest neighbors of the top line is 3 years a new groups of feature the minister loops of feature and all the TF IDF the the best 1 is is that the the SVM on the best of our reach an accuracy of around 0 . 8 in most cases the and the less some categories sometimes the hydrogen was taken as the best 1 and action inference rather well even without our linguistic features so 1 might ask why why why that why developing those complicated features the an excellent answers here because it meant the other is more powerful they an anomaly so the at of time so I will just briefly mention that of if you if you try to draw the performance of the classifier learning curves of this 1 where we'll use 3 successive of features like it's a good sit 2 lines of all of these it means the classifier like the in this 1 they're not parallel but like the gap is very high so this 1 is that signals about high variance wildly knew that it can generalize farewell and then the doctor can because it might as well while here aware that when we when we see the cursor go on like this this is a good signal it means a our mall not was brought the problem of all of this 1 so actually including linguistic features makes sense of that's that of the problem with with sort of sold would learn to extract calls for action from text that then I said it's also very interesting to know what exactly the whole form right it's very important to know whether we want to start a war or whether we want to maintain his so the classifications we developed as a but it includes cooperated treatment which has which has 2 subcategories either that calling for peace for these Galatians or calling for all x support helped conventional support and both the condition Terrence words on it is received treatment which has 3 subcategories calling for violence solution for escalation of conflicts calling for punishment mitigation like it can be like all like really go acts like sentencing extended some the court to jail or or exclusion brought us when we say OK we're just that's not solve all the mentioned to him then we have calls for
not doing something not médailles mentioned for like we can hold for a big role that is the drawing example we must not lower our guard at any time feminist and how told right so we must not lower we must not do something and this information can also be expressed with the help of works as condemned work Our and of is general calls for just doing something or rhetorical questions from non-specific or some kind of action is mentioned but just the idea that we need to do something our and and other categories and sentences called for multiple actions normally those are complex sentences with many clauses where each clause calls for something different so 1 closed ball called for this collection another was support and another 1 sentencing and then saying OK that's just something I unless getting very reminiscent of saying the other when I wouldn't know where to the sentence belongs to which is the essentially about auditory as in this in this example for example the militants wholeness massacre schoolchildren the head of soldiers thanks to the defense installations surely committed war crimes and must be dealt with as such so so mentioned it should be that's something should be done but it's very hard to say what exactly even for a human being out we don't know exactly what is meant obviously those people should not be glorified because they did bad things but should be be kilos should be saved sent to jail would just something we don't know so this is the we also perform classification for this but here we're not doing that great yet unfortunately she here you can see the results you can see the results for the fine-grain was classification is man can't and there isn't it and see that all I'll just say that this course overload of abnormal there are less than 50 % of the reasons are because it's natural language it's the ambiguous and even for human coders it's hard to assign correct labels and we also have enough document sometimes is just a dozen of examples but even when we merge categories like and we do only with 4 of them and then this course a getting higher so we have about 0 . 6 accuracy on average so we have moreover that it shares of probably with where improving we hopefully you will get good results 1 day and 1 of the answers so that we are aiming at is to get that hopefully will help us in our in this classification task so to wrap up I'm finishing a Christian starts with the words that there is a gap between what we can do from the logical point of view and what we need with questions and at the the the what questions we have to answer in that many instances power of many things can already be done so that the problem is that the many things that we're interested in their he in the language they're very ambiguous and political that can help us yeah that's what we tried to do here and we reach a set of results we explain how we extracted calls fraction which is like very a pragmatic semantic the DP language information was sought so the certain achievements but there are obviously many more things to do in a room to improve a can cause friction be used in other areas not only communication science science this the I example uh we can identify users in this is our it's the same thing formulation same words we can generate automatic to don't waste like imaginative usually you can't complementation you have hundreds of e-mails and you don't have time to read all of them and you're afraid to miss important tasks some meetings and then you have to do is generated so we already have know where to do with a goal what time have meetings what time you have which is on the right the right hand right how we can use it for news analysis for example if we have an analysis is their review of the hotel and the guest uh suggested to improve something to improve the simple for example and managers can easily so this information and act upon the more we can use it in for example uh medical documents so when doctors have local support of documents to process they can just wear read them and extract out what treatment was recommended for the same symptoms to all of us to be there were more the in our own the 1 of the very useful for communication signs of what was still can be done to answer our questions we can combine different NLP tools are to answer our questions so for example if we combine semantic roles roles labeling we so some lexical tools as FrameNet or WordNet we can find adventure and claims about the truce about specific uh object entity and we can identify and inverted the frames before we have our calls for action if we have about the same evidential claims he we have sources Semantic things but we put in the area I find a casual and so again using all those above-mentioned tools but we could find what what caused what some may make build those casualties of genes of things and of course it's not everything that some couple of ideas we cannot be something you have your on which can be also very interested in this and variable and we have to think about so that was it thank you very much for candidate and you the I don't see how these
based on you so you so what was the 1st decide what is it it's some not the the this will be groups has been working lot years all so reviews of the you know agree on this is a useful feature of the and runs the best the result this is the reason of the lose or you lose use these trends is used highlights test throughout this because of its luminosity the rest answer the time the best pieces you what you do you have all sorts of all the developments of violence in the the want you use all so you always you had need to have the highest tools you so all of these you have some of the some of that and you have 1 of 1 of the things the police in it's property yes so read the I repeat every the carrier the question so the question is so that there are these 2 colleges similar technology can support in like a real life applications and is exactly the example was so like research run in 1 of their medical universities in Germany so that there are lots of text that currently are open or lysed annotated manually which is very expensive and time consuming so whether we can use certain knowledge of that of you yeah I think it's it's it's very possible especially when you said like that there are already lots of texts that probably manually annotated analyzed yet basically all the other reason that idea that they can be used out of the box and the main problem is or when it can be used is when we don't have label data but while this is not the case of course you have to spend some time with training algorithms algorithms ingesting them but in principle it is very possible yeah I think we can have very good results I think the the trick probably is that and the technology can take part of the way right and for instance extracting passages that we know the contain relevant information what needs to be done for instance ethical recommendations that is something that we can do with a certain the higher level for confidence so we but will miss some but will catch most and will say um uh reduced the time needed to find those instances quite a lot and then it's another question whether you also want to use the technology for the 2nd step that is reading those fragments and then deciding what exactly needs to be done because there might be a point in having the addition of new ones and precision uh that you can get with the known human reader but mn you know we're working on that and basically it works the battle the few are ambiguous cases of language use there are in distinguishing different kind of recommendations and the thing with water conflict is courses that things are all fully and because because people strategically try to hide what exactly they are calling for but so if you have takes that don't try to do that performance might be picture quite that much better what In the case of the words the the the or the where the question was aware that to the linguistic features extraction is implemented a so we're using for the whole all as a mention like that let's say the big environment for all of our development is the jump server which is by based for the machine learning so specifically for inpainting we're using a Python library cited foreign and there are features are so that's a separate model which was developed by us over using Stanford Core and all these extract all those features so we had them and we just like add them to our future metrics basically in 19 in a this stage of yeah when we will extract features are the features as well from and the the the of so and so the question is if I will a training corpus of 5 Southern sentences is big enough for the algorithm to learn but not really and of course the the more data we had the better but and for the task of just disambiguating between column fraction not it's fine but I mean we have about it 2 % of accuracy which is good at it we which is would number of 4 or more fun when classification is you probably didn't see that it's not enough obviously about the increase of data can improve but because the started with a corpus of about 2 southern sentences and discourse below lower now we have 5 southern discourse got higher and hopefully soon will have an a bigger corpus that we have some human coders working on annotating eyes so yeah I mean it is it is it is enough but the more data we get hold for the better results will get the at the the with the grass of what the of at the the it it at the the the the
the much and the but the my this 1
right so the question was to bring back this slide on and last of the the the you're interested basically tests training speech right so the other question is so common sentences which part of corpus was used for training which we might try unprotesting so for this experiment it's 20 % that 20 % for testing and data % for training because of what the the the 1st of all it you for of OK so apparently like comes the question is to liberate mom on the graphs of to yell that what you can see like gave have 3 bars so the the blue 1 is so the score for calls for action modification the rent 1 is for not calls for action and we everage and then they have precision recall and F 1 score some British Standard measurements for for for for this for for accuracy and let's just look at but topic the top on the right stop so you you you you see this course so it means that there also for actions have a proceedings for fraction is action action the 1 on the procedures for not calls for action that is somewhere around 0 . 7 and average is close to 0 . 8 while recall for calls for action 0 prior recall for not calls for action is 1 and never reach is sort of tho is just this course is that the show the how how well a how bad diversifies the form the this we all all you are I you this and so the question is whether we can decide in favor of different survives depending on which score is more important for us in principle yes so we've our task is if you look for whatever specific task we are only interested in calls for action and we really want to have high precision of those then he of course we could use of like the classifier performs better for this specific task we we don't call about colds friction and wear only interested in recall for not calls friction yeah base would be the best but as a 4 up in our case was sort of interesting to get the highest cost for for all that uh tasks for all the entities of yeah so that's why basically for for our experiments we chose as spam the then there uh I guess that the the challenge here is that if you rely on on the present only in the precision score here the position is high because you didn't catch the remaining but so and that the easy the recall is is bad for for the calls for action and that means that there are a few calls for action that was so clear that the naive Bayes did catch them and they were right but I didn't catch many so maybe it's not the best strategy so the SVM performed clearly better on the whole but it is that what what the question is about a given some examples that are hard to classify I e well again you know you you would like 1 would think that I like those and because examples as fire can be hard to classify of so that's a good question because when I was preparing the presentation and I looked into some misclassified examples and I'm must so there misclassification the caused by imprecisions by linguistic features so that when the Stanford CoreNLP be as signed their part of speech tag wrongly were like some of their relations were not caught of very often those sentences misclassified and men like where you know like that I went to manually into my my my files they change their value for features and then then specification was gone but many of the examples like really I could not understand the for for what reason was if I decided to uh in favorite on label mn like for example obviously must is there and it's obviously so you calling for something but the decision was made as it was not calling for action so this is some something that definitely uh I will begin deeper this this the and the in the the yeah you will get then the question is whether they tools so far it works only in English and and it considering that we have texts and quite a bunch of languages is correct we uh they make the basically the entire project and focal works in 8 languages and we have a huge dictionary that's a tries to extract from the cells and different kind of concept that can be mentioned in 8 languages as as a lot of fun this
tool in particular has been developed in english so far most of the things that it uses are in principle available for other languages too but in the if we have a proper a tool kits for assigning part of speech for extracting grammar information there's no particular reason why we can't run that in Arabic to we have not done that yet but in principle the technology is 1 that should be translatable to the extent that the languages you dealing with health these features all the structures that we're looking at there always an adjustment obviously needed right if you have a language that for instance uh it doesn't use like 2 separate words for uh connecting all 4 of the definite article so as something that's right Francia from here we have 2 prefixation as uh solution for many things that you need to adjust the way this is done but in principle there's no particular reason why this needs to be restricted to English I that so and so the lines you I think you will the variance of the whole line and that's and that's what we really useful the result all use of all of us you know all of them in the so the of the countries where the use of this approach in uncertain years will released in speech of you this is the last time what the so the question is how did a topology deals with ambiguity like things like referentiality if you have referred basically few externalized parts of the information to something people are supposed to know on a and you 2 things and I guess it depends partly on how exactly this is done like you know if you have them in a training corpus of half a lot of cases that were working roughly like that then and the classifier should have a pretty good chance finding it then can the of refers to another projected on the currently trying with um that is not yet in the states uh where the idea is to look at the history of the same and discourse that's to basically the idea is you know you say Francis cut the example and that those people need to double criminals and need to be treated treated as such but and if you have them in the history of the discourse of the Texas had war criminals need to be treated by and then uh and whatever received locally appropriate way hanging shooting imprisoning another different ways pardoning um and did you have this kind of information from the historicity of the colleges called then you can fill this in but this is obviously much more complex procedure that goes far beyond so trigrams at so this is something that we have on the screen Angelo there is work in that area but it is far from being in a as we can present that this actually works stage there you have it just very briefly to add up here like we deal with a sentence and then a unit of analysis so we basically we don't look at John descendants which is a downside of this because many of same context and principle the right tools mentioned which processing like never resolution for example which can help us identify what the pronoun 1st to of magic currently like now it's not a good idea but of course it would the call to them and during the next thing then of course this like note that every additional tool that you plug in the um multiplies aerosils but after aerosils of the stand kernel p then you have to stand as a result of the anaphora resolution and it basically by the precision rates of all of these individuals with you already get to an overall precision weight of point 5 10 it starts being useless so have both with the history of it as they there some things we can do by combining these tools but the price we pay for that is so we depend on the choice of the precision so with the tools are not perfect and adding 1 has a price and so you know you have once they so so it was called common that there is a lot of work on this in the digital humanities and of the status this is a model of um both philosophical and I cannot just developing the perspective of how 1 can try and find all these things and and texts what dealing with irony dealing with implicature dealing with figurative speech and and also is quite a lot of tools so there's lots of them but the this seemed to be no more acute questions so I guess uh let us thank you very much for being here and for the discussion has been a very big pleasure and if you have any further questions later whatever this thing is online in 4 online we have e-mail addresses and we're happy about questions of ideas suggestions and so on so thanks a lot you can see that