AV-Portal 3.23.3 (4dfb8a34932102951b25870966c61d06d6b97156)

Split-Correctness in Information Extraction

Video in TIB AV-Portal: Split-Correctness in Information Extraction

Formal Metadata

Split-Correctness in Information Extraction
Title of Series
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Programs for extracting structured information from text, namely information extractors, often operate separately on document segments obtained from a generic splitting operation such as sentences, paragraphs, k-grams, HTTP requests, and so on. An automated detection of this behavior of extractors, which we refer to as split-correctness, would allow text analysis systems to devise query plans with parallel evaluation on segments for accelerating the processing of large documents. Other applications include the incremental evaluation on dynamic content, where re-evaluation of information extractors can be restricted to revised segments, and debugging, where developers of information extractors are informed about potential boundary crossing of different semantic components. We propose a new formal framework for split-correctness within the formalism of document spanners. Our preliminary analysis studies the complexity of split-correctness over regular spanners. We also discuss different variants of split-correctness, for instance, in the presence of black-box extractors with split constraints.
Data management Magneto-optical drive Evolutionarily stable strategy Information extraction
Information extraction Server (computing) Blog File format Instance (computer science) Login Information extraction Task (computing)
Server (computing) File format Server (computing) File format Instance (computer science) Parallel port Host Identity Protocol Formal language Heegaard splitting Error message Blog Personal digital assistant Single-precision floating-point format Bus (computing) Arrow of time Software testing Information Maize Exception handling Message passing Error message Resultant Information extraction
Axiom of choice Computer chess Pattern recognition Tuple Diffuser (automotive) Variable (mathematics) Front and back ends Neuroinformatik Different (Kate Ryan album) Single-precision floating-point format Personal digital assistant Core dump Normal-form game Theory of relativity Constraint (mathematics) Funktionalanalysis Instance (computer science) Complete metric space Bulletin board system Category of being Process (computing) Theorem Software framework Regular expression Spacetime Point (geometry) Supremum Algorithm Constraint (mathematics) Motion capture Barrelled space Mathematical analysis Black box Infinity Scattering Number Term (mathematics) Regular expression Data structure Address space Information extraction Complex analysis Operator (mathematics) Equivalence relation Word Integrated development environment Personal digital assistant Function (mathematics) Finite-state machine Table (information) Resolvent formalism State of matter Decision theory Multiplication sign View (database) Relational database Sheaf (mathematics) 1 (number) Set (mathematics) Parameter (computer programming) Function (mathematics) Mereology Formal language Heegaard splitting Software framework Information Endliche Modelltheorie Polynomial Parallel port Variable (mathematics) Information extraction output Parametrische Erregung Automation Resultant Relief Server (computing) Implementation Service (economics) Twitter Operator (mathematics) Spacetime Software testing Maize output Mathematical optimization Task (computing) Operations research Addition Inheritance (object-oriented programming) Diffuser (automotive) Mathematical analysis Performance appraisal Algebra Moment <Mathematik> Natural language Kolmogorov complexity Boundary value problem
so this talk is joint work
together with any can muffled and you have much on
from the Technion and when Martin's from by and front even from Austin University in and it's own correctness
in information extraction so what is the use setting assumes you
have a huge text document for instance coming from server logs softening so or something like that and the task is to
convert it into a structured relational format and to do this you
could leave lead test run information extractor on top of the document however if you have like many servers available bus just like a few documents 1 to analyze this is really an efficient so what you otherwise also could do is the 1st splitting up the document into smaller except documents this could be for instance a splitting a text document into paragraphs or sentences or if you have like an arrow locks splitting it into single error messages or something like that and then year on information extractor born of the septic arguments and in the end basically join all the results surveillance and take them to an all that result halves together but by joining you really mean that you basically testing the union and the question which we also ourselves is in which case this this basically the COBOL and so in which cases can you do this and in the channel so more formally on and
information extractor here is test abstracted as a function which is given a document and returns a relation and as the parent is basically a function which takes a document as an input and splits the document into small sup documents and then people awaiting an extractor on a splitter is basically 1st splitting the document into smaller sub documents then running the relation in parallel on the smaller sub documents and India and combining them together by taking the union to get the results table so as I mentioned central question we want to answer is whether this naive extraction is the same as the barrel extraction and here we have model parameters the first one is what we call the global extract bird which is basically function the function running in the Navy extraction the 2nd 1 is the local structure which is the function running in the parallel environment noticed that in general we also assume that these can be different they might in some cases the the same but they also can be different then the perimetre is basically this splitter which is takes the task of splitting up the document like a said before in something like n-gram into sentences or something like that and last but not least you also have this term combining operator but this is basically just the union operator a so the parameters of our decision problems always make local extract the global extractor and this and to formalize these we basically defined a few different decision problems the 1st of all is the so called the correctness of this is the sad angry at given all the parameters of the global extract or open extract from this that and the question is whether for any input document you get there is the same result if you run the global extractor on the whole document compared with running the local extractor on the splits of the document the 2nd problem is the so-called ability here we basically have the same local extract from global expect so basically the local 1 and the global 1 are the same and the 3rd problem is the 1 which is probably the most important most interesting for practitioners which is called is that the so-called that ability problem here we're just given the global extractor and the question is then there exists some local expect which basically that's the correct computation on the chunks of the documents and the motivation behind this problem is actually a setting where the user doesn't even know that the back and might have some politicians and paralyzed evaluation implemented so basically the user just defines the extract he wants to define and in the back end you have some ways of splitting the document which might already be pair of materialized or not and you just ask the question whether you can use 1 of these splitters and go compute like a local extract which basically then can paralyze the whole extract inch noticed extraction and before I can come into more details about the results
we have to talk about what we considered in our analysis and these are basically regular documents banners and so-called regular documents diverse which for the people who were not into the whole spend framework it's basically a finite-state automaton this variable operations so for instance like this in the sentence extractor here you can see the normal at chess the black ones which are the ones of the final state of the modern and then you also have these Marable edges and 1 alternative but equivalent definitions is in that they are basically regular expressions of capture variables close under addition Odjick Proc and a splitter on the other hand is basically a unary regular documents and so basically each slip of a document is then 1 part which more selective by such as slipper or by the use of variable and 1 important property you of the splitters is then the so-called but disjointness where as the address this joint if every part of the document is in at most 1 output document to depict the us for instance here destroyed splitters something like sentence segmentation so basically splitting the whole document into single sentences or splitting into paragraphs or something like that so as you can see here each part of the input document this just in at most 1 of the splits and and disjoint splitters splitter is something like n-grams for pairs of consecutive sentences so here it can happen that as this single part of the document is basically captured by multiple parts of the Split coming to the main results 4 steps that ability and correctness we call these were the 2 problems for all the parametres are given so in 1 case for this house that ability and we use the same extractor for the local and the global extraction and in the case of the that correctness we have 2 different expect burst running on the naive global extraction and in the parallel local extraction and here we showed that ourselves to the bulletins that correctness for regular Pokémon hours API space complete however if we only allow this joint splitters and the documents penniless are in normal form then we have a polynomial-time extraction room and for those of you who are also into documents vendors the normal form is basically having the demand is that of the model are blast functionality and 1 important takeaway here is that designing both these problems basically boils down to checking equivalents of the set out on top and for this that the poverty problem this was the 1 word Europe basically just given the global extract there and the question is whether there exists some local extractor which that's exactly what we want and this is so empty space complete and here the important part is to show the upper bound of the for the peace based completeness we already need to say just joined splitters if the stages are not disjoint or methods don't reliever and this piece this completeness does not hold however those of you who are into information extraction but also now well in practice we don't really use regular documents vendor too often or some views but sometimes you also use natural language processing or something like that so I'm also talking about Swear we also extended our framework to so-called this constraint black boxes where we say we're using information extracted which is given as a black box for instance you could use the coreference resolvers sentiment extract versus something like that but from the implementation or the design of these I wouldn't you might already know that for instance the coreference resolver is splittable by paragraphs or the sentiment extractor never looks at more than 1 single sentence so it must by definition be visible across sentences or something like that so we defined the black boxes spit of this this correctness they're basically it you're given that year I simplified for the torque to extract boroughs soap to to books exact 1st note that we don't even know how these are implemented these are black boxes for us NS the and also a set of state constraints release that constraints are basically just things like the 1st extract splittable by sentences for the 2nd 1 is that the bull by paragraphs or something like that and the question now is whether the choice of the 2 is splittable by a given letter under the constraints and then we just have like some preliminary results here 1 of which is that there are cases in which the choice of both is splittable by sentences for instance even though the first one is not and the idea here is are the inside this sometimes if like say the 1st extract results splittable sentences you might lose information if you just run it on the smaller chunks however all the couples you lose or not used for the joint so they are projected of in the choice so it's still could be splittable but joins the joint with 3rd is that the book however we also a bad result which basically is that there are extractors which are both south fulfill the both that the bull by the same splitter however the choice is not splittable and the idea here is that you basically can have the case where both extract birds needs different parts of of the document the so if these scatterers overlapping they might not end up like in once that you might not have all the information you need to compute both of extra cost so for those of you were now Europe many of will just came later the main question why should you care about this 1st of all the main motivation of the whole framework was to enable parallelization for information extraction on huge documents so this set Pingree have just a few Uche documents at a huge number of servers and if you chanced to analyze them in power that if you just analyzed in the lead you would have liked savior 5 documents 5 service running and the rest of the server's Contel anything so you want to paralyze that's the main motivation however we also had some of the motivations 1st of all you could you'd use this you like in a setting where you have a huge number of documents for instance analyzing Emmerson reviews of Twitter data or something like that for crude it's not really the case but say Emerson data and other documents it sometimes happens that some of these documents are really small but others are you so by paralyzing and splitting the documents with our framework you could also reduces you of just providing like more but similar sized tasks the 2nd 1 is the banging but I'm not going to get into detail not because of time reasons if 1 is interested I can tell you off-line about that and the 3rd is pending updates so for instance if you're allies and say you could be the you and there is a small change in like say this 2nd check section you just have to basically recompute this they extract which basically held these parts so you don't have to recompute bill extraction but you can just through compute the local data so last but not the future of 1 point is the complexity of that ability of for non disjoint splitters here we don't really have any result yet so that's really important interesting the 2nd is to better understand the last that ability of that constraint black boxes and the 3rd 1 is empirical analysis of the framework to better see you have which cases a core in practice and there might be room for optimization that said thanks for