SplitCorrectness in Information Extraction
Formal Metadata
Title 
SplitCorrectness in Information Extraction

Title of Series  
Author 

License 
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2019

Language 
English

Content Metadata
Subject Area  
Abstract 
Programs for extracting structured information from text, namely information extractors, often operate separately on document segments obtained from a generic splitting operation such as sentences, paragraphs, kgrams, HTTP requests, and so on. An automated detection of this behavior of extractors, which we refer to as splitcorrectness, would allow text analysis systems to devise query plans with parallel evaluation on segments for accelerating the processing of large documents. Other applications include the incremental evaluation on dynamic content, where reevaluation of information extractors can be restricted to revised segments, and debugging, where developers of information extractors are informed about potential boundary crossing of different semantic components. We propose a new formal framework for splitcorrectness within the formalism of document spanners. Our preliminary analysis studies the complexity of splitcorrectness over regular spanners. We also discuss different variants of splitcorrectness, for instance, in the presence of blackbox extractors with split constraints.

00:00
Data management
Magnetooptical drive
Evolutionarily stable strategy
Information extraction
00:15
Information extraction
Server (computing)
Blog
File format
Instance (computer science)
Login
Information extraction
Task (computing)
00:29
Server (computing)
Information
File format
Server (computing)
File format
Instance (computer science)
Parallel port
Host Identity Protocol
Heegaard splitting
Error message
Blog
Personal digital assistant
Singleprecision floatingpoint format
Bus (computing)
Arrow of time
Software testing
Information
Maize
Exception handling
Message passing
Error message
Resultant
Information extraction
01:44
Axiom of choice
Computer chess
Pattern recognition
Tuple
Complex (psychology)
Diffuser (automotive)
Variable (mathematics)
Front and back ends
Different (Kate Ryan album)
Singleprecision floatingpoint format
Personal digital assistant
Core dump
Normalform game
Theory of relativity
Constraint (mathematics)
Kolmogorov complexity
Instance (computer science)
Complete metric space
Bulletin board system
Category of being
Process (computing)
Theorem
Software framework
Spacetime
Point (geometry)
Supremum
Algorithm
Constraint (mathematics)
Motion capture
Barrelled space
Mathematical analysis
Black box
Infinity
Regular graph
Computer
Scattering
Number
Term (mathematics)
Data structure
Address space
Information extraction
Information
Operator (mathematics)
Equivalence relation
Word
Integrated development environment
Personal digital assistant
Function (mathematics)
Finitestate machine
Table (information)
Resolvent formalism
State of matter
Decision theory
Multiplication sign
View (database)
Relational database
Sheaf (mathematics)
1 (number)
Set (mathematics)
Parameter (computer programming)
Function (mathematics)
Mereology
Heegaard splitting
Software framework
Information
Endliche Modelltheorie
Polynomial
Parallel port
Variable (mathematics)
Regulärer Ausdruck <Textverarbeitung>
Information extraction
output
Parametrische Erregung
Automation
Resultant
Relief
Functional (mathematics)
Implementation
Server (computing)
Regulärer Ausdruck <Textverarbeitung>
Service (economics)
Twitter
Regular graph
Operator (mathematics)
Spacetime
Software testing
Maize
output
Mathematical optimization
Task (computing)
Operations research
Addition
Inheritance (objectoriented programming)
Diffuser (automotive)
Mathematical analysis
Performance appraisal
Algebra
Moment <Mathematik>
Natural language
Boundary value problem
00:04
so this talk is joint work
00:07
together with any can muffled and you have much on
00:09
from the Technion and when Martin's from by and front even from Austin University in and it's own correctness
00:17
in information extraction so what is the use setting assumes you
00:22
have a huge text document for instance coming from server logs softening so or something like that and the task is to
00:30
convert it into a structured relational format and to do this you
00:38
could leave lead test run information extractor on top of the document however if you have like many servers available bus just like a few documents 1 to analyze this is really an efficient so what you otherwise also could do is the 1st splitting up the document into smaller except documents this could be for instance a splitting a text document into paragraphs or sentences or if you have like an arrow locks splitting it into single error messages or something like that and then year on information extractor born of the septic arguments and in the end basically join all the results surveillance and take them to an all that result halves together but by joining you really mean that you basically testing the union and the question which we also ourselves is in which case this this basically the COBOL and so in which cases can you do this and in the channel so more formally on and
01:46
information extractor here is test abstracted as a function which is given a document and returns a relation and as the parent is basically a function which takes a document as an input and splits the document into small sup documents and then people awaiting an extractor on a splitter is basically 1st splitting the document into smaller sub documents then running the relation in parallel on the smaller sub documents and India and combining them together by taking the union to get the results table so as I mentioned central question we want to answer is whether this naive extraction is the same as the barrel extraction and here we have model parameters the first one is what we call the global extract bird which is basically function the function running in the Navy extraction the 2nd 1 is the local structure which is the function running in the parallel environment noticed that in general we also assume that these can be different they might in some cases the the same but they also can be different then the perimetre is basically this splitter which is takes the task of splitting up the document like a said before in something like ngram into sentences or something like that and last but not least you also have this term combining operator but this is basically just the union operator a so the parameters of our decision problems always make local extract the global extractor and this and to formalize these we basically defined a few different decision problems the 1st of all is the so called the correctness of this is the sad angry at given all the parameters of the global extract or open extract from this that and the question is whether for any input document you get there is the same result if you run the global extractor on the whole document compared with running the local extractor on the splits of the document the 2nd problem is the socalled ability here we basically have the same local extract from global expect so basically the local 1 and the global 1 are the same and the 3rd problem is the 1 which is probably the most important most interesting for practitioners which is called is that the socalled that ability problem here we're just given the global extractor and the question is then there exists some local expect which basically that's the correct computation on the chunks of the documents and the motivation behind this problem is actually a setting where the user doesn't even know that the back and might have some politicians and paralyzed evaluation implemented so basically the user just defines the extract he wants to define and in the back end you have some ways of splitting the document which might already be pair of materialized or not and you just ask the question whether you can use 1 of these splitters and go compute like a local extract which basically then can paralyze the whole extract inch noticed extraction and before I can come into more details about the results
05:28
we have to talk about what we considered in our analysis and these are basically regular documents banners and socalled regular documents diverse which for the people who were not into the whole spend framework it's basically a finitestate automaton this variable operations so for instance like this in the sentence extractor here you can see the normal at chess the black ones which are the ones of the final state of the modern and then you also have these Marable edges and 1 alternative but equivalent definitions is in that they are basically regular expressions of capture variables close under addition Odjick Proc and a splitter on the other hand is basically a unary regular documents and so basically each slip of a document is then 1 part which more selective by such as slipper or by the use of variable and 1 important property you of the splitters is then the socalled but disjointness where as the address this joint if every part of the document is in at most 1 output document to depict the us for instance here destroyed splitters something like sentence segmentation so basically splitting the whole document into single sentences or splitting into paragraphs or something like that so as you can see here each part of the input document this just in at most 1 of the splits and and disjoint splitters splitter is something like ngrams for pairs of consecutive sentences so here it can happen that as this single part of the document is basically captured by multiple parts of the Split coming to the main results 4 steps that ability and correctness we call these were the 2 problems for all the parametres are given so in 1 case for this house that ability and we use the same extractor for the local and the global extraction and in the case of the that correctness we have 2 different expect burst running on the naive global extraction and in the parallel local extraction and here we showed that ourselves to the bulletins that correctness for regular Pokémon hours API space complete however if we only allow this joint splitters and the documents penniless are in normal form then we have a polynomialtime extraction room and for those of you who are also into documents vendors the normal form is basically having the demand is that of the model are blast functionality and 1 important takeaway here is that designing both these problems basically boils down to checking equivalents of the set out on top and for this that the poverty problem this was the 1 word Europe basically just given the global extract there and the question is whether there exists some local extractor which that's exactly what we want and this is so empty space complete and here the important part is to show the upper bound of the for the peace based completeness we already need to say just joined splitters if the stages are not disjoint or methods don't reliever and this piece this completeness does not hold however those of you who are into information extraction but also now well in practice we don't really use regular documents vendor too often or some views but sometimes you also use natural language processing or something like that so I'm also talking about Swear we also extended our framework to socalled this constraint black boxes where we say we're using information extracted which is given as a black box for instance you could use the coreference resolvers sentiment extract versus something like that but from the implementation or the design of these I wouldn't you might already know that for instance the coreference resolver is splittable by paragraphs or the sentiment extractor never looks at more than 1 single sentence so it must by definition be visible across sentences or something like that so we defined the black boxes spit of this this correctness they're basically it you're given that year I simplified for the torque to extract boroughs soap to to books exact 1st note that we don't even know how these are implemented these are black boxes for us NS the and also a set of state constraints release that constraints are basically just things like the 1st extract splittable by sentences for the 2nd 1 is that the bull by paragraphs or something like that and the question now is whether the choice of the 2 is splittable by a given letter under the constraints and then we just have like some preliminary results here 1 of which is that there are cases in which the choice of both is splittable by sentences for instance even though the first one is not and the idea here is are the inside this sometimes if like say the 1st extract results splittable sentences you might lose information if you just run it on the smaller chunks however all the couples you lose or not used for the joint so they are projected of in the choice so it's still could be splittable but joins the joint with 3rd is that the book however we also a bad result which basically is that there are extractors which are both south fulfill the both that the bull by the same splitter however the choice is not splittable and the idea here is that you basically can have the case where both extract birds needs different parts of of the document the so if these scatterers overlapping they might not end up like in once that you might not have all the information you need to compute both of extra cost so for those of you were now Europe many of will just came later the main question why should you care about this 1st of all the main motivation of the whole framework was to enable parallelization for information extraction on huge documents so this set Pingree have just a few Uche documents at a huge number of servers and if you chanced to analyze them in power that if you just analyzed in the lead you would have liked savior 5 documents 5 service running and the rest of the server's Contel anything so you want to paralyze that's the main motivation however we also had some of the motivations 1st of all you could you'd use this you like in a setting where you have a huge number of documents for instance analyzing Emmerson reviews of Twitter data or something like that for crude it's not really the case but say Emerson data and other documents it sometimes happens that some of these documents are really small but others are you so by paralyzing and splitting the documents with our framework you could also reduces you of just providing like more but similar sized tasks the 2nd 1 is the banging but I'm not going to get into detail not because of time reasons if 1 is interested I can tell you offline about that and the 3rd is pending updates so for instance if you're allies and say you could be the you and there is a small change in like say this 2nd check section you just have to basically recompute this they extract which basically held these parts so you don't have to recompute bill extraction but you can just through compute the local data so last but not the future of 1 point is the complexity of that ability of for non disjoint splitters here we don't really have any result yet so that's really important interesting the 2nd is to better understand the last that ability of that constraint black boxes and the 3rd 1 is empirical analysis of the framework to better see you have which cases a core in practice and there might be room for optimization that said thanks for