Automatically Generating Interesting Facts from Wikipedia Tables

Video thumbnail (Frame 0) Video thumbnail (Frame 319) Video thumbnail (Frame 1799) Video thumbnail (Frame 5864) Video thumbnail (Frame 8086) Video thumbnail (Frame 9916) Video thumbnail (Frame 18040) Video thumbnail (Frame 20176) Video thumbnail (Frame 22639) Video thumbnail (Frame 24565)
Video in TIB AV-Portal: Automatically Generating Interesting Facts from Wikipedia Tables

Formal Metadata

Title
Automatically Generating Interesting Facts from Wikipedia Tables
Title of Series
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2019
Language
English

Content Metadata

Subject Area
Abstract
Modern search engines provide contextual information surrounding query entities beyond ten blue links in the form of information cards. Among the various attributes displayed about entities there has been recent interest in providing fun facts. Obtaining such trivia at a large scale is, however, non-trivial: hiring professional content creators is expensive and extracting statements from the Web is prone to uninteresting, out-of-context and/or unreliable facts. In this paper we show how fun facts can be mined from superlative tables in Wikipedia, whose rows are ranked according to some statistics, to provide a large volume of reliable and interesting content. We employ a template-based approach to semi-automatically generate natural language statements as fun facts. We show how to bootstrap and streamline the process for faster and cheaper task completion. However, the content contained in these tables is dynamic. Therefore, we address the problem of automatically maintaining the pairing of templates to tables as the tables are updated over time. Fun facts produced by our work is now part of Google's production search results.
Data management Table (information) Magneto-optical drive Projective plane Product (business)
Web page Dataflow Context awareness Building Group action Multiplication sign View (database) Source code Set (mathematics) Mereology Staff (military) Flickr Rule of inference Field (computer science) Product (business) Web 2.0 Exploratory data analysis Term (mathematics) Single-precision floating-point format Tower Authorization Physical system Theory of relativity Electric generator Building Projective plane Morley's categoricity theorem Database Group action Semantic Web Word Personal digital assistant Right angle YouTube Force
Recurrence relation Group action Table (information) Observational study Natural language Graph (mathematics) Multiplication sign Virtual machine 1 (number) Mereology Software maintenance Product (business) Template (C++) Web 2.0 Data model Architecture Term (mathematics) Googol Graph (mathematics) Data mining Uniqueness quantification Ranking Aerodynamics Endliche Modelltheorie Normal (geometry) Category of being Computer architecture Pairwise comparison Electric generator Prisoner's dilemma Web page Video tracking Computer network Database Software maintenance Template (C++) Semantic Web Inclusion map Process (computing) Frequency Personal digital assistant Perpetual motion Natural number Natural language Table (information) Abelian category
Torus Email Building Group action Context awareness Euclidean vector Texture mapping Weight View (database) Multiplication sign Numbering scheme Insertion loss Mereology Semantics (computer science) Perspective (visual) Dimensional analysis Web 2.0 Medical imaging Tower Endliche Modelltheorie System identification Error message Multiplication Social class Algorithm Trail View (database) Building Web page Menu (computing) Bit Category of being Root Tower Bipartite graph Inference Right angle Ranking Freeware Metric system Abelian category Task (computing) Resultant Row (database) Spacetime Slide rule Table (information) Connectivity (graph theory) Cellular automaton Virtual machine Maxima and minima Electronic mailing list Student's t-test Expert system Content (media) Number Attribute grammar Wave packet Template (C++) Product (business) Goodness of fit Term (mathematics) Natural number Ranking Selectivity (electronic) Analytic continuation Task (computing) Matching (graph theory) Cellular automaton Lemma (mathematics) Video tracking Expert system Content (media) Morley's categoricity theorem Continuous function Software maintenance Template (C++) Similarity (geometry) Performance appraisal Particle system Word Personal digital assistant Logic String (computer science) Network topology Einbettung <Mathematik> Natural language Table (information)
Point (geometry) Trail Table (information) Algorithm State of matter Multiplication sign Connectivity (graph theory) View (database) Maxima and minima Protein Mereology Flickr Order of magnitude Software maintenance Number Template (C++) Product (business) Sequence Mathematics Term (mathematics) Hybrid computer Representation (politics) MiniDisc Category of being Task (computing) Physical system Stability theory Source code Pairwise comparison Algorithm Electric generator Video tracking Flickr Maxima and minima Bit Thresholding (image processing) Template (C++) Performance appraisal Arithmetic mean Process (computing) Prediction Order (biology) Software testing Social class Table (information) Abelian category Resultant Fundamental theorem of algebra
and that is a tall guy and I mentioned to you this is a
project that might be all was the product team that was launched in Prague about 2 years ago and it's right not thinking about 1 per
cent of that was a trafficker which is produced against singing the what that Google of it have served the idea was actually as you always as most of you will know like Google search and when I just wasn't being search and everything is moving twats out beyond the and rulings and we wanna make the search without page more of a richer and more interesting for all users and this is a part of that effort and actually is is the setting of that and of from within K. P. we are also exploring more engaging facts like those interesting tidbits about some entities you're looking for so this is actually a project that the product can reach out to us because they were doing this O'Connor from fats project and they were generating from facts that are somewhat interesting Bunola really engaging and they have and not enough coverage for the future they're looking for today they come to us and the thing Haken has come up with some ideas so with that well with this election is still interesting we might be able to do something for you so what is the problem a
lot of this is is really challenging alright so so what is the problem the problem is basically we have the entities where we are trying to identify so we have entered is we're we're trying to identify the how users can be engaging with them in terms of flow providing interesting tidbits about entities in question so there are basically several things were looking at and 1 of them I'm still looking and a summary slides from the previous said speakers and is that of a distracting so the best yeah I'll try said so there are several views we're looking at a for this particular interesting from facts generation problems so the first one is what we call in a single view which is that identify what is the most interesting about this particular entity and then we also have a so called the categorical the of so the 1st example like frozen is the highest-grossing animated film of all time but that's the single view of pop casing the top-k view and the and the other 1 is similar to that but we try to put that into the context for an ambush on my tower in may not be the tallest buildings rather world but it is the Forrestal building in China and so we're trying to identify batters walk and the last 1 is the most interesting which is essentially the idea of the so-called distributional almost view the ideas like you are interesting because your part of the group that is very interesting right so here are the cases like for all of the words of the top 10 best-selling of fiction authors up British and including of rolling whole road the harry potter books right so this is another interesting aspects of that so the challenge is of course a threefold so the 2 of them actually is very consistent with the previous talks as well in terms of the interesting is you have to dis define that and then you have to of so called the reliability in this particular case is like yeah it has to be self-contained contend and it's an interesting fact stand on its own can't stand on for and because we're putting in in a in a knowledge panel and then there's also they're a freshness because of the data source will digesting from our across the Web and the web changes all the time and went to make sure like there's no much incremental work that we have to do you moderate to amend tender maintain the system so for and so there's a bunch of related works there and the as many of the related the works of art coming from the database and Semantic Web field it's
actually quite interesting the button look at this problem before on our solution turns out to be a cross between a database Semantic Web as well as a lot of natural language processing as as you will see so all we did a comparison with all of these in a realistic user studies that is required for the product launch all right so and you see if I can I explain this in a way that that I can at the same time see it so so those solution we come up with is that strands of the group our group have been working was tables on the web for a long time so 1 of the things we look at OK so this problem even though the problem is a specified as like looking for interesting things about entities it turns out that there's a lot of thought interesting tables on unaware that are talking about those entities and so you the them things away from Entity Centric to be becoming tables under the country generate a lot of very interesting things about entities of from those places so the idea is a like a we're looking at a so-called a superlative tables on web and identify those tables and then pick up those entities and translate those roles in those superlative tables into the interesting sentences that you see in the the so this turns out to be working really well here is the highest-grossing animated film this is the frozen example and that gets matched to in the simplest case gets matched to religion or example or so
next 1 always so this'll clicker OK so I can show you that Villani the might be better alright so this is actually the better way but I can take my course but she works again wanna were spent by minutes for so long all of architecture so I think I'm gonna go out of the quicker those those them basically 2 major parts of the architecture of the 1st ones got a template generation basically is how we take the table in a general those interesting templates that you see in terms of the sentences and the 2nd 1 is called dynamic maintenance which is essentially a once a template so that the temperature should pretty expensive to generalize as well as I will show you and it's also a machine learning model gruesome that doesn't have a 100 cent of the precision all so you you wanna make sure like you leverage all the templates and and generate as well as human-generated prison much as you can so the idea to do that is you track all those tables and making sure all from snapshot to snapshot the identified the right table along the way so you can reapply there's always all human intervention so template generation so
this is our summer the busy slide so bear with me a little bit and so on in my being on time good alright so but basically on the right you have this so-called web table and this in particular this 1 is coming from a superlative on tallest buildings in the world and from Wikipedia of in our prior working in the group that we actually did all the work in terms of identifying what is the subject column in this case that ability names and what other metrics without the categorical columns and all these things that's our prior work but so our current work is essentially as I mentioned to you the talking using pole view and an active you wanna basically identify those sentences components of those sentences from the table our itself or so as you can see the entities are pretty easy if the inner cell what is a subject column you can easily identify an entity and ranking and it is also easy this to look at how popular those entities are which 1 the interesting trees the but so this 1 is actually quite interesting so sometimes it can be generated from the supplied him ranking itself you the 1st row 2nd the 2nd 1 is 2nd tallest as a 31 so Torres the other eyes sometimes you have to re- arranged a table based on a metric you're interested in so this is actually a very interesting work but tallest all of this goes beyond so promised is actually the mantra language component of our so basically it to learn that in this particles and you need to use tallest there many 2 ways 1 is you can extract from the title and the title sometimes that had work sometime it doesn't work and other if the title doesn't work then you have to figure out the from the column names height so you should be able to identify hey from a machine learning perspective using a on a number of free embedding technologies to say basically OK if it's a height and ranked according to hide its user tallest all highest and depending on the context right so this is a component you have to learn and then our lastly the building of course this is coming from their title as well as allowing other parts of the context of the table so through over so so once you have that that covers the single topic a view and the force selective top gave you the idea is that not only you need to have that the torus the building but you also need to connect that to the categorical logic name attribute value thing this case but it turns out it's a very interesting problem human and natural language because to understand I should say in some high tower the tallest ability of China with China in China right so identifying is the rise phrasal verb to be used in this context it's a very interesting natural image problem that we're trying to solve here but will that I give you some idea of how well complicated this is on so that distributional again here you add you understand like the common to basically OK I'm interested in China as the category and harmony told old buildings there are and how many on Chinese students there are only have to show that were at but so so if I've told you like all the challenges and interesting problems in this space and exactly how how what exact the problem we're trying to do so we have to take that take learning-based approach so we basically tried to learn about what we are trying to learn is based on examples with seeing I I will show you will how we generate the training data we learn the proper sentences that like a human can easily understand it it leads naturally it's variable and and it's very interesting for the people to read or right so this is the training data and then we basically collective surrender 50 most popular superlative table Wikipedia tables the then we did all our work in terms of subject on detection and metrics and categorical and then what we did is we generate a very crude templates right the based on article of based on the table title an entity as was the close match a column names and those cruel templates are then sent to our expert raters the expiry Alagich us polished those crew templates should be transformed into a very readable 1 and then it's acted not work and sometimes it will take like half a minute sometimes it will take 10 minutes because the templates are living like complicated and the reader have to really look at it so we spent about 80 present hours in total and that generous our training data yielded about a thousand training example which is relatively small but remember when we are applying the machine learning model we actually using a lot of pre-training embeddings already so a lot of the semantics are encoded the in the dimensions that we have and so an and then we have 2 components of all we're trying to do the machine learning task on 1 is so-called of predicting the superlative of and plaster class component and and and for them both here all of you the table 14 hours of the United States you will never know I this is talking about a mountains therefore the the word should be highest and that's inferred by our model and the cities and towns in Iceland and it turns out it's ordered according to the population column therefore all the ranking attribute is most popular 1 right so these are kind of really good results I'm coming from this all of similar things of phrasal where when and and and and in the interest of time I'm going to skip over there so apparent
maintenance light but the problem here is is is basically trying to let remember like the machine learning component and all the all of human life and evaluation were up pretty costly Adorno's to redo that any time a table changes slightly right so that's not gonna be efficient for you and it's not tenable this we are trying to do is basically to track from snapshot to snapshot which table on being that can't shop is actually the semantic continuation of the previous table that we already know we already generated those interesting from fats for by so this is the whole idea but rather the will to handle content small content including single rolled insertion deletion and 1 out of hand scheme updates as well as a human error but occasionally so here's 1 example all where you see like this is the prize the table and this whole table gone moved and converted into something like this and and we wanna still be able to capture that this is the same table we're talking about before but also in the interest of time another getting to this part this is basically the algorithm that details is in the paper a whole we actually use a subject column as the anchor and then mapped the other columns for the matching purpose right OK we do some spent per productions well
I just show you some evaluation to because everyone's getting to decide is this is a comparison we did in order to get lunch apparel and meaning like what we what was the protein use up freighter is that we have no idea who they are and every a single reader can only do 3 task most so it's like a very representative off-normal users and behavior and perception of of the results so as you can see we compare with is a number of the results of our earlier results and we see this in terms of the interestingness according to the and the user is given as we significant even by orders of magnitude and for this particular 1 trivia quiz the reason we don't have 1 is because all the then release that the cold so we can reproduce the results and that that that that it said there really is only overlap with our data like a very small percentage so we can get a reliable number but as you can see like basically of performances like many times like a 47 they want times better than than I call or so in terms of the 2 component of a we also have an accuracy that is way above and the the science that we have so it's pretty pretty on table tracking and once you wanna point out all this has been no more than 9 months and the 1st the 1st year we have this launched on only 2 per cent of the tables where the change is required under the bill of human intervention so there's really a minimum of a burden on among us the conclusion that so
it turns out all we took a very different view off the farm fax generation problem as the previous the busy looking to sentences in trying to summarize the sentences and they were taking a very Entity Centric you for this entity what can I find so Avi is very different all his idea hey here's a very interesting table and what can we say from this table so that actually is 1 of the fundamental changes and that's what led us to a much better coverage and much more interesting this compare was the previous approaches you you in the literature in inside a global production system and we have a very nice table tracking algorithm and to make sure that the engineers and all have to spend too much time just modifying the templates and things like that but users love it and and and that's why it's launched and and in the future we are also looking at which you know which appeared tables and then superlative tables but this is out of a more challenging because Wikipedia subpoenaed have table is part of weaving watching table corpus while on this is the highest quality of stable copper state can get so it's really make our job a lot easier once you beyond once you go beyond those is becoming a bit of a tricky so we're looking at those as well but at present
Feedback
hidden