Logo TIB AV-Portal Logo TIB AV-Portal

Interactively Discovering Implicational Knowledge in Wikidata

Video in TIB AV-Portal: Interactively Discovering Implicational Knowledge in Wikidata

Formal Metadata

Title
Interactively Discovering Implicational Knowledge in Wikidata
Subtitle
(The Exploration Game)
Title of Series
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2019
Language
English

Content Metadata

Subject Area
Abstract
The ever-growing Wikidata contains a vast amount of factual knowledge. More complex knowledge, however, lies hidden beneath the surface: it can only be discovered by combining the factual statements of multiple items. Some of this knowledge may not even be stated explicitly, but rather hold simply by virtue of having no counterexamples present on Wikidata. Such implicit knowledge is not readily discoverable by humans, as the sheer size of Wikidata makes it impossible to verify the absence of counterexamples. We set out to identify a form of implicit knowledge that is succinctly representable, yet still comprehensible to humans: implications between properties of some set of items. Using techniques from Formal Concept Analysis, we show how to compute such implications, which can then be used to enhance the quality of Wikidata itself: absence of an expected rule points to counterexamples in the data set; unexpected rules indicate incomplete data. We propose an interactive exploration process that guides editors to identify false counterexamples and provide missing data. This procedure forms the basis of [The Exploration Game](https://tools.wmflabs.org/teg/), a game in which players can explore the implicational knowledge of set of Wikidata items of their choosing. We hope that the discovered knowledge may be useful not only for the insights gained, but also as a basis from which to create entity schemata. The talk will introduce the notions of Implicational Knowledge, describe how Formal Context Analysis may be employed to extract implications, and showcase the interactive exploration process.
Keywords WikiPakaWG Science 2019 36c3-wikipakawg
complex Rifle webcrawlers Context time sets loss proposition Part computational hypermedia different sets information calculus descriptions formal algorithm counterexample binaries The list computation effects attributes bits 19th lasers category means sample orders system Right ideal point RNG Development interactive Quantifier machine analysis Mass experts Formale Begriffsanalyse rules events Codes Coloured theoretical attributes domain incident Representation statements Games form compactness Graph scale information interfaces projects experts theoretical plan Databases incident system call frame Physical wikis case logic optics statements objects table domain order
Context scale survey water programs wikis different objects Office formal classes relation screens Development The list sample maximal staff instance category means processes sample communication prototype spacetime web pages point RNG interactive analysis lasers counterexample rules number period incident queue form Graph scale key information interfaces Guide states experts operations smart independent workgroup applications Physical Software case statements life Games table libraries states attitudes directions views time sets Contracts argument Part energy information subset memory closures box Gates regression trees algorithm generate counterexample point physical attributes graphic designers lasers discount hierarchy configuration Right procedure Results record services link experts fields High performance computing attributes domain memory operations testing statements Games Tuning domain multiple Super plan incident subset wikis Forum objects
Slides Actions scale sets analysis argument rules Coloured fields number round testing statements Cats Games formal classes point Guide sample lasers category processes logic website Right prototype family spacetime
the old of them.
that way.
the. in the. tom and next year they have a tokyo with a complicated set of the i don't quite understand yet it's called in to actually discovering implication a lot of knowledge in the data and they told me that point of the starkest.
but i will i turn to stand what it means and i hope i world so good like to thank you very much and have some prosper and. thank you very much harm to me as rock. low over they were much and therefore can talk talk about the effect of rediscovering occasion in order to get data it's more or less a fun project we started from for finding rules that are implicit in the data and tail just by the data at tesco. on. people and inserted into the data database so far and we will stop of the experts acknowledge that exquisite data indicated that makes so right what what it's like it or maybe you're have thought about to get it out and that's all fine maybe you haven't then surely have had offered to pedia and wikipedia is run by the. the media foundation and that we can media foundation are several other projects and one of those his weekly data and we did it was basically a large graph that and colds machine readable knowledge in the form of statements understatement basically consists of some entity that has connected some some entities that are connected by a state. as some property and these properties can then even half of patients on them so for example we have obama strickland here and we and told that she has received the nobel prize in physics last year by this property a watered and sustain a qualifier time twenty eight team. and also for chop part some plastic asian and all in all we have some eight hundred ninety million statements on which the data that connect seventy one million items using seven thousand properties so far but there's also a bit more so we also know that on a strict gun. it has feet of what optics and also feels awful places so we can use the same property to connect some entity with different other entities and we don't even have to have knowledge that connects to entities we can offer date of birth which is nineteen fifty nine. and ninety nine to more than fifty nine yes and this is than just the plane data not an entity. and now coming to from the experts acknowledged that we have some more we have almost strickland has received a nobel prize in physics and also marie curie has received a nobel prize in physics and we also know that marie curie has a nobel prize idea that starts with first them. than ninety no three and some somewhat random numbers that basically are this idea and then murray korea also has received a nobel prize in chemistry and nine eleven such as another novel idea that starts start of camp and has nine eleven there and then there's also francis on all to receive the nobel prize in chemistry. last year so she has on novel idea that stuff of camp and has twenty eighteen there and now one one could assume that while everybody was awarded the nobel prize that he should also have an over so everybody was awarded a nobel prize should also have an overpriced id and be we could. private as some implication here a soul was a noble prize implies novel idea and while if you if you look shockley of this picture than the us this arroyo conspicuously missing that bono strickland doesn't have an overpriced idea and indeed as twenty five people currently on but the data that i'm missing the. price ideas on a strict win this one of them. somebody call these these people that don't satisfy this implication because counterexamples and well if you look at the data on on the scale of really these eight hundred and ninety million statements then job on find any counterexamples because it's just too big so we need some way too. automatically do that and the the ideas that well if we had this knowledge that while some implications are not satisfied than this and codes may be missing information of wrong information and we want to represent that and in a way that is easy to understand. and also succinct so it doesn't take long to try to down should have a shot representation so that rules aren't any anything including the including complex and tax on logical quantify as so no spark a cruise us as a description of that implicit knowledge no description projects a few front of that. and we also want something that we can actually computer on actual harbor in a reasonable time frame. and while so all approaches we use formal concept analysis which is a technique that has been developed over the past several years to extract what is called propositional implication so just logical for me loss of proposition a logic that are in the implication in the form of this awarded nobel prize. city implies noble id and walk so what exactly is formal concept analysis off to tom thank you so what is from a concept analysis it was developed in nineteen eighty's by a guy called would have to learn and canada. and they are restructuring red letters theory letter theory it's an abuse name and mass its two meanings one meaning is you have to grit and ever letters there on the other thing is the speak about or nurse order relations so i like stakes i like putting and i like stakes more than putting and i like rice more than steak sets in or. i write letters us particular order us which. can we use three percent propositional logic so easy rules like when it rains the street gets wet right or so and the data representation there was no sky used back then they called it a form of contexts which is basically just a set of objects economy objects as just the main set of attributes. and some incidents which basically means which object last half of which attributes so for example my laptop s. the color black so this object s. some property right so there's a small example in the right for such an form of context so the objects there are some animals a platypus that's the. fun animal from australia the memo vicious also laying x. and which is also a venomous leg. we know the spider the dock and to catch so we see ok let's assess all the properties attests being venomous laying x. and being a memo the half the dock which is not a memo but at least x. and so on and so on and very easy to address some education not shia you you can find us whenever you in devon. memo that is a venomous tested the x. so this is a rule that falls out of this binary a table a main problem then at this point just do not have to date a table for the data right we have to impose a graph which is very more expressive and binary data and we cannot even store the data as a binary table even if you. try to we have no chance to compute such rules from that and for this the people from from the continentals us. i have proposed a refund to extract implicit not much from an expert so experts you could be because data it's an expert you can ask the data questions right using the sparkle interface you can ask you can ask is that example for that is then a counter-example has been else so the algorithm is quite easy. the other revenue this year we are driven and some expert in our case to get data and the algorithm keeps notes for counterexamples and keeps notes were valid implications when the beginning we do not have any event the uk shuns the this list on the right is empty and the beginning we do not have any counterexample so the list on it and left the form of context. the buildup is also empty and all the other firm does now is it asked is this implication x. follows white a rifle sex or x. implies a white is a true was a true for example that the an animal that is a mammal and it is a venomous place x. so now the experts which in all cases because data. i can answer to the contrary that showed in our paper be contrary that so we create and if we if the data expert does not find any counterexamples it will say ok that's may be true to think it's yes or if it's not a true implication in the data it can say no.
it's not true and he is a counter-example. so this is something you you contradict by example you say this we cannot be true for example an industry does a vet does not mean it has rained right could be in the cleaning service car something else. so no idea what now was to use the data and as an expert but also include a human into the soup so we do not just want to ask data we also want to ask human expert as well so the first ask an hour to i'm the data expert for some rule but after that he also in. cry of the human expert and you can also say yeah that's true i know that all know we can get us out of air of discount example i know one on the other case over the data space this is true i am of african taxable. yeah and so on its own and you can represent this more or less as is just some mathematical picture it's a very important but you can see on the left like doesn't exploration going on just to get data of that the algorithm underwrite next generation human expert versus the data which can answer all the credits and be combined those. to into one small to under development so economics are sold for four that to work we basically need to have a way of viewing which the data on these parts of which the data as a form of context and thus far more complex while those this was a binary table so well what do we do. do we just take all the items and get it as objects and all the properties as attributes of our contacts and and then have an incidence relation that says while this city has this properties so it does incident there and then we end up with a context that has seventy one million roast and seven thousand. columns so are well that might actually be a slight problem there because be we want to have something that we can run on natural harbor and not on the supercomputers so that's that's maybe not do that and focus on the smallest set of properties that are actually related to one another through some kind of common domain. yeah if so it doesn't make any sense to have a property that relates the spacecraft and then a property that relates to books that's that's probably not a good idea to to try to find implicate implicit knowledge between those two but while two different properties about spacecraft that that sounds good right so and then the interesting question is just how do we. the find the incidence fought fall set of properties and while that actually depends very much on which properties we choose because the dust for for some properties that make sense to to account for the direction of the statement so that's a property called parent. i actually know it's its child and then does our father and mother so i am and i don't want to turn those around as to what you want to have a is a child of the that that should be something different than to be as a child of a. and that's the qualifiers that might be important for some properties so receiving a walk for something might be something different than than receiving of what for something else bought while receiving a water and twenty eighteen and receiving one in twenty seventeen that's probably more or less the same thing so we. we don't necessarily need to differentiate that and there's also a thing called stop classes and they form a hierarchy and but the data and you might also want to take that into account because while winning something that is that is a noble prize that means also winning and and the water itself and winning the nobel peace prize means winning and peace prize so. so that the there's also implications going on there that you want to respect soul and two to see how we actually do that let's look at an example so we a half year while our does follow strict. some of our forefathers first name ashcan this is one of the people that won the nobel prize in physics with her last year and officer or that's the first one and they all got the nobel prize in physics physics last year so. we have all these statements year and these to have a qualifier that says with. and i don't think the qualifiers on the statements are actually but it doesn't actually matter so what we've done here is about put all the entities in the small graph as ross and the table so we have strickland and more ruined ashcan and also i'm old and period that are not on the picture part well. you can never remember that and then here we have wanted and we scale that by the incidence of the different noble prize is that people have won its service of physics noble in the first column for the chemistry noble prize and the second column and just general nobel prizes and the thought column s. of water to that. this scale to buy this book qualifier so watered worth. i'm all and then that's feet of work and we have lasers here and radioactivity so we scale by the actual field of work that people half. and while then if we look what what kind of incidents we get for donna strickland while she has a noble prize in physics and that is also a nobel prize and she has that together with all and while she has feet of what lays as part not radioactive it. then the role itself just noble prize in physics and that is a noble prize but none of the argos. ashkin gets the nobel prize in physics and that is still a noble prize and he gets that wealth should all also he walks and lasers. but not in radioactivity so francis are all has a noble prize in chemistry and that is a noble prize. marie curie she was a noble prize in physics and one and chemistry and they're both a noble prize and she also watch and radioactivity part lasers didn't exist back then so she doesn't get feet of workplace us and then basically this table here is a representation of. our form of context. so. we've actually got had and started building a to awaken interactively do all this thing as things and it will take care of putting the context for you you just put in the properties and while tom would show you how that works so you see some first screen shots off this tool so at least not comment on the graphic design you know. idea about that you have to ask someone about but you're just too much smaller less on the left you say the initial state of the game on the left you have like five boxes they're called counter eastern border posts credit cards use of energy our memory in a calm situation to change and space launches which are just three sets read the fine. so this can't you can explore for example in the case of the credit card you can explore the properties from the data which are called cop network operator and feet so you can just choose one of them on the right custom properties you can just put the properties you're interested in india data whatever one of these seven thousand still like it or so. some of them on the right eight shows then the credit card thingy and and now want to show you what happens if you know explore these properties right now so the for the first step in the game is that the game will ask me in the game the exploration process will ask is it true that every entity in be key data will have these three properties. so are their common among all entities in your data which is most probably not true right i mean that everything in the data has a fee at least i hope so what i will do now i've worked and click to reject this implication buttons and implication nothing in place everything it's not true. the second step now the other firm tries to find the minimal number of christians to obtain the domain knowledge to obtain all the libs rules in this domain so next question is is it true that every freeing indicated that has a cartoon network property also hess iffy and an operator property and. down here you can see the gate assess ok there's twenty six items which are counterexamples so that twenty six items in the data which half the cop network property but do not have the other drones the twenty six is not a big number this could mean ok there's an era to twenty six statements are missing and maybe that's that's by.
i buy a really that's the true case that's also ok but you can now choose what you think is right you can say oh i would say it should be true or you can say no i think that's ok one of these counterexample seems well it lets rejected in this case rejected the next question that asks the true that everything that has an operator. is also a fee in a cart network had this is possibly not true there's also more than one thousand counterexamples one being i think the telecommunications operator and hander your something so we can reject this as well. next question everything that has an operator and the cardinal truck so cosmetic means like results are master card whatever all this stuff is a true that they have to have a fee. and because it is a snow test twenty three items that contradicted but one of the items for example is the american express gold cart i suppose the americans for stuart has some feet so this indicates go there's missing data indicated there is something that the data does not know but should know to reason correctly in be. the date of a few spots across so we can now say yeah that's it that's not reject that's and accept because we think it should be true but the data things other race and you go on we go on this than the last question is a true that everything that has a fee in the continent work should have an operator. and you see oh no contacts samples this means we could assess this is true because s. there is no contacts samples if you ask the data its s. this is a valid application in the data sets so far this could also be as indicating that something is missing i'm not aware of this is possible or not but ok for meat sounds reasonable. everyone has a fee in a cart network should also operate under which reserve bank of something at that. so i accept his implication and then yet you have one duplication game as the exploration game which essentially means your form some knowledge thank you and not adjust to know which education is in the data are true or should be true from your point of view and yet this is more or less the state of the game so far. but as the program took an october and the next stage will be a to show you some what how much does your opinion of the rope different from the opinion that is now reflected in the data so is what you think about the data true or close to true to what is true in the data or maybe the data has wrong information. nation you can find of that but marks were told that only more about that are ok so are let me just quickly well come come back to what we have actually done so we have for a procedure that allows you to explore properties in which the data and the implication of knowledge that holds. between these properties and whether the key ideas here that one when you look at these implication that you get where there might be some that you don't actually want because state they shouldn't be true. but also be once that you don't get what you expect to get to because they should hold. and these unwanted and all missing implications they point to missing statements and items in which the data so they are show you where the opportunities to improve the knowledge in which the data are and while sometimes you also get to learn something about the world and in most cases that wants more complicated than you thought of waltz. and that's just how life is but in general about implications can guide you in your way of improving wiki data and the state of knowledge there in so what's what's next well so what because they don't offer in the exploration game and what what are you different will focus next on is having. and figure about counterexamples and also feature of the condor examples right now you just get a list of of random number of contracts samples and you might want to search through this list for for something you recognize and you might also want to explicitly say well this one should be a counter-example and that's definitely coming. next then while domain specific scaling of properties they are still much work to be done cowardly we be on the have some very basic support for that so you can have properties but. you can do the fancy things where you say well everything that is in the water should be considered as one instance of his property and that's also coming and then the what tom mentioned already want to compare our journalists that you have sex prof through this process against the knowledge that is. now the on the key data as a form of seeing about where do you stand what is missing in the uk data how can you improve the key data and what if any more suggestions for features than just tell us that's a guitar playing on the implication game page and here's the link to the tune again so. so yeah just just let us know open an issue and have fun and if you have any questions then i guess now would be the time to ask thank you thank you very much time and next. so the world's switch microphones now because that i can find this microphone to you if any of you have a question for the speaker is either. many questions as to just and yes. hi thanks for the most talk i wanted to ask what's the first question what's the most interesting implication that you phoned the. one of many folks like. the the most influential thing implication so far. the most basic thing you would expect everything that was launched in space by humans don't think that landed from space. i've been to lending date also has a stock date so nothing lend that on earth which was not started just yes. right now the game only helps you have finite implications are you also planning to have where i can also add developed for example let's say i have twenty five nobel laureates we don't have a noble lot id is that it is their plans very you could give me a simple interface for me to google. an ad that idea because it would make the process of adding new entities to we can get itself most simple yes that's that's partly hidden behind those configurable and fifth row culture examples thing. we will probably not have an explicit interface for adding staff are most likely interface with some some other to build around with the data so probably something about the gift of quick statements or something like that but yes so adding adding data is definitely on the road. the name of questions. wouldn't it be nice to do this and other languages to like the to get and actually its language independent so we use the data and as far as we know we did has a language itself you know the test just items and property so queues in peace and whatever language to use it. should be translated in the language of the properties if there is a label for the property or for that item that you have so if the data a survey of feeling which we are. oh yes all and of course the two will still needs to be translated but that attitude it's a tripling in the light thanks or talk i have a question right now you can find missing data which is right or surplus data what you think will be able to find wrong information was a similar approach. i will it actually do i mean if he did as a counter-example to some forming something we would expect to be true he says could point to run data right if the counter example is a wrong counterexample if there's a missing proper to your missing property to an item. yes. so ok i get asked a second questioned so though the horizontal axis in the incidents matrix. you said it has seven thousand spends seven thousand columns right yarwood house that they are seven thousand properties and like a data. it's actually way more columns right because you multiply the property and properties times in arguments right yeah so if you do any scaling down of course that but if you're multiple entries so that's what i mean with a scaling basically.
i cannot see here already seven thousand is way too big to to actually come to that. how many would it be if you multiply all the arguments that. i've no idea probably a few million. have you thought about or recourse the mess that as the caltech samples maybe you're wrong by other contacts samples much like an agreement to to draw for something like this. on. it didn't get to know can be counted sample be wrong through another county example. maybe you some make some people say it's the that the cat scan have cold have and then use to know about another example might say that this is not to catch up. so the as the property to be a cat or something cash is missing an ok knowledge we have not considered so far deeper reasoning so far this competition a logic you know it's a test of contradictions because all you can do is you can country by country example but it can never. the rule that it's not true so far just your my opinion maybe but not in the lodging so what we have to think about it that the africa reasoning right. sorry requests said because you're not considering all the seven thousand are properties for each of the entities and what's your current process of filtering what other relevant properties and i sorry i didn't get that we basically have picked those so you have doesn't put field here where can go ahead and selective properties. we also have some pretty fine sets. and there's also some some classes for groups of properties that are related but you could lose if you want because sets for example space or family he was the other. awards was won. it depends on the site of the classic some for space it's not that much as a general fifty properties it will take you somewhere else but you can do because they're fifteen or something like that and i think for the for family it's a true but just like forty or fifty properties so not have to. patients. i don't see any more hands maybe someone who has not asked the question yet have another well me to take that otherwise we would be perfectly entitled maybe you can tell us where you will be for people discussions fair where people can find you probably have become just. hatch is banned as h r or just rolling around summer so those those all saw that numbers on the slide served six to eight for four tall and and six to seven nine for me so just color of. well and well being robbed thank you again have a round of applause thank you.
yeah.
just two more.
by the.
Feedback