Mathematical World Knowledge Contained in the Multilingual Wikipedia Project
Mathematical World Knowledge Contained in the Multilingual Wikipedia Project
Mathematical World Knowledge Contained in the Multilingual Wikipedia Project

2020

English

Abstract 
The purpose of this project is to test and evaluate an approach for Formula Concept Discovery (FCD). FCD aims at retrieving a formula concept (in the form of a Wikidata item) together with its defining formula within documents, in this case 100 English Wikipedia articles. To correctly identify the defining formula of a Wikipedia article, this approach searches for shared formulae across Wikipedia articles available in different languages. The formula shared in the most languages is then assumed to be the defining formula. The results show that neither this approach alone nor a combination with an existing approach that considers the order of the formulae inside an article leads to satisfying results. It is thus concluded that the number of times a formula is shared across a Wikipedia article in different languages is not a good indicator to determine the defining formula with the current approach. Consequently, several ideas for further research are proposed which could improve the results.

Keywords  Formula Concept Discovery Wikidata Wikipedia 
00:00
hello everybody my name is that it's hard rock and i'm here to present to you the results of a project and was doing as a student supervised by more to bouts of the research group of people keep. this project aims at extracting the defining formal off mathematical the p.d. articles that means determining the form another describe some of the medical concept of p.p.d. article in case one exists this project as much a rated by work done by sure what it is' who ended enriching the data with mathematical knowledge in the formal formula. defined of g.d.p. the article a chief is an automated extraction of the first formal often article and suggesting it to be key data editors who then check if it was in deep the defining formal and after what's added it's too big data this made it possible to use the data to develop the mathematic. real question and answer system quite mark to a mock to any allows you to look up for mueller just asking natural language questions and for them on last calculations space on input values provided by the user.
01:07
now let's talk about which the data from three he pedia is not designed to be machine readable the automatic retrieval of properties such as the defining foreman often article is a nontrivial task.
01:21
thus the date and only scruff was established in twenty twelve to assist in this task.
01:28
wiki data source data as troops making a data item like the symmetry of second the river to use was enormous shots theorem which is identified but unique id to properties like the defining foreman off rights fear him and its corresponding venue.
01:48
so the aim of this project was to improve the automated extraction of defining formula over the existing approach from she was that i who simply expected to first formula one hundred units p.p.d. articles since a menu investigation indicated that the defining formalised have included in the introductory part of fun. article. the idea behind on new approaches the treaty pedia consists not only of english articles but many more languages. this gives us additional information which we should use to all advantage to proof of results of the already existing approach of expecting the first formula what you want to utilise is the fact that the defining foreman are typically across the most articles across all languages. or bait not in the same mathematical for most you can see here.
02:41
some languages use a rather the next for the function of command or expand and six to one set off six and why or sometimes a function argument is limited to tidy just keep in mind that his selections arbitrary just so that you can see different representations off the same mathematical concept. but in general as indicated by a menu and the staycation there's more consistency of the mathematical notation assuming that the defining four minutes and in tikrit part of an article it should ideally be included in our language versions of an article or but not always in the same mathematical patients we have seen the us.
03:22
the first expect all formula from an article in some solutions and second determine which formula course and the most languages which is then assume to be the to find common. the first step is done using the p.d.'s time trials which contain a collection of all the p.d. articles of in total three hundred and nine the p.d.l. languages and they can be fitted to extract the one hundred english articles as well as their translations which can be obtained with the help of.
03:54
if the media be key a.p.i.. after wants these articles sufficient for a formula to make sure that for example variables on forty recognizes formula we define a mathematical expression as a formula in case it includes at least one formula indicator senior. furthermore it needs to be enclosed in of the key takes tank like the math see or can take since the formulae of the cannot age to slightly different leticia patients can generate visually matching mathematical expressions that's it becomes necessary into second step to be able to recognize the last seen in our.
04:33
expressions as being the same. businesses city becomes even more apparent when you consider that in about every second case the most common form of not only record at most ones with the tech sector same lot of court off not even a single time. that's without checking for similarity be could it really be choosing the most common form line the first place. predefined to for molesting similar are if they only different you to white spaces option of brackets around to supper super script or rather when characters that at the end like commodities that a part of the sentence surrounding the formula and the v.p. the article. note that this list is far from being an extensive enough to recognize all visually matching foreman as being seminar. nevertheless these factors were found to be the reason for the majority of different despite visually equivalent formula in a small manual investigation to us as an example these to form an air which are usually equivalent count as the same point. note that this competition and if asked but simple method unfortunately doesn't allow us to recognize eat at different locations like a and set of x. no mathematical the equivalent formulae like a equals be a peak was a similar instead it just allows us to wreck. nice leticia patients that generate visually matching formal there has been essentially the same.
06:08
sometimes it happens and that to a more formal a half expect same amount of occurrences now which once more likely to be the correct defined form. one approach would be to pick one randomly but this prophecy better choice. the results of what i have already shown that the first formula has a high chance of been the defining from so the on off the formula in an article is probably a good indicator for the defining form in the year of formula course in an article the more likely to be to be the defining former. that's the on off the formulae in the english article would be taking into account to radically one could also use another language of course but first of all the articles used on existing the music he pedia but don't necessarily another languages. and secondly the news p.p.d. has the highest amount of the p.d. editors that's the quality of the article should be better compared to other languages. and that's the on off the four men should ideally be more helpful to determine the defining form. of course it can happen that not all most common formula could in the english article for example case on the one most common form you not exist in the news article it will be automatically choose as the defining from a lot. only if all formula with the most occurrences are not included in the news article and oughta in the german article is the size of since it is the next biggest the piano which at least if you exclude but generated wikipedia articles when you determine the size of the v.p. your language. if once again a formula and not included in german article and aunt out in another big pedia language just taking into account and so on. after wants the chosen most common form into this compared to a gold standard drive from sure what's a diet which contains the correct to find a formula to determine if it's a true positive force what's next to her.
08:19
since we want to improve upon the results from for what i can also make sure that our results are comparable to the us would be nice if is what's possible to reproduce the results for those who incidentally that is easy possible if we limit the vicky pedia languages to only one language english now every formula. counted on the months just be a century disregard number off occurrence of the formula. instead on the the order of the former level be taken into account why the results show some differences like for example less true positives those can be attributed to two factors one contrary to shoe but i'd be automated the comparison of extracted from india with the correct defining formula.
09:05
for this comparison we use a similar note you method which did produce incorrect results for comparisons as they did not expect the match are deficient of similarity even though they were visually equivalent. the second reason is that the p.d. articles have changed since the publication date of schubert's it. yes some formulae what change slightly to mathematically equivalent the patient and thus did not match the defining formulae of or called stand for example both sides of the quaids one works changed in one case. this showed that an automated classification of the science is not perfect the competition billions his current approach consequently more sophisticated methods to determine its to formally as soon our should be used in the future. next we want to find out which approach is better using the formula that a cause and the most languages are using the order of the formula. it is important to note however that we cannot compare both methods directly for the approach of counting the occurrences offer formula he always knew each to additionally choose some measure in case which to most common form an axis which just influence the results. but this additional measure is not needed in the approach that considers only to offer formally says only one formula can be the first. some conclusion comparison off only the number of occurrences with the order is not possible. he readily we could still compare both approaches directly using and how buttery dataset that's that in no case more to pull most common formula exist. but such a data set would most probably be biased. number of most common form an end my career late with the number of formulae in the article which seems to correlate with the length and possible he also the quality of the article that's it could impact the results. although we cannot directly compare both approaches we can do in directly as musi soon. are you use only one language now we will take a look at what happens if we use more than one but he won't be checking of to formally are similar we determine the most common formula just get that would be done in the next. for now all we would only check the latte coats match exactly.
11:49
we can see year the number of two positives depending on the number of the p.d.l. languages used the first investigate the one because language english and then five and twenty of the biggest languages which by the way exclude people want to and five are a sense both almost only include but gen. rated articles. at last all three hundred and nine languages are used to remember that with one language only the autumn but now when the number of languages increases the occurrence of the most common formula increases and that's we have fewer cases their mood of the formula of the most common one. that's in fewer cases the order is decisive so the influence of the on off the formula one the results gets wanna and consequently the influence of occurrences of formulae gets bigger and the results get worse. this indicates that the on is noticeably more important than the number of occurrences. that's it seems like the approach of fusing daughter is better than the approach of using the foreman are occurring in the most matches. as next step now check for similar form an effort to determine the most common form and the results are similar to be for the number off to positives say negatively correlates with the number of languages. but the results of western before. the reason seems to be that the number of times each most com formula occurs increased compared to before you to finding many similar formula that's that are more cases where there is only a singer was calm formula. consequently daughter as less important than before. this indicates once again that as the impact of the occurrence of the formula and the results gets bigger the results get worse. in conclusion how results showed that the number of times performance shared across of g.d.p. the article in different languages.
14:04
it's a bad indicator for the defining from under our current approach when compared to the office for men. nevertheless it cannot be said with certainty that the number of occurrences is inherently better indicator for the defining foreman. knowing how many articles a form and a curse is still believed to be valuable information which might be useful to improve the results of sure what's it on. nevertheless the findings indicate otherwise. but it might if you really could he be possible that a more sophisticated method is needed in order to use the number of occurrences correctly in order to improve results.
14:48
for future work that a few pauses to be made currently only the number of occurrences and thought about taking into consideration. the p.r. tickets include a lot more information go for example if the formula is visually highlighted placing it in a separate line which means that it was teams by the b.d.a. to turn to be more important or if the formula is inside a check on him to finish one or the formula indicated contains. but the quality of its arctic. that's as call based approaches proposed which uses all of these information and waits them such that some information like the aura seemed more important and other information like the number of concerts for example has less influence on from sites. in order to determine this waiting a much bigger dataset is needed than the defining formula of just one hundred articles for that the data should be used since it contains the defining formula for about four thousand and three hundred bp the articles.
15:55
so i hope you enjoyed his presentation thank you for your attention and of course free free to ask me any questions you might have either up or email or in the q. and a session on thursday.