How we found a million style and grammar errors in the English Wikipedia

Video thumbnail (Frame 0) Video thumbnail (Frame 960) Video thumbnail (Frame 1945) Video thumbnail (Frame 2931) Video thumbnail (Frame 4416) Video thumbnail (Frame 5572) Video thumbnail (Frame 7577) Video thumbnail (Frame 10974) Video thumbnail (Frame 12859) Video thumbnail (Frame 13681) Video thumbnail (Frame 15991) Video thumbnail (Frame 17299) Video thumbnail (Frame 18476) Video thumbnail (Frame 19948) Video thumbnail (Frame 22473) Video thumbnail (Frame 23614) Video thumbnail (Frame 25968) Video thumbnail (Frame 28934) Video thumbnail (Frame 31412) Video thumbnail (Frame 33608) Video thumbnail (Frame 36553) Video thumbnail (Frame 38055) Video thumbnail (Frame 40137) Video thumbnail (Frame 41384) Video thumbnail (Frame 42480) Video thumbnail (Frame 44168) Video thumbnail (Frame 45055) Video thumbnail (Frame 49446) Video thumbnail (Frame 50697) Video thumbnail (Frame 52583)
Video in TIB AV-Portal: How we found a million style and grammar errors in the English Wikipedia

Formal Metadata

Title
How we found a million style and grammar errors in the English Wikipedia
Title of Series
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2014
Language
English

Content Metadata

Subject Area
Abstract
LanguageTool is an Open Source proofreading tool developed to detect errors that a common spell checker cannot find, including grammar and style issues. The talk shows how we run LanguageTool on Wikipedia texts, finding many errors (as well as a lot of false alarms). Errors are detected by searching for error patterns that can be specified in XML, making LanguageTool easily extensible. LanguageTool exists since 2003, and it now contains almost 1000 patterns to detect errors in English texts. These patterns are a lot like regular expressions, only that they can, for example, also refer to the words' part-of-speech. The fact that all patterns are independent of each other makes adding more patterns easy. I'll explain the XML syntax of the rules and how more complicated errors, for which the XML syntax is not powerful enough, can be detected by writing Java code. Running LanguageTool on a random 20,000 article subset of the English Wikipedia led to 37,000 errors being detected. However, many of these errors are false alarms, either because of problems with the Wikipedia syntax or because the LanguageTool error patterns are too strict. So we manually looked at 200 of the errors, finding that 29 of the 200 errors were real errors. Projected to the whole Wikipedia (currently at 4.3 million articles), that's about 1.1 million real errors - and that does not even count simple typos that could be detected by a spell checker. If you want less errors in your Wikipedia: LanguageTool offers a web-based tool to send corrections directly to Wikipedia with just a few clicks. And while these numbers refer to the English Wikipedia, LanguageTool also supports German, French, Polish, and many other languages. This talk will contain lots of examples of errors that can be detected automatically, and others that can't. I'll also explain that LanguageTool itself is just a core written in Java (and available on Maven Central), but that it also comes with several front-ends: a stand-alone user interface, add-ons for LibreOffice/OpenOffice and Firefox and an embedded HTTP server
Presentation of a group Group action Computer animation Social class
Area Computer animation Arrow of time Proper map
Computer animation Multiplication sign Software developer Barrelled space Arrow of time Formal language
Distribution (mathematics) Arithmetic mean Process (computing) Computer animation Code Core dump Arrow of time Parameter (computer programming) Mereology Formal language
Message passing Uniform resource locator Numbering scheme Computer animation Personal digital assistant Length Bit Arrow of time Resultant Formal language Vector potential Neuroinformatik
Instance (computer science) Flow separation Symbol table Formal language Template (C++) Proof theory Subject indexing Word Mathematics Arithmetic mean Computer animation Personal digital assistant Auditory masking Formal grammar Determinant Reading (process) Spacetime
Matching (graph theory) Voting Computer animation Assembly language Personal digital assistant Sampling (statistics) Game theory Measurement Formal language Product (business) Neuroinformatik
Computer animation Root Different (Kate Ryan album) Personal digital assistant Bounded variation Semantics (computer science) Rule of inference Formal language
Scheduling (computing) Server (computing) Java applet Connectivity (graph theory) Multiplication sign Projective plane Commutator Cartesian coordinate system Formal language Process (computing) Computer animation Office suite Extension (kinesiology) Window
Word Computer animation Java applet Personal digital assistant Mathematical singularity output Speech synthesis Pattern language Fehlererkennung Mereology Formal language Form (programming)
Slide rule Euler angles Software developer Control flow Fehlererkennung Mereology Sequence Rule of inference Message passing Word Computer animation Root Term (mathematics) Atomic number Personal digital assistant Software design pattern Negative number Pattern language Regular expression
Word Matching (graph theory) Computer animation Analogy Core dump Mathematical singularity Ultraviolet photoelectron spectroscopy Speech synthesis Pattern language Mereology Formal language Form (programming)
Greatest element Numbering scheme Token ring Computational linguistics 1 (number) Set (mathematics) Function (mathematics) Mereology Rule of inference Formal language Pattern matching Root Software design pattern Software testing Endliche Modelltheorie Data structure Descriptive statistics Exception handling Matching (graph theory) Chemical equation Unit testing Sequence Parsing Degree (graph theory) Computer animation Formal grammar output Regular expression Routing
Web page Software developer Set (mathematics) Mereology Parsing Formal language Latent heat Computer animation Personal digital assistant Formal grammar Quicksort Family Descriptive statistics
Slide rule Message passing Computer animation Open source Link (knot theory) Software Formal grammar Error message Parsing Formal language
Statistics Token ring Java applet Virtual machine Black box Mereology Rule of inference Proper map Formal language Wave packet Mechanism design Machine learning Arrow of time Damping Endliche Modelltheorie Extension (kinesiology) Social class Form (programming) Matching (graph theory) Mathematical analysis Planning Graphical user interface Computer animation Logic Formal grammar output Speech synthesis
Computer animation Link (knot theory) Electronic mailing list Video game Barrelled space Database Formal language
Particle system Mathematics Computer animation Patch (Unix) Database Arrow of time Mereology Formal language Neuroinformatik
Revision control Web page Pointer (computer programming) Computer animation Personal digital assistant Moment (mathematics) Square number Complete metric space Error message Formal language
Web page Wiki Computer animation Formal grammar Control flow Arrow of time Object (grammar) Formal language
Noise (electronics) Word Computer animation Rule of inference Formal language
Complex (psychology) Java applet INTEGRAL State of matter Software developer Basis <Mathematik> Cartesian coordinate system Formal language Category of being Centralizer and normalizer Process (computing) Computer animation Root Personal digital assistant Chain Formal grammar Routing
Proof theory Numbering scheme Context awareness Computer animation Software developer Speech synthesis Object (grammar)
Java applet Markup language Database Rule of inference Formal language Process (computing) Root Software Personal digital assistant Endliche Modelltheorie Routing Position operator Task (computing)
check that 1 out so that you want your presentation slides to be error-free use spell check and you and that's what I did here and this subject some of the complaints maybe you can help me with this you see an error of about yes the class yeah right so that is not correct and other problems right so close to
the groups and these letters and so this 1 that also and on yeah always keeps things like that and
a strike so you probably knew that
spell-checking won't find all the error so that you can work on the solution to that area and you can try on to adopt off when when you these 2 sentence into the farm and to check them and it will find both these arrows and will also make proper corrections and that's what this talk is about what comes after school children but
it is your development of the talk 1st explain how we use language to find uh million arrows in the GP young and then I'll explain how language works works timely i'll explain what approach we're using for language to and what other approaches could be useful style and grammar-checking and why we're not using those other approaches the and then of course uh and make a suggestion or we can start fixing these million barrels we found finally I didn't talk about the future work we are planning for language to show at
1st a small so many so many people if you you have thought about language to the OK not
so many how many have actually use that I'm not so many OK that's why we the so
how do you find a million arrows in the pedia basically by running this command language tool at the GPD and a job it's actually not part of the standard language to a distribution because it's kind of very specific checking wikipedia is not what the average user that's all day so it's parts and you need to download the nightly builds to get this command and then it takes 3 parameters chick daytime means like the check and X and then the from pedia and then you specify the part to that XML dump which is several gigabytes in size and you specify n which is just the language code for English and then we'll start
and it will give you some results like this so it will print the title of the article where things that on an error will only if the location of the error of and it will give you a message and the error was some kind of St underlined and this case you can see this in our says will designated as but it should be will be designated as and then so language to actually found the from the problem but the message is is a bit off so the suggestion is not quite correct but at least it from the error so if you run this command you get a lot of these are also a lot of these this all which p and running it on the news
ikipedia takes about 10 ms person but because the English Wikipedia's so very large that would take about 1 week 0 so on my computer to run through the entire Wikipedia so what we actually did was we ran language to on 20 thousand articles which led to about 37 thousand potential errors so what's an error now for the sake of simplicity I will just talk about arrows when I mean Meyer also style suggestions it does not include a simple spelling mistakes that the common spellchecker find the and if you reject this number of 37 thousand potential arrows to the whole Wikipedia which has more than 4 million articles you get a million potential errors and of those we select a 200 so randomly and check them manually to see how many of these taking projected 8 million arrows which would be useful to so many of these actual arrows and the result is that in the whole Wikipedia if we ran lengths to really over the whole ikipedia you'd get about 1 million errors that actually useful and as a said that does not count simple
spelling elsewhere because we have shown half it's just that too many proper names and so in the TPA it gets command even more false alarms the the the so still with 8 million potential errors and only 1 million useful there also it's kind of like too many false alarms like very sh
and the reason there's a reason for that the 1st surprisingly difficult to then extract text from the English Wikipedia they use this MediaWiki syntax and 1 problem we have with that is truly we cannot expand the templates so you have an text like an elevation of about 150 meters that's what you see when you read the GPS but in the Vicki sentence that has this this template's index and we currently cannot expand this so we only get an elevation of about and in the space so that will of course and several instances confuse a grammar checker you also have as an encyclopedia obviously many place names and the movie titles and any kind of names in your text and also non-English place names and that's also a difficult for for us in some kind of proof reading to to handle you also have the cases like the articles about them off where you have some like the value of n for a given a is called something and the what a is used in 2 different 2 different meanings the first one is that the determinant as you would expect and the other 1 is kind of a mask symbol and we as we are not this language was not optimized for articles about math that will to confuse language tools if no if you use language low and proofreading tool on on articles that have already been checked quite well you will obviously get enough more false alarms and most of the articles in the English Wikipedia had already been checked the also english words to me to prove so we get less for the last it
so here's some examples that are examples of some better matches so not so useful matches that you get the the it says create thousand the sample and the text and language to suggest to use a sample of the proof on because it does not that 68 thousand assembler is actually kind of of the product name here are you have cases like score voting and majority judgment allow these voters and language to incorrectly suggest to use it in the 3rd person singular because yeah the reason is that the detection of the noun phrase doesn't work properly so in this case they wish to only detects majority judgment as the noun phrase and not score voting and majority judgment so that leads to a to a false alarm and finally this example from the mother article where it suggest to use and because the is starts with a vowel sound and usually that is the case you use and instead of of the the but of course here's an
useful measures in a vote of 27 journalists from 22 gaming magazine that language suggest use magazines the pro-reform OK the next 1 is easy that's just flows through through the body it suggests to use to remove duplication and sending back their work to the teacher's computer it properly detected Tessin apostrophe missing the is an
example for style you might get if you write something that there are many different variations language to suggest to use just money that's because of variations kind of implies that different already and whether you agree of course with the suggestions it's kind of up to you whether you consider these useful so some
examples of the errors which we cannot detect and these are not from Wikipedia now I made them up so we cannot detect semantic problems of course like barrack the president of France that want the wanted and so it if you write something like I made a concerted effort you English teacher will tell you that concerted implies more than 1 person was involved so that's not correct but we want to take that now and if you write sense like tomorrow I go something that should be tomorrow I will go shopping that's also case which we don't detect yet can and now you could write roots maybe for these but they would be very specific I will talk in a few minutes about how to write rules and then you will probably see that if I could write rules for these cases but it is doubtful if it's if it's actually useful yeah
so here's a short overview about language to the basic idea of land which was always to be the next step after spell checking so we don't replace spell-checking but we can kind of run after checking and nowadays we actually have 1 component that does the traditional spell checking with the component inside of language tools the project was started in 2003 it's recent lgpl we have about 10 regular commuters now and to leave out on a time time-based where he schedule with the new release every frame on the and everything is implemented in Java and XML you see in a few minutes what where where we use XML or what thoughts so as a user how to use
language to we can use that as a command-line application or this stuff we have also several extensions for their office and openoffice vim and Emacs and 5 of them from abroad even a few more if you from the job world you can directly use OWL API if you're from using some of our own language then you might use and HTTP server we also have an HTTP S which returned some very simple XML which you can then the apostasy heroes in text I at
end now notices this were
internally language to takes plain text as input then it finds the sentences in this text and in the sentences finds words and for each word it's that analyze the for example finds its base form for walks through get the base form more for example and you will also find the part of speech text for each word there which can be ambiguous for example in the case of walks you would get it can be a plural noun or it can also be a person singular and we have this analyze text and then we go this analyzed text we run some Java was but most importantly we run over some error detection patterns the that on the kind of the cross-language to and I know explain how these work the basic idea behind these error
detection patterns is to be 1st simple so you don't have to be a suffer development to contribute new rules and the other idea of atoms is that they are all independent so even if they had a new rule you cannot break any of the existing rules that's often unlike and suffer develop on where you you change something and something else that breaks if is an example of a slightly simplified example of rule would always consists of 2 parts the 1st part is the pattern itself so 1st I should say that error detection patterns we internally column roots so rule consists of those 2 parts of the pattern itself and the message that is displayed to the user when that out on that is found in a sentence and the pattern again is just the sequence in this in the simplest case is just a sequence of words so this is a example you have the token that's token is just the technical term for what followed by the regular expression English why attitude so this item will just match the example from the 1st slide that English and also bad attitude and in the message you will then can see that the coreference it will just use that access to to replace that with English or attitude so the user gets to see the Methods did you mean that English for example in that OK that was a very
simple example other things you can do in your rules I can use logical and and or you can use negation you can do skipping like matcher work and skip over and words
and then that's another word so you can use inflection that means you know just match the word walk but also all of its forms without wanting the held on for some walk it would match walking walks and warped and you can match part of speech patterns like ritual ropes or match only all that person singular of ups and that's described analogy in detail OK because this is kind of the core of language to if you 1 more
example this to take the arrow in
always I'm happy that's actually a mistake that non-native English speaker might make and I'm taking these examples as you can see from the bottom of the rule because uh and these 2 examples are part of the rule 1 example needs to be incorrect that means it needs to match the pattern above and the other example needs to be correct so it must not match the pattern above we use these roots these examples inside the rule before all unit tests and it also makes it easier to understand what will actually dust and so this would have some kind of token sense start which just means match only at the start of the sentence how has a regular expression again I'm always happy whenever and then a test uh talking with an exception which means that that move all tokens except those and those are those the B and D and JJ VB were any kind of uh and means model and J. means adjective . that sounds confusing to read to make these are these the tech names are actually standard in computational linguistics it's called the Penn Treebank tag set and that's what we use here it now we have these kind of rules for
29 languages which technically means we support 29 languages but we support them to a very different degree you can see here and if if you can't read that as the languages with the most number of rules are French German catalan polish and then comes English so French for example a modern to rules and there are other languages like Greek and Japanese that have less than 100 routes this means that all of this only a very rough indication of how good the languages supported supported and it's an indication that of course if you use the if you switch language to to Japanese of Greek it will not work that well in the sense that you will not find so many are also so that means you will have to do a lot of work here for that for those languages with with lesser rules to get a better coverage so what we do is basically pattern matching and you might wonder and languages kind of complicated spent matching really enough all the ask in different way why don't we use a more powerful approach so the fake ones that think and ask what is gram actually remains a set of rules that describe how old thoughts sentences and text look like and syntax is a part of grammar that this morning from description of how balance and looks like and you can also ask what the parser The parser is something that takes the sequence as input and generates you some output structure to the entry for example and you know this from
suffer development will do this successfully for the case and so you might ask why don't we just do the same for English why can't we just write powers are likely to pass the floor python or whatever way countries right this kind of power the power that for English as so it turns out
we kind of cute but that sort of approach we're taking land should because it's just so very difficult so there is no formal description of the English grammar but if you look in a comprehensive grammar so here you find that this and you consider this to be some kind of specifications you find this set has a thousand 700 pages and even if you say now OK and there's just English English as kind of special complicated or whatever and you look at the constructed language like Esperanto which should be much easier is to have classification if you wanna call it that way of about 700 pages the so that complicated also if you you write such a parser then you will more or less in end up with a part of that's probably specific to your language and we won support more more than 1 language as in language to and even more reasons why it's difficult to to use the powers our house English so having brothers not automatically mean that you have good error
messages so you basically parser you need to go 1 step further and optimize it to be a useful error messages because otherwise you get the other messages like I cannot pass a sentence which is not useful to the reduce and even if you have done all that you still not finished because if you look at sentences like sorry for my bad english from the 1st slide there it turns out that it's actually puzzles fine because that English could also be some noun phrase sorry for my and then some noun phrase so technically sorry for my bad english is a grammatically correct phrase so that houses actually like this for some this link grammar but for open source and its use in the what an open source so word-processing software so you can go this approach and try to write a parser for years but it's difficult and because we firmly mention 1 support a lot of languages it's not the approach where we're taking but I'm not saying it saying it's better or something it's just difficult we you would to
wonder why don't we use machine learning and statistics and so on 1st we do use Apache Altman LP for finding the Chinese chances another name for phrases so Apache LPM NLP gets us the noun phrases and verb phrases and that's based on the statistical approach which has been trained and we use that as a kind of black box to find those phrases however if you want to use a statistical approach to machine learning or whatever to actually find the errors in the text you some large corpus where all the arrows have been annotated and then need probably another large corpus waiting all this is free of errors and then you could maybe come up with some kind of training to get a model and then you could use that but it's not so easy to to find this annotated there also would have to annotate a lot of also and why the situation for English is quite good you get a lot of the process and and the the English ikipedia for example is huge but we'll to wanna support languages that have have not so many resources so that's kind of difficult it's difficult to use machine learning but again I'm here again I'm not saying that this is a better approach uses if you have some idea how to use machine learning to had to proofread the text automatically feel free to do that and even then you can kind of blocking into language tool which is by writing your own rule in Java and that looks like this you only implement simply implemented in 1 single method from the ruling class the match method and it gets its input and analyze sentence and that's what the class name suggests is a sentence with its tokens and the tokens has their
analyzes so their base forms and they part of speech text and then you can do with it whatever you want you can run any logic you want and you can even actually ignore our analysis of the base forms and part of speech 6 and just look at the original text and if you think you found some match benefit room mentioned doesn't just return that so that's quite a quite easy to integrate into language to and if you do this you get all the the stuff we do for free like the user the graphical user interface the command line interface and the extensions so it should be better to to plaque in into this mechanism then to start writing your own grammar checker they would have to write all the proper planning from scratch and so on shown I have some
overview about how language works about uh we still have that million barrels from DB pedia and how do we fix them it let us say that we do not have really have 1
million barrels or database because 1st we only run on small virtuoso and the 2nd reason is I think having 1 million are also actually this is your database is kind of overwhelming and it would probably what paid anyone to start fixing those I mean wants to work on the to do list with 1 million items so what you can do is you can have a look at the community don't language to that AUC will list a few thousand Arabs and if it's a false alarm you can lock in a market as a false alarm and it's not a false alarm you can actually Click on the link that so in minutes and fix it fix a problem which we so what you see is shaking the XML dumps from Wikipedia so it's not really a life you understand them separated every 2 weeks or so and OK and as we are familiar the probably won't get you very far so what I suggest is actually
something else as a 1st step with a new future where we check the recent
changes from the pedia in we patch this so atom feed of recent changes twice a minute and then we run language to lower it not over the computer particles but only about those parts that affected changed and then we detect if someone has made an edit this had be less has introduced an error and we also usually detect whether someone has made and that has a fixed an hour so what we end up with with a database of freshly introduced arrows and it looks
like this so you have the error message and with
the blue square the underlines the idea of now that 2 cases of my cases it is a false alarm when you can click it is a false alarm like you can just ignore it but now assume like this is the 1st example it's actually error on what you can do then is click on the chick page nulling and in that moment we go to the Wikipedia and fetch the current version of the article by adding the the API and run the complete article through language tools and we will then
show you this this page we you can if everything works well without typing anything you can just click on the corrections made by language to with their useful and if it's not a false alarm and then submit that page and we will send you directly to
Wikipedia so it's as if you had made and added in the GPD and you end up in this if you wait can 1 not for the last object that we didn't break anything and we also conveniently uh said this edit summary and check the this is a minor edit checkbook because usually this is just some kind of grammar or spelling fixed so this is what I would suggest that the 1st step instead of working on those 1 million arrows which just somewhere more uh let's start and try to make sure that not so many new errors introduced in the future using for
example this is not yet at data for languages but if you if you keen to actually use it and to make a hefty myself in the sense that you improve the rules disabled rules that are most useful then let me know which about which language you want to get you wanna activated and I'll try to do that for noise activated I think for the German English and French the so finally some words
about the future work on language without you working walking you through our entire to do this
what I basically like to see of course is to become a spelling and grammar checking ubiquitous start inventory you because like specialities today daily basis cannot anywhere beings uh spell-checked and I think we should stop there we have no more powerful tools we can do start in grammar chain so we should use it we and what securing state well we do have the same job IPI we also available on May central that's useful for Java developers we have an HTTP and H. 2 S. API for more or less because integration with support for many languages with a license that I think should be liberally enough for almost all use cases and we wouldn't Java which is absolute things if you are if yourself is also written in Java if you suffer is not written in Java that you might not like that so how can we despite being written in Java how can we become ubiquitous for example could we ever run in the browser despite despite being implemented in java so my idea was to count just compiled Java to JavaScript with at the end OK I tried that and it failed and to bet nobody replied to my StackOverflow question but that's why I'm and so I'm asking you for helping you can have a look at that question there an answer or even better you after the talk you can come talk to me and explain to me how this is done compiling complex Java application to we the of course we also have to get help from people who wanna at support for another language to language to is not even that difficult you can usually starts with another language and then right 1 we after the other as I mentioned we have a lot of languages that are not actively maintained or not effectively maintained enough so users of languages that are really in need of maintain on and if you wanna making maintain these languages we would be very happy to welcome you maintaining the language basically means right new roots and making sure that the existing routes get improved and that they don't read too many false alarms you don't have to be profitable for that purpose constant also welcomes of developers I'm sorry
and so in summary I
like to say I think today we shouldn't should seek to simple scratching totally north context we have more objects available today and I suggest you some other so users was developer and I hope that our style and grammar-checking of Wikipedia's kind of a proof that this technology can be useful despite the number of false alarms and of course the contributions of the welcome you can talk to me about his speech it and that's it for
me that nice conference b if b are there any questions the the and at the at the standard very difficult so the question is how do you find errors in japanese without any rules Japanese we do have a good for Japanese just not so many what was that you question here now that I see 0 in the database so perhaps it just me and on the ground
and and OK maybe in the meantime as some of the questions we have the question here the on and here I have I can language to cope with text with some mark-up lock likely take or is it's possible in near future let's say if it's not but so now yes and no so it's the and it we kind of push this task to the software that integrates so what we demand of a software that uses language so that integrates language to refer to tell us where the market we will be the position of the marker and then we will just ignore it so if there's some software that can do this with knows market and feeds of the text with including the marker positions then yes we can have the text but it cannot handle that in the sense like we know OK there's no marked up as like a headline and then we have some special headline model so that's not what we can do that yet more questions and the the question about the general tools me something we have no Java routes for language true but we do have XML with so the jar just some special cases if you prefer for some reason to write your keywords and job on because of maybe on the XML based approach that I showed is not powerful enough then you can always write you root job enrichment on for Japanese OK any other questions OK you know you can come to me more and more and thank
Feedback