How we found a million style and grammar errors in the English Wikipedia
Formal Metadata
Title |
How we found a million style and grammar errors in the English Wikipedia
|
Title of Series | |
Author |
|
License |
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. |
Identifiers |
|
Publisher |
|
Release Date |
2014
|
Language |
English
|
Content Metadata
Subject Area | |
Abstract |
LanguageTool is an Open Source proofreading tool developed to detect errors that a common spell checker cannot find, including grammar and style issues. The talk shows how we run LanguageTool on Wikipedia texts, finding many errors (as well as a lot of false alarms). Errors are detected by searching for error patterns that can be specified in XML, making LanguageTool easily extensible. LanguageTool exists since 2003, and it now contains almost 1000 patterns to detect errors in English texts. These patterns are a lot like regular expressions, only that they can, for example, also refer to the words' part-of-speech. The fact that all patterns are independent of each other makes adding more patterns easy. I'll explain the XML syntax of the rules and how more complicated errors, for which the XML syntax is not powerful enough, can be detected by writing Java code. Running LanguageTool on a random 20,000 article subset of the English Wikipedia led to 37,000 errors being detected. However, many of these errors are false alarms, either because of problems with the Wikipedia syntax or because the LanguageTool error patterns are too strict. So we manually looked at 200 of the errors, finding that 29 of the 200 errors were real errors. Projected to the whole Wikipedia (currently at 4.3 million articles), that's about 1.1 million real errors - and that does not even count simple typos that could be detected by a spell checker. If you want less errors in your Wikipedia: LanguageTool offers a web-based tool to send corrections directly to Wikipedia with just a few clicks. And while these numbers refer to the English Wikipedia, LanguageTool also supports German, French, Polish, and many other languages. This talk will contain lots of examples of errors that can be detected automatically, and others that can't. I'll also explain that LanguageTool itself is just a core written in Java (and available on Maven Central), but that it also comes with several front-ends: a stand-alone user interface, add-ons for LibreOffice/OpenOffice and Firefox and an embedded HTTP server
|

00:00
Presentation of a group
Group action
Computer animation
Social class
00:38
Area
Computer animation
Arrow of time
Proper map
01:18
Computer animation
Software developer
Multiplication sign
Arrow of time
Barrelled space
Formal language
02:04
Distribution (mathematics)
Arithmetic mean
Process (computing)
Computer animation
Code
Core dump
Arrow of time
Parameter (computer programming)
Mereology
Formal language
02:57
Message passing
Uniform resource locator
Numbering scheme
Computer animation
Personal digital assistant
Length
Bit
Arrow of time
Resultant
Formal language
Vector potential
Neuroinformatik
05:03
Instance (computer science)
Flow separation
Symbol table
Template (C++)
Formal language
Proof theory
Subject indexing
Mathematics
Arithmetic mean
Word
Computer animation
Personal digital assistant
Auditory masking
Formal grammar
Determinant
Reading (process)
Spacetime
07:19
Voting
Matching (graph theory)
Computer animation
Assembly language
Personal digital assistant
Sampling (statistics)
Game theory
Measurement
Product (business)
Formal language
Neuroinformatik
09:07
Computer animation
Root
Different (Kate Ryan album)
Personal digital assistant
Bounded variation
Semantics (computer science)
Rule of inference
Formal language
10:40
Scheduling (computing)
Server (computing)
Java applet
Connectivity (graph theory)
Multiplication sign
Projective plane
Commutator
Cartesian coordinate system
Formal language
Process (computing)
Computer animation
Office suite
Extension (kinesiology)
Window
12:19
Word
Computer animation
Java applet
Personal digital assistant
Mathematical singularity
Speech synthesis
output
Pattern language
Mereology
Fehlererkennung
Formal language
Form (programming)
13:18
Slide rule
Euler angles
Software developer
Control flow
Mereology
Fehlererkennung
Rule of inference
Sequence
Message passing
Word
Computer animation
Root
Atomic number
Personal digital assistant
Term (mathematics)
Software design pattern
Negative number
Pattern language
Regular expression
15:12
Word
Matching (graph theory)
Computer animation
Analogy
Core dump
Ultraviolet photoelectron spectroscopy
Mathematical singularity
Speech synthesis
Pattern language
Mereology
Formal language
Form (programming)
15:47
Greatest element
Numbering scheme
Token ring
Computational linguistics
1 (number)
Set (mathematics)
Function (mathematics)
Mereology
Rule of inference
Formal language
Pattern matching
Root
Software design pattern
Software testing
Endliche Modelltheorie
Data structure
Descriptive statistics
Exception handling
Matching (graph theory)
Chemical equation
Unit testing
Sequence
Parsing
Degree (graph theory)
Computer animation
Formal grammar
output
Regular expression
Routing
19:17
Web page
Software developer
Set (mathematics)
Mereology
Parsing
Formal language
Latent heat
Computer animation
Personal digital assistant
Formal grammar
Quicksort
Family
Descriptive statistics
20:56
Slide rule
Message passing
Computer animation
Software
Open source
Link (knot theory)
Formal grammar
Error message
Parsing
Formal language
22:26
Statistics
Token ring
Java applet
Virtual machine
Black box
Mereology
Rule of inference
Proper map
Formal language
Wave packet
Mechanism design
Machine learning
Arrow of time
Damping
Endliche Modelltheorie
Extension (kinesiology)
Social class
Form (programming)
Matching (graph theory)
Mathematical analysis
Planning
Graphical user interface
Computer animation
Logic
Formal grammar
output
Speech synthesis
25:22
Computer animation
Link (knot theory)
Electronic mailing list
Video game
Barrelled space
Database
Formal language
26:45
Particle system
Mathematics
Computer animation
Patch (Unix)
Database
Arrow of time
Mereology
Neuroinformatik
Formal language
27:35
Web page
Revision control
Pointer (computer programming)
Computer animation
Personal digital assistant
Moment (mathematics)
Square number
Complete metric space
Error message
Formal language
28:19
Web page
Wiki
Computer animation
Formal grammar
Control flow
Arrow of time
Object (grammar)
Formal language
29:27
Noise (electronics)
Word
Computer animation
Rule of inference
Formal language
30:09
Complex (psychology)
Java applet
INTEGRAL
State of matter
Software developer
Basis <Mathematik>
Cartesian coordinate system
Formal language
Category of being
Centralizer and normalizer
Process (computing)
Computer animation
Root
Personal digital assistant
Chain
Formal grammar
Routing
33:01
Proof theory
Numbering scheme
Context awareness
Computer animation
Software developer
Speech synthesis
Object (grammar)
33:48
Java applet
Markup language
Database
Rule of inference
Formal language
Process (computing)
Software
Root
Personal digital assistant
Endliche Modelltheorie
Routing
Position operator
Task (computing)
00:00
check that 1 out so that you want your presentation slides to be error-free use spell check and you and that's what I did here and this subject some of the complaints maybe you can help me with this you see an error of about yes the class yeah right so that is not correct and other problems right so close to
00:28
the groups and these letters and so this 1 that also and on yeah always keeps things like that and
00:42
a strike so you probably knew that
00:52
spell-checking won't find all the error so that you can work on the solution to that area and you can try on to adopt off when when you these 2 sentence into the farm and to check them and it will find both these arrows and will also make proper corrections and that's what this talk is about what comes after school children but
01:20
it is your development of the talk 1st explain how we use language to find uh million arrows in the GP young and then I'll explain how language works works timely i'll explain what approach we're using for language to and what other approaches could be useful style and grammar-checking and why we're not using those other approaches the and then of course uh and make a suggestion or we can start fixing these million barrels we found finally I didn't talk about the future work we are planning for language to show at
01:59
1st a small so many so many people if you you have thought about language to the OK not
02:06
so many how many have actually use that I'm not so many OK that's why we the so
02:16
how do you find a million arrows in the pedia basically by running this command language tool at the GPD and a job it's actually not part of the standard language to a distribution because it's kind of very specific checking wikipedia is not what the average user that's all day so it's parts and you need to download the nightly builds to get this command and then it takes 3 parameters chick daytime means like the check and X and then the from pedia and then you specify the part to that XML dump which is several gigabytes in size and you specify n which is just the language code for English and then we'll start
02:58
and it will give you some results like this so it will print the title of the article where things that on an error will only if the location of the error of and it will give you a message and the error was some kind of St underlined and this case you can see this in our says will designated as but it should be will be designated as and then so language to actually found the from the problem but the message is is a bit off so the suggestion is not quite correct but at least it from the error so if you run this command you get a lot of these are also a lot of these this all which p and running it on the news
03:48
ikipedia takes about 10 ms person but because the English Wikipedia's so very large that would take about 1 week 0 so on my computer to run through the entire Wikipedia so what we actually did was we ran language to on 20 thousand articles which led to about 37 thousand potential errors so what's an error now for the sake of simplicity I will just talk about arrows when I mean Meyer also style suggestions it does not include a simple spelling mistakes that the common spellchecker find the and if you reject this number of 37 thousand potential arrows to the whole Wikipedia which has more than 4 million articles you get a million potential errors and of those we select a 200 so randomly and check them manually to see how many of these taking projected 8 million arrows which would be useful to so many of these actual arrows and the result is that in the whole Wikipedia if we ran lengths to really over the whole ikipedia you'd get about 1 million errors that actually useful and as a said that does not count simple
05:05
spelling elsewhere because we have shown half it's just that too many proper names and so in the TPA it gets command even more false alarms the the the so still with 8 million potential errors and only 1 million useful there also it's kind of like too many false alarms like very sh
05:29
and the reason there's a reason for that the 1st surprisingly difficult to then extract text from the English Wikipedia they use this MediaWiki syntax and 1 problem we have with that is truly we cannot expand the templates so you have an text like an elevation of about 150 meters that's what you see when you read the GPS but in the Vicki sentence that has this this template's index and we currently cannot expand this so we only get an elevation of about and in the space so that will of course and several instances confuse a grammar checker you also have as an encyclopedia obviously many place names and the movie titles and any kind of names in your text and also non-English place names and that's also a difficult for for us in some kind of proof reading to to handle you also have the cases like the articles about them off where you have some like the value of n for a given a is called something and the what a is used in 2 different 2 different meanings the first one is that the determinant as you would expect and the other 1 is kind of a mask symbol and we as we are not this language was not optimized for articles about math that will to confuse language tools if no if you use language low and proofreading tool on on articles that have already been checked quite well you will obviously get enough more false alarms and most of the articles in the English Wikipedia had already been checked the also english words to me to prove so we get less for the last it
07:23
so here's some examples that are examples of some better matches so not so useful matches that you get the the it says create thousand the sample and the text and language to suggest to use a sample of the proof on because it does not that 68 thousand assembler is actually kind of of the product name here are you have cases like score voting and majority judgment allow these voters and language to incorrectly suggest to use it in the 3rd person singular because yeah the reason is that the detection of the noun phrase doesn't work properly so in this case they wish to only detects majority judgment as the noun phrase and not score voting and majority judgment so that leads to a to a false alarm and finally this example from the mother article where it suggest to use and because the is starts with a vowel sound and usually that is the case you use and instead of of the the but of course here's an
08:38
useful measures in a vote of 27 journalists from 22 gaming magazine that language suggest use magazines the pro-reform OK the next 1 is easy that's just flows through through the body it suggests to use to remove duplication and sending back their work to the teacher's computer it properly detected Tessin apostrophe missing the is an
09:11
example for style you might get if you write something that there are many different variations language to suggest to use just money that's because of variations kind of implies that different already and whether you agree of course with the suggestions it's kind of up to you whether you consider these useful so some
09:39
examples of the errors which we cannot detect and these are not from Wikipedia now I made them up so we cannot detect semantic problems of course like barrack the president of France that want the wanted and so it if you write something like I made a concerted effort you English teacher will tell you that concerted implies more than 1 person was involved so that's not correct but we want to take that now and if you write sense like tomorrow I go something that should be tomorrow I will go shopping that's also case which we don't detect yet can and now you could write roots maybe for these but they would be very specific I will talk in a few minutes about how to write rules and then you will probably see that if I could write rules for these cases but it is doubtful if it's if it's actually useful yeah
10:43
so here's a short overview about language to the basic idea of land which was always to be the next step after spell checking so we don't replace spell-checking but we can kind of run after checking and nowadays we actually have 1 component that does the traditional spell checking with the component inside of language tools the project was started in 2003 it's recent lgpl we have about 10 regular commuters now and to leave out on a time time-based where he schedule with the new release every frame on the and everything is implemented in Java and XML you see in a few minutes what where where we use XML or what thoughts so as a user how to use
11:36
language to we can use that as a command-line application or this stuff we have also several extensions for their office and openoffice vim and Emacs and 5 of them from abroad even a few more if you from the job world you can directly use OWL API if you're from using some of our own language then you might use and HTTP server we also have an HTTP S which returned some very simple XML which you can then the apostasy heroes in text I at
12:20
end now notices this were
12:23
internally language to takes plain text as input then it finds the sentences in this text and in the sentences finds words and for each word it's that analyze the for example finds its base form for walks through get the base form more for example and you will also find the part of speech text for each word there which can be ambiguous for example in the case of walks you would get it can be a plural noun or it can also be a person singular and we have this analyze text and then we go this analyzed text we run some Java was but most importantly we run over some error detection patterns the that on the kind of the cross-language to and I know explain how these work the basic idea behind these error
13:23
detection patterns is to be 1st simple so you don't have to be a suffer development to contribute new rules and the other idea of atoms is that they are all independent so even if they had a new rule you cannot break any of the existing rules that's often unlike and suffer develop on where you you change something and something else that breaks if is an example of a slightly simplified example of rule would always consists of 2 parts the 1st part is the pattern itself so 1st I should say that error detection patterns we internally column roots so rule consists of those 2 parts of the pattern itself and the message that is displayed to the user when that out on that is found in a sentence and the pattern again is just the sequence in this in the simplest case is just a sequence of words so this is a example you have the token that's token is just the technical term for what followed by the regular expression English why attitude so this item will just match the example from the 1st slide that English and also bad attitude and in the message you will then can see that the coreference it will just use that access to to replace that with English or attitude so the user gets to see the Methods did you mean that English for example in that OK that was a very
15:02
simple example other things you can do in your rules I can use logical and and or you can use negation you can do skipping like matcher work and skip over and words
15:13
and then that's another word so you can use inflection that means you know just match the word walk but also all of its forms without wanting the held on for some walk it would match walking walks and warped and you can match part of speech patterns like ritual ropes or match only all that person singular of ups and that's described analogy in detail OK because this is kind of the core of language to if you 1 more
15:45
example this to take the arrow in
15:50
always I'm happy that's actually a mistake that non-native English speaker might make and I'm taking these examples as you can see from the bottom of the rule because uh and these 2 examples are part of the rule 1 example needs to be incorrect that means it needs to match the pattern above and the other example needs to be correct so it must not match the pattern above we use these roots these examples inside the rule before all unit tests and it also makes it easier to understand what will actually dust and so this would have some kind of token sense start which just means match only at the start of the sentence how has a regular expression again I'm always happy whenever and then a test uh talking with an exception which means that that move all tokens except those and those are those the B and D and JJ VB were any kind of uh and means model and J. means adjective . that sounds confusing to read to make these are these the tech names are actually standard in computational linguistics it's called the Penn Treebank tag set and that's what we use here it now we have these kind of rules for
17:24
29 languages which technically means we support 29 languages but we support them to a very different degree you can see here and if if you can't read that as the languages with the most number of rules are French German catalan polish and then comes English so French for example a modern to rules and there are other languages like Greek and Japanese that have less than 100 routes this means that all of this only a very rough indication of how good the languages supported supported and it's an indication that of course if you use the if you switch language to to Japanese of Greek it will not work that well in the sense that you will not find so many are also so that means you will have to do a lot of work here for that for those languages with with lesser rules to get a better coverage so what we do is basically pattern matching and you might wonder and languages kind of complicated spent matching really enough all the ask in different way why don't we use a more powerful approach so the fake ones that think and ask what is gram actually remains a set of rules that describe how old thoughts sentences and text look like and syntax is a part of grammar that this morning from description of how balance and looks like and you can also ask what the parser The parser is something that takes the sequence as input and generates you some output structure to the entry for example and you know this from
19:20
suffer development will do this successfully for the case and so you might ask why don't we just do the same for English why can't we just write powers are likely to pass the floor python or whatever way countries right this kind of power the power that for English as so it turns out
19:44
we kind of cute but that sort of approach we're taking land should because it's just so very difficult so there is no formal description of the English grammar but if you look in a comprehensive grammar so here you find that this and you consider this to be some kind of specifications you find this set has a thousand 700 pages and even if you say now OK and there's just English English as kind of special complicated or whatever and you look at the constructed language like Esperanto which should be much easier is to have classification if you wanna call it that way of about 700 pages the so that complicated also if you you write such a parser then you will more or less in end up with a part of that's probably specific to your language and we won support more more than 1 language as in language to and even more reasons why it's difficult to to use the powers our house English so having brothers not automatically mean that you have good error
20:57
messages so you basically parser you need to go 1 step further and optimize it to be a useful error messages because otherwise you get the other messages like I cannot pass a sentence which is not useful to the reduce and even if you have done all that you still not finished because if you look at sentences like sorry for my bad english from the 1st slide there it turns out that it's actually puzzles fine because that English could also be some noun phrase sorry for my and then some noun phrase so technically sorry for my bad english is a grammatically correct phrase so that houses actually like this for some this link grammar but for open source and its use in the what an open source so word-processing software so you can go this approach and try to write a parser for years but it's difficult and because we firmly mention 1 support a lot of languages it's not the approach where we're taking but I'm not saying it saying it's better or something it's just difficult we you would to
22:27
wonder why don't we use machine learning and statistics and so on 1st we do use Apache Altman LP for finding the Chinese chances another name for phrases so Apache LPM NLP gets us the noun phrases and verb phrases and that's based on the statistical approach which has been trained and we use that as a kind of black box to find those phrases however if you want to use a statistical approach to machine learning or whatever to actually find the errors in the text you some large corpus where all the arrows have been annotated and then need probably another large corpus waiting all this is free of errors and then you could maybe come up with some kind of training to get a model and then you could use that but it's not so easy to to find this annotated there also would have to annotate a lot of also and why the situation for English is quite good you get a lot of the process and and the the English ikipedia for example is huge but we'll to wanna support languages that have have not so many resources so that's kind of difficult it's difficult to use machine learning but again I'm here again I'm not saying that this is a better approach uses if you have some idea how to use machine learning to had to proofread the text automatically feel free to do that and even then you can kind of blocking into language tool which is by writing your own rule in Java and that looks like this you only implement simply implemented in 1 single method from the ruling class the match method and it gets its input and analyze sentence and that's what the class name suggests is a sentence with its tokens and the tokens has their
24:24
analyzes so their base forms and they part of speech text and then you can do with it whatever you want you can run any logic you want and you can even actually ignore our analysis of the base forms and part of speech 6 and just look at the original text and if you think you found some match benefit room mentioned doesn't just return that so that's quite a quite easy to integrate into language to and if you do this you get all the the stuff we do for free like the user the graphical user interface the command line interface and the extensions so it should be better to to plaque in into this mechanism then to start writing your own grammar checker they would have to write all the proper planning from scratch and so on shown I have some
25:25
overview about how language works about uh we still have that million barrels from DB pedia and how do we fix them it let us say that we do not have really have 1
25:44
million barrels or database because 1st we only run on small virtuoso and the 2nd reason is I think having 1 million are also actually this is your database is kind of overwhelming and it would probably what paid anyone to start fixing those I mean wants to work on the to do list with 1 million items so what you can do is you can have a look at the community don't language to that AUC will list a few thousand Arabs and if it's a false alarm you can lock in a market as a false alarm and it's not a false alarm you can actually Click on the link that so in minutes and fix it fix a problem which we so what you see is shaking the XML dumps from Wikipedia so it's not really a life you understand them separated every 2 weeks or so and OK and as we are familiar the probably won't get you very far so what I suggest is actually
26:47
something else as a 1st step with a new future where we check the recent
26:53
changes from the pedia in we patch this so atom feed of recent changes twice a minute and then we run language to lower it not over the computer particles but only about those parts that affected changed and then we detect if someone has made an edit this had be less has introduced an error and we also usually detect whether someone has made and that has a fixed an hour so what we end up with with a database of freshly introduced arrows and it looks
27:37
like this so you have the error message and with
27:45
the blue square the underlines the idea of now that 2 cases of my cases it is a false alarm when you can click it is a false alarm like you can just ignore it but now assume like this is the 1st example it's actually error on what you can do then is click on the chick page nulling and in that moment we go to the Wikipedia and fetch the current version of the article by adding the the API and run the complete article through language tools and we will then
28:22
show you this this page we you can if everything works well without typing anything you can just click on the corrections made by language to with their useful and if it's not a false alarm and then submit that page and we will send you directly to
28:42
Wikipedia so it's as if you had made and added in the GPD and you end up in this if you wait can 1 not for the last object that we didn't break anything and we also conveniently uh said this edit summary and check the this is a minor edit checkbook because usually this is just some kind of grammar or spelling fixed so this is what I would suggest that the 1st step instead of working on those 1 million arrows which just somewhere more uh let's start and try to make sure that not so many new errors introduced in the future using for
29:28
example this is not yet at data for languages but if you if you keen to actually use it and to make a hefty myself in the sense that you improve the rules disabled rules that are most useful then let me know which about which language you want to get you wanna activated and I'll try to do that for noise activated I think for the German English and French the so finally some words
30:06
about the future work on language without you working walking you through our entire to do this
30:13
what I basically like to see of course is to become a spelling and grammar checking ubiquitous start inventory you because like specialities today daily basis cannot anywhere beings uh spell-checked and I think we should stop there we have no more powerful tools we can do start in grammar chain so we should use it we and what securing state well we do have the same job IPI we also available on May central that's useful for Java developers we have an HTTP and H. 2 S. API for more or less because integration with support for many languages with a license that I think should be liberally enough for almost all use cases and we wouldn't Java which is absolute things if you are if yourself is also written in Java if you suffer is not written in Java that you might not like that so how can we despite being written in Java how can we become ubiquitous for example could we ever run in the browser despite despite being implemented in java so my idea was to count just compiled Java to JavaScript with at the end OK I tried that and it failed and to bet nobody replied to my StackOverflow question but that's why I'm and so I'm asking you for helping you can have a look at that question there an answer or even better you after the talk you can come talk to me and explain to me how this is done compiling complex Java application to we the of course we also have to get help from people who wanna at support for another language to language to is not even that difficult you can usually starts with another language and then right 1 we after the other as I mentioned we have a lot of languages that are not actively maintained or not effectively maintained enough so users of languages that are really in need of maintain on and if you wanna making maintain these languages we would be very happy to welcome you maintaining the language basically means right new roots and making sure that the existing routes get improved and that they don't read too many false alarms you don't have to be profitable for that purpose constant also welcomes of developers I'm sorry
33:04
and so in summary I
33:15
like to say I think today we shouldn't should seek to simple scratching totally north context we have more objects available today and I suggest you some other so users was developer and I hope that our style and grammar-checking of Wikipedia's kind of a proof that this technology can be useful despite the number of false alarms and of course the contributions of the welcome you can talk to me about his speech it and that's it for
33:49
me that nice conference b if b are there any questions the the and at the at the standard very difficult so the question is how do you find errors in japanese without any rules Japanese we do have a good for Japanese just not so many what was that you question here now that I see 0 in the database so perhaps it just me and on the ground
35:10
and and OK maybe in the meantime as some of the questions we have the question here the on and here I have I can language to cope with text with some mark-up lock likely take or is it's possible in near future let's say if it's not but so now yes and no so it's the and it we kind of push this task to the software that integrates so what we demand of a software that uses language so that integrates language to refer to tell us where the market we will be the position of the marker and then we will just ignore it so if there's some software that can do this with knows market and feeds of the text with including the marker positions then yes we can have the text but it cannot handle that in the sense like we know OK there's no marked up as like a headline and then we have some special headline model so that's not what we can do that yet more questions and the the question about the general tools me something we have no Java routes for language true but we do have XML with so the jar just some special cases if you prefer for some reason to write your keywords and job on because of maybe on the XML based approach that I showed is not powerful enough then you can always write you root job enrichment on for Japanese OK any other questions OK you know you can come to me more and more and thank
