How we found a million style and grammar errors in the English Wikipedia

Cite

FOSDEM VZW

Naber, Daniel

Formal Metadata

Title

How we found a million style and grammar errors in the English Wikipedia

Title of Series

FOSDEM 2014

Number of Parts

199

Author

Naber, Daniel

License

CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/32542 (DOI)

Publisher

FOSDEM VZW

Release Date

2014

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

LanguageTool is an Open Source proofreading tool developed to detect errors that a common spell checker cannot find, including grammar and style issues. The talk shows how we run LanguageTool on Wikipedia texts, finding many errors (as well as a lot of false alarms). Errors are detected by searching for error patterns that can be specified in XML, making LanguageTool easily extensible. LanguageTool exists since 2003, and it now contains almost 1000 patterns to detect errors in English texts. These patterns are a lot like regular expressions, only that they can, for example, also refer to the words' part-of-speech. The fact that all patterns are independent of each other makes adding more patterns easy. I'll explain the XML syntax of the rules and how more complicated errors, for which the XML syntax is not powerful enough, can be detected by writing Java code. Running LanguageTool on a random 20,000 article subset of the English Wikipedia led to 37,000 errors being detected. However, many of these errors are false alarms, either because of problems with the Wikipedia syntax or because the LanguageTool error patterns are too strict. So we manually looked at 200 of the errors, finding that 29 of the 200 errors were real errors. Projected to the whole Wikipedia (currently at 4.3 million articles), that's about 1.1 million real errors - and that does not even count simple typos that could be detected by a spell checker. If you want less errors in your Wikipedia: LanguageTool offers a web-based tool to send corrections directly to Wikipedia with just a few clicks. And while these numbers refer to the English Wikipedia, LanguageTool also supports German, French, Polish, and many other languages. This talk will contain lots of examples of errors that can be detected automatically, and others that can't. I'll also explain that LanguageTool itself is just a core written in Java (and available on Maven Central), but that it also comes with several front-ends: a stand-alone user interface, add-ons for LibreOffice/OpenOffice and Firefox and an embedded HTTP server