We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Research at the service of free knowledge: Building open tools to support research on Wikimedia projects

00:00

Formal Metadata

Title
Research at the service of free knowledge: Building open tools to support research on Wikimedia projects
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
With roughly 20 billion monthly pageviews, 15 million monthly edits, and almost 55 million articles across 300+ languages, Wikipedia and its sister projects are an essential part of the free Knowledge ecosystem. These projects are created and maintained by a vast network of volunteers. The Wikimedia Foundation, the non-profit organization operating Wikipedia, has a Research team of scientists, engineers, and community developers. They use data and scientific methods to support the needs and advance the understanding of the Wikimedia projects, their readers and their contributors. To expand the team's capacity and breadth of expertise, a focus area of the team is to improve the social and technical infrastructure that helps the broader Wikimedia Movement and research community to tackle complex research challenges. In this talk, members of the Wikimedia Foundation’s Research team will give an overview of their recent efforts to support the community of researchers working on Wikimedia projects. Specifically, they will discuss i) the generation of open data resources; ii) tools for working with open Wikimedia data; and iii) building and releasing machine-learning models to support Wikimedia projects. The goal of this talk is to demonstrate to open tool developers and researchers how to leverage these resources and contribute to the Wikimedia Research community.
Gamma functionProjective planeHypermediaSelf-organizationComputer animation
Projective planeHypermediaAiry functionComputer animation
Staff (military)Self-organizationProjective planeDifferent (Kate Ryan album)Content (media)Staff (military)Office suiteCollaborationismUniverse (mathematics)Computer animation
Computer programData integrityComputer networkMachine learningOpen setProcess (computing)Building2 (number)Content (media)Projective planeOffice suiteCollaborationismDifferent (Kate Ryan album)Universe (mathematics)GenderElectronic data processingVirtual machineSoftware developerInformationAreaAddress spaceComputer animation
MultiplicationStreaming mediaPrice indexWeb pageMeta elementHypermediaCore dumpWikiQuery languageFile formatSubject indexingContent (media)InformationCore dumpWeb pageSoftware developerView (database)BuildingText editorDatabaseState of matterProcess (computing)HypermediaComputer animation
Computer-generated imagerySource codeCodeRule of inferenceRevision controlOpen sourceContent (media)Medical imagingReading (process)CASE <Informatik>Endliche ModelltheorieFormal languageData structureForm (programming)Virtual machineTask (computing)Type theoryMereologyCAN busSet (mathematics)Process (computing)Text editorComputer animation
Differential (mechanical device)Information privacyPredictionWikiInformation privacyProcess (computing)Differential (mechanical device)Set (mathematics)Computer animation
Data analysisCore dumpPartial derivativeEnterprise architectureMarkup languageRevision controlFile formatState of matterArithmetic meanMereologyRevision controlMarkup languageContent (media)Source codeElement (mathematics)InformationParsingCuboidTemplate (C++)Link (knot theory)Vapor barrierObservational studyCore dumpNumberSinc functionWikiSoftware testingSet (mathematics)Motion captureComputer animation
Equals signParsingComputer fontMathematicianParsingInterpolationConic sectionAbelian categoryNormed vector spaceUniqueness quantificationVacuumMeta elementGamma functionFormal languageLibrary (computing)Revision controlDifferent (Kate Ryan album)Element (mathematics)Vapor barrierSoftwareWikiSpeech synthesisSet (mathematics)Core dumpHypermediaComputer animation
Library (computing)ParsingCore dumpLibrary (computing)Template (C++)Metropolitan area networkTraffic reportingLevel (video gaming)Element (mathematics)Revision controlSpeciesOpen setArithmetic progressionInformationCore dumpWikiSoftware repositoryXML
InternetworkingData integrityHoaxLimit (category theory)Model theoryBinary fileLink (knot theory)Template (C++)Computer-generated imageryRevision controlExecution unitGamma functionNormal (geometry)Task (computing)Endliche ModelltheorieContent (media)Electric currentEvent horizonMathematicianWeb pageRandom numberReading (process)View (database)Beta functionBuildingSet (mathematics)Content (media)Context awarenessINTEGRALOrder (biology)Text editorVolume (thermodynamics)Link (knot theory)Revision controlPredictabilityEndliche ModelltheorieMedical imagingModel theoryMathematicsFunction (mathematics)IdentifiabilityType theoryAdditionState of matterECosArmSurfaceWechselseitige InformationInternet forumJSONComputer animation
Virtual machineEndliche ModelltheorieCodeInferenceCollaborationismComputing platformFormal languageFeedbackOpen setEndliche ModelltheorieDifferent (Kate Ryan album)Software developerFormal languageRevision controlAreaMiniDiscAdditionContext awarenessOrder (biology)Core dumpComputer animation
CodeIntegrated development environmentSource codeTask (computing)SoftwareSoftware developerSystem identificationReading (process)VideoconferencingLink (knot theory)Online chatWavePoint cloudUniqueness quantificationOpen sourceBroadcast programmingEvent horizonWikiService (economics)Computing platformSuite (music)Staff (military)AreaIntegrated development environmentSoftware developerHypermediaWikiPoint (geometry)Row (database)CASE <Informatik>Centralizer and normalizerRobotDifferent (Kate Ryan album)Computer animation
AlgorithmArithmetic progressionCuboidSoftware bug
WikiComputer programStrategy gameExplosionSeries (mathematics)Product (business)Reading (process)Projective planeComputer programmingNeuroinformatikVector potentialDirection (geometry)Position operatorPlanningComputer scienceTraffic reportingLattice (order)HypermediaDifferent (Kate Ryan album)Menu (computing)Physical lawLocal ringAdditionComputer animation
Lattice (order)10 (number)TrailPhysical systemAlgorithmSoftware developerLattice (order)HypermediaWikiState of matterVirtualizationQueue (abstract data type)JSONXMLUML
Office suiteEmailElectronic mailing listOffice suiteTouch typingMultiplication signComputer animation
Program flowchart
Transcript: English(auto-generated)
Hello, everyone. I'm Martin Gerlach. I'm a senior research scientist in the research team at Wikimedia Foundation. First of all, I want to thank the organizers for the opportunity to present here today. I'm very excited to share some of our recent work
around building open tools to support research around Wikimedia projects. Before going into the details, I want to provide some background around what is Wikimedia and its research team. I want to start with something that most of you are probably familiar with, Wikipedia,
which is by now the largest encyclopedia in the history of humankind. Wikipedia, together with its sister projects like Wikimedia Commons or Wiktionary, are operated by the Wikimedia Foundation. The Wikimedia Foundation is a nonprofit organization
and has a staff of around 600 employees. It provides support to the communities and the projects in different ways, but it's important to know that it does not create or modify the content and it does not define or enforce policies on the project.
One of the teams at the Wikimedia Foundation is the research team and we are a small team of eight scientists, engineers and community officers and we work with collaborators from different universities to do research around Wikimedia projects.
These activities can be grouped in roughly three main areas. The first one is to address knowledge gap. So what content is missing or underrepresented? One example of this is the gender gap. Second is to improve knowledge integrity,
that is, making sure the content on the projects is accurate, can think of vandalism or misinformation or disinformation. The third aspect is growing the research community, that is, empowering others to do research around the projects.
Today, I want to focus on the activities in this last area. Specifically, I want to present four facets in which we have been contributing towards this goal, that is, around datasets, tools for data processing and building machine learning APIs.
Finally, I want to conclude with how developers or interested researchers can contribute to these three areas. So let's go. Wikimedia Foundation provides already many, many different datasets, most notably Wikimedia Dumps around the content
but also containing information about edits and page views of articles. This is public and openly available and it's used by many researchers as well as developers to build dashboards or tools for editors. However, when working with this data,
this might prove still very challenging for people who might not identify as Wikimedia researchers or for someone lacking the expertise about database schemas or which data is where or how to filter it. Therefore, we try to release a clean and pre-processed dataset
to facilitate that. And one such example is the Wikipedia image caption dataset. This is a cleaned and processed dataset of millions of examples of images from Wikimedia Commons
with their captions extracted from more than 100 language versions of Wikipedia. The background is that many articles on Wikipedia are still lacking visual content which we know are crucial for learning. Adding text to these images increases the accessibility
and enables better search. So with the release of this data, we hope to enable other researchers to build better machine learning models to assist editors in writing image captions. In this case, we did not just release the data
but provided it in a more structured form as part of a competition with a very specific task. The idea was to also attract new contributors through this structure so that researchers could find examples of the types of tools
that could be useful for the community. Experienced researchers outside of Wikimedia could easily contribute their expertise and for new researchers is an easy way to become familiar with Wikimedia data.
The outcome of this was a Kaggle competition with more than 100 participants and many, many open source solutions in how to approach this problem. This was just one example of datasets that release and I just want to highlight there's other cleaned process datasets
we are releasing around quality score of Wikipedia articles, around readability of Wikipedia articles and also their upcoming releases around using differential privacy around geography of readers.
In the next part, I want to present how to work with all this data. We always aim to make data as much as the data publicly available. However, that doesn't necessarily mean it is accessible because it might still require a lot of technical expertise
to effectively work with this data. Therefore, we try to build tools to lower the technical barriers. And here I want to present one such example related to the HTML dump dataset. What is this? This is a new dump dataset available since October 2021
and is now published and updated in regular intervals and it contains the HTML version of all articles of Wikipedia. Why is this so exciting? Traditional dumps, when we are using the traditional dumps,
the content of the articles is only available in the Wikitext markup. This is what you see when you edit the source of an article. However, what you see as a reader when browsing is not the Wikitext markup but the Wikitext gets parsed into an HTML.
The problem is the Wikitext does not explicitly contain all the elements that are visible in the HTML. This comes mainly from parsing of templates or info boxes. This becomes an issue for researchers studying the content of articles
because they will miss many of the elements only when looking at the Wikitext. One example for this is when looking for hyperlinks in articles. A recent study by Mitrevsky counted the number of links in articles and found that Wikitext contains less than half of the links
that are visible in the HTML version of the reader. So we can conclude that researchers should use the HTML dumps because they capture more accurately the content of the article. However, the challenge is how to parse the HTML dumps
or the articles in the HTML dumps version. And this is not just about knowing HTML but it's also about very specific knowledge about how the MediaWiki software translates different wiki elements and how they will appear in the HTML version.
Existing packages exist for Wikitext but not for HTML. Therefore, this is a very high barrier for practitioners to switch their existing pipelines to use this new dataset. And our solution was to build a Python library
to make working with these dumps very easily. We called it MWParserFromHTML and it parses the HTML and extracts elements of an article such as links, references, templates or the plain text without the user having to know anything about HTML
and the way wiki elements appear in it. We recently released the first version of this. This is work in progress. There's tons of open issues. So if you're interested, contributions from anyone are very, very welcome to improve this in the future. Check out the repo on GitLab for more information.
As a third step, I want to mention, present how we use these datasets in practice. And I want to show one example in the context of knowledge integrity. In order to ensure quality of articles in Wikipedia,
there are many, many editors who try to review the edits that are made to articles in Wikipedia and try to check were these edits okay or were they not okay and what should be reverted. The problem is there are a lot of edits happening. So just in English Wikipedia,
there's around 100,000 edits per day to work through. And the aim is, can we build a tool to support editors in dealing with the large volume of edits? Can we help them identify the very bad edits more easily?
And this is what we do with the so-called risk revert model. What is this? So we look at an edit by comparing the old version of an article with its new version. And we would like to make a prediction whether the change is good or whether it is a very bad edit
and it should be reverted. How we do this is we extract different features from these articles, so which text was changed, were there links that were removed, were there images that were removed and so on. And then we build a model by looking into the history
of all Wikipedia edits and extract those edits which have been reverted by editors and use that as a ground truth of bad edits for our model. And the resulting output is that we can, for each of these edits, we can calculate
a so-called revert risk. This is a very bad edit will have a very high probability, a very high risk for being reverted. And this is what our model will output. And our model performs fairly well.
It has an accuracy between 70% and 80%. And I want to mention that we consider this okay. It does not need to be perfect. The way our model is used is there are editors that we will surface these scores to help editors identify at which edits they should take a closer look.
Similar models for annotating content of articles exist. We have been developing these types of models. In addition to knowledge integrity, what I presented,
we have been trying to build models for finding easily similar articles, for identifying automatically the topic of an article to assess its readability or geography or identifying related images, et cetera. I only want to briefly highlight
that the development of these models is rooted in some core principles to which we are committed to. And this can create additional challenges in developing this model, specifically this context. I want to highlight a multilingual aspect so that we always try to prefer language agnostic approaches
in order to support as many as possible of the 300 different language versions in Wikipedia. Finally, I want to conclude with potential ways in which to contribute in any of these three areas
that I mentioned previously. Generally, one can contribute as a developer to MediaWiki or other aspects of the Wikimedia ecosystem. And there the place to get started is the so-called Developer Portal,
which is a centralized entry point for finding technical documentation and community resources. Not going into more detail here, I want to give a shout out and refer to the talk by my colleague, Slawina Stefanova from the Developer Advocacy team.
But specifically in the area of research, I want to highlight a few entry points depending on your interest. In case you would like to build a specific tool, there is Wikimedia Foundation's ToolForge infrastructure, and that is a hosting environment that allows you to run bots or different APIs
in case you would like to provide that tool to the public. If you want to work with us on improving tools or algorithms, you can check out the different packages that we have been releasing in the past months.
These are all work in progress. There are many open issues, and we are happy about any contributions about improving, fixing existing issues, or even finding new bugs. So please check out our repository, too.
If you are interested in getting funding, there are different opportunities. There is an existing program to fund research around Wikimedia projects. This covers many different disciplines, humanities, social science, computer science,
education, law, et cetera, and is around work that has potential for direct positive impact on the local communities. In addition, I want to mention that coming in the future, there are plans for a similar program
to improve Wikimedia's technology and tools. If you want to learn about the projects we are working on, I want to mention that we publish a research report, a summary of our ongoing research projects every six months, and you can find more details
about some of the projects that I have mentioned. Finally, if you would like to engage with the research community, you can join us at WikiWorkshop. This is the primary meeting venue of the Wikimedia research community. This year will be the 10th edition of WikiWorkshop,
and it is expected to be held in May. You can submit your works there. I invite you for the submissions. We highly encourage even ongoing or preliminary works by submitting extended abstracts. In this edition, there will also be a novel track
for Wikimedia developers. If you are a developer of a tool or a system or an algorithm that could be of interest to research on Wikimedia, please check it out and make a submission. Even if you do not plan to make a submission, you are welcome to participate.
As done in the last three editions, WikiWorkshop will be fully virtual, and the attendance will be free. And with this, I want to conclude. I want to thank you very much for your attention. I am looking forward to your questions in the Q&A.
And if you want to stay in touch, feel free to reach out to me personally on my email or any of the other channels that I'm listing here through office hours or mailing list on IRC, et cetera. And with this, thank you very much.