Research at the service of free knowledge: Building open tools to support research on Wikimedia projects
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/61453 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Gamma functionProjective planeHypermediaSelf-organizationComputer animation
00:37
Projective planeHypermediaAiry functionComputer animation
00:57
Staff (military)Self-organizationProjective planeDifferent (Kate Ryan album)Content (media)Staff (military)Office suiteCollaborationismUniverse (mathematics)Computer animation
01:27
Computer programData integrityComputer networkMachine learningOpen setProcess (computing)Building2 (number)Content (media)Projective planeOffice suiteCollaborationismDifferent (Kate Ryan album)Universe (mathematics)GenderElectronic data processingVirtual machineSoftware developerInformationAreaAddress spaceComputer animation
02:53
MultiplicationStreaming mediaPrice indexWeb pageMeta elementHypermediaCore dumpWikiQuery languageFile formatSubject indexingContent (media)InformationCore dumpWeb pageSoftware developerView (database)BuildingText editorDatabaseState of matterProcess (computing)HypermediaComputer animation
03:48
Computer-generated imagerySource codeCodeRule of inferenceRevision controlOpen sourceContent (media)Medical imagingReading (process)CASE <Informatik>Endliche ModelltheorieFormal languageData structureForm (programming)Virtual machineTask (computing)Type theoryMereologyCAN busSet (mathematics)Process (computing)Text editorComputer animation
05:32
Differential (mechanical device)Information privacyPredictionWikiInformation privacyProcess (computing)Differential (mechanical device)Set (mathematics)Computer animation
05:58
Data analysisCore dumpPartial derivativeEnterprise architectureMarkup languageRevision controlFile formatState of matterArithmetic meanMereologyRevision controlMarkup languageContent (media)Source codeElement (mathematics)InformationParsingCuboidTemplate (C++)Link (knot theory)Vapor barrierObservational studyCore dumpNumberSinc functionWikiSoftware testingSet (mathematics)Motion captureComputer animation
08:15
Equals signParsingComputer fontMathematicianParsingInterpolationConic sectionAbelian categoryNormed vector spaceUniqueness quantificationVacuumMeta elementGamma functionFormal languageLibrary (computing)Revision controlDifferent (Kate Ryan album)Element (mathematics)Vapor barrierSoftwareWikiSpeech synthesisSet (mathematics)Core dumpHypermediaComputer animation
08:59
Library (computing)ParsingCore dumpLibrary (computing)Template (C++)Metropolitan area networkTraffic reportingLevel (video gaming)Element (mathematics)Revision controlSpeciesOpen setArithmetic progressionInformationCore dumpWikiSoftware repositoryXML
09:43
InternetworkingData integrityHoaxLimit (category theory)Model theoryBinary fileLink (knot theory)Template (C++)Computer-generated imageryRevision controlExecution unitGamma functionNormal (geometry)Task (computing)Endliche ModelltheorieContent (media)Electric currentEvent horizonMathematicianWeb pageRandom numberReading (process)View (database)Beta functionBuildingSet (mathematics)Content (media)Context awarenessINTEGRALOrder (biology)Text editorVolume (thermodynamics)Link (knot theory)Revision controlPredictabilityEndliche ModelltheorieMedical imagingModel theoryMathematicsFunction (mathematics)IdentifiabilityType theoryAdditionState of matterECosArmSurfaceWechselseitige InformationInternet forumJSONComputer animation
12:59
Virtual machineEndliche ModelltheorieCodeInferenceCollaborationismComputing platformFormal languageFeedbackOpen setEndliche ModelltheorieDifferent (Kate Ryan album)Software developerFormal languageRevision controlAreaMiniDiscAdditionContext awarenessOrder (biology)Core dumpComputer animation
13:33
CodeIntegrated development environmentSource codeTask (computing)SoftwareSoftware developerSystem identificationReading (process)VideoconferencingLink (knot theory)Online chatWavePoint cloudUniqueness quantificationOpen sourceBroadcast programmingEvent horizonWikiService (economics)Computing platformSuite (music)Staff (military)AreaIntegrated development environmentSoftware developerHypermediaWikiPoint (geometry)Row (database)CASE <Informatik>Centralizer and normalizerRobotDifferent (Kate Ryan album)Computer animation
14:48
AlgorithmArithmetic progressionCuboidSoftware bug
15:22
WikiComputer programStrategy gameExplosionSeries (mathematics)Product (business)Reading (process)Projective planeComputer programmingNeuroinformatikVector potentialDirection (geometry)Position operatorPlanningComputer scienceTraffic reportingLattice (order)HypermediaDifferent (Kate Ryan album)Menu (computing)Physical lawLocal ringAdditionComputer animation
16:25
Lattice (order)10 (number)TrailPhysical systemAlgorithmSoftware developerLattice (order)HypermediaWikiState of matterVirtualizationQueue (abstract data type)JSONXMLUML
17:32
Office suiteEmailElectronic mailing listOffice suiteTouch typingMultiplication signComputer animation
18:02
Program flowchart
Transcript: English(auto-generated)
00:06
Hello, everyone. I'm Martin Gerlach. I'm a senior research scientist in the research team at Wikimedia Foundation. First of all, I want to thank the organizers for the opportunity to present here today. I'm very excited to share some of our recent work
00:23
around building open tools to support research around Wikimedia projects. Before going into the details, I want to provide some background around what is Wikimedia and its research team. I want to start with something that most of you are probably familiar with, Wikipedia,
00:43
which is by now the largest encyclopedia in the history of humankind. Wikipedia, together with its sister projects like Wikimedia Commons or Wiktionary, are operated by the Wikimedia Foundation. The Wikimedia Foundation is a nonprofit organization
01:02
and has a staff of around 600 employees. It provides support to the communities and the projects in different ways, but it's important to know that it does not create or modify the content and it does not define or enforce policies on the project.
01:24
One of the teams at the Wikimedia Foundation is the research team and we are a small team of eight scientists, engineers and community officers and we work with collaborators from different universities to do research around Wikimedia projects.
01:43
These activities can be grouped in roughly three main areas. The first one is to address knowledge gap. So what content is missing or underrepresented? One example of this is the gender gap. Second is to improve knowledge integrity,
02:02
that is, making sure the content on the projects is accurate, can think of vandalism or misinformation or disinformation. The third aspect is growing the research community, that is, empowering others to do research around the projects.
02:22
Today, I want to focus on the activities in this last area. Specifically, I want to present four facets in which we have been contributing towards this goal, that is, around datasets, tools for data processing and building machine learning APIs.
02:43
Finally, I want to conclude with how developers or interested researchers can contribute to these three areas. So let's go. Wikimedia Foundation provides already many, many different datasets, most notably Wikimedia Dumps around the content
03:03
but also containing information about edits and page views of articles. This is public and openly available and it's used by many researchers as well as developers to build dashboards or tools for editors. However, when working with this data,
03:22
this might prove still very challenging for people who might not identify as Wikimedia researchers or for someone lacking the expertise about database schemas or which data is where or how to filter it. Therefore, we try to release a clean and pre-processed dataset
03:47
to facilitate that. And one such example is the Wikipedia image caption dataset. This is a cleaned and processed dataset of millions of examples of images from Wikimedia Commons
04:00
with their captions extracted from more than 100 language versions of Wikipedia. The background is that many articles on Wikipedia are still lacking visual content which we know are crucial for learning. Adding text to these images increases the accessibility
04:23
and enables better search. So with the release of this data, we hope to enable other researchers to build better machine learning models to assist editors in writing image captions. In this case, we did not just release the data
04:41
but provided it in a more structured form as part of a competition with a very specific task. The idea was to also attract new contributors through this structure so that researchers could find examples of the types of tools
05:04
that could be useful for the community. Experienced researchers outside of Wikimedia could easily contribute their expertise and for new researchers is an easy way to become familiar with Wikimedia data.
05:22
The outcome of this was a Kaggle competition with more than 100 participants and many, many open source solutions in how to approach this problem. This was just one example of datasets that release and I just want to highlight there's other cleaned process datasets
05:42
we are releasing around quality score of Wikipedia articles, around readability of Wikipedia articles and also their upcoming releases around using differential privacy around geography of readers.
06:01
In the next part, I want to present how to work with all this data. We always aim to make data as much as the data publicly available. However, that doesn't necessarily mean it is accessible because it might still require a lot of technical expertise
06:21
to effectively work with this data. Therefore, we try to build tools to lower the technical barriers. And here I want to present one such example related to the HTML dump dataset. What is this? This is a new dump dataset available since October 2021
06:44
and is now published and updated in regular intervals and it contains the HTML version of all articles of Wikipedia. Why is this so exciting? Traditional dumps, when we are using the traditional dumps,
07:02
the content of the articles is only available in the Wikitext markup. This is what you see when you edit the source of an article. However, what you see as a reader when browsing is not the Wikitext markup but the Wikitext gets parsed into an HTML.
07:21
The problem is the Wikitext does not explicitly contain all the elements that are visible in the HTML. This comes mainly from parsing of templates or info boxes. This becomes an issue for researchers studying the content of articles
07:40
because they will miss many of the elements only when looking at the Wikitext. One example for this is when looking for hyperlinks in articles. A recent study by Mitrevsky counted the number of links in articles and found that Wikitext contains less than half of the links
08:03
that are visible in the HTML version of the reader. So we can conclude that researchers should use the HTML dumps because they capture more accurately the content of the article. However, the challenge is how to parse the HTML dumps
08:22
or the articles in the HTML dumps version. And this is not just about knowing HTML but it's also about very specific knowledge about how the MediaWiki software translates different wiki elements and how they will appear in the HTML version.
08:41
Existing packages exist for Wikitext but not for HTML. Therefore, this is a very high barrier for practitioners to switch their existing pipelines to use this new dataset. And our solution was to build a Python library
09:01
to make working with these dumps very easily. We called it MWParserFromHTML and it parses the HTML and extracts elements of an article such as links, references, templates or the plain text without the user having to know anything about HTML
09:21
and the way wiki elements appear in it. We recently released the first version of this. This is work in progress. There's tons of open issues. So if you're interested, contributions from anyone are very, very welcome to improve this in the future. Check out the repo on GitLab for more information.
09:44
As a third step, I want to mention, present how we use these datasets in practice. And I want to show one example in the context of knowledge integrity. In order to ensure quality of articles in Wikipedia,
10:03
there are many, many editors who try to review the edits that are made to articles in Wikipedia and try to check were these edits okay or were they not okay and what should be reverted. The problem is there are a lot of edits happening. So just in English Wikipedia,
10:21
there's around 100,000 edits per day to work through. And the aim is, can we build a tool to support editors in dealing with the large volume of edits? Can we help them identify the very bad edits more easily?
10:40
And this is what we do with the so-called risk revert model. What is this? So we look at an edit by comparing the old version of an article with its new version. And we would like to make a prediction whether the change is good or whether it is a very bad edit
11:00
and it should be reverted. How we do this is we extract different features from these articles, so which text was changed, were there links that were removed, were there images that were removed and so on. And then we build a model by looking into the history
11:22
of all Wikipedia edits and extract those edits which have been reverted by editors and use that as a ground truth of bad edits for our model. And the resulting output is that we can, for each of these edits, we can calculate
11:43
a so-called revert risk. This is a very bad edit will have a very high probability, a very high risk for being reverted. And this is what our model will output. And our model performs fairly well.
12:01
It has an accuracy between 70% and 80%. And I want to mention that we consider this okay. It does not need to be perfect. The way our model is used is there are editors that we will surface these scores to help editors identify at which edits they should take a closer look.
12:26
Similar models for annotating content of articles exist. We have been developing these types of models. In addition to knowledge integrity, what I presented,
12:41
we have been trying to build models for finding easily similar articles, for identifying automatically the topic of an article to assess its readability or geography or identifying related images, et cetera. I only want to briefly highlight
13:02
that the development of these models is rooted in some core principles to which we are committed to. And this can create additional challenges in developing this model, specifically this context. I want to highlight a multilingual aspect so that we always try to prefer language agnostic approaches
13:24
in order to support as many as possible of the 300 different language versions in Wikipedia. Finally, I want to conclude with potential ways in which to contribute in any of these three areas
13:43
that I mentioned previously. Generally, one can contribute as a developer to MediaWiki or other aspects of the Wikimedia ecosystem. And there the place to get started is the so-called Developer Portal,
14:01
which is a centralized entry point for finding technical documentation and community resources. Not going into more detail here, I want to give a shout out and refer to the talk by my colleague, Slawina Stefanova from the Developer Advocacy team.
14:21
But specifically in the area of research, I want to highlight a few entry points depending on your interest. In case you would like to build a specific tool, there is Wikimedia Foundation's ToolForge infrastructure, and that is a hosting environment that allows you to run bots or different APIs
14:43
in case you would like to provide that tool to the public. If you want to work with us on improving tools or algorithms, you can check out the different packages that we have been releasing in the past months.
15:03
These are all work in progress. There are many open issues, and we are happy about any contributions about improving, fixing existing issues, or even finding new bugs. So please check out our repository, too.
15:24
If you are interested in getting funding, there are different opportunities. There is an existing program to fund research around Wikimedia projects. This covers many different disciplines, humanities, social science, computer science,
15:42
education, law, et cetera, and is around work that has potential for direct positive impact on the local communities. In addition, I want to mention that coming in the future, there are plans for a similar program
16:00
to improve Wikimedia's technology and tools. If you want to learn about the projects we are working on, I want to mention that we publish a research report, a summary of our ongoing research projects every six months, and you can find more details
16:21
about some of the projects that I have mentioned. Finally, if you would like to engage with the research community, you can join us at WikiWorkshop. This is the primary meeting venue of the Wikimedia research community. This year will be the 10th edition of WikiWorkshop,
16:43
and it is expected to be held in May. You can submit your works there. I invite you for the submissions. We highly encourage even ongoing or preliminary works by submitting extended abstracts. In this edition, there will also be a novel track
17:03
for Wikimedia developers. If you are a developer of a tool or a system or an algorithm that could be of interest to research on Wikimedia, please check it out and make a submission. Even if you do not plan to make a submission, you are welcome to participate.
17:23
As done in the last three editions, WikiWorkshop will be fully virtual, and the attendance will be free. And with this, I want to conclude. I want to thank you very much for your attention. I am looking forward to your questions in the Q&A.
17:43
And if you want to stay in touch, feel free to reach out to me personally on my email or any of the other channels that I'm listing here through office hours or mailing list on IRC, et cetera. And with this, thank you very much.