We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Collecting Sentences for Common Voice

00:00

Formal Metadata

Title
Collecting Sentences for Common Voice
Subtitle
Collecting Sentences through different means to allow others to record voices for them
Title of Series
Number of Parts
287
Author
Contributors
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Common Voice is a project to help make voice recognition open and accessible to everyone. To create this data set Common Voice allows volunteers to record defined sentences to contribute their voice. A good data set needs a lot of recordings, and therefore we need to have a lot of sentences to be read out aloud. In this talk Michael will introduce the audience to several ways we are collecting these sentences and goes into more technical detail for these mechanisms. This talk will also feature an intro to Common Voice at the beginning.
Insertion lossVideo gameBuildingDifferent (Kate Ryan album)Computer fontMatrix (mathematics)Flow separationMereologyComputer programmingContext awarenessDemo (music)Projective planeWeb 2.0Computing platformOpen setOrder (biology)Online helpSheaf (mathematics)DiagramMeeting/Interview
Formal languageProjective planeOnline helpSet (mathematics)Pattern recognitionMIDIEndliche ModelltheorieOpen setVirtual machineRow (database)WebsiteMaschinelle ÜbersetzungFlow separationGoogolComputer clusterComputer animation
Process (computing)Local ringLine (geometry)Arithmetic progressionFormal languageNumberSet (mathematics)Row (database)Web pageDifferent (Kate Ryan album)Computer animation
Revision controlMultiplicationBuildingTotal S.A.File formatLogical constantEmailSign (mathematics)LoginSystem administratorKeyboard shortcutINTEGRALTerm (mathematics)Source codeSheaf (mathematics)NumberTouchscreenSystem callSet (mathematics)Profil (magazine)WebsiteRow (database)EmailTraffic reportingInformation privacyValidity (statistics)Flow separationStatisticsTerm (mathematics)Uniform boundedness principlePerspective (visual)Computer animation
WebsiteRepository (publishing)Connected spaceTime domainGame theoryInformationGauge theoryAttribute grammarRevision controlSource codeLibrary (computing)Greatest elementTheory of relativityDifferent (Kate Ryan album)Total S.A.CASE <Informatik>Profil (magazine)Software testingDecision theorySystem callRow (database)NumberSource codeProduct (business)Formal languageOrder (biology)WebsiteForcing (mathematics)Computing platformWordStatisticsDigital electronicsSheaf (mathematics)Public domainEstimatorPoint (geometry)Online helpMereologyIdentity managementRevision controlSingle-precision floating-point formatInclusion mapValidity (statistics)2 (number)Data dictionaryComputer clusterComputer animation
Sheaf (mathematics)Demo (music)ArchitectureData managementBitServer (computing)Set (mathematics)Computer animation
DatabaseWeb browserScripting languageGroup actionRepository (publishing)Client (computing)Server (computing)ExpressionGroup actionMultiplication signRepository (publishing)DatabaseState of matterMereologyServer (computing)WebsiteFront and back endsFormal languageComputer fileData managementUniformer RaumScripting languageTorusComputer animationProgram flowchart
Gamma functionFormal grammarComputer wormCodeGroup actionServer (computing)Repository (publishing)CloningPRINCE2Formal languageGroup actionStatistics
Sheaf (mathematics)Data structureArchitectureSource codeGroup actionRule of inferenceSource codeScaling (geometry)NeuroinformatikLatent heatComputer fileLine (geometry)Computer animation
Sample (statistics)Mixed realityCASE <Informatik>Formal languageConfiguration spaceRule of inferenceGroup actionPattern languageSymbol tableTheory of relativityDifferent (Kate Ryan album)Computer fileRow (database)WordLine (geometry)Key (cryptography)Computer animation
MathematicsHoaxContext awarenessSet (mathematics)Directed setSample (statistics)InformationComputer fileMathematicsNatural languageScripting languageDirection (geometry)Library (computing)Sampling (statistics)Configuration spaceProjective planeSlide ruleTerm (mathematics)Rule of inferenceError messageSource codeRepository (publishing)Process (computing)Key (cryptography)Set (mathematics)Revision controlTotal S.A.Line (geometry)Film editingTheoryOffice suiteView (database)HypermediaSound effectComputer animation
Time domainInternet forumFormal languagePublic domainOnline helpSystem callValidity (statistics)Rule of inferenceServer (computing)Row (database)Slide ruleSet (mathematics)Goodness of fitMereologyInternet forumConnectivity (graph theory)Link (knot theory)Metric systemoutputMatrix (mathematics)Disk read-and-write headComputer animation
Presentation of a groupArithmetic meanSurgeryBitComputer animationMeeting/Interview
ArmComputer clusterFormal languageProcess (computing)SpacetimeObservational studyMeeting/Interview
Process (computing)Link (knot theory)Self-organizationArmArithmetic meanData storage deviceService (economics)System callMeeting/Interview
System callFile formatComputer fileMetadataMeeting/Interview
Projective planeMetric systemCASE <Informatik>Sheaf (mathematics)Meeting/Interview
Metric systemSheaf (mathematics)Internet forumRepository (publishing)Server (computing)Moment (mathematics)Meeting/Interview
Projective planeEndliche ModelltheorieFormal languageMeeting/Interview
CodeLink (knot theory)Meeting/InterviewComputer animation
Transcript: English(auto-generated)
Hello everyone and welcome to my FOSDEM 2022 talk about Common Voice.
I'm Michael. I have been contributing to Mozilla for the past around 13 years. I have contributed to several different projects, such as Firefox. I have been involved in community building through the Mozilla web program. And the last few years I have been focusing on the Common Voice project.
Before we fully jump in, I want to give a quick outline on what we're going to talk about today. First, I'm going to give you an introduction on what Common Voice actually is. I'm going to also show you a quick demo so you understand the full context.
And then we will jump into how we collect sentences for Common Voice. That part will be split in three parts. There are three ways to contribute sentences and I will go into all of them. And then at the end, I will also show you how you can contribute to that as well.
We are always looking for new contributors. So if you have any questions regarding contributing, feel free to drop them into the matrix chat. And then in the Q&A section, we can dive deeper into those questions. The same goes for any other platform questions. If something is not clear in my talk, feel free to ask the question and we can answer that hopefully.
Now, let's get into what Common Voice is. Common Voice is a project led by the Mozilla Foundation. Its goal is to help make voice contribution open and accessible for all.
There are several voice recognition data sets out there. However, the big ones, for example, from Google or from Amazon are not open at all. They have the data. However, you won't ever be able to actually access those data sets to create your own machine learning models or even use the existing models from them.
Therefore, it's crucial that there is an effort to create open data sets that everyone can use. And not only for languages that have millions of speakers, for example, English or Spanish, but also for languages that are not that well represented as it is right now.
The data set of all the recordings and the sentences with it are published under a Creative Commons 0 license. So anyone who wants to use it can actually use it.
As of mid of January, there are 159 languages on Common Voice. These are either already in progress, such as 88 languages where people can actively contribute their recordings.
And the remaining languages are either in the website translation step or gathering the initial set of sentences to make sure that the contribution can start. If you go to Common Voice Mozilla.org, you can see the website.
This is the language subsection. Just to quickly show you how many languages there are. Some have been actively contributed to in quite large numbers. However, there are also a lot of smaller languages that are slowly but steadily gaining traction and get their recordings in.
If you want to contribute, feel free to check out the language subsection page to find the language you speak and see whether it's in progress or already fully able to contribute. You can also find the data set on the data set subsection where you can also find the difference that's on how
many hours are already validated, how many number of voices there are, and you can download it and use it for your needs. There are two sections to contribute to. The first one is recording your voice.
You get shown a sentence and then the goal is to click on the recording button in the middle of the screen and then speak out that sentence out loud. After you have done five sentences, those get submitted and then equally important is the validation.
You can click on listen on the top and then you can listen to other people's recordings and say whether what you hear actually matches what is shown on the screen.
If there are enough validations, they end up in the data set. Additionally to that, you can also report a sentence if it has mistakes in it, if there is a typo, for example, or even if it's an inappropriate sentence.
We try to make sure that that doesn't happen, but I don't think we can catch all of them. So please report if you see something. You can also contribute without logging in.
However, your profile when you log in also has sections to fill out your accent, for example, which gives more data to the data set. So if you are willing to log in, please do so. Eventually in the data set, there is no email or anything attached to you.
So from a privacy perspective, that is OK. However, it's totally fine to also do it anonymously by not logging in. And in the end, it's up to you. But if you can, please do so. Now, that was a very quick overview of the Common Voice website.
As I have already said, the data set is available on the website. You can see the statistics, you can download it, and there is a new release every six months.
Now, for this to work, for these sentences here to show up, we need to actually have those sentences. How do we gather them? That's the topic of this talk. And there are several ways on how we can do that.
The first approach is to use the sentence collector, as you see here.
It is a website where you can either collect sentences or review sentences. Once again, we take the approach of contributing by creating or contributing by reviewing. You can go to your profile. You can select which language you want to contribute to.
In my case, for now, that's English. And then you can go to the Add section. In case you have multiple languages in your profile, you can select which language you want to contribute here. And then we can add a public domain sentence.
Important, it needs to be public domain because otherwise we're not allowed to use it in the data set. So, for example, here we can say, this is an example sentence for my talk. And we can add a second one, for example, this is for false them.
We also require the source. In this case, I have written it myself, so I can say written myself. And I confirm that these are OK for public domain. When I submit, I get an overview. I can already review the sentences.
That is mostly meant in case where I have a data set from an old book that is copyright free by now. For example, I can already click on Review. I can review the different sentences, make sure that non-useful sentences are getting rejected already at this step.
In that case, I can say confirm. And now we can see one sentence failed. When we scroll down, we can see which validations failed. In this case, this is for false them failed. This is because we do not allow any abbreviations. In this case, we know how to pronounce for them.
I doubt that everyone on the planet Earth actually knows that. In this case, you could also pronounce it F-O-S-D-E-M. And the same goes for any other acronyms that are not fully cleared.
Therefore, we avoid abbreviations and acronyms altogether. What you can do in this step is you can copy the sentence, put it back into the sentence box, adjust it to not include an abbreviation and go from there. The other sentence was submitted. When we go to review, we see an interesting sentence.
Generally, it is favorable to only include sentences as single words. We could just simply drop a full dictionary. However, when recording on the Common Voice platform, that's simply not interesting.
So in this case here, it says help. I am just going to reject that for now. And then we can see my sentence here. This is an example for my talk. In that case, I would approve it. And we're done with the review.
English is heavily contributed to, so you won't find a lot of sentences to review. But I think that was a good example. You can also find the review criteria here in case you're not sure on whether to approve something or not. If you're not sure, you can also click on the skip button and leave that decision to somebody else.
You can also see if there are any rejected sentences. Sentences you have previously submitted but somebody else rejected. Those were test sentences from me to test something in production.
And therefore, I already rejected those because those are English for languages that do not use English. Therefore, rejected. You can also see which sentences you have submitted in general. As you can see, that's actually not too many. I'm mostly involved with the technical side of this platform and let others come up with their own sentences.
If you need to, you can also select them and click on delete selected sentences. That could be, for example, if you figure out that the source was not appropriate and you can delete those sentences yourself.
And then finally, we also have some statistics. For example, for English, we have more than 50,000 sentences in the sentence collector. 53 of those are already validated. That means they had been exported to Common Voice and therefore are part of the dataset.
In total, we have 4.7 million sentences in 134 languages, which is quite an extensive number. But given how many sentences we actually need to create hours and hours of recordings, we need to keep going.
As I said, the sentences need to be copyright free, preferably CC0. And now I want to go into a bit more detail on how the sentence collector works.
The sentence collector is a React frontend using Redux for state management. And everything is powered by Node.js Express server.
That is also connected to the database, so that part is more or less straightforward. The interesting part is how we actually export to Common Voice once the sentences are approved. The sentence collector is currently living in a separate repository.
That means that we somehow need to get those sentences into the Common Voice repository. As it works right now, there is a subfolder in the Common Voice repository that has text files in it. And these then get deployed into the Common Voice database on website deployments.
For that, to work, we have a GitHub Action, which runs every week on Wednesday, early mornings, UTC time zone. And that runs an export script, which fetches all the approved sentences for all the enabled languages from the backend server.
And then creates the text file, sentencecollector.txt, pushes it to the Common Voice repository. And then on the next deployment, that will be put into the Common Voice database.
Then, to give you a quick overview on how that works, it's a GitHub Action with quite a few steps. Next, we need to clone the Common Voice repository, and then the export is mostly the interesting things.
We go through every language. We also bring some stats for debugging purposes on how many sentences we got. Then, going to the sentence extractor, we have figured out that using the sentence collector is not superscalable.
We need a lot of contributors that create their own sentences or find sources that they can use. So, what we have is we can export three sentences per article from Wikipedia, which scales a lot better than individual contributions.
We also have some other sources available, such as Wikisource, and these extractions we can automate. That's what the sentence extractor is for. It is based on language-specific rule files, which I can show you now.
It's a .toml file with pre-specified keys. For example, here we have the replacements. As mentioned previously, we don't want any abbreviations. That also includes things like Mr. and Miss.
While that might be easy to pronounce, we decided to just replace that with Mr. and Miss as a word, and then we go from there. There are also other configurations, for example, how many minimum words, how many maximum
words, and also whether it needs to start with an uppercase letter or not. Then we have a regex, which defines which symbols and letters are allowed. We also have configurations on which symbols need to match together.
For example, on line 31, we have the lower quote symbol and the upper quote symbol. Whenever we encounter a lower quote, we also expect an upper quote, and if that is not the case, we check the sentence.
Then we also have the different abbreviation patterns. For English, that is fairly straightforward. In other languages, there are more elaborate rules on detecting those. Once that rules file is done, a pull request can be created.
For example, to show you an example here, we have yet another GitHub action that runs on every push to a pull request. What we have created is an easy way to validate your rules with that.
When you open the check, you can see an extraction file uploaded as an artifact. You can download that and you will get a few thousand sentences from the Wikipedia extract to verify that your rules apply correctly. That allows you to fix mistakes in the config quite easily.
You can push your changes back up to GitHub into the pull request and look at the extraction again to see if it's fixed. All the config files are also documented in the README. Every single key you can use in a config file is documented.
If there are any questions while coming up with a pull request, feel free to open an issue. We want to keep that as extensive as possible. One interesting thing is we are using Rust for this tool. However, the Rust ecosystem is not fully perfect in terms of libraries you can use for natural language processing.
We started with using Rust Punk to split the sentences. We basically get the full text and then we split out the different sentences. That didn't work out perfectly, so there is now also a way to use
a small Python script together with, for example, an LTK which splits sentences way better. That was a very interesting project I worked on. It's, as of right now, quite hacky to do inline Python. However, that enables a lot of other things that could be done.
In theory, now in hindsight, it probably was not the best idea to start with Rust for this one. I think if we had started with Python from the start for this tool, we could have done a few things way easier.
That being said, I don't think we need to rewrite it right now, but maybe in a few years or depending on what else we want to implement in this tool might be beneficial. The third method to get sentences into Common Voice are bulk uploads.
If you find a source that has a lot of sentences to upload, we are talking a few thousands, 50k or even more upwards. We also provide the possibility to create a direct pull request to the Common Voice repository.
That basically makes sure that for very large data sets, we don't need to push them into sentence collector because reviewing 60, 70, 80 thousand sentences in the sentence collector is not fully suitable.
Mostly given when they're coming from the same source, because then we can also just do a review of a statistically significant sample that is by multiple reviewers. If that is below the error margin, we are good to go and the pull request can be merged as well.
For more info on that, you can look at the Common Voice community playbook linked here. The slides are all attached to the talk and there are also more examples and more information in the playbook for the full contributions that are possible to Common Voice.
Those were the three currently available contributions for sentences. I want to quickly go into details on how you can contribute generally.
If you speak a language that is not yet covered in the sentence extractor, we would appreciate if you can take time and create the rules definition for that language. Then we can make sure that we can also extract from Wikipedia. Another opportunity to contribute is to the sentence collector.
Feel free to create pull requests. There are some React parts that I previously refactored but didn't go fully through with it. There are still some components, for example, that do more than they should be doing. Have a look at the code, create an issue if you find something that could be improved and we can take the discussion there.
Any help there is appreciated. There are also some open issues that can be tackled right away. Another contribution that would be super valuable is to help us find good public domain sentences that could be datasets that cover multiple languages or could also be for one specific language.
And last but not least, contributing your voice to record sentences would also be greatly appreciated. That being said, please don't forget to also listen to other people's clips and validate them.
We have a lot of recordings but all of these recordings also need to be validated by other people as well. If you need any help, if you want to coordinate a new language or coordinate existing communities, head over to our discourse forum linked here on this slide.
And create a new topic for anything that might come up. For today, I'm also hanging out in the Matrix Q&A later after this talk. And feel free to drop any questions there as well and I will try to help you out.
I also hang out on the Common Voice Matrix channel on the Mozilla server, so feel free to drop by there as well. Ask your questions. If you have any, give us your input. We greatly appreciate that.
And now, thank you for your attention. I'm looking forward to the Q&A. Hope to see a lot of questions. Thank you.
And we are back live. Congratulations, Michael. Such a wonderful and insightful presentation. I did learn a lot. So personally, I was not aware of some of the things that are behind Common Voice and I'm really glad that you brought this topic in for us.
How are you feeling today? I'm feeling well. All good? I am a little bit sorry that the audio was a little bit low. It's fine. We all have sound on louder and I think it was decent.
Let's get some questions because we do have some already popping in. If not, do enter the Mozilla Dev Room and add your question and when I vote, it will appear here too. I started with the last one, but it's up on the top now. The CC0 requirements, very strict one, especially for languages with small speaker base at their starting phase. How can this be corrected?
Yes, that is indeed a very strict requirement. However, I forgot to mention that in the talk, there is a process to ask other people if they are willing to actually release it under CC0. There is basically a legal document that contributors can send to, for example, news organizations.
And if they're willing to sign that document, then we can use those. I will later on post a link to that process in the Dev Room as well as in this channel here. That would be cool.
Maybe even tweeted out for those who did join and left the call or not in the Dev Room. Okay, let's take one about the datasets. In which format are the datasets stored? To be honest, I am not fully working on the dataset. I'm really mostly on the topics that I discussed in the talk.
The metadata is stored in text files. Those are, as far as I know, separate data. And for the audio, I would honestly need to double check. I think also people can do ask questions in the project.
You might have mentioned at the end, but I was already in this room preparing. Where is the place for the team to be contacted? Should they just use GitHub and log an issue there or is there a metric channel upon?
There is certainly the opportunity to create an issue on GitHub. However, we have a discourse forum at discourse.org. We have a common voice section there. That's probably the best way to get in contact with the community. There is also a metrics channel that is in the Mozilla.org metrics server.
It's called common voice that should be easily findable as well. Otherwise, that metrics channel is also linked in the documentation on the GitHub repository. Super cool. Let's get a question about the users, because we saw this project and it's going on for a while now, a few years.
And Büllen was asking, are there any public projects that have successfully used common voice datasets? The one that comes to mind is Cocky, which has a lot of language models that are also based on the common voice dataset.
I will also tweet that link out. That would be cool. Thanks so much for this.