We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Teaching machines to handle bugs and test Firefox more efficiently.

00:00

Formal Metadata

Title
Teaching machines to handle bugs and test Firefox more efficiently.
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
How Mozilla uses machine learning to streamline its development process: automating various aspects of bug management (such as accurately assigning components, detecting types, and identifying spam), trying to predict potential regressions and selecting relevant tests for specific patches. In addition, an overview of future directions for privacy-respecting machine learning usage in Firefox, with the support of the community.
14
15
43
87
Thumbnail
26:29
146
Thumbnail
18:05
199
207
Thumbnail
22:17
264
278
Thumbnail
30:52
293
Thumbnail
15:53
341
Thumbnail
31:01
354
359
410
Term (mathematics)Data managementSoftwareMachine learningProcess (computing)Projective planeAbstract machineData miningSoftware developerJava appletQueue (abstract data type)Computer animation
Scale (map)Patch (Unix)Software bugPresentation of a groupOpen setTraffic reportingWeb browserNumberSoftwareMultiplication signMathematicsComputer animation
Scale (map)Line (geometry)SoftwareMachine codeMachine codeSoftwareFormal languageElectronic mailing listLine (geometry)DebuggerLevel (video gaming)Computer animation
Source codeScale (map)Computer configurationBuildingFront and back endsPhysical systemStatistical hypothesis testingSoftware bugComplex (psychology)BlogComputer animation
Metric systemOrder (biology)Differential (mechanical device)Software bugMultiplication signWorkstation <Musikinstrument>Different (Kate Ryan album)Computer animation
Field (computer science)Axiom of choiceAbstract machineTask (computing)Type theoryMathematicsMassMultiplication signSoftware bug
HeuristicMachine learningSet (mathematics)Field (computer science)HeuristicProduct (business)Endliche ModelltheorieComputer animation
Component-based software engineeringProduct (business)Euclidean vectorLocal GroupSoftware developerProjective planeSound effectConnectivity (graph theory)Physical lawLoginMultiplication signCoefficient of determinationSoftware developerSoftware bugProbability density functionProduct (business)File viewerCASE <Informatik>Computer animation
Data modelComponent-based software engineeringReduction of orderConnectivity (graph theory)Endliche ModelltheorieState of matterScaling (geometry)SubsetGroup actionHeegaard splittingSoftware developerComputer architectureCASE <Informatik>Compact CassetteNumberSoftware bugComputer animation
Component-based software engineeringEuclidean vectorThresholding (image processing)Control flowVolumenvisualisierungCore dumpFile viewerSoftware bugMaxima and minimaConfidence intervalFunction (mathematics)Level (video gaming)Multiplication signForcing (mathematics)Endliche Modelltheorie
Component-based software engineeringSoftware bugEuclidean vectorFile viewerComa BerenicesEndliche ModelltheorieMultiplication signComputer fileGroup actionConnectivity (graph theory)Physical system2 (number)Web 2.0Software bugFile viewerProbability density functionRobot
Component-based software engineeringConnectivity (graph theory)Abstract machineEndliche ModelltheorieConnected spaceSet (mathematics)Computer architectureComputer animation
BuildingArchitectureComponent-based software engineeringMultiplication signComputer architectureEndliche ModelltheoriePhysical lawLink (knot theory)Computer animation
Link (knot theory)WebsiteOpen set9K33 OsaGraphical user interfaceSmoothingSearch engine (computing)Endliche ModelltheorieLink (knot theory)CASE <Informatik>Software bugRootStudent's t-testUniverse (mathematics)WebsitePhysical systemField (computer science)ResultantBacktrackingPhysical lawMultiplication signSubject indexingOnline helpComputer animation
Maize9K33 OsaInformation securityPasswordComputer configurationWebsiteInformation privacyQuicksortGraphical user interfaceSoftware bugLink (knot theory)Traffic reportingWebsiteEndliche ModelltheoriePhysical law
Maize9K33 OsaWebsiteInformation privacyInformation securityComputer configurationPasswordQuicksortGraphical user interfaceComponent-based software engineeringArchitectureBuildingContent (media)Software bugCASE <Informatik>Group actionComputing platformStatistical hypothesis testingMultiplication signMachine codePatch (Unix)Source codeComputer animation
Statistical hypothesis testingSubsetStatistical hypothesis testingStatistical hypothesis testingMachine codePhysical systemTorusConfiguration spaceForcing (mathematics)Level (video gaming)Uniqueness quantificationPatch (Unix)Computer fileMetropolitan area networkSubsetBitRepository (publishing)Physical lawMachine learningScaling (geometry)CASE <Informatik>Branch (computer science)Shared memoryMultiplication signSource codeBuildingComputer animation
Uniqueness quantificationComputer fileAbstract machineAbstract machineStatistical hypothesis testingVideo cardScaling (geometry)Matrix (mathematics)Computer fileBuildingView (database)Statistical hypothesis testingUniqueness quantificationConfiguration spaceSoftware bugMachine codeCombinational logicCASE <Informatik>Special unitary groupSoftware developerPoint (geometry)Computer animation
Task (computing)SubsetPatch (Unix)Statistical hypothesis testingStatistical hypothesis testingUser interfaceMultiplication signCognitionAbstract machineWeb pageSoftware developerInformation overloadResultantPhysical lawGoodness of fitSource code
Task (computing)SubsetMultiplication signCloud computingPatch (Unix)WindowStatistical hypothesis testingWeb browserMaschinenbau KielMereologyRevision controlAbstract machineSource code
Error messageStatistical hypothesis testingSubsetTask (computing)Twin primeStatistical hypothesis testingTask (computing)ParsingPatch (Unix)Error messageStatistical hypothesis testingFocus (optics)Software developerResultantSource code
Patch (Unix)Characteristic polynomialLink (knot theory)Statistical hypothesis testingInformationParsingSource codeOrder (biology)NumberProjective planeAbstract machineStatistical hypothesis testingPatch (Unix)Statistical hypothesis testingEndliche ModelltheoriePurchasingInformationSource codeHeuristicAuthorizationComputer animation
Data modeloutputPatch (Unix)Data miningTupleStatistical hypothesis testingComputer fileDistanceSource codeoutputStatistical hypothesis testingTupleComputer filePatch (Unix)Abstract machineDistanceEndliche ModelltheorieSet (mathematics)Database normalizationStatistical hypothesis testingSource codeLink (knot theory)NumberData miningMereologyMatching (graph theory)Connected spaceBlogWave packet
BuildingArchitectureData modelMultiplicationoutputPatch (Unix)Data miningStatistical hypothesis testingTupleSource codeDistanceExecution unitComponent-based software engineeringControl flowOperations researchMachine codeReduction of orderSoftware developerEndliche ModelltheoriePatch (Unix)Statistical hypothesis testingWave packetComputer architecturePurchasingStatistical hypothesis testingPhase transitionProjective planePerformance appraisalFuzzy logicControl flowSpacetimeSoftware bugMultiplication signLinear regressionCausalityAlgorithmComputer animation
Cubic graphAndroid (robot)Patch (Unix)Link (knot theory)Set (mathematics)CollaborationismObservational studyAreaSoftware bugLinear regressionProcess (computing)AlgorithmCausalityLink (knot theory)MathematicsMultiplication signField (computer science)Structural loadSoftware developerSheaf (mathematics)Set (mathematics)Commutator
Process (computing)EmailEquals signBroadcasting (networking)Operations researchSoftware developerToken ringInformationWordLevel (video gaming)MathematicsResultantArithmetic progressionView (database)CASE <Informatik>Visualization (computer graphics)Confidence intervalPatch (Unix)Line (geometry)SpacetimeSubsetHeegaard splittingMachine codeKernel (computing)Source code
Data modelFunction (mathematics)Function (mathematics)AlgorithmSoftware developerAbstract machinePatch (Unix)ResultantMachine codeConfidence intervalMereologyMachine learningFunctional (mathematics)Program flowchart
Function (mathematics)Software developerMachine codeEndliche ModelltheorieSoftware bugFunctional (mathematics)MathematicsSound effectPatch (Unix)AreaStatistical hypothesis testingTranslation (relic)Information privacy
Information privacyEndliche ModelltheorieTranslation (relic)Service (economics)Point cloudClient (computing)Translation (relic)Group actionCloud computingOcean currentService (economics)Endliche ModelltheorieInformation privacyProjective planeMultiplication signSource codeImage resolutionClient (computing)Presentation of a groupFormal languageComputer animation
Extension (kinesiology)SineLocal ringTranslation (relic)Client (computing)Machine learningPhysical systemData modelServer (computing)Formal languageAbstract machineEndliche ModelltheorieTranslation (relic)Server (computing)CASE <Informatik>Client (computing)Content (media)Web pageData modelCompact CassetteComputer animation
Open setSource codeAugmented realityData AugmentationTranslation (relic)Set (mathematics)Abstract machineProcess (computing)Translation (relic)QuicksortFormal languageOrder (biology)Web 2.0Point cloudComputer animation
Student's t-testData modelData compressionGeometric quantizationFloating pointParameter (computer programming)Core dumpBefehlsprozessorTotal S.A.Point cloudOrder (biology)Formal languageServer (computing)Endliche ModelltheorieGeometric quantizationStudent's t-testData compressionDataflowComputer animation
DisintegrationInformation privacyEndliche ModelltheorieMaschinelle ÜbersetzungOrder (biology)Web 2.0Extension (kinesiology)BitDemo (music)Multiplication signComputer animation
Information privacyDemo (music)Computer animation
Content (media)HTTP cookieWebsiteService (economics)Web pageEndliche ModelltheorieTranslation (relic)Computer animation
Time domainGroup actionProgrammer (hardware)Inclusion mapUsabilityAsynchronous Transfer ModeAxiomBroadcast programmingAssociative propertyContent (media)Web pageMereologyEndliche ModelltheorieTranslation (relic)Computer animation
Demo (music)Information privacyOnline helpFormal languageBuildingRight angleComplex (psychology)Coefficient of determinationOffice suiteFormal languageSet (mathematics)Scaling (geometry)1 (number)AdditionComputer animation
Open setSoftware developerMachine learningCoefficient of determinationComplex (psychology)Physical lawInformation privacyComputer animation
BuildingArchitectureComponent-based software engineeringTranslation (relic)Interface (computing)MereologyProcess (computing)View (database)Multiplication signContext awarenessProduct (business)Functional (mathematics)Mechanism designWeb browserSlide ruleComputer virusOnline helpMaschinelle ÜbersetzungSet (mathematics)
BuildingArchitectureComponent-based software engineeringTranslation (relic)Web browserStatistical hypothesis testingLattice (order)
Program flowchart
Transcript: English(auto-generated)
Hello, everyone. I'm Marco Castelluccio. Thank you for being here to listen to my talk. I'm an engineer manager at Mozilla. I've been at Mozilla for almost 10 years now. I started as a contributor, then an intern, then I was hired, and I've been here for almost 10 years.
I started working on some funny projects like writing a Java VM in JavaScript, and then more recently I started focusing on using machine learning and data mining techniques to improve developer efficiency,
which has also been the subject of my PhD. During this talk, I will show you how we will all be out of a job in a few years. Joking. I will just thank you through our journey of how we incrementally built
features based on machine learning for improving software engineering, one on top of each other. I'm the father of two, Luna and Nika, and
before we start the presentation, I wanted to explain a little why we need to do all these complex machine learning things on top of bugs, CI, patches, etc., etc.
Firefox is a very complex software. It's a browser. We have hundreds of bug reports and feature requests open per day. We have 108 million bug reports at this time, which is almost the price of a one-bedroom apartment in London. We release every four weeks with thousands of changes, and
during 2022 we had 13 major releases and 45 million minor releases. As you can see, we even sometimes party when we reach a certain number of bugs.
As I said, Firefox is one of the biggest software in the world. We have a lot of legacy. Netscape was open-sourced 25 years ago. A few days ago, we celebrated the 25th birthday. Over time, we had 800,000 commits made by 9,000 unique contributors,
representing 25 million lines of code. We had 37,000 commits only last year by 1,000 unique contributors. Not all of them are paid. Many of them are volunteers.
And this is a list of the languages that we use. As you can see, we use many of them. We have C, C++, and Rust for low-level things. Rust is gaining ground and is probably going to overcome C soon. We use JavaScript for front-end and for tests, and we use Python for CI and build system, but we have many more.
So if anybody is interested in contributing, you have many options to choose from. But let's see. So as I said, the complexity is really large. We have
thousands and thousands of bugs, and we need some way to control the quality to increase the visibility into the quality of the software. And we cannot do that if the bugs are left uncontrolled. One of the first problems that we had was that there is no way to differentiate between
defects and feature requests. We call them bugs on bugzilla, but they are actually just reports. Many of them are defects. Many of them are actually just feature requests. And so at the time, we had no way to measure quality. We had no way to tell in this release, we have 100 bugs. In this other release,
we had 50. So this release is better than the previous. And so we need a way to make this differentiation in order to measure the quality. And it was also hard to improve workflows if we had no way to differentiate between them. So we thought of introducing a new type field. This might seem simple.
It's just choice between defect enhancement and task. But in practice, when you have 9,000 unique contributors, some of them not paid. It's not easy to enforce a change like this.
And we also had another problem. We have 100 million bugs. If we just introduce this type, it's not going to help us at all until we reach a mass of bugs that we change. So if we just introduced it at this time, it will only start to be useful six months from now. So we thought, how do we
set the field for existing bugs so that this actually becomes useful from day one? And we thought of using machine learning. So we collected the large data, we collected a data set. I'm not sure it can be considered large nowadays.
With the 2,000 manually labeled bugs, a few of us labeled independently, and then we shared the labeling so that we were consistent. And we had the 9,000 labeled with some heuristics based on fields that were already present in Baxila. Then we,
using the fields from Baxila and the title and comment through an NLP pipeline, we trained an XGB Boost model, and we achieved accuracy that we deemed good enough to be used on production. And this is how the
bug-bug project started. It was just a way to differentiate between defects and non-defects on Baxila. We saw it worked, and then we decided, we thought, what if we extend these to something else? What is the next big problem that we have on Baxila?
And it was assigning components. Again, we have lots of bugs, millions of, hundreds of thousands of bugs. We need a way to split them in groups so that the right team sees them, so that the right people see them. And
the faster we do it, the faster we can fix them. At the time, it was manually done by volunteers and developers. So you can see a screenshot here, product and component, PDF viewer. In this case, we didn't need to manually create a data set, because all of the 1 million bugs were already
manually split into groups by volunteers and developers in the past. So we had, in this case, a very large data set, two decades worth of bugs. The problem here was that we had
to roll back the bug to the initial state, because otherwise, by training the model on the final state of the bug, we would have used future data to predict the past. And it was not possible, of course. So we rolled back the history of the bug to the beginning.
We also reduced the number of components, because again, with the Firefox scale, we have hundreds of thousands of components. Many of them are no longer actually maintained and no longer relevant, so we reduced them to a smaller subset. And again, we had the same kind of architecture to train the model.
With a small tweak, we didn't have perfect accuracy, and so we needed a way to choose confidence and recall, so pay the price of lower quality, but catching more bugs or catching fewer bugs, but be precise, more time. So we can control this easily with a confidence level that is output by the model,
which allows us to sometimes be more aggressive, sometimes be less aggressive, but at least we can have a minimum level of quality that we enforce. The average time to assign a bug then went from one week to a few seconds. Over time, we
auto classified 20,000 bugs, and since it worked, we also extended it to webcompat.com, which is another, yet another bug reporting system that we have at Mozilla, which if you find web compatibility bugs,
please go there and file them because it's pretty important. And you can see here an action of the bot moving the bug to, again, the Firefox PDF viewer component. Maybe I should have used another example just for fun. Now, we had something working, and it was starting to become promising, but we needed to
make it better. We needed to have a better architecture for the machine learning side of things. We needed to retrain the models. We needed to collect new data. We needed to make sure that whenever a new component comes in, we retrain the model with the new components. If a component stops being used, we need to remove it from the data set and things like that.
So we built over time a very complex architecture. I won't go into too many details because it will take too long, but maybe if somebody has questions later, we can go into that.
And then with the architecture now, it was easier to build new models. We even had contributors building models just all by themselves. In particular,
there was a contributor, Ayush, which helped us build a model to root out spam from backzilla. So it seems weird, but yes, we do have spam on backzilla as well. People are trying to get links to their websites into backzilla because they think the search engine will index them. It's not actually the case.
We tell them all the time, but they keep doing it anyway. We have university students. Backzilla is probably the most studied backtracking system in research, and we have many
university students from many countries that use backzilla as a playing field. Many times, we even contact the universities and professors asking them if we can help them give more relevant topics to
students, etc., but they keep filing bugs. And this contributor, maybe it was from one of these schools, was tired of it and helped us build a model. And the results were pretty good. I'll show you a few examples of bugs that were caught by the
model. So this one was, if you look just at the first comment of the of the bug, it looks like a legit bug, but then the person created a second comment with a link to their website, and it was pretty clear that it was spam.
This one is another example. This is actually a legit bug. It's not spam. Maybe it's not so usable as a bug report, but it was not spam. And then somebody else, a spammer, took exactly the same contents, created a new bug,
injecting the link to their website in the bug report. And somehow, I don't know how the model was able to detect that it was spam. It's funny because you can see that, so when you file a bug on Bugzilla, Bugzilla will automatically insert the user agent so that we have more information as possible to fix bug.
But in this case, he was filing the bug, copying the contents of the other bug. So we have two user agents, and they're even on the different platforms, one on Mac and one on, he was using Chrome, actually.
Okay, so we were done with bugs. Well, we are not done with the bugs. We will have plenty of things to do in the future forever. But we were happy enough with bugs, and we thought, what can we improve next?
One of the topics that we were focusing on at the time was testing and the cost associated to testing. We were experimenting with code coverage, trying to collect coverage to select relevant tests to run on a given patch.
But it was pretty complex for various reasons, so we thought maybe we can apply machine learning here as well. But before we go into that, let me explain a bit about RCI because it's a little complex. So we have three branches, three repositories, which all kind of share the same code, Firefox.
We have Try, which is on-demand CI. We have Autoland, which is the repository where patches land after they've been reviewed and approved. And we have Mozilla Central, which is the actual repository where Firefox source code lives and where
from which we build Firefox Nightly. On Try, we run whatever the user wants. On Autoland, we run a subset of tests. At the time, it was kind of random, what we decided to run. And on Mozilla Central, we run everything.
To give you an idea, on Try, we will have hundreds of pushes per day. On Autoland, the same. And on Mozilla Central, we have only three or four, and it's restricted only to to certain people that have the necessary
permissions, since you can build Firefox Nightly from there, and it's going to be shipped to everyone. The scale here is similar to the bug case. We have 100,000 unique test files. We have around 150 unique test configurations, so combinations of
operating systems, high-level Firefox configurations, so old style engine versus new style engine, certain graphics engine versus another graphics engine, etc, etc. We have debug builds versus
optimized builds. We have ASAN, TSAN, code coverage, etc, etc. Of course, the matrix is huge and you get to 150 configurations. We have more than 300 pushes per day by developers, and the average push takes 1,500 hours, if you were to run it all at the same, one after the other. It takes 300 machine years per month, and
we run around 1 million machines per, we use 100 million machines per month to run these tests. If you were to run all of the tests, you would need to run two point, all of the tests in all of the configurations, you would need to run around
2.3 billion test files per day, which is of course unfeasible. And this is a view of our three-herder, which is the user interface for Mozilla test results. You can see that it is almost unreadable.
Luckily, the green stuff is good. The orange stuff is probably not good. You can see that we have lots of tests and we spend a lot of money to run these tests. So what we wanted to do, we wanted to reduce the machine time spent to run the tests.
We wanted to reduce the end-to-end time so that developers, when they push, they get a result, yes or no, your patch is good or not, quickly. And we also wanted to reduce the cognitive overload for developers. Looking at a page like this,
what is it? It's impossible to understand. Also, to give you an obvious example, if you have, if you're changing the Linux version of Firefox, I don't know, you're touching GTK,
you don't need to run Windows tests. At the time, we were doing that. At the time, if you touched GTK code, we were running Android, Windows, Mac. That was totally useless. And the traditional way of running tests on browsers doesn't really work. You cannot run everything on all of the pushes.
Otherwise, you will have a huge bill from the cloud provider. So we couldn't use coverage because of some technical reasons. We thought, what if we use machine learning? What if we extend bug-bug to also learn patches and tests?
So the first part was to use machines to try to parse this information and try to understand what exactly failed. It might seem like an easy task if you have 100 tests or 10 tests, but when you have 2 billion tests,
you have lots of intermittently failing tests. These tests fail randomly. They are not always the same. Every week, we see 150 new intermittent tests coming in. It's impossible to... It's not easy to automatically say if a failure is actually a failure or if it is an intermittent.
Not even developers are able to do that sometimes. Also, not all of the tests are run on all of the pushes. So if I push my patch and a test doesn't run but runs later on another patch
and it fails, I don't know if it was my fault or somebody else's fault. And so we have sheriffs, people that are whose only focus, whose main focus is watching the CI and they are pretty experienced at doing that,
probably better than most developers. But human errors still exist. Even if we have their annotations, it's pretty hard to be sure about the results. You can see a meme that some sheriff created.
Flaky tests are the infamous intermittently failing tests. So the first step, the second step after we implemented some heuristics to try to understand the failures due to a given patch was to analyze patches
and we didn't have readily available tools, at least not fast enough for the amount of data that we are talking about. We just use Mercurial for authorship info. So who's the author of the push? Who's the reviewer?
When was it pushed? Etc, etc. And we created a couple of projects written in Rust to parse patches efficiently and to analyze source code. The second one was actually a research partnership with the Polytechnic Auditorino and
the machine learning model itself, it's not a multi-label model as a one might think, where each test is a label. It would be too large with the number of tests that we have. The model is simplified. It's the input is the
tuple test and patch and the label is just fail not fail. So the features actually come from both the test, the patch and the link between the test and the patch. So for example, the past failures when the same files were touched,
the distance from the source files to the test files in the tree, how often source files were modified together with test files. Of course, if they are modified together probably they are somehow linked. Maybe you need to fix the test and so when you push your patch you also fix the test. This is a clear link.
But even then we have lots of test redundancies. So we use the frequent item set mining to try to understand which tests are redundant and remove them from the from the set of tests that are selected to run. And this was pretty successful as well.
So now we had architecture to train models on bugs, to train models on patches and tests. The next step was to reuse what we built for patches,
to also try to predict defects. This is actually still in experimental phase. It's a kind of a research project. So if anybody is interested in collaborating with us on this topic, we will be happy to do so.
I will just show you a few things that that we have done in the space for now. So the goals are to reduce the regressions by detecting the patches that the reviewers should focus on more than than others, to reduce the time spent by reviewers on
less risky patches and to when we detect that the patch is risky, trigger some risk control operations. For example, I don't know running fuzzing tests more comprehensively in these patches and things like this. Of course, the model is just an evaluation of the risk.
It's not actually going to tell us if there is a bug or not, and it will never replace a real reviewer who can actually review the patch more precisely. The first step was, again, build the data set. It is not easy to know which patches cause regressions.
It's actually impossible at this time. There are some algorithms that are used in research. The most famous one is S-Z-Z. But we had some answers that it was not so good. So we started here again introducing a change in the process that we have. We introduced a new field, which is called the regressed by,
so that developers, QA users can specify what caused a given regression. So when they file a bug, if they know what caused it, they can
specify it here. If they don't know what caused it, we have a few tools that we built over time to automatically download builds from RCI that we showed earlier, automatically download builds from the past, and run a bisection to try to find what the cause is for the given bug.
With this, we managed to build a pretty large data set, 5,000 links between bug introducing and bug fixing commits, actually commit sets, and then this amounts to 24,000 commits, and
then we were able, with this data set, to evaluate the current algorithms that are presented in the literature. And as we thought, they are not working well at all. So this is one of the areas of improvement for research. One of the improvements that we tried to apply and
to SZZ was to improve the blame algorithm, if you're more familiar with Mercurial Annotate algorithm, to try to, instead of looking at lines,
splitting changes by words and tokens, so you can see past changes by token instead of by line. This is a visualization from the Linux kernel. This is going to give you a much more precise view of what changed in the past.
For example, it will skip over tab only changes, white space only changes, and things like that. If you add an if, your code will be intended more, but you're not actually changing everything inside. You're changing only the if.
This actually improved the results, but it was not enough to get to a acceptable level of accuracy. But it's nice, and we can actually use it in the IDE. We're not doing it yet, but we will, to give more information to users, because developers use Annotate and Git blame a lot.
This is a UI that is a work in progress for analyzing the risk of a patch. This is a screenshot from our code review tool. We are showing the result of the algorithm with confidence. In this case, it was a risky patch with
79% confidence, and we give a few explanations to the developers. This is one of the most important things. Developers do not always trust, developers like any other user, do not always trust results from machine learning, and so you need to give them an explanation.
And this is another part of the output of our tool. This is again on our code review tool. We're showing on the on the functions that are being changed by the patch, if the function is risky or not, and which bugs in the past were involved
in this function. So developers can try to see if the patch is reintroducing a previously fixed bug, and they can also know what kind of side effects there are when you make changes to a given area of the code.
Now we we we did a lot of stuff for developers. We trained models for bugs, we trained models for patches, we trained models for tests, we trained models to predict the facts.
Now I'm going to go to a slightly different topic, even though it's connected. Privacy-friendly translations. So we're working on introducing translations in in Firefox. The subtitle was actually translated automatically using Firefox Translate, which you can use nowadays.
The idea is that translation models improved a lot in recent times. Current cloud-based services do not offer the privacy guarantees that we like to offer in Firefox.
They are closed source. They are not privacy-preserving. So we started the project. It was funded by the European Union to investigate client-side private translation capabilities in in Firefox itself. It is currently available as an add-on that you can install in Firefox.
We support many European languages, and we are working on supporting even more. We're going to also work on supporting non-European languages like Chinese, Korean, Japanese, etc. And
in this case, we use machine learning on the client-side to perform the translation. So your data never leaves your Firefox. The models are downloaded from our servers, but they run locally on your machine. So the contents of the web page that you're looking at will never go to Google Bing or whatever.
They will be translated locally on your machine. We use a few open data sets. Luckily, we have lots of them from past research. Not all of them have good quality, but many of them have, but we are looking for more. So if you have
suggestions for data sets that we can use, please let us know. On the data sets, we perform some basic data cleaning. And we use machine learning-based techniques to clean the data to remove bad sentence pairs that we believe are bad. Of course, the
the data set that I showed before are open, but sometimes they are just crawled from the web, so they contain all sorts of bad sentences. Also, HTML tags and stuff like that, we need to clean them up. Otherwise, the
translations will learn to translate HTML tags. And we use some techniques to increase the size of the data set automatically, like back translations, translating sentences from one language to the other and back translating it in order to increase the size of the data sets.
So we train the large model on cloud, on our cloud machines, but which is pretty large. You can see it's around 800 megabytes. So every language pair, you would need to download 800 megabytes, and it is
super slow. So we can only use that on cloud. So we use some techniques in order to reduce the size of these models and to make them faster. We use knowledge distillation, basically using the
model, the large model that we trained as a trainer for a student model, which is much smaller. So you can see that from 800 megabytes, we got 216. I think now we're around five, six, something like that. So it's much smaller, and you can actually download it on demand from our servers.
And we use quantization for further compression and perf improvements. So moving the data from the model from float32 to int8. Then we compile the machine translation engine to WebAssembly in order to be able to use it inside Firefox.
We introduced some SIMD extensions into WebAssembly and into Firefox in order to be able to be even faster when translating, even though we translate a bit at a time. So it's pretty fast.
And the engines are downloaded and updated on demand. Let me show you a demo. So you can see the
my Firefox is in Italian, but you can see that it automatically detected that the page is in French, and it is suggesting me to translate it to Italian. I will change it to English. Oh, fuck. So it is downloading the model. Now it's translating.
So while it was translating, you already saw the contents of the first part of the page was already translated. So it's super quick in the end. And the translation seems to be pretty good. I don't speak French, but I think it makes sense.
You can also use it from the toolbar. So you can choose a language and translate it to another. Let's do Italian to French. It works.
So if you know any data set that we can use, in addition to the ones that we already use, or
if you're interested in building a new great feature in Firefox, or if you want to add support for your language, or improving support for your language, come and talk to us at our booth. We would be really happy if you could help us. And before we come to an end, let me show you how far we've come. The dogs have grown and
we have learned that it is possible to tame the complexity of a large-scale software. It is possible to use the past history of development to support the future of development.
And it is possible to use machine learning in a privacy-friendly way and in the open. What else could we do with the data and the tools that we have at our disposal? I don't know. I'm looking forward to know. I'm looking forward to see what other wild ideas you and us at Mozilla can come up with. Thank you.
Thank you very much, Marco, for the amazing talk. Now we're open for questions. If anyone would like to make a question, please raise your hands so you can take the microphone.
Questions, questions, questions. Hands up. There. Okay, okay. I'm sorry, I'm learning. I'm new to this. I'm coming up.
Hello. I have actually two questions. First question is, have you actually think about the idea to use this mechanism to automatically translate
interface of Mozilla products? Sorry? Testing? Yes. Yeah. So the question is, have you think about mechanism of automatically translating the interface of Mozilla Firefox products or maybe documentation you already have like MDN?
Because it's still a demand to translate this stuff. I'm sorry, I'm still, I'm not hearing well. Can you maybe come closer?
From here? Okay. Now it's better? Yes. Okay. So my question is, have you trying to use this mechanism of automatic translation
to use this translation for existing interface you have in the products and especially also the documentation part? Because it's kind of vital part when you need to translate new functionality or you have to translate something new in interface, you need the help of translator. But if you already know how to translate and doing this stuff, so that means like you already have a data set, you can actually automatically translate new parts of interface
without translator. Yes. So it is definitely something that could be used to help translators do their job. We could translate parts of the interface automatically and of course there will always be some review from actual translator to make sure that the
translation makes sense in the context, especially because Firefox UI, sometimes you have very short text and it needs to make sense. But yeah, it's definitely something that we have considered and actually one of the data sets that we use from the list, it's not called Mozilla L10N and they are sentence pairs from our browser UI. People are actually
using it in research for automating translations. Does anyone have any other question? Please raise your hands if you have any other question to Marco. Okay. If not,
thank you very much again Marco. Thank you.