Web compatibility
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/47366 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2020439 / 490
4
7
9
10
14
15
16
25
26
29
31
33
34
35
37
40
41
42
43
45
46
47
50
51
52
53
54
58
60
64
65
66
67
70
71
72
74
75
76
77
78
82
83
84
86
89
90
93
94
95
96
98
100
101
105
106
109
110
116
118
123
124
130
135
137
141
142
144
146
151
154
157
159
164
166
167
169
172
174
178
182
184
185
186
187
189
190
191
192
193
194
195
200
202
203
204
205
206
207
208
211
212
214
218
222
225
228
230
232
233
235
236
240
242
244
249
250
251
253
254
258
261
262
266
267
268
271
273
274
275
278
280
281
282
283
284
285
286
288
289
290
291
293
295
296
297
298
301
302
303
305
306
307
310
311
315
317
318
319
328
333
350
353
354
356
359
360
361
370
372
373
374
375
379
380
381
383
385
386
387
388
391
393
394
395
397
398
399
401
409
410
411
414
420
421
422
423
424
425
427
429
430
434
438
439
444
449
450
454
457
458
459
460
461
464
465
466
468
469
470
471
472
480
484
486
487
489
490
00:00
Projective planeWeb 2.0Machine learningOpen innovationMereologyLecture/Conference
00:30
Web 2.0Machine learningSoftware engineeringOpen innovationComputer animation
00:54
Time domainProcess (computing)Web 2.0Web browserWebsiteTraffic reportingFeedbackoutputMereologyQuicksortGeneric programmingProjective planeOpen innovationMobile appOrder (biology)Software bugTrailPrototypeDifferent (Kate Ryan album)Type theoryComputer animation
03:17
Context awarenessCollaborationismOpen innovationDirection (geometry)QuicksortVirtual machineComputer animation
03:42
Context awarenessCollaborationismOpen innovationMedical imagingProcess (computing)Traffic reportingSoftware bugData modelPoint (geometry)Different (Kate Ryan album)WebsiteQuicksortWeb 2.0Form (programming)Traffic reportingContent (media)Virtual machinePoint (geometry)Projective planeoutputStatement (computer science)Process (computing)FeedbackFraction (mathematics)Data storage deviceException handlingEndliche ModelltheorieNoise (electronics)Web browserEvent horizonMultiplication signComputer animation
06:36
Analytic setFile formatVirtual machineFitness functionMetadataTraffic reportingComputer animation
07:28
StatisticsResultantGroup actionTranslation (relic)Endliche ModelltheorieCASE <Informatik>BitEvent horizon1 (number)Default (computer science)Set (mathematics)Process (computing)Statement (computer science)Projective planeVirtual machineProper mapProduct (business)Figurate numberTraffic reportingComputer animation
10:10
Data modelScripting languageVirtual machineType theoryPrice indexComputer animation
10:38
Metric systemMatrix (mathematics)Performance appraisalData modelComputer-generated imageryMoment (mathematics)Open sourceVirtual machineOpen sourceMetric systemHacker (term)Projective planeData managementType theoryResultantMultiplication signMedical imagingVideo gameMoment (mathematics)outputStatement (computer science)Message passingConfidence intervalThresholding (image processing)Latent heatEvent horizonAnalytic setTask (computing)Computer wormDataflowOrder (biology)Data storage deviceGoodness of fitQuantum stateCodeMachine learningSet (mathematics)BitElectronic program guideStandard deviationComputer animation
18:35
Metric systemThresholding (image processing)Confidence intervalEndliche ModelltheorieSoftware bugSocial classProcess (computing)Projective planeStatisticsTraffic reportingDifferent (Kate Ryan album)Presentation of a groupPoint (geometry)ResultantFeedbackWorkloadQuicksortMereologyPosition operatorMultiplication signFlagWave packetRight angleProduct (business)Natural languageMachine learningContent (media)Expected valueLecture/Conference
23:40
Formal languageResultantLibrary (computing)QuicksortVector spaceCountingLecture/Conference
24:22
WebsiteDescriptive statisticsField (computer science)Process (computing)Form (programming)MereologyRevision controlLecture/Conference
25:16
Point cloudOpen source
Transcript: English(auto-generated)
00:05
And our next talk will be about one of the most important buzzwords of the machine learning and how his project helped the web compatibility team, who has to parse around 1,000 issues from volunteers about the open web and all.
00:22
The speaker, Yanis Yanelos, also known as John, or famously, Nemo, and yes, we found him, is part of Open Innovation and has this project, again, as I mentioned, together with the web compatibility team at Mozilla. Please welcome John.
00:43
So hi. Thank you for having me here. My name is Yanis Yanelos. I'm working in the Open Innovation team as a software engineer. And I'm here to talk to you about web compatibility and machine learning. So here's a rough outline.
01:01
And yeah, Open Innovation. So my team is trying to bring innovation in an open way across the org and experiment with other teams and ideas and prototype stuff that might work, might not work, iterate and try to improve things in an open way
01:22
and also trying to keep track of the value that it brings back. And one of the projects that we worked with is web compatibility team. So the team's initiative is to tackle the issue of web compatibility in the web, which means that we are trying to reduce the issues where
01:42
different websites render differently or behave differently in different browsers or where apps or projects or websites don't work in different browsers. And part of it is gathering feedback from users and getting input and triaging the input
02:01
and providing feedback back to the browser vendors. So here, how it looks like, like a very basic reporting workflow. So let's say we are on example.com, and there is a compatibility issue. Then users click on report site issue. Then they fill a form.
02:21
And after that, it goes through our process, and it generates an issue on GitHub. And the issue is usually some sort of basic title around what's the website and what's the generic type of issue. And then if you dive into, there is more details on how to reproduce stuff
02:40
and where, like with different browsers or devices, the users have tried this and even some trace bugs from the browser. And as you can see, part of the process of triaging is putting labels and people doing manual work in order
03:00
to see what's wrong. Or something is false positive, or it's like something that is valuable, or something that is very valuable. So for example, let's say Wikipedia is broken on a browser. Then it's probably much more important than, I don't know, like a very, very tiny local website. So this way, this is where most of the triaging happens.
03:21
It's on GitHub issues and using labels and milestones. And disclaimer, I'm not like any sort of data science machine learning expert, but we try to see how innovation looks in that direction. And here's some context. We worked closely with Webcompat team.
03:41
The idea is based on Mozilla BugBug, which is Firefox release engineering machine learning initiative, where they were trying to introduce some sort of like machine learning principles and concepts in the whole triaging process. And the problem statement that came up is Webcompat projects reporting cadence is so fast.
04:01
So many people submit Webcompat reports. Firefox targets millions of users, and this button is accessible to all sorts of different websites and different users. So people submit a lot of feedback, but only a tiny fraction of the feedback is important,
04:22
and only a tiny fraction of the user feedback ends up being valuable to any browser vendors. So this leads us to another problem statement, that the Webcompat reporting signal to noise ratio is too small. We have a lot of spam. We have a lot of abusive content.
04:40
People probably won't necessarily understand the content of web compatibility, so they might usually submit content like, oh, I have a virus, or I have spam words, stuff like that. So only a small fraction of the stuff that people submit actually end up being valuable. And one thing that we came up with while trying to figure out
05:01
how to improve the project is that we have a lot of historic data. We have around 50k entries of reports, and except for that, we also have all the events and the historic data around the reports. So we have like five times more events around when something was labeled, or when something was milestoneed,
05:22
or who got this issue sign, and stuff like that. So we said that it might be a good idea to train a model using the Webcompat data as input to improve the triaging process. But then we came to the reality,
05:42
and we figured out that we're not machine learning experts, or we barely know what these things look like. But everything is going to be fine. It's going to work out all good. So here's how an example data point looks like. We have tightly from the issue. We have the content of the issue in free form.
06:04
And yeah, everything lives on GitHub. So first steps. We figured out that data living on GitHub is not very future proof, because we've tackled all sorts of different issues, like throttling, failures of the API.
06:20
Sometimes we couldn't get the data, like they were not available. It was so much that even based on the GitHub API policies, it was barely doable to actually use this as a data storage. So we came up with the idea to use some other data storage.
06:41
And apparently Elasticsearch and Kibana was a good fit. So what we did is we took all the Webcompat issues from the API in a JSON format. We fit everything to Elasticsearch, and then we used Kibana for the analytics. And it was very useful, because we never had the opportunity to actually have analytics around Webcompat reports and Webcompat data.
07:01
Because usually what happens is everything used to live on GitHub, issues were in the silo, and we didn't have any metadata or any analytics around it. So after that, and after we figured out what we need to achieve and how data look like, we did some research.
07:21
And we figured out there's a big ecosystem around automatic machine learning tools. So two of the most popular ones is Ludwig from Uber, which got a little bit high because it's Uber. It's a big company that's using TensorFlow. It builds up deep networks, the deep learning models
07:40
based on just the CSV. And the other one is AutoMLGS, which does pretty much the same in a more scrappy way. And we figured out that even with basic tooling and basic machine learning models, we had some decent results. And even the decent results that we got, the basic accuracy that we got, was probably good enough.
08:06
So yeah, we need to figure out what's our data. So one of the most important things that we dealt with is actually came up with a proper data set that actually fit the purpose of our project
08:21
and our problem statement and actually would help us. And we figured out that by default, even if we had tons of data structured in a way that is readable from the tooling around the ecosystem, and even though it felt like we had something valuable, we ended up having nothing in the first place because it didn't work.
08:43
We felt that there's some correlation, but things didn't work. And the most weird example is that we came up with a model given the data set that we had, and it always provided the right results, which is suspicious by default. And we figured out that the data were kinda lying, and they were biased.
09:00
So one of the most important things is that we worked closely with the teams that provide the data and the teams that do the actual triaging and do all the day-to-day work just to figure out how to deal with the problem. And after a couple of failures and after a couple of attempts that they were very biased,
09:23
very off, and very suspicious on the results, we figured out that the value of the data was not on the actual data set, but in the process of the events around the data. So apparently, what did the trick in our case is going through all the historic data, find all the events,
09:41
and see how triaging translates in actions from the users. And based on that, we defined what a good or a bad report is, how we defined it, based on this kind of statement, and we built the data set.
10:03
And after that, we came to the conclusion that we have a decent data set, we have a model that works, but we need something that works in production. And we need something that it's not just a script based on automatic tooling that we barely know how it works, we have some indications that it works well,
10:20
but not really. And yeah, we tried to see what ecosystem looked like. And we came up to this type of basic tooling, apparently Python is leading in this world of machine learning, and Pandas is very good for data frames,
10:43
and handling data is like that, and Scikit-learn, even though it's researchy and more educational, it did the trick. And by taking a look at the actual tooling around the machine learning system, we came up with this basic set.
11:01
And we figured out that things are very simple. And actually, even though we're expecting complicated code base and some external resources for telling us if things are fine, we actually came up with something very basic, all we did is had a proper basic dataset,
11:22
we fit it to our model, and it provides some good results. And even under the hood, it's not that complicated, so what we're pretty much doing is a very dummy approach, we just concatenate all the data that we have from the free text, we tokenize everything, and we pass it to a well-known off-the-shelf classifier,
11:43
which is XeBoost, which is doing gradient boosting, and actually provided amazing results. But even after that, we were kind of challenging
12:00
the whole idea of metrics and what success looks like. And metrics came into the game, and we tried to figure out what kind of metrics we need to have in order to make sure that we're doing good. So there are many things that you probably know
12:21
as a machine learning developer and as a data scientist, and might be very confusing and very complicated, but in the end, what matters is having a basic understanding of the different metrics, making sure that you know what the dataset
12:40
and what the results look like, and be consistent with the things that you track. So yeah, understanding the problem you're solving and what the metrics means to that. And we came up with these results, and everyone was very happy. So we got, with basic tooling and with basic stuff,
13:06
we got 90% accuracy. And it's not bad, right? So our current stack, look, it's mostly Python-based. We have a project called webcompatml, which is the Python pockets of all the machine learning
13:21
stuff that we're doing. It's based on Xe Boost. We release Docker images for automation, and we use GitHub events in order to orchestrate all the flow. And this is our pipeline. So how things look like is, every time we have a new GitHub issue,
13:41
every time a user reports something, we trigger the automation, we send the payload to a simple HTTP endpoint. Then from that, we spin off a machine learning task based on Docker. This provides some results that we feed back to our data storage. We have some analytics. And then if we have a basic threshold and basic confidence
14:03
that the results that we have are good, we just post back to GitHub and either close an issue or write a comment or just adding a label to see that this issue doesn't look really good or really valuable, or this issue is very good and we should go for it.
14:21
And I think the most important thing that, at least my takeaway message from all this, at least what I'm trying to convey here, is that the ecosystem is very, very big. In open source right now,
14:40
and machine learning is becoming a commodity. So it's not the case, as in the past, that you needed extensive knowledge of very, very deep research and stuff, very, very highly skilled people that did this work and that do this work. Pretty much, the tooling that we have right now and the open source ecosystem around machine learning
15:00
and data science is very, very approachable. So what I'm trying to say here is that if you have a basic problem statement that you understand and you have data to back to this problem statement, then quick hacks can bring a lot of value. And in our case, a quick hack like that, which is pretty much a very, very basic NLP
15:24
machine learning model, provided 90% accuracy of all the input that we have, which means that for, I don't know, 50K reports, we can have signals that 10% is okay and 90%, we can skip it, which in the end,
15:44
brings up more opportunities about the project because if you can have a way to get the input and being able to throttle the input to the pipeline that does the actual work,
16:00
it brings up a lot of opportunities like opening it up to more people. Right now, we're targeting a specific Firefox release. What if we target all the Firefox releases or what if we target people outside Firefox to contribute to the pipeline? So yeah, quick hacks bring a lot of value. We saw this in real life in our project.
16:23
We have results that show that what we did actually saves time and effort from people doing the manual work without invading in their pipeline. And yeah, it was quick, contained, easy experiment
16:41
that turned out with good results. So yeah, my moment of wisdom here is that I'm highly encouraging people to try this type of experiments, at least get comfortable with the tooling and the idea of introducing machine learning to your project, especially in open source world where we have open processes, we have issue management in the open,
17:03
we have user feedback, especially in big projects that cannot really easily be triaged by a few people. And yeah, by introducing that, you can get good results. And also, the world has a lot of buzz around it
17:21
and there is companies and projects that are very into machine learning and they try to promote deep learning and more complicated stuff, but in the end, even basic tooling works. We tried a state vector machine which is the most basic notion of classifier for this type of problems
17:40
and actually provided amazing results. So try basic tooling, try it to boost. It's like the industry standard for these kind of problems and see the results. And in the end, this is all you need. This is the most useful thing I've seen in this machine learning journey that we had.
18:00
It pretty much guides you through the problems that you need to solve and what kind of tooling you can use around. So yeah, I'm highly encouraging you to introduce machine learning to your project, see how things work, try to be a little bit more innovative without breaking your whole workflow
18:20
and validate the results. And that's it. Thank you so much. Questions now.
18:41
Don't be shy. Yes, I'm gonna run with the mic. Christos, I will need you to help me. Hi, thanks for the talk. So what's your expectation of issues that are incorrectly classified? Sorry, can you repeat it? Valid issues that are incorrectly classified
19:02
as spam or not of good quality. You expect them to be re-raised by people or how do you handle that? Can you repeat that? Yeah, so for issues that are incorrectly classified, what's your expectation around that? How do you handle those?
19:21
Do you expect the users to re-raise them or? Yeah, so the people doing the manual triaging know about the process and they are trying to flag things that are not correct. So every time that we train the model, it feeds back, like it gets the notion of the false positives back in the training model. So we just keep this part of the workflow.
19:49
And someone was here, I forget. Okay, yeah, upper. I was missing some steps today, don't worry, okay. Hello, thank you for your talk.
20:01
You said you used the feedback to improve your model. Do you have any kind of statistics how that's influenced the quality of the model over time? We don't have any statistics right now because everything is very new. So it started being in production in late November.
20:20
So we don't have a lot of statistics around or metrics around how it improved with time. But yeah, right now, all we know is that people doing the manual work are also tasked to submit, give back feedback about how this process works and how if we have a false positive,
20:41
just let us know and we're gonna train everything again. But yeah, no metrics. Can I ask one more? Did you get feedback from the people that work with your system, that work with those predictions, whether it has affected their productivity? Yeah, so it was very interesting because one of the things that we identified early on
21:01
is that working on a silo and working on your own is very bad in this kind of work because we built a basic model that we're very happy that it provided results. And then we started running it in a more experimental way in the actual workload and people didn't really like it because they didn't trust it. So we asked for feedback
21:20
from the people doing the manual triaging and one of the most important things that came up is that it looks okay, it gives results, but we don't trust it, like what is it doing? So part of it was writing documentation, writing some examples of how it works, understanding the metrics, like giving some sort of TLDR
21:40
about the metrics in machine learning and also what kind of classifiers we use and what is confidence threshold and after that, people were more familiar with it and they were very happy, like triagers say that they are happy that there is something cleaning up the pipeline for more important stuff. So yeah, also something that relates
22:01
to the previous question is that we find very important the confidence threshold in this project. So given the accuracy that we have and given the metrics that are high related to the class that we care, which is the bugs that don't need diagnosis,
22:24
we have so big accuracy that the confidence threshold can be very high and if you tell people that you know what, there's like 60% chance that the results are bad, they don't like it, but if you say that the model says that it's like 99%, correct,
22:40
then people trust it. So we introduced the idea of classifications and labels and high, low, and very high confidence. Yes. I'm gonna be so hard to pass the mic. Yes, nice presentation.
23:01
Regarding to the data points that you've showed, I noticed the title of the issue was there as well. Sorry? The title of the issue, so it was something written in natural language. How did you handle that? Like using NLP methods or something? And the next one for internet, like different languages, how did you handle it?
23:21
Or you just limited it to English? Was the multiple languages? So the title of reports that it is like any other languages, did you filter that out or you just focused on English? First of all, we don't have any, like most of the content we have is in English, so this wasn't an issue so far. About cleaning up the content,
23:40
we used NLP methods to talk, like we used IDF and count vectorizers and they gave good results. One of the things that we might try in the future is introducing like some sort of more highly performant NLP libraries like Spacy. But even with the stuff from Scikit
24:01
and even with the basic text extraction methods, the results are very good, so my approach is to not complicate things. TFIDF, yes. I want to address the second part with languages. The tool he presented, webcompat.org, provides you suggestions when you enter a bug,
24:22
even if it's a different language website, you will suggest to like, site is broken, mobile version doesn't work, there are glitches, so it's easy to just note that. Even if you add some description in other language, you will identify with the suggested thing that you chose when you reported the issue. Also part of the general improvement in this project,
24:43
one of the things that my team also did, which is not in that scope, is we tried to improve the reporting process. So right now we have a free form that is written in markdown and it has some data, and what we're trying to move forward is having structure on like the fields. So for example, what's the browser,
25:01
or what's the description and what are the steps, because right now it's completely free text. Any other questions? Thank you so much.