Linespots: Predicting Bugs in your Code
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 160 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/33763 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 201761 / 160
10
14
17
19
21
32
37
39
40
41
43
46
54
57
70
73
85
89
92
95
98
99
102
103
108
113
114
115
119
121
122
130
135
136
141
142
143
146
149
153
157
158
00:00
PredictionMachine codeSoftwareDifferenz <Mathematik>Software bugLevel (video gaming)PlotterAreaCASE <Informatik>XMLProgram flowchart
00:52
Price indexTrailPerformance appraisalInterior (topology)Disk read-and-write headNumberMachine codePopulation densityMIDIOrdinary differential equationOptical disc driveSoftware bugProjective planeMachine codeCuboidAlgorithmMultiplication signGreatest elementGame controllerAreaElectronic mailing listResultantCalculationMereologyLevel (video gaming)Correspondence (mathematics)TrailRevision controlMessage passingThumbnailBasis <Mathematik>Hash functionPopulation densityComputer filePairwise comparisonPlateau's problemRight angleScaling (geometry)Disk read-and-write headNumberGroup actionAverageTable (information)Arithmetic meanMultilaterationSlide ruleObservational studyFunction (mathematics)Link (knot theory)PlotterCASE <Informatik>MathematicsGraph (mathematics)FeedbackPoint (geometry)HypothesisXMLUMLComputer animation
10:21
Bit error rateAverageVideo trackingLogicTrailMachine learningDecision theoryBlock (periodic table)Functional (mathematics)Virtual machineAlgorithmSign (mathematics)AreaProgrammschleifeHypothesisProjective planeParameter (computer programming)Price indexGraph coloringMachine learningSoftware bugLogicCASE <Informatik>NumberData managementComputer fileVideo gameReal numberMachine codeWritingImplementationElectronic mailing listMereologyMoment (mathematics)RankingInsertion lossEndliche ModelltheorieParametrische ErregungOpen sourceFormal languagePower (physics)Online helpCuboidStatement (computer science)Computer animation
15:05
CuboidConstraint (mathematics)Multiplication signElectronic visual displaySoftware testingOpen sourceConnected spaceSheaf (mathematics)ImplementationHypothesisAlgorithmSign (mathematics)Software bugLecture/ConferenceMeeting/Interview
16:17
Software bugMachine codeDifferent (Kate Ryan album)Type theoryAlgorithmSet (mathematics)Mathematical analysisFluid staticsUniform resource locatorMaxima and minimaCuboidArithmetic meanMereologyVector spaceLecture/ConferenceMeeting/Interview
19:05
Machine codeSoftware testingGraph coloringDirection (geometry)Instance (computer science)Software bugMereologyImplementationRight angleUniverse (mathematics)CASE <Informatik>Execution unitLecture/Conference
20:44
ImplementationSoftware testingElectronic mailing listTrailComputer fileMeeting/Interview
21:09
Software bugMultiplication signMachine code1 (number)Set (mathematics)Functional (mathematics)AreaMeasurementNumberProjective planeWeightArithmetic progressionPotenz <Mathematik>Exponential functionGoodness of fitMathematicsLibrary (computing)MereologyLattice (order)Scaling (geometry)System callWordQuantumCuboidSound effectElectronic mailing listWeight functionCASE <Informatik>Lecture/Conference
24:28
Division (mathematics)Greatest element1 (number)Weight functionRandom number generationMultiplication signBitPoint (geometry)
Transcript: English(auto-generated)
00:04
Thank you everyone. So, as a start, I would like to ask you what your feelings are if you see a picture like this. Any positive or negative feelings? Fear. Okay, so this is what a pull request, or in this case, just a diff from a pull request in GitHub looks like.
00:26
And this probably means you are reviewing it, which is hard for a lot of people. So, what would you say if I tell you that there was something that made this confusing, I have to look everywhere thing look more like this, and give you a heat map of problem areas,
00:46
where you should look, where there's probably a bug. And what if I told you that there was a thing that would predict 50% of the bugs in your project
01:01
by just looking at 0.5% of the lines of code in your project. Does that sound good, or helpful? I see some thumbs up. So, that's what line spots, hopefully, can do, ta-da. First, who knows what a good commit is?
01:21
Who knows what a hunk is, or who doesn't know what a hunk is in a commit? Some hands, okay. So, just to be able to explain the algorithm later, the blue colored lines are the metadata, which tells you about who wrote the commit, when he wrote it, gives you a hash. And a hunk is then what's marked with the big hunk.
01:44
That's just one part of code that's so near each other that Git decided to make it into a unit. And that's going to be used in the algorithm, so I just wanted to explain that.
02:01
So, this is how line spots works. The idea is you take a project, and it has to be a Git project at this point, but generally you can use every version control system, I guess, and you iterate over all commits, and you have to decide for each commit if it is a bug fix or not, and this is not trivial, and I'm going to talk about that later.
02:25
But let's say we somehow know if it is a bug fix or not. If it's not a bug fix, we just keep track of all the movement of what happened in the commit, and that's it. But if it was a bug fix, we not only track the changes,
02:40
but we play score, because the idea is if there was a bug fix somewhere, that means there was a bug. And if there was a bug somewhere, that means that area is probably complicated, or someone wrote it when he was tired, or it's just a place where obviously bugs will occur.
03:00
And research has shown that if an area had a bug in the past, the probability is increased that there will be more bugs in the future. So this is the whole basis of the algorithm. For my study, I just used a simple keyword finder for the commit messages to decide if a commit was a bug fix or not.
03:21
So, for example, at Koala we use fixes colon issue link, or I think at Gnome it's something like bug colon issue link, stuff like that. So that's individual for each project, and it felt like it kind of worked. So this is how the scoring works.
03:43
The idea is that every line in your project starts with a score of zero. Every time there's a bug that is connected to the hunk that line is in, we increase the score. Ideally we would increase the score every time we wrote a bug,
04:00
but there's no way of knowing when I wrote a bug. It's just like implicitly known by solving it later. So in this case, there's a red line that is removed, so we also drop the score, and there are two green lines that are added, and those scores are increased. The first two lines are not changed in any way, so we just give them a flat increase,
04:24
and the bottom lines that are added use something like the small calculation in the bottom. It's just the average of the hunk. If you're very interested in the ins and outs, how the algorithm works, you can come to me later, but I don't think that's very interesting for the whole group,
04:43
so I'm not going to go too much into detail on that. The whole thing then is weighted by age, so newer commits change the score stronger than older commits. That's just to make sure that if you fixed bugs in the past and there's no new bugs in that area,
05:00
we kind of decided that that area probably was fixed for now, and we shouldn't look at it even more. This is what the result of the algorithm looks like. You get a list of line scores and corresponding lines and files, and this can then be used to, for example, create a heat map of your project or do whatever you think helps your project.
05:26
Those scores are not absolute, they're relative to all the other scores in your project, so for example, you could use it to look at the highest score, 10% of lines of code or something.
05:41
The interesting part for me was to figure out how do I evaluate if this works or not. Ideally, I would want a lot of data, using this on a group and have a control group that doesn't use it, and then have hundreds and hundreds of people reviewing stuff and see if it works or not,
06:02
and that wasn't really feasible for my bachelor thesis, so what I did, I created something I call pseudo-future. This is what the history of a Git project looks like. You have an init commit where you started your project, and at some later point, there's your head commit, which is your now state.
06:22
You can use this to go back in time and decide, I'm just going to go back, let's say, 100 commits, and I call this now, and everything after those 100 commits is now a future you don't know anything about, because in this case, at the middle point, I have no idea what's coming after that,
06:42
so I can use the following commits as a future that I have in my project and that I can check against, but I can't influence my algorithm by it. And what I did then was I used all the lines that were proposed by my algorithm
07:02
and see if in those lines there would be bug fixes in the future. And what I used for this, and I wasn't really sure if this is the best thing to use, but it kind of sounded sane, it's called hit density, and it just shows the ratio between number of bugs that are occurred in the lines I checked
07:27
and the lines of code that I used to find those bugs. So if this number is higher, this just means that I found more bugs per line of code I looked at, and I think that's what efficient review looks like.
07:41
I want to find as many bugs by looking at as little lines as possible. So when you plot this, it kind of looks like this. The blue line is the proportion of bugs I've hit, so it goes between 0 and 1, which is then 100%. So in this case, I arrived somewhere around, let's say, 19% of all the bugs that happened in the future.
08:07
And the red line is the hit density, which is the real interesting thing. So you can see that the hit density spikes very early, meaning that around 0.0001 or something like that,
08:21
we have the best ratio of bugs we found to how many lines we looked at. And this was very consistent over a lot of projects. I'm going to show you a table with some numbers later. And it means that we can use this algorithm by looking only at very small numbers of lines of code.
08:45
We don't have to look at half your project to find everything, but we only have to look at very, very small numbers of lines of code, which is good because that's what code review usually consists of. So this is the comparison with the backspots algorithm.
09:00
This whole thing I did was based on Google's backspots algorithm, which does the same thing I did but on a file base, so you don't get a score for each line but for each file. And if you compare both graphs, the backspots algorithm finds more bugs, but that's just because it proposes whole files.
09:21
And it assumes that there is knowledge about parts of the code where there's actually no knowledge about. And the line spots results on the right side spike much earlier and reach their plateau much earlier. So for using this, also you can see the scale on the right side.
09:41
For backspots, we have a ratio of 16% of bugs found per percent line of code, and for line spots, we have 2,000% bugs found per line of code, which is pretty awesome. There's another picture which is completely zoomed out,
10:03
and you can't really see how steep it rises, so this is just to show that overall backspots looks to find more bugs, but for the cost of you having to look at like 20% of your project's code, which is probably not something you want to do.
10:22
So these are some projects I tested with it, and the green color just is an indicator of which ranking algorithm worked best. So what this shows is I looked at 0.5% lines of code because that was kind of a sweet spot.
10:41
Under that, there's just the parameters I ran the algorithm with, and then there's some projects you probably all know Atom, you probably all know GIMP, GTK, Nautilus is the file manager for GNOME, and I ran a few algorithms over them and tried to find out which one works best.
11:01
So what the numbers show you, for example, for BERT, because it's the best case that I encountered in my whole thesis, was that with looking at 0.5% lines of code, I found almost 57% of bugs that would occur in the next 150 commits.
11:20
So by just looking at a very small portion of the project, I could find almost all the bugs, or three-quarters of the bugs that would occur in the near future, which sounds pretty cool. So what can be proved about the project? And this is where this becomes really interesting
11:41
because although it kind of sounds awesome to find bugs in your code, so far we haven't really, or I haven't really figured out how this can be used in real life because everyone I've talked to about this was like, this is really awesome, this sounds super cool, but no one had an idea how this could help someone,
12:00
although everyone thinks it's awesome. So first, some academic stuff, what can be improved. So first, the decision if a commit is a fix or not is a very vague thing, and there's the idea to use machine learning to increase the accuracy of that, but ideally you would reach an accuracy of 100%
12:24
for deciding if something is a bug fix or if something isn't a bug fix because that's what the whole thing is based on. If these decisions are not made precisely, then everything after that becomes unprecise. The next thing which a lot of people suggest
12:44
is tracking logical chunks, so something like functions or logical blocks like if statements, loops and stuff because if there's a bug somewhere in a function that probably is a sign for that function being complicated, so you better take a look at the whole function.
13:01
This was just too much for the nine weeks I had in my thesis, but generally I really want to try this, although it gets very complicated and has to be implemented for every language itself. So far the algorithm is completely language agnostic and runs on everything you have that is text-based.
13:22
Finally, we can apply some machine learning voodoo because at the moment you can solve every problem by throwing machine learning and big data at it. One part would be the is it a fix or not a fix decision, but generally I think I have received suggestions to use machine learning in every area of this algorithm.
13:43
So if you have cool ideas on how you could use machine learning to make this better, please tell them to me. There's a list somewhere where it starts with machine learning voodoo and then there's a list of a whole lot of ideas. And finally, I would really like to test this with people.
14:03
I would really like to get projects to work with me to find a way to get this thing useful in real life because I think it could be helpful with code review, which is a thing we at Koala struggle with because we just have so much to do.
14:23
And if you would be interested in somehow collaborating with me or just me running the algorithm over your project and then seeing if it somehow helps or not, come forward, talk to me here or at the Koala stand. Thank you.
14:45
Thanks Maximilian for a nice talk. We can take some questions. Thank you for the talk. Is it only you who can run this algorithm or is it open sourced or something?
15:03
My complete thesis is open sourced, so you can use the first implementation I wrote. I'm currently rewriting it in a very much nicer way, but the algorithm is open sourced and the thesis and all the data I gathered is also open sourced.
15:25
Can you talk again a little more about why you're looking only at the bug fix sections rather than like feature commits and why the bug fix sections would be more prone to new bugs?
15:41
Yes, so that is just based on the research I based my thesis on because they found the connection between bug fixes occurring and there being more bugs in the future. I'm pretty sure there is a lot of more science telling you that there might be problems. Yes, new features probably are prone to have bugs because they're just newly implemented or not tested.
16:03
So that would be another thing to look for. But maybe just like I only had nine weeks for my thesis so there was just time constraints.
16:21
Thanks very much. It was really interesting. So first about deciding whether a commit is a bug fix or not. Well probably if all the commits are linked to some issue tracker and issues are labeled as bugs or not that could solve the problem entirely, right?
16:42
Yes. And the other question I had, did you check, can you tell if, well what kind of bugs are you looking at? I mean purely logical bugs or also well bugs like misunderstood requirements, stuff like that.
17:04
So there are really many different kinds of bugs, right? So I guess the algorithm doesn't care for what type of bugs you feed it. And the research this is based on didn't go into detail
17:24
on like if there is a difference between let's say typos and some people argue that typos are bugs and some people don't. I guess that it probably doesn't matter because having a typo somewhere from what the research I'm based this on
17:41
told was that if there is a bug there is some reason for a bug to be there and that also leads to the probability rising that there is just more bugs there. So we for example try to label our issues with bugs and non-bugs based on what kind of bugs we want to find because we kind of use this.
18:03
So for example we don't label typos as bugs because we are not interested in having the typo bugs in this algorithm set. But if that would be something you are trying to look for then I guess labeling that as bugs also works.
18:20
Did that answer the question? Hi, thank you Max. Just wondering if you can locate high-risk lines of code can you also suggest fixes? No, that would be something that maybe, I don't know, you could then feed that into some static code analysis thing
18:44
but this is just a probability if there is a bug fix or not. This is not saying there is a bug, this is just saying the probability of there is a bug in this line is higher than the next one for example. So no fixes sadly.
19:03
Thank you for your talk. I was wondering are you planning on making things available like for instance when you push a pull request on GitHub so you would have a nice tool that will highlight quite nicely the lines so reviewers can look at it? There is the guys at GitLab running around somewhere there thinking about using this.
19:22
So far I didn't build anything in that direction but I think it could be cool although all the people I have talked to so far weren't sure if they would use it. So the idea sounds awesome and I agree but all the people with money that I tried to somehow get information out of
19:42
if this could be useful somehow weren't really into the put this into GitHub and make your pull request colorful thing. More questions?
20:03
Thank you. There is something usually when you fix bugs sometimes you write tests. As I understand how it's working most probably as the code added in the tests will be in a bug fix commit
20:20
it will just increase the score of the probability of having a bug here. Probably maybe you write bad tests but probably it's not bugs just to ensure that the bug you fixed is not happening anymore so is there ways you avoid some kind of parts in the code because test is the most obvious one
20:40
but maybe there is other kind of things like this? The implementation that exists is too simple for stuff like that but I guess it would be rather easy to have an ignore list that just looks for test files or maybe documentation stuff or something and just doesn't track them and doesn't offer them
21:00
in the lists of stuff you should look at. So when you say make a bug fix then your score around there goes up
21:22
is there any way for the score to go down afterwards or if you fix a bug a few times then probably the code is actually fixed will the score go down or would it stay up? So that's what the waiting function is for I didn't show it because it's just an exponential function
21:43
but what it does is you run this this isn't run incrementally but this is just run once and you get one set of scores and then for the next commit you would have to do it again. So for example if you analyze the last 500 commits then the 500th commit in the past
22:02
weighs less than the newest ones so if there was a bug 500 commits ago and there wasn't a bug fix and there are no bug fixes in the future the score is reduced by the waiting function. Thank you.
22:25
So in this exponential decay with time how much do you only pick up the most active areas of the code compared to the most error-prone areas of the code? So if you would analyze the CPython project wouldn't you just have all the asyncio stuff light up
22:43
and all the POSIX libraries be converted to zero? If there is a lot of bug fixes in that area probably because that means there is a lot of bugs in that area. I mean because it's time-dependent so it's how recent the changes are, right?
23:00
Yeah, that's kind of a problem like generally this awaiting by time is not ideal because time is not a very good measure on progress because there's projects where there's a lot of commits happening in a short duration and there's project where it's like weeks or months of no commits so even though you could have one commit two weeks ago
23:21
and the next commit today and they are in the sense of the project they are not very far apart but they would be weighted very different due to the time used to weigh the stuff. So I thought about using numbers of commits to weigh stuff but then again numbers of commits is not a very good measure
23:40
because some projects use small and some projects use large commits so that's also a thing where no ideal solution is there so far. We still have time for some questions.
24:06
So you're assigning each line some kind of weight or score. I didn't get exactly how some lines were scored higher than the others.
24:20
Some were 0-3, some were 0-5. Oh yes, you mean the example scoring, this one. Okay so to go about it in a little bit more detail Can you hear me without the mic?
24:42
Okay so here are the old scores which are 0.3 for the first lines and 0.5 for the following two and here are the new scores and the way the new scores are calculated is that at the top it says score increase is 0.25 which is just the random number I choose.
25:02
This would be decided by where we are on the weighting function so how far back in time we are. And the first two lines are not changed so they just get the 0.25 increase flat. The third line is removed so the score is also removed and the bottom ones are added as new lines
25:23
so what is done is I calculate the average score of the complete hunk which is done at the bottom and is 0.43 which is then used as the old score and then increased by 0.25 to the new score.
25:42
Does that answer your question? Okay so let's all thank Maximilian again for his talk.