Finding bugs for free: The magic of static analysis.
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 160 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/33756 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Sound effectGradientAuthorizationNumbering schemeWavePauli exclusion principleBDF-VerfahrenMeeting/Interview
00:43
Mathematical analysisCodeFormal languageLibrary (computing)Query languageObject (grammar)WindowCross-site scriptingLine (geometry)CASE <Informatik>Query languageInstance (computer science)Array data structureSpacetimeFunktionalanalysisSoftware testingQuicksortOperator (mathematics)Complex analysisType theoryDynamical systemSocial classCategory of beingDatabaseOrder (biology)MassSource codeMultiplication signComputer programmingBitQuality of serviceMathematical analysisCodeContrast (vision)Set (mathematics)WritingObject-oriented programmingFormal languageError messageDoubling the cubeMathematicsGoodness of fitInteractive televisionAreaSoftwareRight angleArithmetic meanVideo gameWeb pageVulnerability (computing)Range (statistics)Military rankFluid staticsSoftware bugObject (grammar)Scaling (geometry)Differenz <Mathematik>Computer clusterMereologyBinary filePauli exclusion principleEntropiecodierungDeclarative programming
08:19
Loop (music)Query languageObject (grammar)Social classLibrary (computing)CodeAbstract syntax treeModule (mathematics)Computer iconAngleOpen setTunisReading (process)Formal grammarExecution unitIntegerAbstract syntax treeLogicData compressionCausalitySystem callBitLetterpress printingAreaSelf-organizationProcess (computing)Green's functionRight anglePresentation of a groupMassQuery languageSequelLibrary (computing)Computer programmingQuicksortCodeElement (mathematics)Mathematical analysisExpressionTheory of relativityKey (cryptography)Line (geometry)Point (geometry)Set (mathematics)Combinational logicResultantSocial classError messageOnline helpLoop (music)Arithmetic meanCore dumpNetwork topologySoftware bugSlide ruleModule (mathematics)Multiplication signSpacetimeNumbering schemeData management2 (number)String (computer science)Token ringData structureStatement (computer science)Level (video gaming)Electronic mailing listIterationCASE <Informatik>Game controllerSyntaxbaumFitness functionTouchscreenWeb 2.0Dot productComputer animation
15:54
Numbering schemeControl flow graphControl flow graphCodeNumbering schemeComputer programmingInformationTupleMultiplication signGraph (mathematics)Element (mathematics)Interpreter (computing)Ferry CorstenPoint (geometry)QuicksortMachine codeDataflowComputer animation
16:44
Numbering schemeDataflowRandom numberTupleControl flow graphLogical constantCASE <Informatik>Correspondence (mathematics)Arrow of timeCondition numberControl flow graphBranch (computer science)Software testingNumbering schemeTupleLoop (music)DataflowInformationFerry CorstenPoint (geometry)Error messageIntegerCodeGreatest elementMultiplication signImage resolutionLetterpress printingStatement (computer science)Different (Kate Ryan album)Interior (topology)ProgrammschleifeMathematical analysisAreaGame controllerApproximationSet (mathematics)RandomizationStructural loadComputer animation
20:47
FlagHeegaard splittingControl flow graphNumbering schemeTrailEndliche ModelltheorieRight angleComputer programmingPoint (geometry)FlagControl flow graphSoftware testingTransformation (genetics)TupleLoop (music)DataflowCASE <Informatik>Numbering schemeCodeStatement (computer science)Data compressionBranch (computer science)Position operatorModule (mathematics)SubsetError messageComputer animation
22:55
Convex hullControl flow graphSoftware testingCASE <Informatik>Order (biology)Right angleCodeBranch (computer science)Heegaard splitting1 (number)Matching (graph theory)DiagramProgram flowchart
24:04
Context awarenessFlagNumbering schemeContinuous functionFlagError messageNumbering schemeProjective planeComputer programmingLine (geometry)Open sourceQuicksortTupleLetterpress printingSystem callContext awarenessLimit (category theory)Web pageMultiplication signCodeFunktionalanalysisOrder (biology)INTEGRALIntegerCASE <Informatik>Mathematical analysisResultantGreatest elementAreaLoop (music)TrailComa BerenicesPrime numberComputer animation
27:29
Process (computing)Mathematical analysisCodeComputer programmingDisk read-and-write headQuery languageSpacetimeGoodness of fitPiAreaCore dumpWeb 2.0Instance (computer science)Rule of inferenceMultiplication signComputer fileFormal languageDeclarative programmingProjective planeFreewareINTEGRALResultantJava appletParallel portOpen setEmailDirectory serviceBitRight angleQuicksortFile formatDebuggerComputer animation
30:52
Port scannerConfiguration spaceBitMathematical analysisInstance (computer science)QuicksortBit ratePosition operatorQuery languageSet (mathematics)TheoryData managementResultantException handlingNumbering schemeFormal languageComputer programmingSubsetError messageOperator (mathematics)Fluid staticsThomas BayesProcess (computing)MassMereologyProjective planeEstimatorSoftware bugService (economics)CodeMeeting/Interview
33:32
Boss CorporationEndliche ModelltheorieType theoryCovering spaceLecture/Conference
34:11
Multiplication signPlanningProcess (computing)InformationLibrary (computing)Software developerSource codePlastikkarteDataflowElectric generatorComputer fileType theoryMathematical analysisCodeStandard deviationQuicksortAreaQuery languageError messageFocus (optics)Position operatorMeeting/Interview
35:28
Projective planeDisk read-and-write headOpen sourceSoftware bugQuicksortStandard deviationLibrary (computing)CodeRevision controlAreaService (economics)Compilation albumTransformation (genetics)Intermediate languageMathematical analysisElectronic mailing listJava appletNumbering schemeScripting languageProcess (computing)Moment (mathematics)Formal languageResultantDifferent (Kate Ryan album)Inequality (mathematics)Equaliser (mathematics)Multiplication signSet (mathematics)CausalityNoise (electronics)Error messageSoftware testingFunktionalanalysisParameter (computer programming)Core dumpDomain nameSoftware developerInverse elementDemo (music)Physical systemSlide ruleInterpreter (computing)System callDataflowLecture/ConferenceMeeting/Interview
Transcript: English(auto-generated)
00:06
Good morning, can everyone hear me? Wave at the back, great. Okay. Just introduce myself and company. My name is Mark Shannon, I've been playing around with Python for over
00:20
a decade and most of that I spent either working out how to optimise it or how to break it as a side effect of those two things. I authored PEP 412 and I was a BDF delegate for PEP 484, I think those are right numbers, you'll have to Google
00:42
them to check for me. Anyway, so before we kind of get into the depth of stuff and while people are still arriving, we shall have a mini space opera. So this is the Curiosity Lander. In February 2012 it was on flights to Mars and NASA found
01:04
a bug in the Lander software. Now the Lander software is written in C and in C you can pass arrays around but if you pass arrays around the C language doesn't bother in passing the size around so it's very easy to pass an array of one size to a function expecting an array of another size. So NASA found
01:25
one of these cases in testing and they asked us to investigate, see if we could find any more cases of this. So we were able to write a nine-line query and we
01:43
were able to find another 30 instances of this and we don't know exactly what NASA did because all their software is secret and closed source and sort of military grade but we presume they fixed them and we do know that Curiosity landed safely. So NASA is quite keen on static analysis but NASA is somewhat risk
02:04
averse. We wish to, I want to try and convince you that code analysis is something that you can use even if you are less risk averse than NASA it's still worth your effort. I want to convince you that in the long run it can save you time
02:20
and money. The little bit of effort you need to put into using it will pay off. Now obviously NASA's code was written in C but we support a bunch of languages at LGTM and one of those and the one I believe you're probably most interested in is Python. So what is code analysis?
02:44
Code analysis is finding out facts about your program. Basically anything or that you might want to think you might think is interesting. So that could be metrics, you could look for a complexity in the code, look for hot spots of complexity,
03:04
all sorts of things but one particularly interesting sort of set of facts are bugs or likely bugs. So I would like to contrast code analysis with a couple of quality assurance things you probably already do. So the first one is testing. I'm going to assume everyone tests their code,
03:25
I'm not going to bother asking because they're not going to embarrass anyone. So testing is obviously very important for you know checking that your code is safe to be released but testing is very specific to a code base and also you need to write tests for everything you're interested
03:46
in. If you write a new piece of code you're going to have to write new tests for it. You can't rely on pre-existing tests or pre-existing tests of tests to find what you want. Whereas code analysis often you can rely on pre-existing stuff. Another quality assurance thing you'll
04:03
probably use is code review. Now code analysis is much more like code review. So a human code reviewer is going to look at the diff in your code and attempt to see what issues they can find with it. Hopefully they'll do it in a positive and constructive fashion but essentially are looking for flaws in your code. So they might be looking for
04:24
design flaws but also be just looking for smaller scale errors. Now code analysis can take away a lot of that work. Code analysis can find those errors and it's much much more meticulous than any human could ever be. It can also work out whether your
04:41
changes interact with other pieces of code and double check the interaction which is something that a code reviewer could very easily omit. So what does make for good code analysis? You know it's not very useful. Okay so let's go through these
05:02
one by one. So it needs to be flexible. So given the NASA example that was an error they had not anticipated. Had that been a sort of general purpose check they already had that would have been fine but because they didn't anticipate it we need to create a new analysis on the fly reasonably quickly and reasonably easily. That's an important
05:26
part of analysis. Another one is it's accurate. After all imagine you have a watch and whenever you look at it it's right half the time. Well you do the watch. Well you just bin it. What about if it's right 90% of the time? Well it's kind of useful but not really that useful
05:44
but you probably keep it if you didn't have any other sort of means of checking the time. But what about watch that's 99.5, 99.8% accurate? Then yeah sure you double check if there was a flight to catch but otherwise you'd pretty much rely on it. And assuming you can rely on something then it just makes your life easier because you're not
06:04
kind of like double checking it or doubting it all the time. So accuracy is very important and finally you need to find it needs to be useful. You need to be insightful. So pep 8 is all very well and good. We all love pep 8 but does it really matter if there's 81 characters
06:21
in a line or you know these minor little things? But work analysis can find things like a you know cross-site scripting vulnerability or something like that. Then that's really valuable. So obviously there's a huge range of things that we can find and it's kind of you want the code analysis to find interesting things. So can we do this for Python? And the
06:51
it can be tricky compared with a statically tight language. So Python doesn't have any type annotations or declarations. Of course with type annotations they are used but
07:03
there's relatively few of them. Also because it has a history of being dynamically typed people tend to just pass values around and then locally check for things like is something none? Does something have a particular attribute? Is it callable or so on before they do some operation on it? And we need to understand these sort of things. Also people
07:25
do things that are genuinely dynamic in Python. Things that a code analysis tool is always going to struggle with. Things like creating classes from database schema on the fly. That's pretty difficult to analyze. If you have the database schema to hand then maybe you can integrate analysis of that but generally you're not going to
07:44
be able to do that. So in order to keep things accurate we also need to know what we don't know. So I've said flexibility is kind of important. So our tool LGTM is
08:08
what makes it flexible. So at heart it contains an object oriented query language. The advantage of a query language is because it's declarative you can just say I'm looking for this sort
08:22
of problem. And that allows you to find things in fairly brief so make fairly brief queries that will find what you want. So given the NASA example I said there was a nine line query. I'm not going to bother you with that because that uses the C libraries and so we're going to bother with Python. So I will give you a Python example though.
08:43
So here's an example query. So basically what we're looking for here is a for loop and the thing it's iterating over is not an iterable. And this query is pretty short. Obviously at first glance it may not make a lot of sense. I'll kind of explain how
09:03
it works. So the three clauses much like any sequel query we have a from clause which sort of describes the program elements that we're interested in. A where clause which relates them and a select which just gets us our result. So in the from clause we're looking for a for loop, an expression, a class and an AST node. An AST node at
09:27
this point will just say that's just some point in the program. And we're interested in that so that we have some marker that we can look at when we see a result so we know where what we're looking at and how to fix it. So the key thing is if we just look for we're not looking for any old combination of those. We're looking for
09:43
them specific relation between them. So the first thing is that the expression is the thing in the for loop. So that's our first line in the where clause basically saying that the iterable in the for loop is iter. And then the next one is probably the key point. And that's basically saying that that expression the values it
10:03
refers what it refers to or the set of values it could hold. We don't care about the value but we are interested in the class of those value and the origin as we know where it came from for producing results. And the last two the last line says obviously that it's not iterable but you'll note that
10:23
we're not saying that it isn't iterable we're saying that we don't know that it is iterable which is why we have the second clause which says we don't we do know something about it. I have a hand up. Oh sorry underscore is a convention meaning ignore this value. I believe it's SQL convention as well
10:46
but don't reuse SQL so anyone can correct me later if I'm wrong. Yeah so that we don't care about that value. And then we select the loop and the origin which is usually a useful helper to find to tell us where the value came
11:03
from so we can fix the errors. So flexibility is and you can write these brief queries and you can write your own queries. So what makes it accurate or precise. This is that refers to that we had in the previous
11:21
slide. That is in that's basically wrapping all our analysis in the library. So I'm now going to go through the library. Just check my time. Yeah that's good. OK. So I'm going to go through some examples and I'm going to show some example code.
11:45
It's kind of nice to show real code but there's a couple of reasons not to. Real code is far too big. Won't fit on a slide. Usually what tends to happen is that the cause of an error and the manifestation of the error are not necessarily in the same place. It makes it awkward to grind. And there's
12:02
another equally important reason which is I don't really want to like choose some arbitrary piece of code and sort of point out and say look there are bugs in your code. So I think we so it's much better that the finger points at myself and the code is clearly intentionally buggy as you can probably guess from the name buggy code to there. So hopefully I won't
12:23
upset anyone. So the first first piece of analysis we can do is basically to parse the source code and produce an abstract syntax tree. An abstract syntax tree is basically a tree that describes
12:42
the structure of the source code. So in the very simple code on the left produces the abstract syntax tree on the right. Now this is assumed this is a piece of a whole module not just a snippet of code. The tree on the right. The top level is the module itself and that contains two
13:01
statements. An assignment and a for loop. The assignment is broken down into the target of the assignment which is the left hand side which is numbers. Now note that numbers is just a name rather than a string. So it's a name and then numbers and then the value is just the value one. Now this is abstract syntax tree rather than what's
13:24
other called a concrete parse tree and our concrete parse tree would contain things like the parentheses around the one which we've emitted from the abstract syntax tree because it doesn't affect the meaning. It's just extra syntax. Likewise there's no actual marker for the four or the in tokens. They're just emitted.
13:47
The for loop is a little bit more complicated. The target again is the thing that gets assigned which is N, the iter which is if you recall from our query with the loop.get iter that's the iterator and then there's a body which is just a list of
14:02
statements. In this case one statement. That statement is an expression statement whose value is a call and the call calls a name which is print and has the args name N and those dot dot dots are just for there's maybe a few other bits to do regarding the presence or lack of star args and star star args and
14:22
annotations and such forth. Except there's no annotations in the call. Forget the annotations bit. Okay. So if we have a look at that piece of code and we run it through our tools it tells us that indeed we can we have an error there. So N, so if this is a screen shot because I didn't
14:44
trust the Wi-Fi but I think I should have a where is it? Alerts. Non-iterator for loop. One simple. Why is this
15:13
not working? Here we are. Right. Okay. So this is
15:21
actually on the web. Yeah. So if I click on that it will highlight the origin which is where the origin bit comes in. So basically we can say that numbers is an integer and you
15:40
shouldn't iterate over an integer. Okay. So that's the AST. That's our kind of first thing we do in our analysis. The presentation. So our next step is a control flow graph. If you look at the program on the left you will
16:04
see that this one is somewhat redundant but is correct. And that the numbers we are iterating over this time is the tuple one, two, three. Now the AST would have just said we have two assignments called numbers and it doesn't
16:21
give us any information about ordering. It sort of gives us information about ordering but not accurate enough to be generally useful. Whereas a control flow graph does. So control flow graph basically is a graph that emulates the way that the interpreter actually executes the code. So if you
16:40
first of all, if you look for the you see the octagonal elements, the module, there's an entry and an exit point for this whole flow graph. The orange one at the top is the entry and the gray one mostly near the bottom is the exit point. And then it simply flows through the code. So essentially what evaluates is we evaluate the constant
17:01
one, then we assign it to numbers, then we evaluate the constants one, two, three, create a tuple, and then assign that to numbers, and then we go through the for loop. And the for loop basically says load numbers and then loop over its items, printing each other one at a time.
17:21
I don't know how, it's reasonably clear. Yeah, the Google docs won't allow you to put SVGs up so I'm not sure the resolution of the PNG is great. Okay. So that's great. So we no longer, we're not coming up with a false positive because of we're choosing one when we should
17:44
know it's a tuple. But things can get more complicated. So in this case, if this was just the code, obviously, we would stop as soon as we hit random with a name error, but let us assume that random is defined somewhere else as something that is either genuinely random or something
18:03
the analysis can't work out whether it's true or false. Doesn't really make a lot of difference. Now in the, at the beginning of the second if statement, we from our, we know that numbers is either one or it's one, two, three. So it's either an int or a tuple. Now if it's an int or a
18:26
tuple, we track it through and we're going to see some errors. So we're going to think that, well, in either for loop, numbers could be an int or a tuple. So there's both errors in both loops. But of course, there's only an
18:41
error in the second loop because the first loop checks to, is guarded by the check to see if it's a tuple. So the one cannot reach that. But the control flow graph doesn't really show us that because the control merges. So what we need to do is what's called data flow. So data flow
19:01
essentially we track the values or the set of values or some approximation to the exact values obviously because we can't execute the code. They give us the information we need. In this case we can just, because these are simple constants, we just track those. And what will happen of course is that our flow here, I should mention that if
19:21
you can see, apologies for those people who are color blind, but the green and the blue arrows, the green corresponds to the case where the condition is true and the blue corresponds to the condition where it's false. And we can track through the value. So basically as we go through the second branch, that test will eliminate one
19:45
of the values on either side such that the first for loop we know that the value one will not get there. So we track the value of one from the assignment. And then as we go through the test and the test is true, obviously
20:02
one is not a tuple. So at that point that value is discarded. And then obviously there's no error in the first for loop. But in the second for loop obviously that's just merely testing it isn't a tuple. One isn't a tuple. So then we hit that and then there's an error there. So we can see that on our website.
20:24
And you can see there's an error in the second loop, but not in the first as we're able to work out that from the data flow that we're not seeing, we're seeing an integer for the second one but not for the first
20:40
one. But of course there's a reason, there's cases where that doesn't work. So this is a slightly contrived case but you do see things like this where people do a test, set some conditional value and then do the same test later on and then use the thing they set in
21:02
the first test. For example, and this can happen in imports as well. You might see code that's like try import foo and then accept import error foo equals none and later on there's a test that says if foo, because modules are always true, do something using foo and we
21:21
need to track that. So here's the program and the control flow graph on the right. And we have our data flow is insufficient to prevent us getting a false positive here. Because flags and numbers are different.
21:41
So all we know about numbers at the point we hit if flag is that it's one or the tuple. Flag is nothing to do with that so both pass through that test and then we get a false positive. So what we can do is
22:01
to split the control flow graph. So basically what we do is this transformation. So if you see we basically what we do is we don't rejoin the flow after the if statement and then we can basically duplicate and move the test, the if flag test into each branch. And then it should be obvious
22:24
that it fairly trivially falls out which loop gets run and which doesn't, you know, if false and if true are pretty straightforward. And then it's fairly clear that in the first on the right hand side that loop is never going to be executed on the right
22:41
hand side. It's safe to be executed because numbers is one, two, three. Now obviously we don't do this transformation on the source code because that messes up everything else. So we do it on the control flow graph and here's the transformation. Now the other thing to note here is that you might think this is this has a tendency to just
23:02
blow things up horribly. It's actually not so bad. We do limit the amount of splitting we do in order to avoid it blowing up. But often once you split you can often, because you split on cases where the repeated tests you can often then prune some of those branches. So you will note that on the right hand side the on the left hand branch which follows
23:24
the true test of the first place the we also know that the second test is therefore can only be false and because it's a and on the other side it's the other way around. So consequently we're able to
23:41
prune those extra branches and you'll see there's really no the right hand side is really no larger than the left hand side. Sometimes it does expand and particular cases where you have like an early test and then a whole lot of code and then another test that matches and then there tends to be some duplication but generally we don't see much of an increase in size of the control flow graphs with this.
24:06
So well so far so good but you will have noticed that all of that stuff was very localized and data tends to flow around a program through calls and so on. So have a look at this code. So is this correct?
24:21
Well yes it is because we can we're either calling print numbers with false and an integer in which case if it's flag is false we don't loop or with true and a tuple and if it is true then that's okay to loop. So what we need to do here is if
24:42
we track the calls from print numbers in the code at the bottom into the function that's not quite sufficient because what values can flag have in the call in the function where it could be true or false numbers could be one or the tuple and then we're unable to distinguish.
25:03
So what we need is something called call context which is where we basically pass the context from which we call something and that enables us to disambiguate this. So basically in order to be correct either we need to not go track any values through calls
25:21
or for a very restricted set or we can use call context. Again call context is something where potentially you can blow up the sort of size of things we're analyzing but yeah we need to sort of carefully limit that to try and find interesting stuff without being killing performance too much and there's more but we
25:44
don't really have time because I could carry on like this for all day. So I think hopefully I've convinced you that our analysis is reasonably accurate as a result of using all these techniques. Let me just check my notes.
26:06
Cool. Okay. So now a quick bit over run over LGTM.com. Here's the front page for Django.
26:22
We analyze a large number of open source projects. Not quite as many as we would like so if yours isn't up there and you think it ought to be come and talk to me. But generally I think we aim to analyze most
26:41
stuff on GitHub and Bitbucket. I'll just quickly run through this so you can see the number of contributors, the number of alerts which are things that some of those errors other things are just sort of recommendations and warnings which are probably less important. Lines of code and a whole bunch of other stuff. I'm
27:02
picking Django because it's reasonably good code so hopefully I won't embarrass anyone. Something pretty much everyone has heard of. Now another project, one slightly less famous and lower quality. But I want to highlight that we do pull request integration. So if you
27:22
think this might be valuable for your project, you want to look through the alerts and if you think this is, you want to know about this every time you have a pull request, we do pull request integration. All you need to do is log in to LGTM, you need to get a user ID and then go to your project and you can
27:41
click pull requests. Right. Now that's a bit of formatting. Now here's the sort of interesting stuff. So as I said, flexibility is key. Flexibility means you have a problem that I haven't anticipated. And you
28:02
would like to find other instances of that problem. Well, you can write your own query. Now, this is query language. It's custom query language. It's declarative programming. It's somewhat different to the sort of programming you might be used to. So I appreciate that although these queries look
28:21
very concise, it's sometimes not entirely straightforward to get your head around it. But I would recommend you have a go. And so just to recap, good code analysis should be flexible, accurate and give you
28:41
valuable results. Insightful in some ways is up to you. If you need to know something about your code and you can write a query to do it, then that's pretty insightful. Okay. I think that is it. So there's an open space on code
29:02
analysis. I want to invite other people who are interested, Koala people and so on as well to that. But if you want to come along and chat to me, I'll be around after the talk. I'll be at the sprints. And if you thought that was all kind of cool, we're hiring. Does anyone end a
29:24
talk without saying that? But we are. Oh, Larry's put his hand up. Okay. But we are hiring. If you think this is cool and you'd like to work on this, we're hiring for doing Python analysis from the web front end for the core infrastructure.
29:42
And if you want to do C++ or Java analysis or the like, we're also hiring for those. Okay. I think that's it. Is there any questions? So we have something like 15 minutes for
30:06
questions. And I already see some questions going over there. I can go with the mic from the side of the room. I just tried it in parallel and tried to add my project. And it
30:23
says I can't build it because some headers are missing. So how is that stuff handled if you have non Python dependencies? Also did not find our requirements text because it's not in the top directory. We have a requirements.d. And in there are the
30:41
requirement files because we don't have just one. We have multiple. So. Okay. I think that's probably better if we discuss that offline. But, yeah, the problem is we do essentially just scan everything. And if it's pretty standard, we build it. And if it's not, we'll need a bit of
31:01
custom configuration. But I'll look into that. Because we're always wanting to fix these issues. Okay. So static analysis is great. And especially if you're doing C, C++, Java, any of those static
31:22
languages, if you're programming those and you aren't using static analysis, start using it yesterday because it will find massive amounts of bugs that otherwise you wouldn't. But in Python the problem is that I've used several of these tools and whatever always seems to be happening is that you enable them and because
31:43
Python is so dynamic, you get massive amounts of false positives and they just overwhelm you and all of the valuable things get lost. So do you have any sort of like numbers or estimates on how your false positive rate is? Well, this is the accurate
32:01
thing. So this is what I was saying earlier about knowing what you don't know. So to put it technically, I would say we attempt to ensure that our knowledge, our set of facts that we're trying to present to you and which you can base your queries is as large a
32:22
strict subset of the truth as we can manage. With that in mind, it should in theory be no false positives at all. Obviously we're not perfect, so we do have some false positives but we generally regard any false positive as a bug with a
32:41
handful of exceptions where to do the analysis completely perfectly accurately would give us essentially no results and we're prepared to trade a few false positives to get meaningful results on most projects. But as to numbers, it's kind of hard to say
33:01
because we don't know how many errors there are in a program, so we can't come up with a number as to what our false positive rate is. But as I say, we aim for zero. It's not an achievable goal but it's definitely something we take pretty seriously.
33:20
So you have it as a service online. Do you offer it as a self-hosted service, for example, for companies that have like proprietary reasons for not putting their code online? Okay. Basically the answer is I don't do sales. My boss has told me off for trying to do sales because I'm an engineer and I'm not very good at it. So don't ask me. But this is all
33:41
free for open source. If you want to use it commercially, then you should contact us directly through sales. So cemal.com.
34:00
So when there are those new type annotations in Python 3 now, do you also make use of that when you run the analysis? Okay. So I didn't cover type annotations and we don't use much, take much information from them. Basically we're
34:20
sort of like it's generally, you know, development is an ongoing process and we'll focus on what gives us the best sort of like improvement at the time. So including type hints is definitely something that's on the cards and we particularly want to do it for analyzing stub files for the standard library and so on because
34:40
our standard library analysis is kind of just we did a one-off analysis of the C code and then tried to write some queries on that to generate type information, but that's relatively weak compared with what's now in the typing stubs. So we want to use those. Of course the worry is that any error in stub files is just going to manifest itself as a
35:01
false positive. So that's something, that's definitely a concern. But yeah, so the plan is to take advantage of those, but not to try and do type checking as such as merely using those as sources of information for the more so general
35:21
data flow. Thank you. Any other question? Do you have examples of bugs that were fixed
35:42
in open source projects thanks to our GTN or even projects that embraced the tool? Not off the top of my head. I can dig you out a list if you want some evidence later on. But I mean going back a long time, we found
36:01
bug in the 2.7 standard library, which is a kind of funny little one. So in 2.7, if you implement done to equals but not done to not equals, then the interpreter gives you different results for equality and inequality. So I think it was weak ref set
36:22
in 2.7 a while ago. I mean, it's a long time ago, especially like four or five years ago, I think, if it was fixed. It was both equal and not equal to itself. And we used that as a little demo. And unfortunately, I think someone from the core developer team was in that demo, and they fixed it with an hour. So we had to find
36:40
a different demo. And we found in various other projects. So various things. So interestingly, we had something that looked like a false positive in Flask, but it turned out that there was actually something wrong in the requirements dot test for Flask. It was saying it needed click 2.0 or higher,
37:01
and it was giving us an error that said you were calling it with an argument that didn't exist in the function being called. But when you corrected the requirements to click four, then the error went away. So sometimes you find rather indirect errors like that. So I can claim standard library in Flask,
37:20
so, you know. Any other questions? Great talk. Thank you. Just wondering, one of the slides there, you appear to generate code that was also Python, but an improved version. So I'm just wondering, is it possible to save, you know, if you put in a project,
37:43
can you spit out improved code as such that we can then do? When you said improved code, did you mean the transformation on the flow graph? Well, I wouldn't say that's improvement. It's got duplicate code in it, so I would say it's definitely not an improvement. So analysis tools, much like compilers, are going to transform the code
38:00
and do all sorts of weird things with it during their analysis. And some of the intermediate representations of code are going to be truly atrocious, incomprehensible code. But they make things more explicit as far as the analysis is concerned internally. But you should leave the code as it is. I mean, if there's errors in your code, fix them, obviously.
38:21
And if things are unclear, make them clearer. But that's not, you shouldn't really do that for the benefit of tools. Any other question?
38:43
Well, I do have a question. OK. So I will ask away. So can you give us an idea about how many projects your company is processing in general? And also, I'm kind of curious what kind of infrastructure, if you can share
39:01
anything about that, you folks have to run your system on all these projects. There's some numbers. Yeah, how heavy? So 50,094 projects, apparently. I hope that number's right.
39:21
Yeah, it's kind of, so this is including Java, JavaScript, and Python. I think probably roughly half of those will be, or maybe even more, JavaScript. Probably sort of, I don't know, anyway, I'm not sure of the ratios, but JavaScript is the largest number
39:43
than Python than Java. But I think that's just because our proportion of projects we analyze at the moment is lower for Java because of the build issues and so on. But yeah, this number should go up, and if we add languages, it'll definitely go up further.
40:03
Any other question? Going once, going twice, going, go on. Well, let's thank the speaker again.