We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Finding bugs for free: The magic of static analysis.

00:00

Formal Metadata

Title
Finding bugs for free: The magic of static analysis.
Title of Series
Number of Parts
160
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
"Finding bugs for free: The magic of static analysis. [EuroPython 2017 - Talk - 2017-07-14 - Arengo] [Rimini, Italy] Overview Static analysis is a powerful technique for improving the quality of software. It allows you to find bugs in your Python code without any the need for any annotations. At lgtm.com we provide state-of-the-art static analysis for a number of languages including Python. Our Python analysis can find bugs without the annoyance of many false positives. We have already found bugs in the standard library, requests, numpy and many others. In this talk I will briefly describe what static analysis is, how it can be useful to you, and then give an overview of the techniques we use. Intended Audience This talk is aimed at all Python developers, although I expect that those who have used pyflakes, pep8, pylint or mypy in the past will find it the most accessible. I hope that this talk will inspire developers to start using static analysis tools (hopefully ours) and to produce better code as a result. I will assume that the audience knows Python well, but knows little or nothing about static analysis.
Sound effectGradientAuthorizationNumbering schemeWavePauli exclusion principleBDF-VerfahrenMeeting/Interview
Mathematical analysisCodeFormal languageLibrary (computing)Query languageObject (grammar)WindowCross-site scriptingLine (geometry)CASE <Informatik>Query languageInstance (computer science)Array data structureSpacetimeFunktionalanalysisSoftware testingQuicksortOperator (mathematics)Complex analysisType theoryDynamical systemSocial classCategory of beingDatabaseOrder (biology)MassSource codeMultiplication signComputer programmingBitQuality of serviceMathematical analysisCodeContrast (vision)Set (mathematics)WritingObject-oriented programmingFormal languageError messageDoubling the cubeMathematicsGoodness of fitInteractive televisionAreaSoftwareRight angleArithmetic meanVideo gameWeb pageVulnerability (computing)Range (statistics)Military rankFluid staticsSoftware bugObject (grammar)Scaling (geometry)Differenz <Mathematik>Computer clusterMereologyBinary filePauli exclusion principleEntropiecodierungDeclarative programming
Loop (music)Query languageObject (grammar)Social classLibrary (computing)CodeAbstract syntax treeModule (mathematics)Computer iconAngleOpen setTunisReading (process)Formal grammarExecution unitIntegerAbstract syntax treeLogicData compressionCausalitySystem callBitLetterpress printingAreaSelf-organizationProcess (computing)Green's functionRight anglePresentation of a groupMassQuery languageSequelLibrary (computing)Computer programmingQuicksortCodeElement (mathematics)Mathematical analysisExpressionTheory of relativityKey (cryptography)Line (geometry)Point (geometry)Set (mathematics)Combinational logicResultantSocial classError messageOnline helpLoop (music)Arithmetic meanCore dumpNetwork topologySoftware bugSlide ruleModule (mathematics)Multiplication signSpacetimeNumbering schemeData management2 (number)String (computer science)Token ringData structureStatement (computer science)Level (video gaming)Electronic mailing listIterationCASE <Informatik>Game controllerSyntaxbaumFitness functionTouchscreenWeb 2.0Dot productComputer animation
Numbering schemeControl flow graphControl flow graphCodeNumbering schemeComputer programmingInformationTupleMultiplication signGraph (mathematics)Element (mathematics)Interpreter (computing)Ferry CorstenPoint (geometry)QuicksortMachine codeDataflowComputer animation
Numbering schemeDataflowRandom numberTupleControl flow graphLogical constantCASE <Informatik>Correspondence (mathematics)Arrow of timeCondition numberControl flow graphBranch (computer science)Software testingNumbering schemeTupleLoop (music)DataflowInformationFerry CorstenPoint (geometry)Error messageIntegerCodeGreatest elementMultiplication signImage resolutionLetterpress printingStatement (computer science)Different (Kate Ryan album)Interior (topology)ProgrammschleifeMathematical analysisAreaGame controllerApproximationSet (mathematics)RandomizationStructural loadComputer animation
FlagHeegaard splittingControl flow graphNumbering schemeTrailEndliche ModelltheorieRight angleComputer programmingPoint (geometry)FlagControl flow graphSoftware testingTransformation (genetics)TupleLoop (music)DataflowCASE <Informatik>Numbering schemeCodeStatement (computer science)Data compressionBranch (computer science)Position operatorModule (mathematics)SubsetError messageComputer animation
Convex hullControl flow graphSoftware testingCASE <Informatik>Order (biology)Right angleCodeBranch (computer science)Heegaard splitting1 (number)Matching (graph theory)DiagramProgram flowchart
Context awarenessFlagNumbering schemeContinuous functionFlagError messageNumbering schemeProjective planeComputer programmingLine (geometry)Open sourceQuicksortTupleLetterpress printingSystem callContext awarenessLimit (category theory)Web pageMultiplication signCodeFunktionalanalysisOrder (biology)INTEGRALIntegerCASE <Informatik>Mathematical analysisResultantGreatest elementAreaLoop (music)TrailComa BerenicesPrime numberComputer animation
Process (computing)Mathematical analysisCodeComputer programmingDisk read-and-write headQuery languageSpacetimeGoodness of fitPiAreaCore dumpWeb 2.0Instance (computer science)Rule of inferenceMultiplication signComputer fileFormal languageDeclarative programmingProjective planeFreewareINTEGRALResultantJava appletParallel portOpen setEmailDirectory serviceBitRight angleQuicksortFile formatDebuggerComputer animation
Port scannerConfiguration spaceBitMathematical analysisInstance (computer science)QuicksortBit ratePosition operatorQuery languageSet (mathematics)TheoryData managementResultantException handlingNumbering schemeFormal languageComputer programmingSubsetError messageOperator (mathematics)Fluid staticsThomas BayesProcess (computing)MassMereologyProjective planeEstimatorSoftware bugService (economics)CodeMeeting/Interview
Boss CorporationEndliche ModelltheorieType theoryCovering spaceLecture/Conference
Multiplication signPlanningProcess (computing)InformationLibrary (computing)Software developerSource codePlastikkarteDataflowElectric generatorComputer fileType theoryMathematical analysisCodeStandard deviationQuicksortAreaQuery languageError messageFocus (optics)Position operatorMeeting/Interview
Projective planeDisk read-and-write headOpen sourceSoftware bugQuicksortStandard deviationLibrary (computing)CodeRevision controlAreaService (economics)Compilation albumTransformation (genetics)Intermediate languageMathematical analysisElectronic mailing listJava appletNumbering schemeScripting languageProcess (computing)Moment (mathematics)Formal languageResultantDifferent (Kate Ryan album)Inequality (mathematics)Equaliser (mathematics)Multiplication signSet (mathematics)CausalityNoise (electronics)Error messageSoftware testingFunktionalanalysisParameter (computer programming)Core dumpDomain nameSoftware developerInverse elementDemo (music)Physical systemSlide ruleInterpreter (computing)System callDataflowLecture/ConferenceMeeting/Interview
Transcript: English(auto-generated)
Good morning, can everyone hear me? Wave at the back, great. Okay. Just introduce myself and company. My name is Mark Shannon, I've been playing around with Python for over
a decade and most of that I spent either working out how to optimise it or how to break it as a side effect of those two things. I authored PEP 412 and I was a BDF delegate for PEP 484, I think those are right numbers, you'll have to Google
them to check for me. Anyway, so before we kind of get into the depth of stuff and while people are still arriving, we shall have a mini space opera. So this is the Curiosity Lander. In February 2012 it was on flights to Mars and NASA found
a bug in the Lander software. Now the Lander software is written in C and in C you can pass arrays around but if you pass arrays around the C language doesn't bother in passing the size around so it's very easy to pass an array of one size to a function expecting an array of another size. So NASA found
one of these cases in testing and they asked us to investigate, see if we could find any more cases of this. So we were able to write a nine-line query and we
were able to find another 30 instances of this and we don't know exactly what NASA did because all their software is secret and closed source and sort of military grade but we presume they fixed them and we do know that Curiosity landed safely. So NASA is quite keen on static analysis but NASA is somewhat risk
averse. We wish to, I want to try and convince you that code analysis is something that you can use even if you are less risk averse than NASA it's still worth your effort. I want to convince you that in the long run it can save you time
and money. The little bit of effort you need to put into using it will pay off. Now obviously NASA's code was written in C but we support a bunch of languages at LGTM and one of those and the one I believe you're probably most interested in is Python. So what is code analysis?
Code analysis is finding out facts about your program. Basically anything or that you might want to think you might think is interesting. So that could be metrics, you could look for a complexity in the code, look for hot spots of complexity,
all sorts of things but one particularly interesting sort of set of facts are bugs or likely bugs. So I would like to contrast code analysis with a couple of quality assurance things you probably already do. So the first one is testing. I'm going to assume everyone tests their code,
I'm not going to bother asking because they're not going to embarrass anyone. So testing is obviously very important for you know checking that your code is safe to be released but testing is very specific to a code base and also you need to write tests for everything you're interested
in. If you write a new piece of code you're going to have to write new tests for it. You can't rely on pre-existing tests or pre-existing tests of tests to find what you want. Whereas code analysis often you can rely on pre-existing stuff. Another quality assurance thing you'll
probably use is code review. Now code analysis is much more like code review. So a human code reviewer is going to look at the diff in your code and attempt to see what issues they can find with it. Hopefully they'll do it in a positive and constructive fashion but essentially are looking for flaws in your code. So they might be looking for
design flaws but also be just looking for smaller scale errors. Now code analysis can take away a lot of that work. Code analysis can find those errors and it's much much more meticulous than any human could ever be. It can also work out whether your
changes interact with other pieces of code and double check the interaction which is something that a code reviewer could very easily omit. So what does make for good code analysis? You know it's not very useful. Okay so let's go through these
one by one. So it needs to be flexible. So given the NASA example that was an error they had not anticipated. Had that been a sort of general purpose check they already had that would have been fine but because they didn't anticipate it we need to create a new analysis on the fly reasonably quickly and reasonably easily. That's an important
part of analysis. Another one is it's accurate. After all imagine you have a watch and whenever you look at it it's right half the time. Well you do the watch. Well you just bin it. What about if it's right 90% of the time? Well it's kind of useful but not really that useful
but you probably keep it if you didn't have any other sort of means of checking the time. But what about watch that's 99.5, 99.8% accurate? Then yeah sure you double check if there was a flight to catch but otherwise you'd pretty much rely on it. And assuming you can rely on something then it just makes your life easier because you're not
kind of like double checking it or doubting it all the time. So accuracy is very important and finally you need to find it needs to be useful. You need to be insightful. So pep 8 is all very well and good. We all love pep 8 but does it really matter if there's 81 characters
in a line or you know these minor little things? But work analysis can find things like a you know cross-site scripting vulnerability or something like that. Then that's really valuable. So obviously there's a huge range of things that we can find and it's kind of you want the code analysis to find interesting things. So can we do this for Python? And the
it can be tricky compared with a statically tight language. So Python doesn't have any type annotations or declarations. Of course with type annotations they are used but
there's relatively few of them. Also because it has a history of being dynamically typed people tend to just pass values around and then locally check for things like is something none? Does something have a particular attribute? Is it callable or so on before they do some operation on it? And we need to understand these sort of things. Also people
do things that are genuinely dynamic in Python. Things that a code analysis tool is always going to struggle with. Things like creating classes from database schema on the fly. That's pretty difficult to analyze. If you have the database schema to hand then maybe you can integrate analysis of that but generally you're not going to
be able to do that. So in order to keep things accurate we also need to know what we don't know. So I've said flexibility is kind of important. So our tool LGTM is
what makes it flexible. So at heart it contains an object oriented query language. The advantage of a query language is because it's declarative you can just say I'm looking for this sort
of problem. And that allows you to find things in fairly brief so make fairly brief queries that will find what you want. So given the NASA example I said there was a nine line query. I'm not going to bother you with that because that uses the C libraries and so we're going to bother with Python. So I will give you a Python example though.
So here's an example query. So basically what we're looking for here is a for loop and the thing it's iterating over is not an iterable. And this query is pretty short. Obviously at first glance it may not make a lot of sense. I'll kind of explain how
it works. So the three clauses much like any sequel query we have a from clause which sort of describes the program elements that we're interested in. A where clause which relates them and a select which just gets us our result. So in the from clause we're looking for a for loop, an expression, a class and an AST node. An AST node at
this point will just say that's just some point in the program. And we're interested in that so that we have some marker that we can look at when we see a result so we know where what we're looking at and how to fix it. So the key thing is if we just look for we're not looking for any old combination of those. We're looking for
them specific relation between them. So the first thing is that the expression is the thing in the for loop. So that's our first line in the where clause basically saying that the iterable in the for loop is iter. And then the next one is probably the key point. And that's basically saying that that expression the values it
refers what it refers to or the set of values it could hold. We don't care about the value but we are interested in the class of those value and the origin as we know where it came from for producing results. And the last two the last line says obviously that it's not iterable but you'll note that
we're not saying that it isn't iterable we're saying that we don't know that it is iterable which is why we have the second clause which says we don't we do know something about it. I have a hand up. Oh sorry underscore is a convention meaning ignore this value. I believe it's SQL convention as well
but don't reuse SQL so anyone can correct me later if I'm wrong. Yeah so that we don't care about that value. And then we select the loop and the origin which is usually a useful helper to find to tell us where the value came
from so we can fix the errors. So flexibility is and you can write these brief queries and you can write your own queries. So what makes it accurate or precise. This is that refers to that we had in the previous
slide. That is in that's basically wrapping all our analysis in the library. So I'm now going to go through the library. Just check my time. Yeah that's good. OK. So I'm going to go through some examples and I'm going to show some example code.
It's kind of nice to show real code but there's a couple of reasons not to. Real code is far too big. Won't fit on a slide. Usually what tends to happen is that the cause of an error and the manifestation of the error are not necessarily in the same place. It makes it awkward to grind. And there's
another equally important reason which is I don't really want to like choose some arbitrary piece of code and sort of point out and say look there are bugs in your code. So I think we so it's much better that the finger points at myself and the code is clearly intentionally buggy as you can probably guess from the name buggy code to there. So hopefully I won't
upset anyone. So the first first piece of analysis we can do is basically to parse the source code and produce an abstract syntax tree. An abstract syntax tree is basically a tree that describes
the structure of the source code. So in the very simple code on the left produces the abstract syntax tree on the right. Now this is assumed this is a piece of a whole module not just a snippet of code. The tree on the right. The top level is the module itself and that contains two
statements. An assignment and a for loop. The assignment is broken down into the target of the assignment which is the left hand side which is numbers. Now note that numbers is just a name rather than a string. So it's a name and then numbers and then the value is just the value one. Now this is abstract syntax tree rather than what's
other called a concrete parse tree and our concrete parse tree would contain things like the parentheses around the one which we've emitted from the abstract syntax tree because it doesn't affect the meaning. It's just extra syntax. Likewise there's no actual marker for the four or the in tokens. They're just emitted.
The for loop is a little bit more complicated. The target again is the thing that gets assigned which is N, the iter which is if you recall from our query with the loop.get iter that's the iterator and then there's a body which is just a list of
statements. In this case one statement. That statement is an expression statement whose value is a call and the call calls a name which is print and has the args name N and those dot dot dots are just for there's maybe a few other bits to do regarding the presence or lack of star args and star star args and
annotations and such forth. Except there's no annotations in the call. Forget the annotations bit. Okay. So if we have a look at that piece of code and we run it through our tools it tells us that indeed we can we have an error there. So N, so if this is a screen shot because I didn't
trust the Wi-Fi but I think I should have a where is it? Alerts. Non-iterator for loop. One simple. Why is this
not working? Here we are. Right. Okay. So this is
actually on the web. Yeah. So if I click on that it will highlight the origin which is where the origin bit comes in. So basically we can say that numbers is an integer and you
shouldn't iterate over an integer. Okay. So that's the AST. That's our kind of first thing we do in our analysis. The presentation. So our next step is a control flow graph. If you look at the program on the left you will
see that this one is somewhat redundant but is correct. And that the numbers we are iterating over this time is the tuple one, two, three. Now the AST would have just said we have two assignments called numbers and it doesn't
give us any information about ordering. It sort of gives us information about ordering but not accurate enough to be generally useful. Whereas a control flow graph does. So control flow graph basically is a graph that emulates the way that the interpreter actually executes the code. So if you
first of all, if you look for the you see the octagonal elements, the module, there's an entry and an exit point for this whole flow graph. The orange one at the top is the entry and the gray one mostly near the bottom is the exit point. And then it simply flows through the code. So essentially what evaluates is we evaluate the constant
one, then we assign it to numbers, then we evaluate the constants one, two, three, create a tuple, and then assign that to numbers, and then we go through the for loop. And the for loop basically says load numbers and then loop over its items, printing each other one at a time.
I don't know how, it's reasonably clear. Yeah, the Google docs won't allow you to put SVGs up so I'm not sure the resolution of the PNG is great. Okay. So that's great. So we no longer, we're not coming up with a false positive because of we're choosing one when we should
know it's a tuple. But things can get more complicated. So in this case, if this was just the code, obviously, we would stop as soon as we hit random with a name error, but let us assume that random is defined somewhere else as something that is either genuinely random or something
the analysis can't work out whether it's true or false. Doesn't really make a lot of difference. Now in the, at the beginning of the second if statement, we from our, we know that numbers is either one or it's one, two, three. So it's either an int or a tuple. Now if it's an int or a
tuple, we track it through and we're going to see some errors. So we're going to think that, well, in either for loop, numbers could be an int or a tuple. So there's both errors in both loops. But of course, there's only an
error in the second loop because the first loop checks to, is guarded by the check to see if it's a tuple. So the one cannot reach that. But the control flow graph doesn't really show us that because the control merges. So what we need to do is what's called data flow. So data flow
essentially we track the values or the set of values or some approximation to the exact values obviously because we can't execute the code. They give us the information we need. In this case we can just, because these are simple constants, we just track those. And what will happen of course is that our flow here, I should mention that if
you can see, apologies for those people who are color blind, but the green and the blue arrows, the green corresponds to the case where the condition is true and the blue corresponds to the condition where it's false. And we can track through the value. So basically as we go through the second branch, that test will eliminate one
of the values on either side such that the first for loop we know that the value one will not get there. So we track the value of one from the assignment. And then as we go through the test and the test is true, obviously
one is not a tuple. So at that point that value is discarded. And then obviously there's no error in the first for loop. But in the second for loop obviously that's just merely testing it isn't a tuple. One isn't a tuple. So then we hit that and then there's an error there. So we can see that on our website.
And you can see there's an error in the second loop, but not in the first as we're able to work out that from the data flow that we're not seeing, we're seeing an integer for the second one but not for the first
one. But of course there's a reason, there's cases where that doesn't work. So this is a slightly contrived case but you do see things like this where people do a test, set some conditional value and then do the same test later on and then use the thing they set in
the first test. For example, and this can happen in imports as well. You might see code that's like try import foo and then accept import error foo equals none and later on there's a test that says if foo, because modules are always true, do something using foo and we
need to track that. So here's the program and the control flow graph on the right. And we have our data flow is insufficient to prevent us getting a false positive here. Because flags and numbers are different.
So all we know about numbers at the point we hit if flag is that it's one or the tuple. Flag is nothing to do with that so both pass through that test and then we get a false positive. So what we can do is
to split the control flow graph. So basically what we do is this transformation. So if you see we basically what we do is we don't rejoin the flow after the if statement and then we can basically duplicate and move the test, the if flag test into each branch. And then it should be obvious
that it fairly trivially falls out which loop gets run and which doesn't, you know, if false and if true are pretty straightforward. And then it's fairly clear that in the first on the right hand side that loop is never going to be executed on the right
hand side. It's safe to be executed because numbers is one, two, three. Now obviously we don't do this transformation on the source code because that messes up everything else. So we do it on the control flow graph and here's the transformation. Now the other thing to note here is that you might think this is this has a tendency to just
blow things up horribly. It's actually not so bad. We do limit the amount of splitting we do in order to avoid it blowing up. But often once you split you can often, because you split on cases where the repeated tests you can often then prune some of those branches. So you will note that on the right hand side the on the left hand branch which follows
the true test of the first place the we also know that the second test is therefore can only be false and because it's a and on the other side it's the other way around. So consequently we're able to
prune those extra branches and you'll see there's really no the right hand side is really no larger than the left hand side. Sometimes it does expand and particular cases where you have like an early test and then a whole lot of code and then another test that matches and then there tends to be some duplication but generally we don't see much of an increase in size of the control flow graphs with this.
So well so far so good but you will have noticed that all of that stuff was very localized and data tends to flow around a program through calls and so on. So have a look at this code. So is this correct?
Well yes it is because we can we're either calling print numbers with false and an integer in which case if it's flag is false we don't loop or with true and a tuple and if it is true then that's okay to loop. So what we need to do here is if
we track the calls from print numbers in the code at the bottom into the function that's not quite sufficient because what values can flag have in the call in the function where it could be true or false numbers could be one or the tuple and then we're unable to distinguish.
So what we need is something called call context which is where we basically pass the context from which we call something and that enables us to disambiguate this. So basically in order to be correct either we need to not go track any values through calls
or for a very restricted set or we can use call context. Again call context is something where potentially you can blow up the sort of size of things we're analyzing but yeah we need to sort of carefully limit that to try and find interesting stuff without being killing performance too much and there's more but we
don't really have time because I could carry on like this for all day. So I think hopefully I've convinced you that our analysis is reasonably accurate as a result of using all these techniques. Let me just check my notes.
Cool. Okay. So now a quick bit over run over LGTM.com. Here's the front page for Django.
We analyze a large number of open source projects. Not quite as many as we would like so if yours isn't up there and you think it ought to be come and talk to me. But generally I think we aim to analyze most
stuff on GitHub and Bitbucket. I'll just quickly run through this so you can see the number of contributors, the number of alerts which are things that some of those errors other things are just sort of recommendations and warnings which are probably less important. Lines of code and a whole bunch of other stuff. I'm
picking Django because it's reasonably good code so hopefully I won't embarrass anyone. Something pretty much everyone has heard of. Now another project, one slightly less famous and lower quality. But I want to highlight that we do pull request integration. So if you
think this might be valuable for your project, you want to look through the alerts and if you think this is, you want to know about this every time you have a pull request, we do pull request integration. All you need to do is log in to LGTM, you need to get a user ID and then go to your project and you can
click pull requests. Right. Now that's a bit of formatting. Now here's the sort of interesting stuff. So as I said, flexibility is key. Flexibility means you have a problem that I haven't anticipated. And you
would like to find other instances of that problem. Well, you can write your own query. Now, this is query language. It's custom query language. It's declarative programming. It's somewhat different to the sort of programming you might be used to. So I appreciate that although these queries look
very concise, it's sometimes not entirely straightforward to get your head around it. But I would recommend you have a go. And so just to recap, good code analysis should be flexible, accurate and give you
valuable results. Insightful in some ways is up to you. If you need to know something about your code and you can write a query to do it, then that's pretty insightful. Okay. I think that is it. So there's an open space on code
analysis. I want to invite other people who are interested, Koala people and so on as well to that. But if you want to come along and chat to me, I'll be around after the talk. I'll be at the sprints. And if you thought that was all kind of cool, we're hiring. Does anyone end a
talk without saying that? But we are. Oh, Larry's put his hand up. Okay. But we are hiring. If you think this is cool and you'd like to work on this, we're hiring for doing Python analysis from the web front end for the core infrastructure.
And if you want to do C++ or Java analysis or the like, we're also hiring for those. Okay. I think that's it. Is there any questions? So we have something like 15 minutes for
questions. And I already see some questions going over there. I can go with the mic from the side of the room. I just tried it in parallel and tried to add my project. And it
says I can't build it because some headers are missing. So how is that stuff handled if you have non Python dependencies? Also did not find our requirements text because it's not in the top directory. We have a requirements.d. And in there are the
requirement files because we don't have just one. We have multiple. So. Okay. I think that's probably better if we discuss that offline. But, yeah, the problem is we do essentially just scan everything. And if it's pretty standard, we build it. And if it's not, we'll need a bit of
custom configuration. But I'll look into that. Because we're always wanting to fix these issues. Okay. So static analysis is great. And especially if you're doing C, C++, Java, any of those static
languages, if you're programming those and you aren't using static analysis, start using it yesterday because it will find massive amounts of bugs that otherwise you wouldn't. But in Python the problem is that I've used several of these tools and whatever always seems to be happening is that you enable them and because
Python is so dynamic, you get massive amounts of false positives and they just overwhelm you and all of the valuable things get lost. So do you have any sort of like numbers or estimates on how your false positive rate is? Well, this is the accurate
thing. So this is what I was saying earlier about knowing what you don't know. So to put it technically, I would say we attempt to ensure that our knowledge, our set of facts that we're trying to present to you and which you can base your queries is as large a
strict subset of the truth as we can manage. With that in mind, it should in theory be no false positives at all. Obviously we're not perfect, so we do have some false positives but we generally regard any false positive as a bug with a
handful of exceptions where to do the analysis completely perfectly accurately would give us essentially no results and we're prepared to trade a few false positives to get meaningful results on most projects. But as to numbers, it's kind of hard to say
because we don't know how many errors there are in a program, so we can't come up with a number as to what our false positive rate is. But as I say, we aim for zero. It's not an achievable goal but it's definitely something we take pretty seriously.
So you have it as a service online. Do you offer it as a self-hosted service, for example, for companies that have like proprietary reasons for not putting their code online? Okay. Basically the answer is I don't do sales. My boss has told me off for trying to do sales because I'm an engineer and I'm not very good at it. So don't ask me. But this is all
free for open source. If you want to use it commercially, then you should contact us directly through sales. So cemal.com.
So when there are those new type annotations in Python 3 now, do you also make use of that when you run the analysis? Okay. So I didn't cover type annotations and we don't use much, take much information from them. Basically we're
sort of like it's generally, you know, development is an ongoing process and we'll focus on what gives us the best sort of like improvement at the time. So including type hints is definitely something that's on the cards and we particularly want to do it for analyzing stub files for the standard library and so on because
our standard library analysis is kind of just we did a one-off analysis of the C code and then tried to write some queries on that to generate type information, but that's relatively weak compared with what's now in the typing stubs. So we want to use those. Of course the worry is that any error in stub files is just going to manifest itself as a
false positive. So that's something, that's definitely a concern. But yeah, so the plan is to take advantage of those, but not to try and do type checking as such as merely using those as sources of information for the more so general
data flow. Thank you. Any other question? Do you have examples of bugs that were fixed
in open source projects thanks to our GTN or even projects that embraced the tool? Not off the top of my head. I can dig you out a list if you want some evidence later on. But I mean going back a long time, we found
bug in the 2.7 standard library, which is a kind of funny little one. So in 2.7, if you implement done to equals but not done to not equals, then the interpreter gives you different results for equality and inequality. So I think it was weak ref set
in 2.7 a while ago. I mean, it's a long time ago, especially like four or five years ago, I think, if it was fixed. It was both equal and not equal to itself. And we used that as a little demo. And unfortunately, I think someone from the core developer team was in that demo, and they fixed it with an hour. So we had to find
a different demo. And we found in various other projects. So various things. So interestingly, we had something that looked like a false positive in Flask, but it turned out that there was actually something wrong in the requirements dot test for Flask. It was saying it needed click 2.0 or higher,
and it was giving us an error that said you were calling it with an argument that didn't exist in the function being called. But when you corrected the requirements to click four, then the error went away. So sometimes you find rather indirect errors like that. So I can claim standard library in Flask,
so, you know. Any other questions? Great talk. Thank you. Just wondering, one of the slides there, you appear to generate code that was also Python, but an improved version. So I'm just wondering, is it possible to save, you know, if you put in a project,
can you spit out improved code as such that we can then do? When you said improved code, did you mean the transformation on the flow graph? Well, I wouldn't say that's improvement. It's got duplicate code in it, so I would say it's definitely not an improvement. So analysis tools, much like compilers, are going to transform the code
and do all sorts of weird things with it during their analysis. And some of the intermediate representations of code are going to be truly atrocious, incomprehensible code. But they make things more explicit as far as the analysis is concerned internally. But you should leave the code as it is. I mean, if there's errors in your code, fix them, obviously.
And if things are unclear, make them clearer. But that's not, you shouldn't really do that for the benefit of tools. Any other question?
Well, I do have a question. OK. So I will ask away. So can you give us an idea about how many projects your company is processing in general? And also, I'm kind of curious what kind of infrastructure, if you can share
anything about that, you folks have to run your system on all these projects. There's some numbers. Yeah, how heavy? So 50,094 projects, apparently. I hope that number's right.
Yeah, it's kind of, so this is including Java, JavaScript, and Python. I think probably roughly half of those will be, or maybe even more, JavaScript. Probably sort of, I don't know, anyway, I'm not sure of the ratios, but JavaScript is the largest number
than Python than Java. But I think that's just because our proportion of projects we analyze at the moment is lower for Java because of the build issues and so on. But yeah, this number should go up, and if we add languages, it'll definitely go up further.
Any other question? Going once, going twice, going, go on. Well, let's thank the speaker again.