We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Tooling for Static Analysis of Python Programs

00:00

Formal Metadata

Title
Tooling for Static Analysis of Python Programs
Title of Series
Number of Parts
130
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In spite of the dynamic nature of their favorite language, some Python developer have a huge desire to statically analyse it. This can indeed be useful for linters, type inference, auto-completion and all the tooling some developers expect from modern IDE. We all know that lazy binding prevents even the simplest function call or attribute lookup to be reliably analyzed. Yet we try. And Python as this fabulous ``ast`` module, saving us from writing a parser! Isn't that a strong incentive to do static analysis? This talk presents two modules developers can build upon to build such analyzers: - gast, a generalization of the Python AST that provides a common API for all the variant of the Python AST, from python 2.7 to Python 3.8 - beniget, an analyzer of the Python AST (built upon gast) that provides a useful and well-known abstraction to understand programs: use-def chains Built upon these two modules, memestra is a static analyzer of deprecated function calls, developed in partnership with QuantStack. A tool which, given a module, reports any use of deprecated APIs. Let's explore how such a module can be built and unveil the mysteries of static analysis.
61
Thumbnail
26:38
95
106
Mathematical analysisFluid staticsComputer programProcess (computing)Abstract syntax treeAbstractionChainEuclidean vectorCore dumpNumberUsabilityRegulärer Ausdruck <Textverarbeitung>Module (mathematics)Revision controlTranslation (relic)ParsingoutputBuildingRevision controlCompilerSubject indexingMultiplication signModule (mathematics)Different (Kate Ryan album)ExpressionProgram slicingDimensional analysisBlock (periodic table)BitPay televisionTranslation (relic)Letterpress printingFormal languageField (computer science)Connectivity (graph theory)DeterminismAbstractionComputer programmingMereologyPlanningCodeMathematical analysisGroup actionLevel (video gaming)Statement (computer science)Representation (politics)Process (computing)Abstract syntax treeValidity (statistics)Fluid staticsAbstract syntaxNetwork topologyParsingTerm (mathematics)Projective planeResultantAdditionComputational scienceGoodness of fitStatisticsTupleLibrary (computing)Generic programmingEndliche ModelltheorieDampingElement (mathematics)Slide rulePartial derivativeInformation systemsMathematicsLine (geometry)DataflowVariable (mathematics)NumberCompilation albumFreewareKernel (computing)Category of beingMusical ensembleSymbol tableFamilyComputer animation
Translation (relic)ParsingChainSet (mathematics)Variable (mathematics)Personal digital assistantException handlingLocal ringChainComputer programmingMathematical analysisCASE <Informatik>Network topologySequenceFitness functionLoop (music)Proxy serverModule (mathematics)Point (geometry)Slide ruleLine (geometry)Instance (computer science)Limit (category theory)FunktionalanalysisSubsetNeuroinformatikEndliche ModelltheorieVariable (mathematics)Letterpress printingWordDifferent (Kate Ryan album)CompilerGame controllerString (computer science)IdentifiabilitySound effectRun time (program lifecycle phase)Multiplication signCodeAdditionInheritance (object-oriented programming)Musical ensembleException handlingDivisorModule (mathematics)Computer fileBranch (computer science)BitPhysical systemStatement (computer science)Dynamical systemContent (media)Type theoryLengthValidity (statistics)UsabilityFluid staticsSet (mathematics)Computer animation
Local ringContinuum hypothesisExtension (kinesiology)Fluid staticsFunction (mathematics)TrailLetterpress printingMechanism designModule (mathematics)Proof theoryCodeMathematical analysisLimit (category theory)Perfect groupView (database)Instance (computer science)Resolvent formalismRevision controlCategory of beingFunktionalanalysisSocial classPoint (geometry)TrailType theoryCASE <Informatik>Process (computing)System callModule (mathematics)Goodness of fitCodeCache (computing)Client (computing)Fluid staticsSlide ruleBitMathematical analysisWordContext awarenessSubsetElectronic signatureLimit (category theory)ChainModule (mathematics)Multiplication signDot productSoftware testingCodeComputer programmingStiff equationMusical ensembleComputer animation
Proof theoryCodeMathematical analysisFluid staticsLimit (category theory)Perfect groupFunktionalanalysisModule (mathematics)ParsingInformation systemsRepresentation (politics)Keyboard shortcutCompilerGoodness of fitEndliche ModelltheorieMultiplication signSoftware developerRevision controlFormal languageSocial classCodeSubsetVariable (mathematics)Library (computing)Level (video gaming)Computer animationMeeting/Interview
Transcript: English(auto-generated)
Okay, thanks. Let's start. So, hi folks, I'm Serge and we're going to spend 30 minutes or so speaking about Python and static analysis, but well, to be honest, the title is a bit misleading
because it's impossible to use full Python and at the same time have very accurate static analysis. So, the real title should be Python or static analysis, pick one. But still, it's not because it's impossible to do it as a whole that we can't do
relevant stuff and that's what we are going to explore together. I'm a compiler engineer but I have a lot of other hobbies and I'm going to present some tooling to do some partial static analysis on Python programs based on my previous experience
on that topic. So, that previous experience is mostly based on the Python project. So, some of you may know it but it's an embedded DSL, embedded into Python, targeted at scientific
computing which makes it possible to strictly compile Python kernels into efficient native libraries that you can still import from Python and get good speed up when you're using NumPy or this kind of stuff. So, it's not a generic compiler at all, it's very specific but it was,
it's still a good plan to develop. And while developing it, I met obviously a lot of issues and some of them were relevant enough to build a new package that were solving some specific issues that I thought would be useful to other people and that proved to be true.
So, the two major components of Python that are reusable are GAST and Benignette. So, these two packages are using names from a small language on the west of France called
Bread the Neck and I will just give the translation while we go through these tools. So, the first one gives you an abstraction of the Python abstract syntax tree across Python versions and the second one provides some basic analysis of Python programs that other
compiler or analyzer can use. And I'm going to go through the two of them and then describe another project, memestra, who's using these two building blocks to write a simple but efficient static analyzer. Let's move on. So, GAST, it's available on the cheese shop on GitHub and such.
So, it's also, so it could mean generic AST which is totally relevant but it also means in Bread the Neck a good time girl. So, this tool started four years ago and was presented by
Python French and the original goal of that tool was to ease the transition between Python 2.7 which was the original target of the Python compiler to Python 3. When you're
migrating a compiler from across different versions of Python, it's not as easy as when you do it for a regular application because you're changing the input of your program. You're going to parse Python 2 programs or you're going to parse Python 3 programs. So, that's very different from adding parenthesis around print statements, right? So, instead of
reworking the compiler for each new Python version, I wrote this small abstraction layer to ease the manipulation of the AST. Basically, it gives you a generic AST that you can
manipulate whatever the underlying Python version you're using. It turns out it's very successful because there are more than 100,000 downloads per day according to PyAP stats of that package, but you don't need to trust numbers. It's actually a dependency of terms of flow,
not my fault, and so only one package is actually using it a lot in addition to my package, but that's an important package. So, it looks like a very useful package but it's not as useful as it seems. So, to give you an hint about what it tries to do,
if you run this simple Python line, you're just importing the AST package that's going to, and then you dump the result of parsing a simple subscript expression which is valid Python across Python 2.7 up to 3.9, and we're going to see how the Python compiler is seeing
that expression, and it changes a lot depending on the actual Python version. So, if we are in Python 2.7, at the top level we've got a module because even if it's a single expression,
it has to be in a module. This module has a single statement which is an expression, and this expression is the actual subscript, and the important part is that the slice field of that subscript is an extended slice because it has several elements, and the first dimension is an index and the second one is an ellipsis, and that's how it's
represented in Python 2.7, and you can build your compiler based on that. But if you happen, and I hope you did, to move to Python 3, say Python 3.6, then you still have the module and the expression and the subscript, but then the slice is no longer an extended slice.
The node is now an index and indexed by a tuple. Okay, so it means that if you want your compiler to work both for Python 2.7 and Python 3.6, you need to handle two different ways of representing a subscript. But even worse, if you go up to Python 3.9 because
you live on the bleeding edge, then you no longer have the index and it's a single tuple. So it means that whenever you've got a new Python version, be it a major or minor version change, you need to update your whole compiler because you're not matching the same
abstraction, the same abstract tree. So what I did is put all that glue code in the GAST module, and whatever the Python version you're using, if you're dumping
the AST using GAST instead of AST, we've got exactly the same API just to ease the transition, then the module is something. The AST representation is very close to the 3.9 one because we try to match that one, but it's also compatible with all the previous versions.
So if you're running that exact code in Python 2, you will get the exact same abstract syntax tree, which means you can then think on the GAST representation instead of the official AST representation, and your compiler is fine with that. That's a super nice property,
and it's not getting obsoleted by the doom of Python 2. It's also very relevant to all the minor Python 3 versions. So that's for GAST. Very simple building blocks. When you do that, you've got some trade-offs. The first one is because of
legacy compatibility with Python 2, and we're still keeping that. Our AST representation is slightly more verbose, especially for AST name, but that shouldn't be a big trouble, but it's good to know it. And we are relying on the
official AST parser. We're not implementing it from scratch, so there is a translation from your Python AST module to GAST. So it's an extra processing step, which shouldn't matter for even for large modules, but it does matter if you're trying to parse all the Python files
in your system, which is something I tried to do, and in that case you've got something like a factor to slow down because you're creating a bunch of new objects. So you have to pay for it, but then you're doing Python, so you don't care that much about paying for doing things
more in an easier way. So that's for GAST. Very useful tool, at least to me, and it proved to be useful for other people too. So then, BennyGet. Same, available on GitHub and on the Gist shop, used by a few other packages in addition to Python, and this time it's not a bad word,
it's blessed. Just the same you would say, holy shit, you say BennyGet when you're surprised or when you want to protect yourself, something like that. And this analyzes this package, to understand it you need some compiler background, but a very basic one, and I'm going to
introduce them in the next slides. So it's computing the used definition chains for Python programs statically. So it's a foundation for a lot of more advanced analysis, especially the
Python one, but to understand how cool it is, you need to understand used devchains. And when you need to understand something, you go to Wikipedia. So Wikipedia tells you that it's basically a used devchain is a link or a tree between a definition of a variable and its usage at another point in the program. So for instance, if I'm defining variable A and
then using it, then there is a chain between that definition and that usage. And because of the dynamicity of Python, there is a lot of ways to create chains between identifiers, and some of them are a bit tricky. So for instance, but first, what is it useful for?
If you've got a definition that has no usage, then it means that the definition is not useful at all. So you can use BennyGet to write some kind of linter that would detect that
you're defining a variable somewhere or assigning or creating a binding to be accurate. And that binding is never used elsewhere. So to detect unused imports or useless assignments, it can be a useful tool. You have to take into account a few things. For instance,
underscore is generally used when an assignment is useless. For instance, when doing type destructuring. And when you're importing a module, even if that module is never used, there are a lot of side effects like this module may not exist and you may actually have imported
that module just to try to import it and handle the import exception or the module import may have a side effect because code is actually running and so it may not be relevant to remove that import. But that's up to the user and that's technically not something a static analyzer
can decide for you. Yeah, so just a reminder, you're not defining a variable when you use an assignment in Python. You're creating a binding between a value and a string identifier which is the name of the variable. It's somehow being a bit pedantic
but once you've got that model in mind, it helps a lot to understand all the limitations of use dev chain for Python. For instance, here is a small loop and I'm iterating over l and in some case I'm printing g and in other case I'm assigning it.
So, is the print statement faulty? And the answer is the same as for all the tricky case actually. It depends on the runtime content of l. Depending on whether l is empty or not,
whether l has values that evaluates to false or not, then the print statement is going to be faulty. But according to the use dev chains, there is a possibility of valid def use which is the l's clause is defining g and the true branch is using it. So, that's fine.
So, it looks like perfectly valid code and it may not be. That's a dynamic behavior that's been used. It's not able to capture. Well, it could send a warning actually but
we're not doing it like that. It's static analysis. Another case, you're defining a global using the cursed keyword global and you're doing that inside the function and doing the assignment inside the function and in another function you're using x. Is that faulty or not?
Well, according to the static analysis of the def use chain, there is a def in foo and a use in and that's fine or at least we can represent it. But it's not capturing all the possible usage. It depends whether bar is called after or before foo and we don't know that because we don't
know how the module is used. We're only doing a per module analysis, not a whole package or word analysis which however would depend on the python path which we don't control. So,
it's not possible to decide whether it's the only possible chain or not. Still, dynamic behavior. Another case which looks super simple. We've got an assignment for x and then we're iterating over a sequence, eventually assigning x to the different value in that sequence and
then printing x. So, the print is a user but to which def is it referencing and in that case it may reference the first assignment or the loop assignment and we don't know. So, we represent both. So, both values are potential definition for the print usage.
So, with that in mind, what does beignet do? It tries to compute all the possible but it may not be the actual or the possible definition for any usage in your module.
So, that's super useful but it's only a super set of the actual definition use links that may happen at runtime. So, that's the difference between dynamically creating binding and statically assigning variables. But also, although it seems I'm super pessimistic,
it turns out that if you're writing a compiler for a subset of pythons that with static assumptions which a lot of people are doing, then it may be useful. It may be useful
to write, for instance, a simple linter. You may know PyLint or PyFlake, these kind of linters. With beignets, you can write a much simpler linter but only with a few lines. You're basically iterating over all the def use chains and whenever you've got a def that
has no use, you check if the def name is conventional bypass even if someone in the chat told me that it may not be as conventional as I would expect. And if it's not a bypass,
plus a few other checks that I removed here for the sake of fitting into one slide, you can issue a warning. So, that's a good first step to write a linter and the good thing is that this linter does not depend on the Python version because beignet itself depends on
and we already saw that ghast is agnostic from the Python version. So, with that, we've got something that is really generic and with all the concerns splitted across several packages, which is a nice property from an engineering point of view. And to illustrate that,
uh so first obviously you understood that beignet can do anything if you've got a call to the global instance or the eval instance or to the local instance. Anything that is or double underscore import all this kind of stuff, we can't do anything about that.
Just be aware what actually what we could do and that's what we do is whenever you reference globals you reference the word so it creates a lot of links from all the possible global available in the program and you're referencing them but that's still a subset of the possible
globals. So, that's not enough but we're doing our best. Just be aware Python or static type static analysis and not Python and static analysis. That's the key. Cool.
And so as an illustration to how to write a useful tool based on the generic tool set without too much effort, let me introduce memestra. So memestra means two things.
People from Brittany are known to drink a lot and so it could say the same please when you're ordering a beer or cedar or whatever but it's also when you're upset by the children or by other people you could say oh please stop and then you would say memestra. So stop, stop to write
deprecated code. That's what memestra is trying to do. So it checks your code for places where a deprecated function is used and what is a deprecated function? A deprecated function is a function decorated with the deprecated decorator which is not standard, non-standard but
you can have your own in your package and or you can use the one from pipy which is nice and based on the tooling I introduced even you could be could feel or have a good hint that
it's easy to write that kind of useful linter based on beneath it. Basically it's three steps. The user is going to give you the signature of the decorator say decorator dot deprecated or something like that and you're going to track the usage of that decorator into your code.
Track the usage from a definition that's beneath it. So once you have all the usage you have all the functions decorated by this deprecating decorator and then you can track all the usage of that of this function of these deprecated functions. If you have none then that's cool.
If you have one is that usage inside a function that is itself deprecated then that's okay because you're still into the deprecated word but if it's not then you're using a deprecated function in a non-deprecated function and that's something the user needs to be warned
about and so you print it and that's it. That's how you write a deprecated function linter using beneath it. So as with all commercial or simplistic talk
it's a bit slightly more difficult than this but the idea is is the same. So for instance in that test.py code you're importing the decorator package and you're deprecating foo. foo is used both at top level and inside a function bar. Bar is not marked as deprecated itself
so if you run memestra on this module you will get two warnings and that's what we would expect. But if you think a bit you maybe think about numpy. Numpy is
numpy has a lot of the API of numpy is quite large and so it's marking a lot of function as deprecated because the API evolves over time. So in your client code you want to parse
numpy code to check whether everything is deprecated or not. So it means that instead of doing a module analysis you want to do a cross-module analysis. So it's a bit more difficult than what I've told because you're assuming that the Python path is known and you can
then start to resolve statically the imports and then apply the very same method as the one I described in the previous slide. So to do that efficiently you need a cache which actually represents correctly how modules are imported in Python. So that's good and you can process the
whole Python including numpy stack in a matter of minutes using memestra tracking all the deprecated usage. So that's still possible and can be done efficiently. When you want to advertise deprecated usage I should mention the deprecated package which is not official but
just does the job and it's used in a lot of places and you can just use it as you would expect you deprecated function and you can even give a version or a reason why you're duplicating it. So feel free to use that and memestra is obviously compatible with that.
Actually memestra is independent of the decorator you can configure that in a decent way. So as usual there is limitation and in that case typing is in our way.
Here the foo method is marked as deprecated but only within the class foo. If we have another class with a foo method and it's not deprecated then we can't resolve in bar the call to foo and say okay foo is that deprecated call and and then we are doomed.
memestra does support flagging a class as deprecated and izetka is whenever you instantiate that class you'll get the warning but for method it's not the case. We could imagine a coupling with mypy or using type annotation to improve the analysis but that's far
beyond this talk and but that may be a relevant goal for memestra. So we are getting close to the end. So again python is not made for static analysis. Have a look to mypy and how they are
trying to do static typing on python. There are limitations and so for any static tooling there are limitations and you need to be aware of them. Once you're aware of them that's okay. You can start building toolings and sharing them as I do and as other people are doing obviously and it's still a perfect tool for embedded dsls into python and a lot of people in the python
numeric compute community are doing that. And to finish the talk I'd like to thank Sylvain Corlet for the original idea of memestra, Mariana for reviewing a lot of my pull requests on memestra and Lancelot6 for the proofreading of this talk and I'm handing
it to you Nicolas or to anyone who has questions. Okay so we have two questions so I'm going to take the first one. David is saying
can you use python ist module to parse code for other languages such as C? If not maybe you know some specialized libraries for creating ists for other languages? So no you can't. The python
ist module is specific to python and as I told it's all it's even specific to the actual python version you're running. You can't parse any other language unless it is syntactically strictly embedded into python. So if you've got a subset of python you can use the python
ist module to parse it and that would work because that's only parsing and not trying to interpret anything. So it happens that I'm also an LLVM developer so I can answer your second question if you want to parse C or C++. Clang has python binding to do that so you can use
python binding from the clang compiler and you get a representation of C or C++ that you can manipulate from python. Cool so last question. Artem is asking if
is it applicable only for functions what about class level or global variables? I didn't get the first sentence. Is it applicable only for functions? So if the question is about memestra then yes it's applicable to both
functions and classes but it's not applicable to function inside classes so not to method. Okay cool so we're on time thank you very much thank you for presenting
have a good day thanks Nicolas and thank you to all.