We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Exploring the Python AST Ecosystem

00:00

Formal Metadata

Title
Exploring the Python AST Ecosystem
Title of Series
Number of Parts
132
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
This session will introduce attendees to Python's rich ecosystem of abstract syntax tree tooling and libraries, with an emphasis on practical applications in static analysis and metaprogramming. Attendees should be fully comfortable with Python syntax and semantics, but familiarity with the ast module itself will not be necessary. The talk will begin with a conceptual overview of ASTs, including a brief look at Python's built-in introspection capabilities. It will introduce tools for AST visualization (astor, showast, python-ast-explorer), creation (asttools, meta), and transformation to source code (codegen). How the AST can be used for static analysis will be covered; this will include discussion of Python's built-in facilities (NodeVisitor) as well as of the 3rd party tools astsearch, astpath, and bellybutton. The talk will demonstrate the advantages and limits of these tools in comparison to other static analysis tooling (pylint, mypy); particular attention will be paid to how these tools can be incorporated into attendees' workflows and existing codebases and projects. Tooling for Python AST manipulation and metaprogramming will be the final topic covered, focusing on the use of the NodeTransformer built-in. The talk will cover practical applications and examples of metaprogramming, such as metaprogramming for DSLS (pony, xpyth), runtime code manipulation (patterns, yield-from), and others (e.g. assertion rewriting in pytest). While the talk will touch only briefly on each of the applications discussed, by the end of the session attendees should have a firm grasp of the kinds of problems the AST can be used to solve, what existing AST tooling can accomplish, and what resources are available for the development of their own AST tools.
Abstract syntax treeMathematical analysisCodeWindowAbstract syntax treeCartesian coordinate systemLambda calculusBytecodeDifferent (Kate Ryan album)ExpressionElectric generatorReverse engineeringOperator (mathematics)Type theoryAbstract syntaxRepresentation (politics)BitElectronic mailing listNumbering schemeCellular automatonRight angleAbstractionIdeal (ethics)System callData compressionCodePlatonic solidMonad (category theory)CompilerRun time (program lifecycle phase)Module (mathematics)Visualization (computer graphics)FunktionalanalysisInformationObject (grammar)Virtual machineGroup actionInstallation artCore dumpInclusion mapSource code4 (number)MereologyPhysical systemSocial classNetwork topologyIntegrated development environmentSoftware testingHypothesisFluid staticsInstance (computer science)Real numberPoisson-KlammerSpacetimeLine (geometry)String (computer science)Medical imagingComputer fileRevision controlVariable (mathematics)Entire functionQuicksortPoint (geometry)AdditionMetric systemMoment (mathematics)Inheritance (object-oriented programming)Data storage deviceStatement (computer science)WritingLaptopMathematical optimizationComputing platformParsingInstallable File SystemINTEGRALComputer programmingCondition numberEndliche ModelltheorieComputer animation
Mathematical analysisFluidExecution unitRegulärer Ausdruck <Textverarbeitung>Pauli exclusion principlePattern languageMaxima and minimaLocal GroupExpressionAttribute grammarProjective planeQuicksortCASE <Informatik>Power (physics)Multiplication signCodeCartesian coordinate systemFluid staticsPhysical systemFunktionalanalysisComplex analysisSource codeStatement (computer science)Object (grammar)CounterexampleRepresentation (politics)Module (mathematics)DatabaseRight angleSystem callElement (mathematics)Level (video gaming)Abstract syntax treePattern languageRule of inferenceElectronic mailing listInstance (computer science)Functional programmingError messageLine (geometry)Variable (mathematics)Maxima and minimaLoop (music)File formatCategory of beingSocial classComputer filePositional notationMatching (graph theory)Software repositoryMathematical analysisNetwork topologyInheritance (object-oriented programming)Point (geometry)Software testingComputer programmingParsingNumbering schemeGraph (mathematics)Telephone number mappingProfil (magazine)Human migrationDifferent (Kate Ryan album)Cycle (graph theory)Regulärer Ausdruck <Textverarbeitung>IntegerRevision controlDescriptive statisticsKey (cryptography)Pauli exclusion principleTerm (mathematics)Data dictionaryWrapper (data mining)Arrow of timePlastikkarteSpacetimeHash functionAbstract syntaxComputer animation
Abstract syntax treeHardware-in-the-loop simulationLoginMessage passingString (computer science)Drum memoryUser interfaceFormal grammarCodeTransformation (genetics)Abstract syntax treeFigurate numberModule (mathematics)Exception handlingKey (cryptography)BitSource codeMatching (graph theory)Social classInterior (topology)Line (geometry)Cellular automatonAttribute grammarFunktionalanalysisGoodness of fitSlide ruleInheritance (object-oriented programming)Data compressionFluid staticsComputer programmingElectronic mailing listQuicksortMessage passingString (computer science)DampingExecution unitSoftware developerDynamical systemLetterpress printingType theoryPhysical systemObject (grammar)Error messageProcess (computing)Data structureFunction (mathematics)Electronic program guideArtificial lifeLibrary (computing)MathematicsFormal languageRegulärer Ausdruck <Textverarbeitung>Reading (process)Greatest elementTemplate (C++)View (database)Right angleComputer fileNetwork topologyForm (programming)File formatCartesian coordinate systemMereologySoftware testingVideo gameStatement (computer science)Instance (computer science)Differenz <Mathematik>Integrated development environmentData miningProduct (business)ParsingRange (statistics)Sequel2 (number)Metric systemRevision controlMathematical singularityUnit testingTypprüfungDifferent (Kate Ryan album)BuildingData conversionProfil (magazine)ExpressionSign (mathematics)Single-precision floating-point formatComputer animation
Query languageSoftware testingSource codeAbstract syntax treeVirtual machineComputer programmingExpressionPauli exclusion principleEquivalence relation1 (number)QuicksortError messageMereologyData storage deviceChannel capacityResultantString (computer science)Fluid staticsClosed setLine (geometry)EmailValidity (statistics)Automatic differentiationImplementationReal numberSoftware testingNumbering schemeMathematicsRegular graphProcess (computing)Module (mathematics)Transformation (genetics)Pattern matchingBranch (computer science)Latent heatCodeSyntaxbaumElectric generatorMappingVirtual machineComputer programmingSource codeStructural loadProduct (business)Scripting languageQuery languageType theoryInterpreter (computing)Projective planeFigurate numberSpacetimeBitAreaInstance (computer science)Control flowMonad (category theory)Block (periodic table)Keyboard shortcutDomain nameSimilarity (geometry)Sound effectInverter (logic gate)Arrow of timeRegulärer Ausdruck <Textverarbeitung>Series (mathematics)Term (mathematics)Right angleStatement (computer science)CurvatureGoodness of fitSlide ruleEqualiser (mathematics)Different (Kate Ryan album)Revision controlMultiplication signAbstract syntax treePermutationTrailRandomizationBoilerplate (text)Suite (music)Macro (computer science)Metric systemParticle systemSequelFront and back endsMilitary baseCondition numberSystem callOperator (mathematics)Presentation of a groupForm (programming)Rule of inferenceFormal languageSoftware developerLevel (video gaming)Pattern languagePoint (geometry)Computer animation
Transcript: English(auto-generated)
Well, thank you very much for the wonderful introduction. As I said, I'm Chase Stevens. I do want to take a moment before I really stop my talk in earnest, rather, to just acknowledge what an honor and a privilege
it is to be here, getting to listen to a speaker such as myself, a real master of the craft. So you should all feel very lucky. I do want to just make a small shout out to my company, Take A Metrics, the Boston-based company. We build a machine learning platform
that helps online sellers optimize for profitability across their entire business. We use really interesting tools, so TensorFlow, Jupyter Notebooks, Hypothesis for testing. It's a really cool place to work. We have currently, I think, $3 billion of retail revenue that we manage per year, and that's
across thousands of sellers that include some of the hottest, hottest brands. So absolutely, if you are looking for a new place to work and looking for a place where you can have a real challenge, but also get to work with cool people, then drop me a line. All right. I also want to say that this entire talk is up on GitHub.
So if you'd like, after the talk, you can go and check all this out. There's a Docker image you can build that will allow you to run this. So hopefully you'll find that very useful. Right, so the premise of the talk here, basically, is that first of all, I'm not going to be showing you, I promise, anything more complex or difficult
than parsing JSON or manipulating the DOM tree or generating XML, right? It's really basic stuff. People get very, I don't know, flustered about ASTs, I find. But it's simple. I'm going to be showing you a lot of tools
that are out there already on PyP, pip installable, that will allow you to leverage the AST to do static manipulation and runtime manipulation of code, which is obviously very cool. And I'm also hopefully going to be not giving you enough information that you can just stand up
and start hacking on AST stuff, but enough to get started, right? Places to look for, resources to go to. And really all I'm asking in return is to be revered as a modern day Prometheus. Just someone who's coming down from Olympus and bringing the knowledge to all your ignoramuses about what the AST can do.
So it's a very simple quid pro quo. I hope we can all agree to this. So let's start off with an AST. The AST stands for abstract syntax tree, but that can be a little bit of a misnomer, I find. Really what you want to think about conceptually here is this is a way of representing in Python, Python code,
right, a way of programmatically viewing, inspecting and manipulating your source. So for instance, really simple, right? X equals one plus two. We can take this and use the built-in AST module in Python,
parse this source code as a string, and what we get back is a tree basically that encapsulates all of what's going on there. And you can see here that we've been returned a module object, not obvious at first what that is. The reason why it's a module object is because the central conceit behind the AST module
is that you're meant to be parsing like source code files, entire files rather, that you're going to then end up using but we can do other things to inspect this, right? So there's a built-in dump function that will allow us to see what that tree looks like
and you can see here beyond the module, there's a body that has the assignment that we've just made. We're assigning to this variable X. The value of the assignment is a binary operation where the left-hand side is the number one, the right-hand side is the number two, and the operation we're performing is addition. So hopefully pretty intelligible.
Little difficult to read maybe, so there's this awesome tool called ASTORE. Primarily its use is in round-tripping ASTs, right? So allowing you to go from source code to the AST representation, then back to source code, but here I'm using it just to get a slightly tercer printout of what the AST is,
something that's a little more readable. And beyond that, there's also a tool called Show AST for Jupyter Notebooks. They'll show you a visualization, right? Something graphical that is the same thing as the previous two cells, but just a little easier to inspect sometimes
if you want to see what's going on. Now, why did I say that AST as abstract syntax tree is a little bit of a misnomer possibly? I think when people hear the word abstract in this, they tend to think, ooh, there's like monads probably,
or like oblique references to platonic ideals or something like that. I don't want to do that. I don't want to have to worry about that. It's not like abstract, like super conceptual. It's abstract syntax. So what I mean by that is, again, very simple expression, one plus two plus three. You can see in the AST, it's being represented as one plus two plus three,
but you'll notice this second cell here, we have some brackets around the one plus two, and that isn't represented in the AST. So what's being abstracted away is the syntax, right? We're not able to go from the AST that we get back to the precise original source code
because those brackets have been basically stripped from this. And likewise, there's no representation of like a colon, for instance, in the AST or a white space per se, right? You can't tell how many tabs or spaces I use to indent something. It's a more genericized, I guess, version
of the syntax of the source code. You'll also notice that as I go along here, that what the AST doesn't contain is any sort of runtime information, which includes types. It does include type annotations, which are part of the syntax, right, but not types themselves.
Why do you care about the AST just in general, right? Why would you want to look at this? Well, here's a good example of why knowing the AST or looking at the ASTs for different pieces of source code is super useful, right? This is a list comprehension here, right? Item for group and groups, for item and group, if check group.
This is how I thought list comprehensions had to be written obligatorily, right? You had, here's what I'm gonna return, here are all my fours, here are all my if conditions. But if you look at the AST, you'll notice it's a list comprehension, right? What we're returning is this item expression,
and we have a list of generators. These are comprehensions, and each of these comprehensions has a list of ifs. So what this means is actually I can do this instead, right, I can have the if directly follow a for, and this is something I didn't know before I looked at what is the representation in the AST
of this list comprehension. And what this does is means that I'm only running that check call for every group and not for every item in the group, right? So you get a little bit more efficiency. You can also have multiple if statements, so you don't have to join them together with and, which can get a little messy syntactically.
So there are things you can learn about how Python works just by looking at how the representation of the syntax is. But this is the real piece de resistance, right? This is where I'm going to take some source code, have it parsed as this parse variable,
and then use the built-in function compile, and what I get back is this code object. What is the code object? Basically, we've taken that source and interpreted it, and now we have some bytecode, right? So this is not, like the AST model is not just something that someone put into Python on a lark. This is deeply embedded into the system, right?
This is very, very powerful, because what we can do with that code object is we can then create an environment, execute the code within that environment, and then pull things out of the environment, right? So like I said, here we have this code that's basically assigning one to X and two to Y,
and we can pull out from the environment I passed in X and Y. And of course, if I wanted to, I could have begun by putting things into that environment for that code to access. So again, super, super integrated. One question you might have at this point is, well, if I'm running a program
and I want to have access to some of the functions that I'm running, how would I do that? One method is to use the inspect module. So inspect has a get source method defined that will give you back the source for a lot of Python objects, not all of them, right? So functions and classes, definitely. If you want to access like Lambda expressions
or if you want to access like generators, it's a little bit more difficult. You have to use what's called a decompiler, which basically will take the byte code, inspect it and sort of reverse engineer an AST out of that. But for most applications, inspect.get source is what you want, right?
And again, now that I have the source for the function I've just declared above, I can take that, parse out the AST, do whatever I want to it and recompile it back into the function if I wanted to. Right, so there are obvious applications for this in static analysis. Just to be clear, what is static analysis?
Well, consider the counterexample of a profiler. So a profiler, you have to be executing the code intrinsically to be profiling it, right? You want to see what runs quickly, what runs slowly, what's calling what. And that is something you can only do if you actually execute the code.
Static analysis is when you're doing an analysis without executing any code, right? So very trivial example would be like counting the lines of code that you have. Another example that sort of highlights the distinction here is if you had a program that was checking the syntax of all the code
in your code base, right? In that case, you are parsing the code, right? You're going to be making two abstract syntax trees, but you're not executing it, right? You have the difference between the representation of the code and the actual running code itself. So how might we use the AST module to accomplish this?
Well, here's a really simple, simple program, I guess. But basically I'm assigned to these three variables, A, B and C, right? And if we read this file and we load it in, it's an AST, you can see here, I have my three assignment nodes. Shouldn't be anything surprising. But built into Python, we have this ast.nodevisitor class
that I can subclass. And basically what this allows you to do is to override methods that all take the format like visit underscore node name. And when I instantiate this new class and then call .visit on some AST, it's basically going to traverse that tree.
And then every time it reaches an assigned node, in this case, going to call the code that I've given. So in this case, what I'm doing is I'm looking at every target of the assignment because assignments can have multiple targets, right? You can have x equals y equals one or something like that. Checking to see whether it's a name node, as opposed to say, assigning to an element of a list
or a key in a dictionary, something like that. And if so, then just printing that out. And you can see here, when I run this, it prints out ABC. Already, there are like serious, legitimate applications for even something this simple, right? Like if anyone's familiar with Alembic, it's a migration system for databases,
but basically it requires you to declare at module level, I believe, this is what this revision's hash or number is, and this is the previous revision that's based off of, right? So you could take something like this, go through all of your migrations, and then create like a graph to ensure that there are no cycles in them
or something like that, right? There are already, even with this like very minimal built-in functionality, a lot of things you can do that are interesting. But further to that, you can do even more cool stuff, right? So this is a tool called AST Search. And if at this point in the talk, you're thinking, boy, I wish I hadn't come here, then this is the tool for you
because basically, AST Search is using the robustness and the power of the AST, but abstracting away for you all the sort of internal complexity of having to learn about ASTs, right? So you can see here, I'm using this as a command line tool. I'm basically doing AST Search
and saying question mark equals one. So that's to say some wild card equals one and searching for that sort of pattern in this protobuf repo, right? And what I get back, you can see, I have some things where we're assigned to an attribute of an object. I have some things where I'm just assigned to a name, right?
In the AST and the internal representation, those two things are very different, but AST Search allows you to not have to worry about that and just do this sort of very simple pattern matching, which is great. However, obviously there are a lot more things you can exploit with the AST. So another tool out there is called AST Path and this is a little more complex.
You have to know what you're doing in terms of what the AST looks like, that you're hoping to match on. But basically what this is doing is allows you to supply an XPath expression. And by doing that, you can capture really interesting properties of your code and search for them through your code base, right? So this is something that you can't do
with like a regular expression, for instance. I'm looking for numbers where the numbers value is greater than a hundred, right? And because this is the AST, this is going to handle integers, it's going to handle flows, it doesn't care whether there are underscores in the numbers, it doesn't care whether you're using scientific notation or not,
it's all handled for you. And I want to find those only if they're not assigned to something, right? This is a sort of deeper structural property of the code that is difficult to capture if you don't have access to the AST, but with the access to the AST is almost trivial to capture, right?
And you can even do things like not just capturing one line, but here I'm searching for all function definitions that have at least one decorator and have a for loop in the body, right? So I can capture really, really deep structural properties of my code that are very, very difficult,
if not impossible to capture in any other way. So PEP 572, rest in peace, Guido. People aren't aware of this. Basically, what it allows you to do is opposed to saying match equals pattern.search data and then use match in this if statement later on,
you can use this assignment expression syntax to basically assign to match within your if statement. There are a bunch of other use cases, but this is sort of the primary one that's touted. Obviously, I don't have a very strong opinion on this, mostly because I have a strong interest later on in not being thrown to the back of a car,
but if you want to see, well, where can I apply this in my code base, you can use AST path. You can say, all right, I wanna find all assignments where the name is the same as the very following statements if it's an if statement being used in the test for that if statement.
So a few of these lines, because it's not respecting white space or anything like that, a few of them, the second line is blank, but you can see here, for instance, here we have this callable being assigned to, and then if callable is none, that's a candidate for replacement with the new assignment expressions.
Now, I know some of you are thinking, ooh, I'm gonna go back, I'm gonna use that AST path tool, and I'm gonna start creating this sort of programmatic way of capturing things that I don't like in my code base. You know, the sort of things where you have PR
after PR after PR of just commenting the same thing over and over and over again, and you just want to have some sort of tool that you can run that catches all of those things that you don't like. Don't worry, fam, I got you covered. Let's say that this is my code base, pretty small.
The function that I'm interested in here is this call to deprecated function. I have just written in my PR a super sexy new function that's gonna replace this, but it turns out there's like 3,000 uses of deprecated function. I don't really want to do that all my PR, and some of them are a little difficult to undo,
and let's just leave that for someone else, right? So I can use this tool called belly button, right? And this is basically a wrapper around AST path, but I can define different rules. So in this case, a deprecated function call rule. I give it a description that's going to be basically the error message
if this is caught in my code base. I give it the expression, which is basically this XPath expression that I want to run against the ASTs of all the modules in my code base, and I can give an example of what this looks like, what not to do, and something to do instead, both of which are validated against that expression. So when you run this tool,
you can be sure that the example and counter example you've given actually adhere to what you're looking for, and then also some stuff about like where do I want to run this, which is not super important, but basically I can then take this and run this on my code base, and as you can see here, it gives me back,
oh, this linting has failed because I'm using this deprecated function in this particular code.py. And obviously, this is sort of just the tip of the iceberg, right? This is a pretty superficial thing to look for, but you could also do things like, let's say I have an enum with a bunch of different values, and I want to make sure
that if I'm sort of working with one of those values, that all of the other cases are handled, right? I could write an expression that says, if I have an if statement where I'm checking against one of the values, then I am obliged to also have elseifs for every one of the other values. This is something you could implement relatively easy with an AST path expression,
and then all of a sudden, you're getting some of those really cool compile time almost guarantees that super functional languages like Haskell and Scala will give you, but specific to your project in Python. Just a little brief interlude. So these are some of the tools I've talked about so far.
Two that I didn't mention, but which are very cool, you should look at them. One is called Green Tree Snakes. So basically, this is like pitched as the missing guide to the Python AST, but it is a riveting read and I highly recommend it. The other one is Python AST Explorer. Basically, what this tool lets you do is,
it's an online tool, you paste in some code on the left hand side of this page, and on the right hand side gives you like this collapsible view of the tree that basically the code will be represented as in the AST. Again, super, super useful, especially if you're going to go and write like AST path expressions, for instance.
The thing is, because this is on GitHub, you can just go take a look through these, maybe give them a few stars, I don't know. I know that it's a little bit of a vanity metric, but do consider that in our post-apocalyptic, post-singularity future, aka Q1 2019, get stars will be used to determine
whether or not you have to work in the silicon mines. So give generously. Right, so now we're going to be shifting a little bit. Before, we were just looking at, what's the structure here? Can we query on this structure? Now we're going to be looking at, can we take the underlying AST, do some manipulation on it, make some changes to it,
and then run that code? So this is where a lot of people get very uncomfortable and there's a good reason why. So there are basically two forms of AST manipulation. I'll call them static and dynamic. Static AST manipulation is basically, so you have some code base, right?
You have a program that's going to run on that code base and that program reads each of the files, does some manipulation to it, and then writes it back out to source, right? What's the problem with this? Well, from a development standpoint, it's a little weird, right? Because if I have this pre-processed
and then post-processed code, am I meant to be working with the post-processed code? Is that my check into my version control? If not, then I guess I have to build that during my build and deploy process, but then how do I find out what the code ended up looking like at the end? Like it's a mess.
Dynamic AST manipulation is no less of a mess, but basically the premise here is that as opposed to doing that once in sort of this processing process, you're going to be doing it with code you have access to in whatever scope you're in, right? So sort of live code objects that you're going to be manipulating, which obviously has a similar problem
in that how do you know what you got out? And like if there's an exception that ends up being raised and you look at the trace back, you're going to be pointing to a line that doesn't match up to anything in your code base. So again, a little bit difficult to debug. So with that being said, how many people's initial reaction is,
oh man, get that as far away from my code base as possible. Can I get a show of hands? All right, a few, a few. Now, how many people use PyTest? Hmm, I think I see some people that raised their hand twice. Well, let me tell you something. PyTest, what a great tool, right?
So simple. I just wrote this little test. It's like super easy. I don't have to like create a class and use self.assertNot is false or whatever in unit test. And hey, when I run it, look at this error message I get, right? It says, hey, you know, this failed,
but not only did it fail, like I tried rerunning it and there's some weird stateful stuff going on and like, you should really look into this. Or maybe I had different tests, right? Maybe I had a test where at the end of it, I was like saying, oh, assert this dictionary equals this dictionary. Oh, and look, PyTest gives me back this like super, super, super nice diff of saying, oh, these keys don't match these keys
and it's just so easy to use. How's it do it? Well, it takes the code that I have there in cell 32 and transforms it into that monstrosity, which no one wants to write or read. And this is not meant to be like a bait and switch. Like this is a really good application
for AST manipulation. The sort of tooling that you are going to run as a developer locally is a fantastic candidate, right? Because you may want to do manipulations like this that make your life a whole lot easier, but you wouldn't necessarily be comfortable doing in like a super critical production system. How many people here are familiar with Protobuf?
It's not super critical that you are, but a lot of hands, that's great. So essentially what it is is a schema language for serializing messages, right? So very simple schema I have here. I have like a latitude, a longitude and a message.
And this is like something I want to log out to some system, right? What's the problem with Protobuf? Well, it's super strongly typed. And Python is pretty dynamic. So let's say this is my function that I'm going to write to create this Protobuf message. I have my lat, I have my long, I create the message. I'm starting to populate these attributes.
Maybe I have to do some conversions or something. Maybe I want to format the message. And then I want to return the Protobuf file. Looks good. But what happens when I put in a float instead of an int? Protobuf goes berserk when you try to assign to that particular attribute.
And this is a problem, right? Because the last thing you want is for your logging system that's meant to log your errors, also creating errors that then don't get logged. That's a poor outcome. So the other, you know, this is not just, I mean, this is sort of a trivial example
because you'd say, oh, well, you could use mypy and you could just make sure that everything's an int and yada, yada, yada. But how do you know your int is within the int32 range? It's a little more difficult, a little more onerous. What you really want that function to look like, the function that constructs the Protobuf message, is more like this second cell, right? Where you're creating it and then you don't want,
like if there's something going wrong in the conversion, you don't want that to be caught. So you want to assign that to like this temporary variable and then, oh, you want to assign to the attribute and you want to catch a type error, I guess. And maybe you want to like print out a warning or something. And then you have to do that for every one of these and then you return it and it's like, oh, man,
this is like not clear to me as some other developer who have to go into this code and figure out what's going on. So how might we solve this? Well, the AST module also has a built-in Node transformer class which can subclass very similar to the Node visitor class. And this is a pretty long bit of code here
but just to make it clear, basically here I'm checking, are we assigning to an attribute of this protobuf message? Here is what I'm going to be replacing that assignment with, right? So I'm saying basically the same thing I just had in the previous slide. I have to assign to a temporary variable and then I have this try accept
which is going to take that temporary variable and assign it to the protobuf attribute. And then finally here, this is the only part that's a little strange. Basically the visit assign method is expected to return a single node but we want to replace the previous single assign node we had with multiple nodes.
So we wrap them all in a if statement that always is going to run. And what do we get out of this? Well, if here's our original function, we can do the same stuff that I've shown you already, right, get the source of it, parse that into an AST and then call the assign replacer.visit method on this,
then transform that back into our source code. Of course, we'd also compile it if we want to in some environment. And we get back essentially what we wanted. And I think it's an open question whether this is more or less maintainable than just writing the code as output, right?
Because the code that's output is really oblique and opaque and not a great experience, whereas the other one is pretty terse. One thing you should note is that is this testable? Yeah, totally, as testable as any decorator, right? You could in fact wrap this up in a decorator. You don't have to have this at the bottom of your module or something.
It's just another thing where you need to pass in a bunch of functions, make sure that the functions you get back from it are performing what you expect them to. It's something that you can use in code. The other thing about that previous AST transformer
that I showed you is quite verbose. You might say, well, you have to know a lot about the AST to figure out what's going on there, or to write it in fact. Here's another tool called AST tools, fantastic library. You can use this quoted template decorator. And basically this gives you a function where I can pass in the pro buff attribute I'm going to assign to as an AST node
and the value I'm going to assign to as an AST node. And specify what I want to get back. And when I run this, I'll get basically an AST, a list of AST nodes that'll have this assignment, this try accept, yada, yada, yada. So it makes it a lot cleaner, a lot easier to follow.
DSLs, everyone loves a good DSL, right? We are infatuated with them in programming. We love SQL, we love regular expressions. What's the problem with DSLs? We represent them as strings in our code. What does that mean? Well, supposing for instance that this is the space of all strings,
the space of all regular expressions might look like this, right? They're just so, so many strings that aren't valid regular expressions. What's the practical impact of that? The practical impact is you're not actually leveraging this tool you have, the Python syntax checker, basically,
to make sure these are valid. So you have no guarantees about whether your regular expression, when it gets run, it's gonna compile or not, right? Another example, strings, right?
XPath, strings, boom, boom, boom, boom, boom, boom, boom, SQL, boop, not a great situation, right? There's a little bit of a mismatch here. And what's the practical effect of this? I'll tell you what the practical effect of this is. The practical effect is that Sam who sits next to you just checked this into the code base and some branch is never gonna run
until you go on holiday and then it's exclusively going to run and you find out that Postgres actually doesn't understand Shakespearean English. But on a serious note, I once had a project I was working on where it had a lot of XPath expressions and I ran a little tool on them that I made
to see what was going on and about 20% of them were invalid syntactically. So this is not just like made up, like academic problem. This is something that real people have wet themselves to sleep over. So this is a serious issue. Ah, but here's the solution. What a fantastic use of AST manipulation.
A little tool called PonyORM. What this does is it lets you use generator syntax to create SQL queries. How does it do that? Well, you have this very simple expression, customer for customer and customers, if some customer or the total price is greater than
or equal to, or sorry, greater than 1000. What does it? It takes that generator, it decompiles it to get the bytecode, sorry, to get the AST, then does the manipulation from the AST into SQL basically. And what's the nice thing about this? Well, okay, maybe it doesn't give you
the full expressive capacity of SQL, but what it does give you is a guarantee that Python, when you import this module or whatever, is going to throw up and say, oops, syntax error. And you're gonna know that from the start, right? If you have syntactically valid generator expressions, you know that you can get syntactically valid SQL.
It's a guarantee. And likewise, you can make sure that if you want to later on, you have a programmatic way of manipulating this using the AST module, as opposed to having to do like weird string manipulation. There's also a tool called XPython that will do this for XPath expressions. I don't know of one for regular expressions, but I'm sure one is forthcoming.
Testing, right? This is kind of a broad topic, but you ever notice in testing, like, all right, I wrote my cool, like, quick sort of implementation. I don't know why I wrote it, but I got a check mark on my PR, so that's all cool. I'm using modern CI practices, so of course I had some tests in there
and I made sure those tests passed and everything's copacetic, and I'm about to merge this into master. Since my tests all pass, before I merge into master, why do I have the tests still? I mean, obviously, maybe when I was writing this, I wanted to have them to make sure that my code ran properly,
but now I know it runs properly. Why don't I just delete those? Well, it's not really that you care about whether your code you have runs properly, right? It's actually more that you want to capture if someone else comes along and manipulates this and makes some changes that this still works, right? So it's not that you care about
whether your code now works. You care about whether if your code were wrong, that this test would fail, right? And that's how you know whether it's a good test or not. If the test would fail, if your code were wrong. Is that hard to capture?
No, because we can use the AST, right? We can use this cool tool called CosmicRay. We can have it automatically make a bunch of random permutations to the code that are almost certainly gonna make it fail. And then at that point, we can validate our test suite against that and see how often are we actually capturing those failures versus how often are we actually writing tests
that don't do anything, just happen to run the code and give us like a nice code coverage metric but aren't making sure our behavior's enforced that we want. So again, super great use on like the dev side, not on the like production side of a tool that you can use to do something you would have a lot of difficulty doing
with some other, you know, if you're just trying to manipulate this as a string, right? And this will do things like, oh, I'm gonna switch these branches around, I'm gonna invert these conditions, all sorts of good stuff. And again, here's some of the things that I've just been talking about. If you wanna go back at some later point and check these out, that'd be awesome.
Okay, so what's next for AST manipulation? What do I see as being the problems? Well, number one, source mapping. If you're familiar with front-end development, you might know this term, but the essential conceit here is like I have some JavaScript and I want to minify that JavaScript or I have some type script that I wanna compile into JavaScript.
But when I run that minified version, I want to then be able to actually debug or see what's going up, going wrong with the original code that I had, right? The same sort of trace back issue that I talked about before. The JavaScript community has solved this problem, right? We should just steal this. This is a serious hindrance to us using AST manipulation in production.
Like I just said, neither dynamic or static AST manipulation has a good answer for this. So we should have a tool that lets us do this, right? Also, AST manipulation in general needs to be easier, right? The AST transformer, sorry, node transformer that I showed you earlier
was pretty verbose, required you to know a lot about what's going on underlyingly, right? Not something that most people are really willing to dive into. We need something that's like the AST equivalent of find and replace that is super simple to use and produces good results. And then finally, this is sort of a consideration,
but backwards compatibility has not been great, right? If you think about the minor releases of Python that have been going on, there's a lot of new syntax added. Some of it's really awesome. Some of it you wouldn't wanna live without. But the problem is that although to most people that's like a totally unnoticeable,
backwards compatible change, on the AST level, there can be very, very significant changes. And a lot of the tools I wanted to show during this presentation, I found out weren't compatible with 3.6, right? Which is a shame. We're losing out on a lot of what we could have, a lot of the tooling we could have by not having some like intermediary form, right?
That's a backwards compatible AST that will work over different versions of Python, for instance, or even just by considering more carefully what changes to the AST will end up resulting in for different AST based tools. So another area that moving forward, I think that we could really benefit from looking at.
So let's talk about what you learned in this presentation. First of all, AST is easy, useful, tons of tools out there, right? Now you can go take this presentation, get started with stuff, start reading a little bit more about it, start using these in your own code bases, right? But more importantly, you also learned that I am a visionary,
the likes of which an Elon Musk figure sort of would like to aspire to become one day. And yes, those posters are available online, just send me an email. So here are my closing thoughts. Alan Turing once said, there need be, there need be no real danger of programming
ever becoming a drudge for any processes that are quite mechanical, maybe turned over to the machine itself. A beautiful thought, right? A thought that was expressed almost 80 years ago. And every time I find myself writing a bunch of boilerplate code, I cry a little inside, because this is not where we're at right now.
If this is still a place we want to get to, and I think it is, then we are going to need to have code that understands code. And ASTs are the best way of doing that. Thank you very much.
Thank you. Thank you, Chase, for this insightful talk. Hands up, who got some inspiration to clean up his own code base now? Quite a few people. Questions?
So you did mention that AST manipulation can still be quite tricky, and it's backwards incompatible and things like that. Do you think that Python would benefit from having a macro syntax of some sort and exposing that to some of the other things and bringing that more to the developer? You know, I really enjoy Python
being a pretty minimalist language, although it's getting to be less so, maybe. I don't think that that is necessary to be built in. I think that we already have a lot of the tools that you need to do that if you wish to. And then if you don't want to get your hands
onto any of this sort of stuff, then it's not something you have to worry about, right? So I like not having that capability in Python just to keep things simpler for people who don't want to mess with it. Do you think we should go full JavaScript and have extra layers on top of Python then?
Maybe. I mean, so I think the thing that you might, hopefully will take away from this is like, let's say there's a pep you really liked that didn't go through, like a pattern matching or something, right? Or something you really need that's specific to your project but isn't universally applicable, right?
You can do that through AST manipulation. So it could be argued that maybe for some domains you have that extra layer that goes from something that's more appropriate to what you're doing to Python. Not a bad idea, I think.
So what problems at your company did you actually solve with those ISD tools? Well, I'm about to be crucified here, but not at my current company, but at a previous company,
I was stuck with a Python 2.7 code base, which I know no one here has ever had to dirty themselves with working with. However, I needed to have yield from syntax, which isn't in Python 2.7, and helpfully the pep says, hey, do you know that yield from is equivalent to this block of Python code?
So basically, more generally though, I've used AST querying in search more to do like custom linting stuff, which I found very, very, very helpful because Pylint, for instance, has some good general rules,
but not ones that are going to be specific to your project. And then just on top of that, being able to do introspection to see like, oh, what's going on in this? I wrote something once that went through, like I said, a code base to see what are all the XPath expressions in this, which was based off how was the string being used in terms of what method was calling it, things like that.
So it's nifty to be able to pull out for those sorts of things. Okay, thanks. Hi, do you have any experience with using full syntax tree tools like Red Baron or similar? So I haven't personally,
but I know other people who have. And that's, yeah, definitely, if you want to do the sort of static manipulation, full syntax tree tools will preserve sort of that white space, the colons, all that good stuff, so that the code that you get out will be closer to what you originally put in. So I think definitely if you're going to be
looking into doing static AST manipulation, that's the way to go. Hi. Hi. There's CTX on loads in one of the first examples. When you do an AST dump, what does that mean? So yeah, I don't actually know whether the Python interpreter cares about that,
but basically, it's giving you some context about, for a specific name node, there are a few different things they can be. The major ones are store and load, which represent whether you're loading that value from the environment or from the scope, or whether you're storing that into the scope. If that makes sense.
One final question. I see you made a reference to SQLAlchemy, and then later on you put a slide with pony.oram, and you suggested that pony.oram can generate SQL statements. Are you suggesting that SQLAlchemy does not do it in the same fashion?
SQLAlchemy doesn't, as far as I know, maybe I'm wrong, doesn't give you the ability to transform generators into SQL. Maybe I'm incorrect on that front. I haven't really kept up with SQLAlchemy specifically, but if it does, that's awesome, and if not, I think pony is a great tool for you.
Being able to go into a code base, not necessarily even knowing SQL, but just use the Python that you know to be able to really guarantee that you're going to have valid SQL expressions. Okay, so the next thing will be the lunch break. If you want, we can continue the question and answer session for a couple of minutes.
There's still, I think, two questions. You put the source mapping thing in the next, so it means there is nothing now, or there is some really many projects,
because that's the part I'm interested in. I mean, if I get an error, the line number, I'll do map that line number to my original code. No, totally, I mean, I definitely think that it's a place where, as far as I know, the ecosystem is deficient, right? A lot of these things, most people wouldn't feel comfortable doing outside of experimentation without being able to tell
where is this error coming from, and as far as I know, there aren't any solutions for that in Python right now. Pytest doesn't go both ways, right? So Pytest will just produce for you the new code, and I think basically what they've done is test the code they generate well enough
so you'd never have internal errors happening in that, and I think it just keeps track of what line it replaced to show you the error on that line. Great talk. Thank you. So there is a Haskell monad bind operator, which is represented by two arrows
to the right and an equals. Can we support this in Python using the AST? Yeah, I mean, you could, so it depends on what you mean by support, right? It's not valid Python. It's, oh, sorry, sorry, I understand your question. So that specific syntax, no, right? So if you're using the, at least if you're using the AST module,
you're beholden to actually writing syntactically valid Python code, right? You could do something if you want to, like Scala has four comprehensions, which will basically do that sort of monadic binding. You could have a similar thing in Python, right, where you could say, I'm going to take this generator comprehension,
run that, get the AST, and transform that into a series of flat maps and maps if you wanted to, but you can't support syntax that doesn't already exist in Python. Thanks. So let's thank Chase again for this insightful talk. Thank you.