We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Hack The CPython

00:00

Formal Metadata

Title
Hack The CPython
Subtitle
Hack The Interpreter At Runtime
Title of Series
Number of Parts
118
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Have you ever realized how dynamic CPython interpreter is? Maybe it is the most dynamic interpreter you may see. It gives interfaces to internal things like garbage collector or AST, allows to alter functions code, modify built-in functions etc. This talk will go beyond that dynamism. From adding a new syntax to hooking the evaluation loop, it will show how to hack parts of python. Before understanding these hacks, you will learn internals of CPython step-by-step. Steps are important because in every step we have at least one hacking option. Also it gives the audience a short brief of how python works. After learning how cpython works, we'll cover how to hack (use things that is not their main purpose) the interpreter and the interfaces it gave. For an example we will disassembly the bytecode and then assemble it again with adding our statements or adding a new syntax for python at runtime with AST. Talk will hack these steps: - AST - Bytecode - CTypes - CPython evaluation loop
Keywords
20
58
GoogolIdentical particlesHacker (term)Graph (mathematics)DataflowCodeAbstractionSyntaxbaumUniverse (mathematics)ParsingToken ringComa BerenicesReading (process)ParsingConstructor (object-oriented programming)Arc (geometry)Data structureNetwork topologySoftware testingSimultaneous localization and mappingAbstract syntax treeBytecodeControl flow graphPerformance appraisalOperator (mathematics)Hacker (term)Student's t-testInterpreter (computing)Pauli exclusion principleWebsiteAbstract syntax treeCodePoint (geometry)Line (geometry)Function (mathematics)Variable (mathematics)Control flowBytecodeComputer fileFile formatCodierung <Programmierung>Stack (abstract data type)Active contour modelStreaming mediaExpressionIdentifiabilityToken ringData storage devicePerformance appraisalSlide ruleAssembly languageDisassemblerTwitterIntegerMyspaceParsingObject (grammar)Virtual machineEndliche ModelltheorieSyntaxbaumMathematical optimizationConnectivity (graph theory)Online helpOperator (mathematics)Set (mathematics)Electric generatorResultantMereologySheaf (mathematics)Hecke operatorComa BerenicesFreewareSymbol tablePerturbation theoryPiReal numberLattice (order)Source codeObservational studyVideoconferencingInformationProjective planeState of matterSoftware testingForcing (mathematics)Theory of relativitySpacetimeForm (programming)Network topologyMultiplication signDirection (geometry)Lecture/ConferenceComputer animation
Hacker (term)Strategy gameModule (mathematics)SequenceToken ringSource codeStreaming mediaUniform resource locatorError messageLocal GroupEquals signNumberCodeFinite element methodPrice indexCorrelation and dependenceData typePartial derivativeCodierung <Programmierung>Function (mathematics)ReliefInterior (topology)SpacetimeRun time (program lifecycle phase)ParsingGradientAbstract syntax treeElectronic mailing listExecution unitPosition operatorWebsiteRule of inferenceParameter (computer programming)Statement (computer science)Functional (mathematics)Self-organizationExpressionComputer fileObservational studyError messageMereologyRoutingTransformation (genetics)Multiplication signCodeEndliche ModelltheorieCore dumpProper mapToken ringReading (process)Uniform resource locatorType theoryPattern languageData storage deviceGraph coloringNamespaceSource codeFamily5 (number)Lie groupAreaRepresentation (politics)CASE <Informatik>QuicksortOpen setView (database)Physical lawInsertion lossCodecLattice (order)Turbo-CodeMechanism designModule (mathematics)Streaming mediaLine (geometry)Codierung <Programmierung>Hash functionStrategy gameIntegerElectric generatorBranch (computer science)Installable File SystemInformationComputer animation
Mathematical optimizationState observerMultiplication signFunctional (mathematics)BlogExterior algebraMathematical optimizationCompiler
Hacker (term)Strategy gameFunction (mathematics)BytecodeMathematical optimizationStrategy gameMathematical optimizationFunctional (mathematics)BytecodeComputer animation
Data bufferFunction (mathematics)System callWrapper (data mining)ReliefDialectLocal ringMereologyHookingFunctional (mathematics)System callFunctional (mathematics)EvoluteModule (mathematics)Revision controlInternet forumCore dumpDirected graphSign (mathematics)PiLie groupVariable (mathematics)Multiplication signAddress spaceImplementationMereologyEndliche ModelltheorieTrailState observerComputer-assisted translationChaos (cosmogony)Pattern languageObject (grammar)BytecodeLibrary (computing)CodePerformance appraisalSpeicheradresseSymbol tableMathematical optimizationHookingExtension (kinesiology)VarianceField (computer science)Standard deviationAssembly languageXML
TwitterComputer animation
Transcript: English(auto-generated)
Hello everyone, thanks for coming. My name is Batwan, I'm a high school student from Turkey and in this talk I'm going to show you a few tricks of how to hack CPython interpreter. Before we start you can ask me any question through Twitter that is identical handle
and get slides from speaker deck with that same handle. So we are here to hack CPython but what is hacking? When you say hack or hacking or hacker, people often think a black wearing guy who serves for evil forces, who works for money, tries to break into your server, yes they are hackers.
But hacking is not just doing an illegal thing. Think about torrenting. When you say torrenting, people think you are doing something illegal but you can torrent with someone's permission. So we are going to do a legal hacking of CPython.
The definition of hacking is using something in a way you are not supposed to do. Like in Turkey we use old cheese containers to plant, it's a recycling hack. And we are going to hack Python for our freedom. You all know this guy Richard Stallman who saved us from slaving to the bad guys.
We are all grateful to him. He hacks for his freedom, so do we. Anyone heard pep 313? No? It's an old pep about adding Roman literals as Python integers. It's rejected with some good reasons in 2005.
But what if you want to use pep 313? Isn't it your freedom? Yes it is. It is why we are going to hack. For doing such hacks, we need to learn how CPython works. Every step in that execution model is a break point to us where we can inject our code,
our alter the code output. For an example, if you want to implement pep 313, you need to replace all capital X with 10 in the AST to compiling step.
Okay. The first step is tokenizing. When you say python file.py, Python will read your file in the encoding you have specified and then stream it to the tokenizer. Think about two plus two.
There are three tokens, identifiers. Like two is an integer, plus is an operator, two is an integer. And by the way, the first encoding is the, first token is the encoding token. You may use it when you are hacking.
But tokens doesn't know about relationship. It doesn't know if something in the ifs body or ifs test. So we need something more relational like concrete syntax trees. The parser of Python generates concrete syntax trees from the tokens that stream it into it.
And this is the last step. You can convert bidirectional. You can convert a code into the CST concrete syntax tree and concrete syntax tree into the CST. The CSTs of two same expressions are different. Like you can write two plus two with 15 myspace
between the operator and integer or without myspace. These two CSTs are different. But we need to know they are the same. So we need something more abstract. Like abstract syntax tree. It is generated by ASTL from the 2.5, I think.
And the AST just keeps relevant information about component. It just keeps nodes, line information, and colon offsets. AST, we can hack AST by Python set. Python says don't take the parser, don't take the bytecode.
If you want to hack, use it AST. It offers a great API for hacking ASTs. This is three lines of code. It's A to beginning of every variable. Like X plus Y plus Z. When you transform the AST with this code,
Python will understand it like AX plus AY plus AZ. And there is a great documentation about this AST on a site called the Green Tree Snakes. You can check that out. I think Python refers to it on DevGuide or the official documentation.
And this is the last step of compiling bytecode generation. Python uses a format called bytecode store instructions. Python have a disassembler for this bytecode objects, but doesn't have an assembler. I proposed an assembler on Python ideas last month and it is rejected like the people optimizer API.
The people is the optimizer for the bytecode objects. It has only a few optimizations. And if you want anything, you need to write your own people. We are going to do that too. And this is the last step of execution model evaluation.
Python will go through every instruction you specified in that bytecode with a for loop, also, and a help of GCC feature called labeled go-tos. And it will push and pop to the stack. See Python virtual machine is stack-based. So it uses push and pop method.
When you say two plus two, it means it will push two to the stack, two to the stack, and then an instruction called binary that come in and pop the last two value from the stack and merge them into each other and push it back, push the four back.
Yes. So let's hack. The first hack I'm going to tell you is using Valor's operator on Python 3.7. It's set on Python 3.8, but you can use this on Python 3.7. For doing such a hack,
we need to interfere between tokenization and reading file. The only step between tokenization and reading file is the encoding. So we are going to add our encoding, and when we are decoding, we will alter this code with Python 3.7 compatible code. Our series, we should run before the tokenization happen.
We need a new tokenizer, or we can modify the Python tokenize module, and we will tokenize the source with that module. We will alter this code, and alter this code by changing tokens, changing positions of tokens,
and then we are going to un-tokenize it and stream back to the real tokenizer. For modifying tokens, we need something called token module. I import that as tokens. When you add a new token to the C Python,
you need to add an ID for it, and a name for it. Our token names is Chronicle and ID is 255. I edit the tokens to the token model, and I edit the token itself to the exact token types. It's a dictionary where tokenize module
uses that to find the ID of and token. And for changing the tokenize mechanism, you should look at the source code of tokenize module. I looked and I saw there is one rule, one main rule that use it for tokenize module. So I am altering that with adding my cute valsopator.
The second step is writing a decode function for our encoding, valsopator. This decode function has an extra parameter beside the input and errors. It is encoding. It is for because you can specify your own encoding with valsopator. For an example, you can say valsopator seven dash utf two
and it will decode the encoding you have specified. Also, it will alter the code. I'm just streaming it back to the generate vals source code and return with the encoding you have specified.
And this is the last step, is adding a new encoding. We need a search function, codecs.register uses the search function. And if we return a codec info, codecs.register allows us to register this codec info.
I'm just streaming vals 37 and dash for seeing if you have specified an encoding or not. If you don't specify, it will use utf eight and return a codec info. By the way, you need to register this encoding every time a Python session starts. For this, we need to hack this site module. Anyone heard site module?
Site, no. We don't care about site module, actually. We care about its behavior. Site module uses PT hash files and we are going to inject our code at the PT hash files. This is the code for implementing rejected paths like three, one, three.
We are going to use Roman literals as Python integers. Our strategy is we are going to run when we are imported. We are going to only be effective inside of this LSCOPE and we are going to erase proper error messages when it is used outside of this LSCOPE.
For implementing this PEP, we need to use AST modules, not transformer. It goes through to every name definition like A, X, test, obtainer, and capital X, I, V,
and checks for is it a Roman literal or not. If it is, it is going to return its integer representation by returning a new node. For scoping, we need an extra transformer that will go through every with statement and checks the with statement name is hello, like this.
If it's hello, it will get the first argument and then it will get to the transformer we wrote with that argument and transform only with body, not the whole file. It just transform only with body and then it will copy the locations of nodes
to the new node and fix missing line nodes and return the new with node. The runtime, we should run when we are import, so we are going to call a function called hello with lower case. This will import ourself and get the file and read the file and pass the file to AST.
Then it will call our transformer, the transformer we wrote here, to the file and then we are going to compile that transformed AST by manual and execute it under the modules namespace. Another funny hack is Rust return. Rust runs implicitly, saw the Python with that RLR decorator.
We are going to return implicitly. We need two things, return the last statement, last expression and we should allow infinite branch like if-as inside of a if-as, we should return all possible last expressions.
The first thing to transform AST is we need to go through to every function definition and check that if there is a RLR decorator. If there is, we need to remove that decorator, we don't want recursion, and then call the adjust method. The adjust method will find the last statement
of a function and then if it's an expression, it will pop that statement and return it as that return. If it is a if statement, it will adjust the ifs body with that same function
and this is how we support infinite branching. The other thing is people, an alternative to people optimizer, people optimizer. We are going to optimize eliminating local bars. Maybe you ever, there was a blog post about, it's called C Python's bytecode compiler is dumb
and it doesn't optimize eliminating local variables, it doesn't make optimization and we are going to make it ourselves. Our strategy is when we are going to run when a function decorated, we just only make the optimizations user have specified and we are going to re-put the source,
the bytecode of a function back when it's done. For optimizing, we need a decorator. This decorator, we create a bytecode object from function. The bytecode object comes from this module and it is invisible so we are going to return
a new bytecode object every time we make an optimization and then we will re-put the code of bytecode object at the end and function will be optimized. I'm going to show an example of optimizations. This is eliminating local variables. We will go through the bytecode
and find which symbol value is a constant and which symbol uses var and then keep a track of that and find the unused symbols to eliminate local variables and then we are going to remove them manually by using a for loop, remove all the unnecessary symbols.
You can't assemble bytecode in standard library modules but you can use a module called bytecode as a third byte model from Victor Steiner. It has its own people implementation you can hack
and it's as some of bytecode great. This is the last hack. I'm going to do is, this is Catalyzer v1 extended. Catalyzer is a module for hooking functions like autotalks but it's a more general concept. It will hook your functions.
The first version of Catalyzer uses decorators to hook your functions like before prehooks, on call hooks and post hooks but in that extended version, I didn't want to mutate the functions you gave to us so I am using an methodology to hack the C Python calling function
instead of user's function. So this is the code for hooking into a py function fast call cables. This function uses when a bytecode comes to call a Python function.
I'm using James Powell's code hooking, libhook for this, a slightly modified version called adpyhook and this field overwrite the memory address of C Python's function with my modified C Python's function and it is how we are going to inject our code
into the C Python's evaluation step. For modifying the C Python's function, I directly copied the code of it and then checked for Catalyzer sign. If a function using Catalyzer, Catalyzer will sign it. If the function has a Catalyzer sign,
I will call the Catalyzer before the function call, on the function call and after the function call. Yeah, this is it. Thank you for coming and listening. You can contact me through Twitter. Okay.