Bytecodes and stacks

Video in TIB AV-Portal: Bytecodes and stacks

Formal Metadata

Title
Bytecodes and stacks
Subtitle
A look at CPython’s compiler and its execution model
Title of Series
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Identifiers
Publisher
Release Date
2018
Language
English

Content Metadata

Subject Area
Abstract
So, you wrote some Python code. What needs to happen before it starts running? And once it's running, how does Python keep track of what it's doing? I'll talk about CPython's tokenization, parsing, bytecode and its serialization and cache, the stack-based virtual machine, line number tables, and code, frame and function objects. Don't worry if you've never heard of these concepts. While even experts should learn something new, the talk is aimed at anyone who's worked on a Python project or two.
Loading...
Parsing Multiplication sign Source code Stack (abstract data type) Computer programming Formal language Data model Coefficient of determination Different (Kate Ryan album) Core dump Information Endliche Modelltheorie Social class Token ring Interior (topology) Bit Variable (mathematics) Parsing Abstract syntax tree Process (computing) Phase transition Website MiniDisc Self-organization Whiteboard Freeware Asynchronous Transfer Mode Laptop Bytecode Functional (mathematics) Computer file Token ring Line (geometry) Mathematical analysis Streaming media Code Field (computer science) 2 (number) Revision control Network topology Operator (mathematics) String (computer science) Motion blur Representation (politics) Data structure Compilation album Module (mathematics) Addition Information Equals sign Expression Stack (abstract data type) Line (geometry) System call Frame problem Compiler Film editing Personal digital assistant String (computer science) Network topology Speech synthesis Object (grammar)
Logical constant Functional (mathematics) Numbering scheme Presentation of a group Implementation Module (mathematics) Computer file Line (geometry) Augmented reality Parameter (computer programming) Code Tracing (software) Metadata Network topology Object (grammar) Flag Module (mathematics) Execution unit Computer virus Logical constant Information Interior (topology) Letterpress printing Enumerated type Variable (mathematics) Abstract syntax tree Code Type theory Subject indexing Numeral (linguistics) Network topology Free variables and bound variables Convex hull Object (grammar) Tuple Local ring Asynchronous Transfer Mode Flag
Logical constant Parsing Serial port Structural load Multiplication sign Execution unit Source code Sheaf (mathematics) Set (mathematics) Parameter (computer programming) Function (mathematics) Computer programming Different (Kate Ryan album) Kernel (computing) Object (grammar) Core dump Flag Drum memory Multiplication Parsing Email Logical constant Computer virus Mapping Numbering scheme View (database) Structural load Electronic mailing list Data storage device Parameter (computer programming) Bit Variable (mathematics) Abstract syntax tree Compilation album MiniDisc Right angle Pattern language Modul <Datentyp> Representation (politics) Sinc function Resultant Bytecode Reading (process) Trail Numbering scheme Mobile app Functional (mathematics) Module (mathematics) Computer file Line (geometry) Electronic mailing list Code Attribute grammar Revision control Network topology Operator (mathematics) Energy level Maß <Mathematik> Module (mathematics) Slide rule Information Interface (computing) Bytecode Correlation and dependence Core dump Letterpress printing Line (geometry) Binary file Code Cache (computing) Word Personal digital assistant Function (mathematics) Statement (computer science) Object (grammar) Table (information) Marginal distribution Tuple Flag
Numbering scheme Email Computer file View (database) File format Structural load Electronic program guide Source code Timestamp 2 (number) Code Revision control Cache (computing) Mathematics Different (Kate Ryan album) Personal digital assistant Kernel (computing) Flag Library (computing)
Intel Context awareness System call Multiplication sign Source code Parameter (computer programming) Mathematics Different (Kate Ryan album) Computer configuration Flag Endliche Modelltheorie Logic gate Position operator Exception handling Computer virus Logical constant Wrapper (data mining) Interior (topology) Attribute grammar Bit Variable (mathematics) Hash function Chain Convex hull Hacker (term) Laptop Bytecode Ocean current Frame problem Functional (mathematics) Numbering scheme Module (mathematics) Letterpress printing Code Attribute grammar Gastropod shell output Module (mathematics) Default (computer science) Information Cellular automaton Electronic program guide Line (geometry) Frame problem Code CAN bus Loop (music) Personal digital assistant Function (mathematics) Object (grammar) Local ring
Cache (computing) Regular graph Computer programming
Module (mathematics) Functional (mathematics) Information Multiplication sign Right angle Computer Frame problem
Wrapper (data mining) Library (computing)
Bytecode Source code Module (mathematics) Information View (database) Wrapper (data mining) Source code Attribute grammar Hierarchy Type theory Function (mathematics) Object (grammar) Interpreter (computing) Normal (geometry) Social class Right angle Information
Bytecode Revision control Word Machine code Right angle Bit Parameter (computer programming) Code
Bytecode Installable File System Functional (mathematics) Loop (music) Machine code Multiplication sign Statement (computer science) Interpreter (computing) Videoconferencing Code Frame problem
Laptop Information Link (knot theory) Multiplication sign Maxima and minima Website Bit Thumbnail
Email Link (knot theory) Compiler Usability Stack (abstract data type) Software bug Data model Inclusion map Latent heat Bargaining problem Bloch wave Formal grammar Convex hull Gamma function Default (computer science)
Source code Noise (electronics) Fluid Type theory Function (mathematics) Object (grammar) Interface (computing) Attribute grammar Code
[Applause] hello everyone so I'm Peter I work for Red Hat on Python great job by the way and today I will be talking about byte codes and stacks you might not have seen the talk on the board downstairs because that's late addition to the conference also you might have been to a talk that is a bit more advanced than this one already so I apologize for that hopefully this will get some of the some of the basics a bit clearer it's not very good timing at the end of the conference but bear with me so I will be talking about how Python actually takes the codes you've written and manages to do what the code says all right so we'll be looking at the compiler and the abstract syntax tree and tokenization at the execution model which is frames and so on this is for python version two seven there are always little tweaks between different Python versions to all this so if you're using a different version not everything will work exactly the same also I've uploaded the notebooks I will be using two the Euro Python website so if you have some more time then then you can just play along right so I have this little Python module it it does very simple stuff it assigned some variables prints something then it defines a function right so this will we'll be using this for examples throughout the talk now this is just a piece of text this is just just some just some bytes or some Unicode characters sitting on your disk then you tell python to run it so what actually happens python opens the file reads the text and the first thing needs to do is to tokenize which means cut the cut the text up into very small pieces actually how many of you have taken a compilers class in school almost half I guess so I guess you'll be a bit bored how many have you taken a compilers class on Python none okay there are a few differences to the traditional model so tokenization cutting cutting the source text into little pieces that we can then use there's a mode module for it you can run the tokenize module it will give you something like this so all the information from the original file let's say this this equal sign is an operator it occurs on line 1 between characters 2 & 3 so like this the text is cut up into little pieces that that that are called tokens that the rest of the process then uses now a little difference from the traditional model that you will not find in other languages is if you have indentation traditional languages would have braces at the beginning and end which then gets converted to brace tokens in Python we have indent and detent tokens right so this this is actually the indentation is actually detected in the tokenization phase if you have some text between parentheses where indentation doesn't matter then the tokenizer doesn't emit these indent and detent tokens so there's a bit of cleverness already built into the tokenization step if you'd like to do this from Python you just open the file you called tokenized tokenize and you get all the information to play with this is really nice if you're for example writing a Python like language that has all the same tokens you can just use tokenization for free and then build some kind of parser on top so speaking of parsers that's actually the next step that happens now you have a stream of tokens you need to turn it into an abstract syntax tree now there was a talk about abstract syntax trees before is there anyone who does not know what an AST is okay maybe I should have written my abstract a little bit clearer so an AST as I guess you all know is is a representation of what will actually happen in the program there's an ast module which can parse parse the text so there's just organization and the parsing in one step gives you a module object which has some fields in this case it has a body and inside that body we have my assignment assignment expression which is a function call and a function definition Python itself has an ast dump function which is not really useful because it just puts everything in a single string there are lots of lots of modules on on pi PI that make this a little bit better to work with ast pretty as something I just I just found it previous versions of this dog had the code actually in there it's not hard to find so here's the structure of my code there's an assignment with a target and a value an assignment with the target and a value expression which is a call I assume all of you have I've seen things like this so now we have an IOC tree the next step to do with this ast tree is to compile it into bytecode again our ask is there anyone who does not know about compiling to bytecode a
few people okay so let's run through this again I have this module I have my
ast and I can call the built-in compile fire function which will take this ast and package it up into into a code object what this takes is the AAC tree a file name suggests that it can tag the code object you know this is where it's coming from back traces then can use that information and there's a mode I'm not getting into that in this talk so this will give us a code object a code object is something Python can run so if I do exec of the code object that will actually run all the code contain it there now what is inside a code object there's a little function called code info that gives me most of the things inside there formatted pretty nicely so it has some meta information some file name it's also used for functions so it has lots of stuff about arguments what kind of information goes in some internals number of local variables stack sizes as an implementation detail and as flags to say what kind of features this this code object has for example if there are any free variables or non-locals that so it needs to reach out or if in Python 2 there were two kinds of functions one worse faster so the type of the code object is is encode in the in these flags we also have a tuple of constants so whatever whatever constants are our present are put into a tuple and then they're referred by by an index we also have names because inside the Kojo object usually Python refers to everything by numerical index when it
needs to print a bad trace or generate the mapping of variables to two values it will look here and get get the names get the names of the actual variables if I write my own function now I guess don't spend too much time reading this this just prints out all the attributes of my code object there's a bit more there's lots of the same information we had before but there are something the dis marginal hides from us one of those is the code which we couldn't read anyway but this is the actual instructions this is what Python will then execute as your code runs another thing we have is the L note app which is a mapping of of the byte codes to line numbers so when again when a stack trace is generated Python doesn't keep track of lines but of we're in the byte code it is and then it uses this line number tab table to map back to two individual lines so that's that's mostly it whenever I have a function in my module as I had had before it gets put in well the code object for that function is compiled separately and put in as a constant for for the enclosing module so then what happens is when that function is created when the DEF statement is actually executed this code object it gets packaged into a function so this is where where that is taken from if you want to look for information on that you can just get it out of the constants and you know flags are a little bit for different for functions than they are for modules for example right so I talked a bit about the byte code let's go a bit more in depth if I have my ast again my byte code looks a bit like this it's it's bytes it's just numbers from 0 to 225 if I print them out as actual numbers they become a bit more readable you can see a bit of pattern here it's always some big number and some small number big number and small number right actually that the bytecodes come in pairs of two bytes so it's not actually a byte code it's a word code it's has units of two bytes it's always an instruction and an argument the disk module can tell us what all these numbers mean so first we have the 100 and 0 the 100 means load a constant and the zero is which constant this will load so this was for the statement a equals 3 would this needs to do is load the number 3 that puts it on onto a stack and then it stores that value into the variable a all right so this this does two instructions load a number store the number similar for for the second line for calling a function it just gets all the arguments of the function who gets the function itself and in this case I was multiplying something so it calls multiply and then it calls the function and because we're not using the return value it gets rid of the return value right so bytecode is just a set of instruction that Python follows linearly as it as it goes along you can also run to this module from the command line get the exact same result you can also use it programmatically so if you do get instructions you get you get nice little objects that tell you what each each individual bytecode means so this is pretty similar output to what we had before right so now that we have our bytecode or our information about what the program should do what python does is it saves that because all this all this parsing is is quite an expensive operation Python will save this cash this so that next time the same module is loaded it doesn't have to parse it again right so serialization is a bit like I guess at this level you most of you know it but it's a bit like JSON you get some values you can write them into a file this doesn't you use JSON it uses a module called Marshall so Marshall is has pretty much the same interface as JSON you call dumps you get some some bytes corresponding to whatever you put in Marshall is specially made for for Python code objects so it can only serialize simple stuff like tuples numbers code objects themselves anything mutable will will not appear in this dump so if your program creates a list there is no constant for that list in the bytecode there's just an instruction to create a new list for immutable stuff like like to pose numbers the codes are already in there so I skipped a bit where does where does Python put all this it puts it in a pyc file since anybody not familiar with pyc files everybody is great so this is this is what a pyc file looks inside it's you
know again just a bunch of bytes if I mark if I list would the would the serialized version of the code object is also a list of bytes if I compare them the the pyc file will have header and the first section and the rest is just the marshalled the serialized code object right so like verify the first 16 bytes are a header the rest is is just the code object what is this header the header is mostly there for cache invalidation so when you change the source code python has to check if if the old if the olds cache is still valid if you changed it it's not valid and it has a bit of a bit of information to check if we're still fresh so the first
four numbers are what's called a magic number it's the same for every python version it's called import Lib butyl magic number and this just tells python you know this is a pyc file for this particular version of the bytecode if
your run a pyc file with a different version of python it's it won't load the next two files are in python 3 7 all usually all zeros these are flags because there are a few different versions few different formats for pyc files that use do a different kind of cache invalidation usually what you get would be the zeros the next 4 bytes are are a number of seconds from from 2000 something from the UNIX epoch so it's a timestamp so whenever you change the source code its timestamp won't change in python we'll know that's something change of course if you do many changes in the same second you might run into trouble which is why we have the last last thing in the header and that is the size of the original source code so if you add something the size should hopefully change and the cash will be invalidated this is not perfect but it works in most cases as anybody had problem with pyc cache invalidation one oh wow that's many more than I
expected so for you who raised their
hands there is now in Python 37 an option to use to use actual hashes of the source code for this for this you set some flags used the compiled out compile all module it has an option to use to use hashing instead or to not check anything at all right so if this is not not okay for either there are options now right so that was caching how much time do I have about five minutes all right so with that out of the way I will talk a little bit about about the execution model so we have the code object which has all the instructions and the code object is immutable which of course is is different from functions if you have a Python function you know I have this this funk in my module I can put custom attributes on it it will remember them I can do mostly anything in that I can do the same thing with with a code object that's that's immutable and that's actually shared between all functions that that are defined defined on the same line so if I have a loop and define functions in that loop there'll be different function objects with they might have different default defaults for their arguments they might have even different different names the attributes we set on there might be different but they're sharing the same same object so they're they're different if I call them they can do different things but the code the the actual instructions are the same because for each function definition that's that serialized code object that that module will go just have one code object that will be shared now so I have an immutable code object I have a mutable function wrapper on top of it that supplies the name and the and the default arguments and every time I call a function there's another object on top which is called a frame now the frame will remember what function is executing where in the function Python is currently and where to go when that function returns so let's see the best way to look at this stuff is in the import module I have a function that will get the current frame and write some write some information from it right so what we got is is a frame object it has some code it has the last instruction so this is the position in the bytecode where where this frame is it has a line number which is actually not stored in the frame it's I think it's computed on the fly when I ask for it from from the position it has some local variables and it has the frame of the function that called this so in this case this is a frame for my jupiter notebook cell right now the interesting thing about frames is that they are mutable so when when the function is executed the current instruction number changes so when whenever I printed this line number that corresponded to this line if I do this multiple times I will get different line numbers right so as the function is executed the frame gets updated also the locals might get updated if I set a variable I will get that get that and addictive of the locals all right so all this was packaged up in my frame object for the back the back is the function calling calling this frame so that will that will in this case give me my my notebook shell that that's a place to to return when my function is is done this is of course used used for printing trace backs when you get an exception this chain of frames is walked back and you get information from all the frames visited and I think I don't have too much time left so if you have any kind of questions I'm I'm aware that okay so I'm aware that this was a bit quick it was originally I give the gate this in three hours and I could go into a lot more detail so I hope to do this at the start of the conference to get some kind of background for everything but that didn't work out so if you have any questions if you want to go into more detail if there's something that's that sparked your interest then talk to either ask a question or talk to me later yes of course but let's first thank the speaker [Applause]
thank you better okay who's the who's gonna first ask your question who's gonna be first okay thank you for the talk actually I would replicate person asking such a question but I will try to and do you think the things shown could be useful in like regular Python programming for optimizing or debugging or just interesting yes they they will definitely be useful when I asked how
many people had trouble with with pyc cache invalidation I was really surprised and obviously when you're debugging that it helps to know you know what the pyc actually is and why it's
giving me stale stale information right also for debugging obviously it's a lot more useful if you are writing a debugger than if you're using one but it's it still helps you know how the things are executed right what happens when you define a function right there's there's something that's compiled ahead of time and then it's wrapped up in a function and then you get then you get a frame whenever a function is is called I think this is this is good knowledge to have you know to to have an idea how the computer works it it definitely helps in debugging if you have you know a better knowledge of the internals yes any of
the modules that you've imported to give us to talk and to show us all all those amazing amazing things actually written in Python or are the old written see it
depends a surprising amount of them are C wrappers our Python wrappers on on C libraries right one one of them I've shown you is inspect and inspect I believe has okay if I call
it from the command line Python - M
inspect is it inspect yeah then it'll actually give you the source code of Python modules right so this is how you could you could check but yeah most of them are are Python wrappers my question is regarding the interpreter the information that you present it regarding byte codes and so on are they
specific to normal Python interpreter or it's shared by pi PI or is that not it has nothing to do with the interpreter at all all right
this is something I actually forgot to say at the beginning all of this is specific to see pythons specifically its 4c Python 3 7 right it should stay the same four three seven one three seven two but it can change dramatically even across python versions for example I think in Python three six we got the word code where each instructions it's two bytes long which improved improved speed a bit before the the bytecode sizes were variable according to if the instruction took an argument or not so that just threw off all the tools that deal that had to deal with bytecode right all of this all of this specific to see python and also usually subject to change so anyone else
so you talked about how it gets to bytecode from the original code what happens after that to get to the machine code that actually runs Oh was that too
big a question okay so what happens if I'm if I'm in the frame if you know python is actually executing this there is no machine code involved it stays bytecode and the Python interpreter is inside there is a giant loop with a switch statement so it's just a bunch of ifs and it just looks at the current instruction and calls a c function called some C code that values an instruction nothing gets compiled to machine code in C Python so we have time
for one more question I think okay do you have the three hour talk somewhere I said video or something I don't think I do okay but if if you organize a meet-up
then let me know and I can maybe come over it's really interesting stuff and I always get interesting questions from people who have a bit of a bit more time to look into this more deeply so I would be interested in maybe getting that recorded one short question maybe better shot do
you have any links or places we can go to to look up more information about this topic yeah yes if you found this interesting then it's uploaded on the Europe I thumb sites so notebooks are
there and I have a little summary that
sometimes has a link for example the full grammar specification also most of this you can you can find in the Python documentation that's supposed to be very informative if it's not then file bug
maybe or or contribute or if you have a question I'm always happy to to discuss things either here on the conference tomorrow maybe or or by email thank you
thank you very much let's make some noise [Applause]
Loading...
Feedback

Timings

  704 ms - page object

Version

AV-Portal 3.21.3 (19e43a18c8aa08bcbdf3e35b975c18acb737c630)
hidden