We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Bytecodes and stacks

00:00

Formal Metadata

Title
Bytecodes and stacks
Subtitle
A look at CPython’s compiler and its execution model
Title of Series
Number of Parts
132
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
So, you wrote some Python code. What needs to happen before it starts running? And once it's running, how does Python keep track of what it's doing? I'll talk about CPython's tokenization, parsing, bytecode and its serialization and cache, the stack-based virtual machine, line number tables, and code, frame and function objects. Don't worry if you've never heard of these concepts. While even experts should learn something new, the talk is aimed at anyone who's worked on a Python project or two.
35
74
Thumbnail
11:59
Data modelStack (abstract data type)Mathematical analysisString (computer science)InformationLine (geometry)Motion blurToken ringInterior (topology)Network topologyFrame problemBitFormal languageEndliche ModelltheorieExpressionMultiplication signModule (mathematics)Core dumpAbstract syntax treeCodeStack (abstract data type)Token ring2 (number)Computer fileRevision controlInformationFunctional (mathematics)Coefficient of determinationLine (geometry)Different (Kate Ryan album)Data structureBytecodeFreewareNetwork topologySocial classParsingEquals signRepresentation (politics)Phase transitionStreaming mediaCompilation albumParsingLaptopSpeech synthesisSystem callObject (grammar)WebsiteSource codeComputer programmingAsynchronous Transfer ModeString (computer science)MiniDiscVariable (mathematics)Self-organizationOperator (mathematics)Film editingCompilerProcess (computing)Field (computer science)CASE <Informatik>AdditionWhiteboardSign (mathematics)Computer animation
Module (mathematics)Object (grammar)CodeNetwork topologyComputer virusInterior (topology)Convex hullExecution unitAugmented realityFlagEnumerated typeLine (geometry)Logical constantLetterpress printingModule (mathematics)Object (grammar)Computer fileLogical constantCodeInformationAbstract syntax treeTupleFlagFunctional (mathematics)MetadataPresentation of a groupFree variables and bound variablesAsynchronous Transfer ModeTracing (software)Subject indexingNumeral (linguistics)Functional (mathematics)Local ringParameter (computer programming)Numbering schemeNetwork topologyType theoryImplementationVariable (mathematics)Stack (abstract data type)Computer animation
CodeModule (mathematics)Parameter (computer programming)Numbering schemeLogical constantSlide ruleFlagKernel (computing)Computer virusLine (geometry)Correlation and dependenceMaß <Mathematik>Letterpress printingNetwork topologyParsingBytecodeReading (process)Modul <Datentyp>Compilation albumStructural loadMultiplicationBinary fileFunction (mathematics)Drum memoryRepresentation (politics)Core dumpView (database)Electronic mailing listObject (grammar)Electronic program guideData compressionSerial portAbstract syntax treeBitComputer fileLine (geometry)CodeMathematicsNumbering schemeTimestampMobile appBytecodeDifferent (Kate Ryan album)Execution unitParameter (computer programming)Functional (mathematics)Table (information)TupleEmailAttribute grammarInformationFunctional (mathematics)Object (grammar)Revision controlPattern languageModule (mathematics)Computer programmingTrailCore dumpRight angleCache (computing)CASE <Informatik>WordStructural loadSinc functionInterface (computing)Function (mathematics)ResultantLogical constantFlagMiniDiscStatement (computer science)Electronic mailing listParsingData storage deviceLibrary (computing)Multiplication signOperator (mathematics)Variable (mathematics)MappingFile formatSet (mathematics)Marginal distributionLevel (video gaming)Source code2 (number)Computer animation
IntelModule (mathematics)Function (mathematics)Computer virusInterior (topology)CodeConvex hullAttribute grammarSystem callCAN busElectronic program guideLogical constantFrame problemHacker (term)outputData compressionObject (grammar)Attribute grammarCodeFunctional (mathematics)Functional (mathematics)InformationDifferent (Kate Ryan album)MathematicsMultiplication signLocal ringLine (geometry)Frame problemNumbering schemeDefault (computer science)LaptopCASE <Informatik>Computer configurationHash functionGastropod shellWrapper (data mining)ChainException handlingParameter (computer programming)Logic gateLoop (music)BitLetterpress printingContext awarenessPosition operatorVariable (mathematics)Endliche ModelltheorieFlagModule (mathematics)Cellular automatonBytecodeOcean currentLecture/ConferenceComputer animation
Regular graphComputer programmingComputer animation
Cache (computing)Functional (mathematics)NeuroinformatikRight angleFrame problemMultiplication signInformationComputer animationLecture/Conference
Module (mathematics)Wrapper (data mining)Library (computing)Computer animationLecture/Conference
Attribute grammarFunction (mathematics)Module (mathematics)InformationObject (grammar)Source codeType theoryView (database)HierarchySocial classInformationInterpreter (computing)Data compressionWrapper (data mining)BytecodeComputer animationXML
Normal (geometry)Right angleInterpreter (computing)BytecodeRevision controlCodeWordBitParameter (computer programming)Lecture/ConferenceMeeting/Interview
CodeMachine codeBytecodeInterpreter (computing)Multiplication signLoop (music)Installable File SystemFunctional (mathematics)Frame problemStatement (computer science)Lecture/ConferenceMeeting/Interview
VideoconferencingBitMultiplication signRow (database)Lecture/ConferenceMeeting/Interview
Maxima and minimaLaptopLink (knot theory)InformationThumbnailWebsiteLecture/Conference
Stack (abstract data type)CompilerData modelDefault (computer science)Bloch waveInclusion mapGamma functionUsabilityConvex hullBargaining problemFormal grammarSoftware bugLink (knot theory)Latent heatEmailComputer animationLecture/ConferenceMeeting/Interview
Function (mathematics)Interface (computing)Attribute grammarCodeObject (grammar)Source codeFluidType theoryNoise (electronics)Computer animation
Transcript: English(auto-generated)
Hello everyone. So I'm Peter. I work for Red Hat on Python. Great job by the way. And today I will be talking about byte codes and stacks. You might not have seen the talk on the board downstairs because that's a late addition
to the conference. Also you might have been to a talk that is a bit more advanced than this one already, so I apologize for that. Hopefully this will get some of the basics a bit clearer. It's not very good timing at
the end of the conference, but bear with me. So I will be talking about how Python actually takes the codes you've written and manages to do what the code says. So we'll be looking at the compiler and the abstract syntax tree at tokenization, at the execution model which is frames, and so on. This is for
Python version 2.7. There are always little tweaks between different Python versions to all this, so if you're using a different version not
everything will work exactly the same. Also I've uploaded the notebooks I will be using to the EuroPython website, so if you have some more time then you can just play along. Right, so I have this little Python module. It does
very simple stuff. It assigns some variables, prints something, then it defines a function. So this will be using this for examples throughout the talk. Now this is just a piece of text. This is just some
bytes or some Unicode characters sitting on your disk. Then you tell Python to run it, so what actually happens? Python opens the file, reads the text, and the first thing it needs to do is to tokenize, which means cut the text up into very small pieces. Actually, how many of you have taken a
compilers class in school? Almost half, I guess, so I guess you'll be a bit bored. How many have you taken a compilers class on Python? None. Okay, there are a
few differences to the traditional model. So tokenization, cutting the source text into little pieces that we can then use. There's a module for it. You can run the tokenize module. It will give you something like this. So all the information from the original file, let's say this equals sign is
an operator. It occurs on line one between characters two and three. So like this, the text is cut up into little pieces that are called tokens that
the rest of the process then uses. Now a little difference from the traditional model that you will not find in other languages is if you have indentation, traditional languages would have braces at the beginning and end, which then gets converted into brace tokens. In Python, we have indent and
deadend tokens. So the indentation is actually detected in the tokenization phase. If you have some text between parentheses where indentation doesn't matter, then the tokenizer doesn't emit these indent and deadend tokens. So there's a bit of cleverness already built into
the tokenization step. If you'd like to do this from Python, you just open the file. You call tokenize.tokenize and you get all the information to play with. This is really nice if you're, for example, writing a Python-like language that has all the same tokens. You can just use tokenization for free
and then build some kind of parser on top. So speaking of parsers, that's actually the next step that happens. Now you have a stream of tokens. You need to turn it into an abstract syntax tree. Now there was a talk about abstract
syntax trees before. Is there anyone who does not know what an AST is? Okay, maybe I should have written my abstract a little bit clearer. So an AST, as I guess you all know, is a representation of what will actually happen in the
text. So this does tokenization and the parsing in one step. Gives you a module object which has some fields. In this case, it has a body. And inside that
body, we have my assignment, assignment, expression, which is a function call, and a function definition. Python itself has an AST dump function, which is not really useful because it just puts everything in a single string. There are lots of modules on PyPI that make this a little bit better to work
with. AST pretty is something I just found. Previous versions of this talk had the code actually in there. It's not hard to find. So here's the structure of my code. There's an assignment with a target and a value, an assignment with the target and a value, expression, which is a call. I
So now we have an AST tree. The next step to do with this AST tree is to compile it into bytecode. Again, I'll ask, is there anyone who does not know
about compiling to bytecode? A few people, okay. So let's run through this. Again, I have this module. I have my AST, and I can call the built-in compile function, which will take this AST and package it up into a code object.
What this takes is the AST tree, a file name suggested it can tag the code object. This is where it's coming from. Backtraces then can use that information. And there's a mode. I'm not getting into that in this talk.
So this will give us a code object. A code object is something Python can run. So if I do exec of the code object, it will actually run all the code contained in there. Now, what is inside a code object? There's a little function called code info that gives me most of the things
inside there formatted pretty nicely. So it has some meta information, some file name. It's also used for functions, so it has lots of stuff about arguments, what kind of information goes in. Some internals, number of local variables, stack size is an implementation detail.
It has flags to say what kind of features this code object has. For example, if there are any free variables or non-locals, so it needs to reach out. Or if in Python 2 there were two kinds of functions,
one was faster, so the type of the code object is encoded in these flags. We also have a tuple of constants. So whatever constants are present are put into a tuple, and then they're referred by an index.
We also have names, because inside the code object, usually Python refers to everything by numerical index. When it needs to print a back trace or generate the mapping of variables to values, it will look here and get the names of the actual variables.
If I write my own function, now I guess don't spend too much time reading this. This just prints out all the attributes of my code object. There's a bit more, there's a lot of the same information we
had before, but there's something that this module hides from us. One of those is the code, which we couldn't read anyway, but this is the actual instructions. This is what Python will then execute as your code runs. Another thing we have is the ELNO tab, which is
a mapping of the byte codes to line numbers. Again, when a stack trace is generated, Python doesn't keep track of lines, but of where in the byte code it is, and then it uses this line number table to map back to individual lines. That's mostly it.
Whenever I have a function in my module, as I had before, it gets put in, well, the code object for that function is compiled
separately and put in as a constant for the enclosing module. Then what happens is when that function is created, when the def statement is actually executed, this code object gets packaged into a function. This is where that is taken from. If you want to
look for information on that, you can just get it out of the constants, and the flags are a little bit different for functions than they are for modules, for example. I talked a bit about the byte code. Let's go a bit more in depth.
If I have my AST again, my byte code looks a bit like this. It's bytes. It's just numbers from 0 to 225. If I print them out as actual numbers, they become a bit more readable. You can see a bit of pattern here. It's always some big number and some
small number, big number and small number. Actually, the byte codes come in pairs of two bytes. It's not actually a byte code. It's a word code. It has units of two bytes. It's always an instruction and an argument. The disk module can tell us
what all these numbers mean. First, we have the 100 and 0. The 100 means load a constant, and the 0 is which constant this will load. This was for the statement A equals 3. What this needs to do is load the number 3. That puts it onto a stack,
and then it stores that value into the variable A. This does two instructions, load a number, store the number. Similar for the second line, for calling a function, it just gets all the
arguments of the function. It gets the function itself. In this case, I was multiplying something. It calls multiply, and then it calls the function. Because we're not using the return value, it gets rid of the return value. Byte code is
just a set of instructions that Python follows linearly as it goes along. You can also run the disk module from the command line, get the exact same result. You can also use it
programmatically. If you do get instructions, you get nice little objects that tell you what each individual byte code means. This is pretty similar output to what we had before.
Right? Now that we have our byte code or our information about what the program should do, what Python does is it saves that, because all this parsing is quite an expensive operation. Python will save this, cache this, so that next
time the same module is loaded, it doesn't have to parse it again. So serialization is a bit like, I guess at this level most of you know it, but it's a bit like JSON. You get some values. You can write them into
a file. This doesn't use JSON. It uses a module called Marshall. So Marshall has pretty much the same interface
as JSON. You call dumps. You get some bytes corresponding to whatever you put in. Marshall is specially made for Python code objects, so it can only serialize simple stuff like tuples, numbers, code objects themselves. Anything
mutable will not appear in this dump. So if your program creates a list, there's no constant for that list in the byte code. There's just an instruction to create a new list. For immutable stuff like tuples, numbers, the codes
are already in there. So I skipped a bit. Where does Python put all this? It puts it in a PYC file, since anybody not familiar with PYC files? Everybody is? Great. So this is what a PYC file looks inside. It's, again, just
a bunch of bytes. If I list what the serialized version of the code object is, also a list of bytes. If I compare them, the PYC file will have a header on the first
section, and the rest is just the serialized code object. So we can verify the first 16 bytes are a header, the rest is just the code object. What is this header? The header is mostly there for
cache invalidation. So when you change the source code, Python has to check if the old cache is still valid. If you change it, it's not valid, and it has a bit of
information to check if we're still fresh. So the first four numbers are what's called a magic number. It's the same for every Python version. It's called import libutil magic number, and this just tells Python, you know, this is a PYC file for this particular version of
the byte code. If you run a PYC file with a different version of Python, it won't load. The next two files are, in Python 3.7, usually all zeros. These are flags because there are few different versions, few different formats for PYC files that do a different kind
of cache invalidation. Usually what you get would be the zeros. The next four bytes are a number of seconds from 2000 something from the Unix epoch, so
it's a time stamp. So whenever you change the source code, its time stamp will change, and Python will know that something changed. Of course if you do many changes in the same second, you might run into trouble, which is why we have the last thing in the
header, and that is the size of the original source code. So if you add something, the size should hopefully change, and the cache will be invalidated. This is not perfect, but it works in most cases. Has anybody had
problem with PYC cache invalidation? Oh, wow, that's many more than I expected. So for you who raised their hands, there is now in Python 3.7 an option to use
actual hashes of the source code. For this, you set some flags, use the compile all module. It has an option to use hashing instead or to not check anything at all.
So if this is not OK for you, there are options now. So that was caching. How much time do I have? About five minutes. So with that out of the way, I will talk a little bit about the execution model. So we have
the code object, which has all the instructions, and the code object is immutable, which of course is different from functions. If you have a Python function, you know, I have this func in my module. I can put custom
attributes on it. It will remember them. I can do mostly anything in that. I can do the same thing with the code object. That's immutable, and that's actually shared between all functions that are defined on the same line. So if I have a loop and define functions in
that loop, there will be different function objects with they might have different defaults for their arguments. They might have even different names. The attributes
we set on them might be different, but they're sharing the same object. So they're different. If I call them, they can do different things, but the code, the actual instructions are the same because for each function definition, that serialized code object, that module,
will just have one code object that will be shared. Now, so I have an immutable code object. I have a mutable function wrapper on top of it that supplies the name and the default arguments. And every time I call a function, there's another object on top which is
called a frame. Now, the frame will remember what function is executing, where in the function Python is currently, and where to go when that function returns. So let's see. The best way to look at this stuff is
in the import module. I have a function that will get the current frame and write some information from it. So what we got is a frame object. It has some code. It has the last instruction, so this is the position in
the bytecode where this frame is. It has a line number which is actually not stored in the frame. I think it's computed on the fly when I ask for it from the position. It has some local variables, and it has the frame of the
function that called this. So in this case, this is a frame for my Jupyter notebook cell. Now, the interesting thing about frames is that they are mutable. So when the function is executed, the current instruction number changes. So whenever I
printed this line number, that corresponded to this line, if I do this multiple times, I will get different line numbers. So as the function is executed, the frame gets updated. Also, the locals might get updated if I set a variable.
I will get that in the dict of the locals. All right, so all this is packaged up in my frame object. For the back, the back is the function calling
this frame. So that will, in this case, give me my notebook shell. That's a place to return when my function is done. This is, of course, used for printing tracebacks when
you get an exception. This chain of frames is walked back, and you get information from all the frames visited. And I think I don't have too much time left. So if you have any kind of questions, I'm
aware that this was a bit quick. Originally, I gave this in three hours, and I could go into a lot more detail. So I hope to do this at the start of the
conference to get some kind of background for everything. But that didn't work out. So if you have any questions, if you want to go into more detail, if there's something that sparked your interest, then either ask a question or talk
to me later. Yes, of course. But let's first thank the speaker. Thank you, Peter. OK, who's going to first ask a question? Who's going to be first? OK.
So thank you for the talk. Actually, I would probably hate the person asking such a question, but I will try still. Do you think the things you've shown could be useful in like regular Python programming for optimizing or debugging, or just
interesting? Yes, they will definitely be useful. When I asked how many people had trouble with the PYC cache invalidation, I was really surprised. And obviously, when you're debugging that, it helps to know what
the PYC actually is and why it's giving me stale information. Also for debugging, obviously it's a lot more useful if you are writing a debugger than if you're using one. But it still helps to know how the things are executed. What happens when
you define a function? There's something that's compiled ahead of time, and then it's wrapped up in a function, and then you get a frame whenever a function is called. I think this is good knowledge to have an idea how the computer works. It
definitely helps in debugging if you have a better knowledge of the internals. Is any of the modules that you've imported to give us the talk and to show us all those amazing things actually written in Python, or are they all written in C? It depends. A surprising amount of them are Python
wrappers on C libraries. One of them I've shown you is inspect, and inspect, I believe, has...
Okay, if I call it from the command line, python-m-inspect, is it inspect? Yeah. Then it'll
actually give you the source code of Python modules, right? So this is how you could check. But, yeah, most of them are Python wrappers. My question is regarding the interpreter. The
information that you presented regarding byte codes and so on, are they specific to normal Python interpreter, or it's shared by PyPy, or it has nothing to do with interpreter at all? Oh, right. This is something I actually forgot to say at the beginning. All of this is specific to
CPython. Specifically, it's for CPython 3.7. It should stay the same for 3.7.1, 3.7.2, but it can change dramatically even across Python versions. For example, I think at Python 3.6, we got the
word code where each instruction is two bytes long, which improved speed a bit before the byte code sizes were variable according to if the instruction took an argument or not. So that just threw off all the tools that had to deal with
byte code. All of this is specific to CPython and also usually subject to change. So anyone else? So you talked about how it gets to byte code from the original code. What happens after that
to get to the machine code that actually runs? Was that too big a question? Okay. So what happens if I'm in the frame, if Python is actually executing this? There is no machine code involved. It stays byte code, and the Python interpreter is inside,
there is a giant loop with a C switch statement, so it's just a bunch of ifs, and it just looks at the current instruction and calls a C function, calls some C code that values the instruction. Nothing gets compiled to machine code in CPython.
So we have time for one more question, I think. Okay. Do you have the three-hour talk somewhere? Is it a video or something? I don't think I do. Okay.
But if you organize a meetup, then let me know and I can maybe come over. It's really interesting stuff, and I always get interesting questions from people who have a bit more time to look into this more deeply. So I would be interested in maybe getting that recorded.
One short question, maybe. Very short. Yeah. Do you have any links or places we could go to to look up more information about this topic?
Yeah, yes. If you found this interesting, then it's uploaded on the EuroPython site, so notebooks are there. And I have a little summary that sometimes has a link. For example, the full grammar specification. Also, most of this you can find
in the Python documentation. That's supposed to be very informative. If it's not, then file a bug, maybe, or contribute. Or if you have a question, I'm always happy to discuss things, either here on the conference tomorrow, maybe, or by email.
Thank you. Thank you very much. Let's make some noise.