Bytecodes and stacks
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 132 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/44896 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 2018109 / 132
2
3
7
8
10
14
15
19
22
27
29
30
31
34
35
41
44
54
55
56
58
59
61
66
74
77
78
80
81
85
87
91
93
96
98
103
104
105
109
110
111
113
115
116
118
120
121
122
123
125
127
128
129
130
131
132
00:00
Data modelStack (abstract data type)Mathematical analysisString (computer science)InformationLine (geometry)Motion blurToken ringInterior (topology)Network topologyFrame problemBitFormal languageEndliche ModelltheorieExpressionMultiplication signModule (mathematics)Core dumpAbstract syntax treeCodeStack (abstract data type)Token ring2 (number)Computer fileRevision controlInformationFunctional (mathematics)Coefficient of determinationLine (geometry)Different (Kate Ryan album)Data structureBytecodeFreewareNetwork topologySocial classParsingEquals signRepresentation (politics)Phase transitionStreaming mediaCompilation albumParsingLaptopSpeech synthesisSystem callObject (grammar)WebsiteSource codeComputer programmingAsynchronous Transfer ModeString (computer science)MiniDiscVariable (mathematics)Self-organizationOperator (mathematics)Film editingCompilerProcess (computing)Field (computer science)CASE <Informatik>AdditionWhiteboardSign (mathematics)Computer animation
06:26
Module (mathematics)Object (grammar)CodeNetwork topologyComputer virusInterior (topology)Convex hullExecution unitAugmented realityFlagEnumerated typeLine (geometry)Logical constantLetterpress printingModule (mathematics)Object (grammar)Computer fileLogical constantCodeInformationAbstract syntax treeTupleFlagFunctional (mathematics)MetadataPresentation of a groupFree variables and bound variablesAsynchronous Transfer ModeTracing (software)Subject indexingNumeral (linguistics)Functional (mathematics)Local ringParameter (computer programming)Numbering schemeNetwork topologyType theoryImplementationVariable (mathematics)Stack (abstract data type)Computer animation
08:31
CodeModule (mathematics)Parameter (computer programming)Numbering schemeLogical constantSlide ruleFlagKernel (computing)Computer virusLine (geometry)Correlation and dependenceMaß <Mathematik>Letterpress printingNetwork topologyParsingBytecodeReading (process)Modul <Datentyp>Compilation albumStructural loadMultiplicationBinary fileFunction (mathematics)Drum memoryRepresentation (politics)Core dumpView (database)Electronic mailing listObject (grammar)Electronic program guideData compressionSerial portAbstract syntax treeBitComputer fileLine (geometry)CodeMathematicsNumbering schemeTimestampMobile appBytecodeDifferent (Kate Ryan album)Execution unitParameter (computer programming)Functional (mathematics)Table (information)TupleEmailAttribute grammarInformationFunctional (mathematics)Object (grammar)Revision controlPattern languageModule (mathematics)Computer programmingTrailCore dumpRight angleCache (computing)CASE <Informatik>WordStructural loadSinc functionInterface (computing)Function (mathematics)ResultantLogical constantFlagMiniDiscStatement (computer science)Electronic mailing listParsingData storage deviceLibrary (computing)Multiplication signOperator (mathematics)Variable (mathematics)MappingFile formatSet (mathematics)Marginal distributionLevel (video gaming)Source code2 (number)Computer animation
17:45
IntelModule (mathematics)Function (mathematics)Computer virusInterior (topology)CodeConvex hullAttribute grammarSystem callCAN busElectronic program guideLogical constantFrame problemHacker (term)outputData compressionObject (grammar)Attribute grammarCodeFunctional (mathematics)Functional (mathematics)InformationDifferent (Kate Ryan album)MathematicsMultiplication signLocal ringLine (geometry)Frame problemNumbering schemeDefault (computer science)LaptopCASE <Informatik>Computer configurationHash functionGastropod shellWrapper (data mining)ChainException handlingParameter (computer programming)Logic gateLoop (music)BitLetterpress printingContext awarenessPosition operatorVariable (mathematics)Endliche ModelltheorieFlagModule (mathematics)Cellular automatonBytecodeOcean currentLecture/ConferenceComputer animation
23:46
Regular graphComputer programmingComputer animation
24:29
Cache (computing)Functional (mathematics)NeuroinformatikRight angleFrame problemMultiplication signInformationComputer animationLecture/Conference
25:30
Module (mathematics)Wrapper (data mining)Library (computing)Computer animationLecture/Conference
26:10
Attribute grammarFunction (mathematics)Module (mathematics)InformationObject (grammar)Source codeType theoryView (database)HierarchySocial classInformationInterpreter (computing)Data compressionWrapper (data mining)BytecodeComputer animationXML
26:46
Normal (geometry)Right angleInterpreter (computing)BytecodeRevision controlCodeWordBitParameter (computer programming)Lecture/ConferenceMeeting/Interview
27:54
CodeMachine codeBytecodeInterpreter (computing)Multiplication signLoop (music)Installable File SystemFunctional (mathematics)Frame problemStatement (computer science)Lecture/ConferenceMeeting/Interview
28:41
VideoconferencingBitMultiplication signRow (database)Lecture/ConferenceMeeting/Interview
29:26
Maxima and minimaLaptopLink (knot theory)InformationThumbnailWebsiteLecture/Conference
29:49
Stack (abstract data type)CompilerData modelDefault (computer science)Bloch waveInclusion mapGamma functionUsabilityConvex hullBargaining problemFormal grammarSoftware bugLink (knot theory)Latent heatEmailComputer animationLecture/ConferenceMeeting/Interview
30:23
Function (mathematics)Interface (computing)Attribute grammarCodeObject (grammar)Source codeFluidType theoryNoise (electronics)Computer animation
Transcript: English(auto-generated)
00:04
Hello everyone. So I'm Peter. I work for Red Hat on Python. Great job by the way. And today I will be talking about byte codes and stacks. You might not have seen the talk on the board downstairs because that's a late addition
00:22
to the conference. Also you might have been to a talk that is a bit more advanced than this one already, so I apologize for that. Hopefully this will get some of the basics a bit clearer. It's not very good timing at
00:41
the end of the conference, but bear with me. So I will be talking about how Python actually takes the codes you've written and manages to do what the code says. So we'll be looking at the compiler and the abstract syntax tree at tokenization, at the execution model which is frames, and so on. This is for
01:10
Python version 2.7. There are always little tweaks between different Python versions to all this, so if you're using a different version not
01:21
everything will work exactly the same. Also I've uploaded the notebooks I will be using to the EuroPython website, so if you have some more time then you can just play along. Right, so I have this little Python module. It does
01:44
very simple stuff. It assigns some variables, prints something, then it defines a function. So this will be using this for examples throughout the talk. Now this is just a piece of text. This is just some
02:01
bytes or some Unicode characters sitting on your disk. Then you tell Python to run it, so what actually happens? Python opens the file, reads the text, and the first thing it needs to do is to tokenize, which means cut the text up into very small pieces. Actually, how many of you have taken a
02:25
compilers class in school? Almost half, I guess, so I guess you'll be a bit bored. How many have you taken a compilers class on Python? None. Okay, there are a
02:41
few differences to the traditional model. So tokenization, cutting the source text into little pieces that we can then use. There's a module for it. You can run the tokenize module. It will give you something like this. So all the information from the original file, let's say this equals sign is
03:07
an operator. It occurs on line one between characters two and three. So like this, the text is cut up into little pieces that are called tokens that
03:22
the rest of the process then uses. Now a little difference from the traditional model that you will not find in other languages is if you have indentation, traditional languages would have braces at the beginning and end, which then gets converted into brace tokens. In Python, we have indent and
03:44
deadend tokens. So the indentation is actually detected in the tokenization phase. If you have some text between parentheses where indentation doesn't matter, then the tokenizer doesn't emit these indent and deadend tokens. So there's a bit of cleverness already built into
04:03
the tokenization step. If you'd like to do this from Python, you just open the file. You call tokenize.tokenize and you get all the information to play with. This is really nice if you're, for example, writing a Python-like language that has all the same tokens. You can just use tokenization for free
04:25
and then build some kind of parser on top. So speaking of parsers, that's actually the next step that happens. Now you have a stream of tokens. You need to turn it into an abstract syntax tree. Now there was a talk about abstract
04:43
syntax trees before. Is there anyone who does not know what an AST is? Okay, maybe I should have written my abstract a little bit clearer. So an AST, as I guess you all know, is a representation of what will actually happen in the
05:09
text. So this does tokenization and the parsing in one step. Gives you a module object which has some fields. In this case, it has a body. And inside that
05:22
body, we have my assignment, assignment, expression, which is a function call, and a function definition. Python itself has an AST dump function, which is not really useful because it just puts everything in a single string. There are lots of modules on PyPI that make this a little bit better to work
05:45
with. AST pretty is something I just found. Previous versions of this talk had the code actually in there. It's not hard to find. So here's the structure of my code. There's an assignment with a target and a value, an assignment with the target and a value, expression, which is a call. I
06:08
So now we have an AST tree. The next step to do with this AST tree is to compile it into bytecode. Again, I'll ask, is there anyone who does not know
06:22
about compiling to bytecode? A few people, okay. So let's run through this. Again, I have this module. I have my AST, and I can call the built-in compile function, which will take this AST and package it up into a code object.
06:48
What this takes is the AST tree, a file name suggested it can tag the code object. This is where it's coming from. Backtraces then can use that information. And there's a mode. I'm not getting into that in this talk.
07:02
So this will give us a code object. A code object is something Python can run. So if I do exec of the code object, it will actually run all the code contained in there. Now, what is inside a code object? There's a little function called code info that gives me most of the things
07:24
inside there formatted pretty nicely. So it has some meta information, some file name. It's also used for functions, so it has lots of stuff about arguments, what kind of information goes in. Some internals, number of local variables, stack size is an implementation detail.
07:46
It has flags to say what kind of features this code object has. For example, if there are any free variables or non-locals, so it needs to reach out. Or if in Python 2 there were two kinds of functions,
08:02
one was faster, so the type of the code object is encoded in these flags. We also have a tuple of constants. So whatever constants are present are put into a tuple, and then they're referred by an index.
08:24
We also have names, because inside the code object, usually Python refers to everything by numerical index. When it needs to print a back trace or generate the mapping of variables to values, it will look here and get the names of the actual variables.
08:47
If I write my own function, now I guess don't spend too much time reading this. This just prints out all the attributes of my code object. There's a bit more, there's a lot of the same information we
09:00
had before, but there's something that this module hides from us. One of those is the code, which we couldn't read anyway, but this is the actual instructions. This is what Python will then execute as your code runs. Another thing we have is the ELNO tab, which is
09:22
a mapping of the byte codes to line numbers. Again, when a stack trace is generated, Python doesn't keep track of lines, but of where in the byte code it is, and then it uses this line number table to map back to individual lines. That's mostly it.
09:48
Whenever I have a function in my module, as I had before, it gets put in, well, the code object for that function is compiled
10:00
separately and put in as a constant for the enclosing module. Then what happens is when that function is created, when the def statement is actually executed, this code object gets packaged into a function. This is where that is taken from. If you want to
10:20
look for information on that, you can just get it out of the constants, and the flags are a little bit different for functions than they are for modules, for example. I talked a bit about the byte code. Let's go a bit more in depth.
10:40
If I have my AST again, my byte code looks a bit like this. It's bytes. It's just numbers from 0 to 225. If I print them out as actual numbers, they become a bit more readable. You can see a bit of pattern here. It's always some big number and some
11:03
small number, big number and small number. Actually, the byte codes come in pairs of two bytes. It's not actually a byte code. It's a word code. It has units of two bytes. It's always an instruction and an argument. The disk module can tell us
11:23
what all these numbers mean. First, we have the 100 and 0. The 100 means load a constant, and the 0 is which constant this will load. This was for the statement A equals 3. What this needs to do is load the number 3. That puts it onto a stack,
11:46
and then it stores that value into the variable A. This does two instructions, load a number, store the number. Similar for the second line, for calling a function, it just gets all the
12:02
arguments of the function. It gets the function itself. In this case, I was multiplying something. It calls multiply, and then it calls the function. Because we're not using the return value, it gets rid of the return value. Byte code is
12:22
just a set of instructions that Python follows linearly as it goes along. You can also run the disk module from the command line, get the exact same result. You can also use it
12:41
programmatically. If you do get instructions, you get nice little objects that tell you what each individual byte code means. This is pretty similar output to what we had before.
13:00
Right? Now that we have our byte code or our information about what the program should do, what Python does is it saves that, because all this parsing is quite an expensive operation. Python will save this, cache this, so that next
13:24
time the same module is loaded, it doesn't have to parse it again. So serialization is a bit like, I guess at this level most of you know it, but it's a bit like JSON. You get some values. You can write them into
13:43
a file. This doesn't use JSON. It uses a module called Marshall. So Marshall has pretty much the same interface
14:01
as JSON. You call dumps. You get some bytes corresponding to whatever you put in. Marshall is specially made for Python code objects, so it can only serialize simple stuff like tuples, numbers, code objects themselves. Anything
14:23
mutable will not appear in this dump. So if your program creates a list, there's no constant for that list in the byte code. There's just an instruction to create a new list. For immutable stuff like tuples, numbers, the codes
14:40
are already in there. So I skipped a bit. Where does Python put all this? It puts it in a PYC file, since anybody not familiar with PYC files? Everybody is? Great. So this is what a PYC file looks inside. It's, again, just
15:04
a bunch of bytes. If I list what the serialized version of the code object is, also a list of bytes. If I compare them, the PYC file will have a header on the first
15:25
section, and the rest is just the serialized code object. So we can verify the first 16 bytes are a header, the rest is just the code object. What is this header? The header is mostly there for
15:43
cache invalidation. So when you change the source code, Python has to check if the old cache is still valid. If you change it, it's not valid, and it has a bit of
16:02
information to check if we're still fresh. So the first four numbers are what's called a magic number. It's the same for every Python version. It's called import libutil magic number, and this just tells Python, you know, this is a PYC file for this particular version of
16:21
the byte code. If you run a PYC file with a different version of Python, it won't load. The next two files are, in Python 3.7, usually all zeros. These are flags because there are few different versions, few different formats for PYC files that do a different kind
16:43
of cache invalidation. Usually what you get would be the zeros. The next four bytes are a number of seconds from 2000 something from the Unix epoch, so
17:04
it's a time stamp. So whenever you change the source code, its time stamp will change, and Python will know that something changed. Of course if you do many changes in the same second, you might run into trouble, which is why we have the last thing in the
17:23
header, and that is the size of the original source code. So if you add something, the size should hopefully change, and the cache will be invalidated. This is not perfect, but it works in most cases. Has anybody had
17:41
problem with PYC cache invalidation? Oh, wow, that's many more than I expected. So for you who raised their hands, there is now in Python 3.7 an option to use
18:01
actual hashes of the source code. For this, you set some flags, use the compile all module. It has an option to use hashing instead or to not check anything at all.
18:20
So if this is not OK for you, there are options now. So that was caching. How much time do I have? About five minutes. So with that out of the way, I will talk a little bit about the execution model. So we have
18:44
the code object, which has all the instructions, and the code object is immutable, which of course is different from functions. If you have a Python function, you know, I have this func in my module. I can put custom
19:02
attributes on it. It will remember them. I can do mostly anything in that. I can do the same thing with the code object. That's immutable, and that's actually shared between all functions that are defined on the same line. So if I have a loop and define functions in
19:25
that loop, there will be different function objects with they might have different defaults for their arguments. They might have even different names. The attributes
19:40
we set on them might be different, but they're sharing the same object. So they're different. If I call them, they can do different things, but the code, the actual instructions are the same because for each function definition, that serialized code object, that module,
20:03
will just have one code object that will be shared. Now, so I have an immutable code object. I have a mutable function wrapper on top of it that supplies the name and the default arguments. And every time I call a function, there's another object on top which is
20:22
called a frame. Now, the frame will remember what function is executing, where in the function Python is currently, and where to go when that function returns. So let's see. The best way to look at this stuff is
20:41
in the import module. I have a function that will get the current frame and write some information from it. So what we got is a frame object. It has some code. It has the last instruction, so this is the position in
21:04
the bytecode where this frame is. It has a line number which is actually not stored in the frame. I think it's computed on the fly when I ask for it from the position. It has some local variables, and it has the frame of the
21:22
function that called this. So in this case, this is a frame for my Jupyter notebook cell. Now, the interesting thing about frames is that they are mutable. So when the function is executed, the current instruction number changes. So whenever I
21:42
printed this line number, that corresponded to this line, if I do this multiple times, I will get different line numbers. So as the function is executed, the frame gets updated. Also, the locals might get updated if I set a variable.
22:04
I will get that in the dict of the locals. All right, so all this is packaged up in my frame object. For the back, the back is the function calling
22:20
this frame. So that will, in this case, give me my notebook shell. That's a place to return when my function is done. This is, of course, used for printing tracebacks when
22:41
you get an exception. This chain of frames is walked back, and you get information from all the frames visited. And I think I don't have too much time left. So if you have any kind of questions, I'm
23:03
aware that this was a bit quick. Originally, I gave this in three hours, and I could go into a lot more detail. So I hope to do this at the start of the
23:22
conference to get some kind of background for everything. But that didn't work out. So if you have any questions, if you want to go into more detail, if there's something that sparked your interest, then either ask a question or talk
23:40
to me later. Yes, of course. But let's first thank the speaker. Thank you, Peter. OK, who's going to first ask a question? Who's going to be first? OK.
24:05
So thank you for the talk. Actually, I would probably hate the person asking such a question, but I will try still. Do you think the things you've shown could be useful in like regular Python programming for optimizing or debugging, or just
24:22
interesting? Yes, they will definitely be useful. When I asked how many people had trouble with the PYC cache invalidation, I was really surprised. And obviously, when you're debugging that, it helps to know what
24:41
the PYC actually is and why it's giving me stale information. Also for debugging, obviously it's a lot more useful if you are writing a debugger than if you're using one. But it still helps to know how the things are executed. What happens when
25:03
you define a function? There's something that's compiled ahead of time, and then it's wrapped up in a function, and then you get a frame whenever a function is called. I think this is good knowledge to have an idea how the computer works. It
25:22
definitely helps in debugging if you have a better knowledge of the internals. Is any of the modules that you've imported to give us the talk and to show us all those amazing things actually written in Python, or are they all written in C? It depends. A surprising amount of them are Python
25:47
wrappers on C libraries. One of them I've shown you is inspect, and inspect, I believe, has...
26:12
Okay, if I call it from the command line, python-m-inspect, is it inspect? Yeah. Then it'll
26:23
actually give you the source code of Python modules, right? So this is how you could check. But, yeah, most of them are Python wrappers. My question is regarding the interpreter. The
26:42
information that you presented regarding byte codes and so on, are they specific to normal Python interpreter, or it's shared by PyPy, or it has nothing to do with interpreter at all? Oh, right. This is something I actually forgot to say at the beginning. All of this is specific to
27:01
CPython. Specifically, it's for CPython 3.7. It should stay the same for 3.7.1, 3.7.2, but it can change dramatically even across Python versions. For example, I think at Python 3.6, we got the
27:21
word code where each instruction is two bytes long, which improved speed a bit before the byte code sizes were variable according to if the instruction took an argument or not. So that just threw off all the tools that had to deal with
27:40
byte code. All of this is specific to CPython and also usually subject to change. So anyone else? So you talked about how it gets to byte code from the original code. What happens after that
28:01
to get to the machine code that actually runs? Was that too big a question? Okay. So what happens if I'm in the frame, if Python is actually executing this? There is no machine code involved. It stays byte code, and the Python interpreter is inside,
28:23
there is a giant loop with a C switch statement, so it's just a bunch of ifs, and it just looks at the current instruction and calls a C function, calls some C code that values the instruction. Nothing gets compiled to machine code in CPython.
28:40
So we have time for one more question, I think. Okay. Do you have the three-hour talk somewhere? Is it a video or something? I don't think I do. Okay.
29:00
But if you organize a meetup, then let me know and I can maybe come over. It's really interesting stuff, and I always get interesting questions from people who have a bit more time to look into this more deeply. So I would be interested in maybe getting that recorded.
29:26
One short question, maybe. Very short. Yeah. Do you have any links or places we could go to to look up more information about this topic?
29:43
Yeah, yes. If you found this interesting, then it's uploaded on the EuroPython site, so notebooks are there. And I have a little summary that sometimes has a link. For example, the full grammar specification. Also, most of this you can find
30:01
in the Python documentation. That's supposed to be very informative. If it's not, then file a bug, maybe, or contribute. Or if you have a question, I'm always happy to discuss things, either here on the conference tomorrow, maybe, or by email.
30:23
Thank you. Thank you very much. Let's make some noise.