Writing a Python interpreter from scratch, in half an hour
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 141 | |
Author | ||
Contributors | ||
License | CC Attribution - NonCommercial - ShareAlike 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/68770 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 20233 / 141
8
17
22
26
27
31
42
48
52
55
56
59
64
66
67
72
73
77
79
83
86
87
95
99
103
105
113
114
115
118
119
123
129
131
135
139
140
141
00:00
WritingSource codeFluid staticsMathematical analysisInterpreter (computing)Bit rateFocus (optics)Electronic visual displaySkewnessNormed vector spaceControl flowComputer networkSeries (mathematics)Token ringPrice indexString (computer science)DecimalInterpreter (computing)DampingTerm (mathematics)Network topologyPoint (geometry)CodeFormal languageParsingCore dumpProgramming language2 (number)BitMultiplication signComputer animationLecture/Conference
02:03
System of linear equationsGenderToken ringObject (grammar)String (computer science)Recursive descent parserParsingOpen setMereologyFunctional (mathematics)Key (cryptography)Message passingNumberCASE <Informatik>Token ringSubject indexingParsingNeuroinformatikLevel (video gaming)InformationResultantPasswordRecursionClosed setWritingPower (physics)Library (computing)WebsiteoutputPoisson-KlammerMathematical optimizationPoint (geometry)Loop (music)SpacetimeDemo (music)Queue (abstract data type)Formal grammarBridging (networking)CodeBitImplementationType theoryElectronic mailing listDoubling the cubePlastikkarteEscape characterException handlingRule of inferenceData dictionarySystem call2 (number)Phase transitionComputer animation
11:53
Token ringString (computer science)ParsingRecursionImplementationParsingInterpreter (computing)Complex (psychology)Goodness of fitLevel (video gaming)Interpreter (computing)Line (geometry)SpacetimeRun time (program lifecycle phase)Token ringModule (mathematics)CodeMathematicsElectronic mailing listStatement (computer science)MultiplicationString (computer science)ParsingMereologyData structureFormal languagePoisson-KlammerPhase transitionComputer programmingClosed setOpen setEntire functionOperator (mathematics)System callParsingVariable (mathematics)Functional (mathematics)Message passingAdditionComputer animation
17:34
Statement (computer science)MathematicsToken ringString (computer science)ParsingLetterpress printingStatement (computer science)Functional (mathematics)Subject indexingParsingLetterpress printingSpacetimeLevel (video gaming)Ocean currentSingle-precision floating-point formatSequenceParsing2 (number)Closed setCodePoisson-KlammerOpen setLine (geometry)MultiplicationObject (grammar)Block (periodic table)ImplementationError messageModule (mathematics)Electronic mailing listBitData structureFormal grammarSyntaxbaumExpressionDifferent (Kate Ryan album)TrailLengthAbstract syntax treeToken ringNumberStack (abstract data type)ConsistencyClassical physicsMereologySystem callComputer programmingCountingLoop (music)DigitizingParameter (computer programming)Message passingComputer animation
23:14
Electronic mailing listParsingStatement (computer science)Block (periodic table)Interpreter (computing)Letterpress printingInterior (topology)Glass floatData typeModule (mathematics)Object (grammar)Function (mathematics)Independence (probability theory)Statement (computer science)System callPattern languageOvalEntire functionInterpreter (computing)MereologyLine (geometry)SyntaxbaumSingle-precision floating-point formatBlock (periodic table)Loop (music)Green's functionGUI widgetFunctional (mathematics)Module (mathematics)Computer fileSocial classRecursionMessage passingDiagonalNetwork topologyService (economics)Level (video gaming)Point (geometry)CodeParsingNatural numberQuicksortRight angleEstimationDigital electronicsDot productPhase transitionComputer programmingLetterpress printingSequenceParsingCellular automatonMultiplicationRandom matrixComputer animation
28:57
ParsingInterpreter (computing)Video game consoleObject (grammar)Vertex (graph theory)Module (mathematics)Letterpress printingExpert systemObject (grammar)Operator (mathematics)ExpressionLetterpress printingBinary codeFunctional (mathematics)Sign (mathematics)System callType theorySubject indexingKey (cryptography)BitOcean currentCASE <Informatik>Electronic mailing listMereologyParameter (computer programming)Statement (computer science)Network topologyLogical constantModule (mathematics)Binary filePhase transitionComputer programmingString (computer science)Observational studyParsingNumbering schemeCodeCellular automatonComputer animation
34:33
Function (mathematics)Interpreter (computing)Letterpress printingError messageVideo game consoleRandom numberWeb pageElectric generatorSynchronizationThread (computing)Social classPattern matchingAlgebraic closureException handlingBytecodeExtension (kinesiology)Reading (process)Functional (mathematics)Bookmark (World Wide Web)Type theoryLine (geometry)Message passingParsingSoftware frameworkFerry CorstenInterpreter (computing)Exception handlingStatement (computer science)Single-precision floating-point formatParameter (computer programming)Inheritance (object-oriented programming)System callObject (grammar)CodeUser-defined functionImplementationEntire functionLocal ringThread (computing)BitToken ringMultiplication signSocial classBytecodeLibrary (computing)Goodness of fitMereologyFlow separationCovering spaceComputer animation
40:09
Sign (mathematics)Block (periodic table)ParsingParsingToken ringLine (geometry)QuicksortPhase transitionLecture/ConferenceMeeting/Interview
40:47
Interior (topology)Line (geometry)Token ringSpacetimeSemiconductor memoryCASE <Informatik>SubsetInterpreter (computing)ImplementationLecture/Conference
41:47
Library (computing)DampingFormal grammarWritingParsingLecture/Conference
42:21
ParsingFormal languageFormal grammarImplementationParsingMultiplication signLecture/Conference
43:19
Interior (topology)Lecture/ConferenceComputer animation
Transcript: English(auto-generated)
00:04
I'm going to try to attempt to explain how a Python interpreter works in about 30 minutes. And I'll be doing that by showing you how to build, like, a really simple Python interpreter. In technical terms, it's called a tree walk interpreter.
00:22
From scratch in Python. But the first thing that should come to your mind is why? Why would you want to do that? Well, the point of this exercise is kind of twofold. The first point is that I want to show that it's not some crazy, like, black magic wizardry
00:43
to create a programming language. It's kind of just like writing any other piece of code. And the second point is that Python is like a 32-year-old language. It has been contributed to by thousands of people. So I think that a lot of people would assume that it would be too hard to contribute to.
01:05
And yeah. I wanted to show that added score, like, the Python interpreter is still fairly simple. And I think that you still can, and that you should try to contribute to it. So yeah.
01:21
Let's start the thing with a bit of a story. A long time ago, I wanted to build a JSON parser. Yeah. It would be something like this. Give me one second. Can I stop it from being...
01:43
Do some mirroring. I'd like it to be... Can you just mirror what I see? Yeah. That's better. Yeah.
02:00
Should be good. Should be good. Yeah. That's better. Yeah. So I wanted to build a JSON parser, just like the one that's built into the standard library. I just tried to write something that's a parse function. It takes in a string, and it gives you whatever information that's available inside
02:25
the string, directly available to you. The fact that the results had two objects inside it, which is the information of two people, and that John here is a vegan, but Rob is not.
02:41
And yeah. Like... The thing is... I'm sure that all of you have used a parse function like this. But it gets often looked over that what is it doing exactly?
03:00
So this might look like structured data to you. But to the computer, it's just a string of characters. And yeah. So how do we make the computer understand things like that there are two keys here, and that the values are strings, as opposed to, like, floats and numbers? How do we do that?
03:21
Well, the first step of it is to differentiate the different parts of the string that you have. Like the fact that we have an opening and a closing curly brace here, which are colored in yellow, there's a red colon surrounded by these two strings in purple.
03:41
The first level of segregating the string input that we get is called the tokenizer. The step comes from the JSON spec, and the spec is really important. Because, like, without the spec, there would be no definitive way to be able to figure
04:02
out what exactly it means to be a JSON string. And thankfully, the JSON.org website has an exact spec which defines what it means to be a JSON object. It goes something like this. In JSON, an object is defined as starting with a curly brace token, and then there can
04:24
be many items followed by a comma, and then it ends with another item with, like, no comma at the end. Like, sadly, JSON doesn't have tailing commas. And then it closes with a closing curly brace. And what exactly are these items?
04:41
An item can be defined as a key and then a colon and a value. For now, you can assume that the key and the values are strings in this example. And, yeah, with that grammar properly defined, we are able to tell that, yeah, that is supposed to be a JSON object, that these are items, there are two items in the object, and that's
05:05
a key and a value. Yeah. So that's pretty much all there is to parsing. Once we have the tokens and we have a spec, we can take and structure the input, and we can, like, very easily extract the information out of it.
05:24
Yeah, I think that's enough talking in the air. Let's actually look at some code. Yeah, so we start with a tokenize function, which takes an adjacent string and turns it into a queue of tokens. What we do is fairly simple.
05:40
We just start with the index 0, and we keep going until the end of the string. We take out one character. First, we check if it's any white space, then we just jump over it by doing index plus equals 1, then continue on to the next character. Now, if it's a character, like, a curly brace or a bracket or something like that,
06:04
we do the same, but we create this token object, and we append it to a queue of tokens, and then we move on to the next character. There's, like, more of the same, but there's, like, the three things left here, which is
06:22
a string, like, a number, and then everything else, which is the true, false, and null. That's the only possible values that you can have in JSON. So for those, since we don't know how long that token is going to be, we just pass in the current index to the function, and then we get the new index after the thing has been passed.
06:41
Let's take a look at the string function for it. So, like, what it does is actually pretty similar to the top level tokenized function. We start a while loop until the end of the string, but instead of starting from zero, we start from the current index, and we skip over, like, the starting code, and we keep
07:07
going until we see the end code, at which point we go ahead of the code, we, like, take the entire quoted string out, we append the token to our queue of tokens, and then we return, like, the index of the thing that we have, like, after the string has
07:25
been passed. The only exception to this rule is when we see a backslash, in which case we jump over two tokens, and this allows you to escape things like double quotes if you want to have that in your JSON string, and, yeah, that's pretty much all there is to tokenizing.
07:46
We just collect tokens until we run out of the JSON string, and then we return the tokens that we have collected. Yeah, let's look at the parse phase next. The parser will, like, take out the first token from the tokens queue, and then, like,
08:05
based on some of the properties, such as it being a square bracket, we call parse array, like, if it's, like, an opening curly brace, we call parse object, like, if it's a string or a number, we can just call that, and for true, false, and null, we just return
08:22
the value true, false, or none. But, like, the real magic that's happening here, like, that's happening inside the parse array and the parse object functions. So let's actually take a look at parse object. So to be able to parse a JSON object, we start with an empty dictionary.
08:47
There is the one special case here, which is that if we see, like, a closing curly brace immediately, we just return the empty object, but apart from that, we take a look at the first token, we expect it to be a key, so we, like, parse it as a string.
09:05
Then we expect to see a colon token, which we just discard, and then for the value, we just call parse again. But, hey, the parse function is calling parse object, and now parse object is calling
09:20
parse once again. How exactly is that supposed to work? Well, the thing is, the parse function, we saw that it only looks at the tokens that it needs to look at. If we parse it, the queue of tokens, it will take a look at the first one, and say that if it happens to be a string, it will just take that one token and return.
09:47
But the benefit from that is that we can just keep calling it. This may be a bit hard to visualize, so let's actually try to see what's going on. So say that we call the top-level parse function with these tokens.
10:04
There's an opening, the curly brace, the string, colon, string, closing curly brace. Okay? The parse looks at the first token, like it happens to be an opening curly brace, so it calls parse object. The parse object expects the first token to be a key, and then it expects a colon,
10:22
which it gets, and then it calls parse again for the value. And parse will look at the first token, and since, like, is it an opening bracket? No. Is it a curly brace? Is the type string? Yes. So we call parse string.
10:41
From there, parse string, like, parses that token into a Python string. And then our parse object function expects either a comma or a closing curly brace. And since it finds a closing curly brace, we break out, and we return that object.
11:01
Yeah. Since this technique relies on recursion a lot, like take parse, type eventually ends up calling parse again, this technique is called recursive descent parsing. But yeah. Like, with that implementation, our JSON parser is effectively complete.
11:25
Let's actually try to... Like, this is how you would use something like that. By defining a tokenized function, which can turn the string into tokens, and then calling
11:43
parse over those tokens, we basically have a fully functional JSON parser now. Let's actually look at a demo of that. Yeah. So it's about 300 lines of code, and if we try to call the parse function with this
12:05
JSON string, yeah. So we're effectively able to parse JSON. It works. And the good thing about JSON is that the spec is so simple, but JSON is so widely
12:24
used that even something as complex as this API call that came from, I think, random user.me or something, this entire API call should be parseable.
12:42
So we have a fully featured JSON parser from, I suppose, 300 lines of Python code. But yeah. That's not what the talk is about. The talk is about writing a Python interpreter. So yeah.
13:00
Now we're gonna build a Python interpreter for this JSON parser that we just wrote. So we're gonna build a package called interpreter, which is a Python interpreter, and the Python interpreter is gonna run the JSON parser. So yeah. Let's try to do that.
13:20
Okay. First things first. What exactly is what we call an interpreter? Well, that's actually three things. So the first part is the tokenizer. It turns your Python code into tokens. Like, if you were to have a hello world program, the tokens will be print, and then an opening
13:41
bracket, the string hello world, the closing bracket, and the new line. And the new line is important, because Python doesn't have semicolons or anything like that. The only way for the parser to be able to tell that a statement has finished is for there to be a token like a new line.
14:03
So then there's the parser phase, which takes in these tokens, and it turns it into this nested data structure, and then there's the interpreter, which runs this data structure from top to bottom, and then we get our program execution.
14:22
Yeah. The third phase is called the interpreter, even though it's part of the interpreter I know. Yeah. Let's look at the tokenizer first. Well, the Python tokenizer is actually pretty much the same as the tokenizer that we wrote for the JSON module. Like, there's just a few additions.
14:41
There's two major changes. The first one is that the new lines are important. And how exactly do we build that into the Python tokenizer? We do something like this.
15:01
So say we have a method to scan one token. It reads one character. Then if that character happens to be a new line, and we are not inside some bracketed statement, like a list or a multi line statement, then we add a new line token. If we have any adjacent new lines, those can just be ignored, because even though
15:26
white space exists in a code base, for the runtime, the white space is pretty much meaningless. And the same holds true for all the other white spaces that exist.
15:40
Those can also be skipped. And then the tokenizer goes on to tokenize all the operators and variables and strings and brackets and everything, just like the JSON tokenizer. The second big change is that indentation matters in Python. As opposed to pretty much every other language that people write.
16:03
Well, to be able to support that, all we have to do is whenever we add a new line token, we make sure to detect the indentation on that new line. So let's see how we're gonna do that.
16:21
So the thing with indentation is that it's kind of like a stack. Because if a line is indented by two levels, you need to know what both those levels are. Because things like this can happen. So say that we start a statement, there's the one level of indentation, say four spaces.
16:44
Then there's a bunch of empty lines. The comments are essentially empty lines for the runtime. Then we see another statement with the same level of indentation. So we need to be able to detect that there has been no change in the indentation.
17:02
Then we go to two levels of indentation. And now we have the indentation of eight spaces. When we go back, we need to be able to tell that we have gone back to indentation level one. So to track that, we need a stack. And similarly, right here, we have gone from the two levels of indentation to zero.
17:27
And that is really important to be able to tell that we haven't gone from here to here. We have gone completely outside. So to track that, we can do something like this.
17:42
So we take out the current level of indentation by keeping looking at characters as long as there is a space or a tab, so we add that to our level of indentation. The current indentation will be at the top of the stack.
18:01
And if there's a mismatch of the sequence of the spaces and tabs in the indentation, we get the classic error of inconsistent use of tabs and spaces. If the current level of indentation stays the same as the indentation from the new
18:21
line, we do nothing. But if the new indentation is greater than the current level of indentation, so we add one indent token, then the latest level of indentation has to be added to the stack. But if we have a less number of indent characters than the current indentation, then we have
18:44
to detect how many dedents we have to do. Say that the previous statement was five indents in, and we have gone to two level of indents now. So the previous indent will match the index one because of zero indexing.
19:01
So we do a plus one to get to two. And the current length five minus two, that's three. So we add three tokens, then we take out those three indentations. So that's how we keep track of the current level of indentation. Yeah. On to the parser. The parser is actually pretty similar to the JSON parser.
19:26
The only difference is that instead of making a dictionary, that we make this nested Python object. One sec. Yeah. So we take in our Python program, we tokenize it, and then we turn it into this structure.
19:49
For the hello world, we know that there's a Python module. And inside the body, there is this thing called an expression statement. The function called to the function print with the one argument, which is the constant hello
20:08
world. This structure is called the parse tree for the abstract syntax tree. Well, the parsing bit is best explained live. So I'm just gonna go through the code on the parser.
20:24
Let's see. Yeah. So this is the current grammar of the parser that we're implementing. So just like the JSON spec, we have a spec for Python as well.
20:40
And yeah. Like, at first glance, it's fairly straightforward. Like, what is a Python module? Like, just a bunch of statements. So a statement can be a single line statement or a multi line statement. What's a multi line statement? That's a function definition. And if statement, a while loop, for loop, and so on.
21:00
The function definition? Well, that starts with the def token. And then we see a name. And then we see an opening bracket. And then some parameters, if there are any. Then a closing bracket, the colon, and a block. We'll come to block in a second. But the parameters, they'll just be a bunch of names.
21:25
And then we can have a comma and then a name and a comma and a name and so on forever. Well, that's pretty much how the parser is implemented in practice. We just take the grammar and turn that into Python code.
21:42
How? Well, when the parser starts, we simply call parse module. Let's try to go to the parse function. Yeah. So the parse function will create a list of statements. And then we just keep going until the whole module has been parsed.
22:01
Then we call parse statement. We get back one statement. Then we append that to the list of statements that we parsed. Then we return a module. That is basically what that's saying. For a statement, let's look at parse statement. Parse statement is just like either we call parse single line statement.
22:27
Yeah. Parse multi line statement will eventually go on to call parse block. Let's take a look at parsing an if statement. So for the if statement, we call parse expression.
22:43
So we expect a colon after it, and then we call parse block. And parse block is kind of interesting. Let's look at the grammar of it. Well, a block is defined as a new line, then indent, like a bunch of statements, then a didn't. And yeah, that makes sense.
23:02
That's a block of code. I think it happens to be on a new line. There's like a new level of indentation. We have some statements, and then we didn't back. So yeah. Do you realize that? Yeah. The thing with blocks is we can implement it something like this.
23:24
So parse block will expect a new line, then an indent. They can make a body for that block. And as long as it's not like a fully parse or we see a didn't, we keep calling parse the statement, then append those.
23:41
But parse statement can call parse multi line statement. And multi line statements have blocks inside them. So what's going on there? Well, let's try to visualize that as well. So we have a function here, then we find a block.
24:01
And say that's the block that we are currently parsing. We'll find one level of indent, and then we find a first statement, which happens to be some multi line statement. Like that's a while loop. The while loop itself goes on to call parse block. Like we find a new line, find the one indent,
24:23
and then we go on to parse the statements. Like that's a single line statement, that's a single line statement. And then we find one more multi line statement. So we call parse block again. So we find a new line and one indent, the single line statement, and then finally we find one didn't.
24:43
Like at that point, the parse block in yellow will return. And then immediately after that, this like third statement in the green block has the finally finished parsing, and then the green parse block will find a didn't and then return.
25:02
Like then for the blue one, like it has finally parsed its first statement, which is the while statement. Then it will parse the second statement, then the third file that, yeah, the file has ended. Like let's return. So that's how the parsing phase goes. Yeah, from there we can move on to the interpreter.
25:26
And yeah, how does code run? Like that's the question. Say we have a parse tree, like the one that we saw, how do we make it run? Well, the thing is, like just how you might have been taught
25:40
how Python programs run. They run top to bottom, left to right. It actually is that simple. Like a Python module is a bunch of statements. Like how do you run those statements? One by one, top to bottom. Say that, like that's the program. The problem comes from the deeply nested,
26:01
sort of the recursive nature of the AST. The problem here is that we have a module, but inside the body there can be like any kind of node, really. Then inside that node there can be any kind of node. So how do you deal with that? Like how many statements would you like to write?
26:22
Not many, I hope. But yeah, like to deal with that, there's this thing called the visitor path. Like that's what will help us do the entire interpretation thing. Let's actually take a look at the interpreter.
26:45
So we have an interpreter class, and it happens to have a global scope. The scopes are really important. We have a bunch of built-in functions that exist in the global scope. And then we set the current scope,
27:02
which happens to currently be the global scope as well. And then we have the top-level visit method. Now the visit is what will be called with the module, and then the visitor is supposed to interpret the entire parsed tree.
27:22
What we do is fairly simple. We look at the kind of node it is. Like right now this will return module, and we try to look for a function with the same name. So we try to look for the visit module, and then we just run that.
27:40
And what is the visit module doing? Well, it goes over all the statements in the body, and it just calls self.visit statement. Notice the use of self.visit. This is the thing that's very interesting about the visitor pattern, is that we can just keep visiting the children
28:02
in the sequence that we want them to be run. So you can kind of think of self.visit as just self.run. So we just run that statement. Then how does that get run? Well, say that it happens to be a print statement, so it will be a function call.
28:22
So this will be call. Then the ticket will look for a visit call, and visit call will be called with that node. So, yeah. Let's actually go through a simple example of this. I think that would be better.
28:41
Yeah, say that we are doing something like this. Python dash m interpreted dot parser. Yeah, let's do x equals two. No, let's do x equals two plus two.
29:01
Print x. Yeah, so that's what the parse of that looks like. And if you want to make it a bit more like a tree, we can just run black on it. So yeah, that's the transform code. And yeah, the module, like a body, has an assign node,
29:20
which that's the first statement. And the second statement is called an expert statement, because the thing inside it is an expression. Now function calls can return values, but since we don't care about the value in this case, we just say that, hey, treat this expression as a statement, don't worry about the return value.
29:43
So let's try to walk through this. So when we visit the first statement, we'll go through visit assign. And the visit assign will try to visit the value,
30:01
and what's the value? The value is a binary operation. It's like doing a binary operation between the constant two and the constant two. Then the operation there is plus. So let's try to look at visit bin op.
30:22
Yeah, so we get the way run, like the left and the right side of the binary operation by just calling cell.visit. And since these happen to be constants, let's look at visit constant. Yeah, that just takes the value out, which is two.
30:41
It wraps it in a value object, and then just returns that. If we go back up to bin op, yeah. So we get a value object with the value two, then a value object with the value two. Now we just start like a define all the operations, since the operator is plus.
31:01
Now we return a new value with the value two plus two, which is four. Now we have to go back, take up to visit assign. We have only visited the value part of it. Yeah, so this gives us a value object with the value four.
31:25
And for names, doing an assignment is fairly simple. And we happen to be assigning it to a name, X. So in that case, we just set the value to the name
31:41
in the current scope. Yeah, fairly simple. For like trying to assign to like a dictionary key or to a list index, it gets much more complicated. But we're not doing that, so I'm not going to worry about that. Yeah, like now we have set the value four
32:03
to the name X in the current scope. So what happens next? Well, like the next thing is a call to the name print with one argument, which is the name X. Yeah, so we do a visit expert statement. And yeah, like that's going to be fairly anti-climatic.
32:22
We just visit the expression inside, then we don't return the value. So we return none. OK, let's look at visit call then. So we are visiting a call. The call is to the function print.
32:42
And the name print and visit name. So what does that do? So we take the name out. We check like if it's in the current scope. Like if it is not set in the current scope, we try to look for it in the global scope. And if that's not set, we just say that the name
33:06
has not been defined. But in our case, like both the scopes happen to be the same. And we have defined, I believe, cell.globals.
33:20
Yeah, so we have defined print to be the print function. So what do we do when we actually find it? Where were we? Visit call. Yeah, we found the function, which happens to be an object that we created called print. Since it is a type function, we collect all the arguments
33:45
by visiting every single arg. Like now the argument is also a name, so it will be taken out from the current scope. And if it's not there, that will be found in the global scope. So we take out the value x from the scope, which is 4.
34:01
And that becomes the value 4 in these arguments. And then we take this function object, and we call it with the arguments. So let's try to look at the print function. When we call it, we just convert all the arguments to strings, and we print it. So yeah, that's how this program should run.
34:23
So if I were to move it out of the parsing phase and just run it, oh yeah, the pretty part is not there. Yeah, that should print 4. So yeah, now one glaring thing that's still left
34:42
to be discussed is what about functions? There's a global scope, but what about function scopes and things like that? Well, interpreting functions is also not very tricky. So we have this thing here called a user-defined function
35:04
or a user function. Well, whenever we see a function definition, we just create a user function with the definition, which is the body, all the statements, and everything. We wrap that inside the user function object,
35:25
and we just store that in the current scope. So you could define the function at the global scope or inside a local scope or something like that. And when we run this function, and we get a user function,
35:41
the object from the scope, when we call it, we are calling this part, user function. Yeah, when we do a call, the call we do is we store like the current scope of the interpreter. We create a new scope for the function.
36:02
We set that to be the entire interpreter scope for a bit. Then for every single argument that was passed to the function call, we take out these arguments,
36:21
the values. We take the function parameters from the function definition that we stored when the user function object was created. And in that new scope, we assign these arguments to the parameters. And then for every single statement that was defined
36:41
in the function, we visit those statements. Like if we encounter a return statement somewhere in there, we catch that as an exception, and we just return that as the value of that function call. And then finally, we set the scope of the interpreter back to the parent scope when the function exits.
37:03
So yeah, with this, we should have a working interpreter. So yeah, like with suppose, like around, like this takes implementing a visit function
37:22
for every single type of node that you have in your parser. But the implementation of most of them is fairly simple. Like for break and continue, we just do an exception. My favorite, however, is a visit pass where we just pass. But yeah, with like around 600, 700 lines of interpreter code,
37:45
we should be able to run this code inside our interpreter. And yeah, it works. And yeah, I think that's it.
38:01
But the one thing that maybe a few of you might have in your mind is that, wait, that's a toy implementation that doesn't really do anything. Like where's my favorite Python feature in this implementation? Like there's a bunch of things that we haven't even talked about.
38:23
Like the bytecode that gets generated in the actual implementation. Things like the classes are not implemented. Threads in IO, like I haven't even looked at that code, to be honest. But the thing is, most of these can be implemented.
38:44
Like fairly simply, the current framework that you have with this tokenizer parser and the interpreter, this can be expanded upon to add all of these one by one. And you know the best part?
39:03
There's all these other libraries that are written purely in Python. Like asyncio, for example, is like a giant blob of Python. It's like 15,000 lines of code. And you don't have to re-implement that to be able to re-implement Python. It's just like it's Python code.
39:21
You can run it. And the same holds true for all of these libraries as well. Like CPython as a whole is I think one third C code and two thirds Python code. But yeah, that's basically everything I have.
39:42
Like if you want to look at some resources, like the interpreter that I have created, the implementation has a large amount of influence from the book Crafting Interpreters. And if you want to learn about the exact internals of CPython, there's like a book by Anthony Shaw.
40:01
It's called CPython Internals. It's like a really good book. But yeah, thanks for your time. We have around like four minutes for questions. So if anyone has any questions, please sign up.
40:23
Yeah, hi. When you were talking about parsing blocks, you said that the parser looks for the dident and then sort of returns. Yeah. Does that mean that if you add a bunch of empty lines with just indentation, for example, would that make the parser actually take a bit longer to parse?
40:44
Well, the parser won't, but the tokenizer will. So like at the tokenization phase, we essentially ignore the color, like duplicate the new lines that we see. So if there's just a bunch of white space, that would just be looked over by the tokenizer, and there'll be only one new line token to look at.
41:04
So yeah, not really, but sort of. Okay, so this may be an unfair question, but taken to the natural conclusion, are there any benefits to writing your Python interpreter?
41:21
And for that, I mean, can you maybe implement a specialized subset of Python that is optimized for memory, optimized for performance, something along these lines? Yeah, most definitely. And if you look at the implementations of the Stackless Python or the MicroPython, that's exactly what they've done.
41:43
They've taken the Python interpreter and specialized it to the use case that they want. So yeah. Okay, thank you. Hi, do you know if you're familiar with it, but I've been doing some work recently with Lark that I think is a Python library
42:00
that lets you write custom grammars and parsers. I was curious if you've used anything like that and what the trade-offs might be for using something like that versus doing it from scratch like you've done in the talk. I'm sorry, what does the tool do? It adds the custom grammars to Python. It lets you, you can basically define your own grammar
42:22
and then it will give you a parser for that grammar. So you could write your own language with it quite quickly. Yeah, so if you were to write a non-toy implementation of something like this, you should probably use these things called the parser generators.
42:41
Now, I didn't use them because I wanted to do the entire talk in 30 minutes. But there are a bunch of these things that can take in your grammar and give you a parser out of it. But yeah, there are also huge languages like Ruby, for example, for the longest time
43:02
had a handwritten parser and I don't think that Ruby is a toy language. But yeah, both the approaches, they exist. They have their own trade-offs. But yeah, they both are valid ways to do the same thing.
43:21
Cool, great. Thank you so much, Tushar, for the talk. So we're going to have the keynote happening here next really soon. So stick around for that.