We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Formalizing a Language

00:00

Formal Metadata

Title
Formalizing a Language
Title of Series
Number of Parts
115
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
What is formal grammar, how does it create a programming language and what does Python's grammar look like. In Python 3.9 the grammar parsing engine was changed; find out some of the reasons why and what this change has allowed in terms of future language evolution.
19
Thumbnail
43:26
GoogolProgramming languageSoftware engineeringState of matterParsingStreaming mediaAbstract syntax treeParsingPositional notationExpressionFormal grammarContext-free grammarNeuroinformatikStatement (computer science)CodeArithmetic meanString (computer science)Line (geometry)InferenceLevel (video gaming)BitNetwork topologyBytecodeObject-oriented programmingToken ringRootModule (mathematics)Function (mathematics)Validity (statistics)Parameter (computer programming)SpacetimeMultiplication signRule of inferenceMessage passingSet (mathematics)Electronic mailing listCASE <Informatik>System callWordCombinational logicComputer programmingPeg solitaireLoop (music)SequenceInstance (computer science)RoutingBranch (computer science)Series (mathematics)InformationRegulärer Ausdruck <Textverarbeitung>Web pageMathematicsImage resolutionLine codeSyntaxbaumSubsetRevision controlTerm (mathematics)DeterminismMereologyFitness functionPoisson-KlammerSquare numberDifferent (Kate Ryan album)1 (number)Atomic numberConnectivity (graph theory)Spherical capSoftware testingPoint (geometry)Open setLogic gateMatching (graph theory)Context awarenessSemiconductor memoryData structureFunctional programmingWeightSoftwareSign (mathematics)Ocean currentVariable (mathematics)RecursionOrder (biology)Operator (mathematics)Representation (politics)Dot productCompilation albumEntire functionDecision theoryGroup actionGreatest elementProcess (computing)Condition numberSpeech synthesisoutputBuffer solutionError message4 (number)ResultantComplete metric spaceClique-widthComputer configuration9 (number)Boundary value problemInclusion mapAxiom of choiceControl flowLibrary (computing)Run time (program lifecycle phase)Default (computer science)Latent heatQuarkLetterpress printingHydraulic jumpMeeting/Interview
ParsingString (computer science)ComputerSyntaxbaumInferenceCodeNeuroinformatikLine (geometry)String (computer science)Data structureArithmetic meanFunction (mathematics)Information1 (number)Level (video gaming)Fitness functionObject-oriented programmingBitCompilation albumStreaming mediaBytecodeMultiplication signModule (mathematics)Abstract syntax treeProgramming languageComputer animation
Token ringStreaming mediaFormal grammarLogical constantAttribute grammarSyntaxbaumBytecodeVirtual realityCodeMultiplication signBytecodeVariable (mathematics)NeuroinformatikSystem callStreaming mediaToken ringAtomic numberInformationConnectivity (graph theory)Different (Kate Ryan album)MereologyString (computer science)ParsingCompilation albumWordData structurePoint (geometry)Network topologyOperator (mathematics)Order (biology)Boundary value problem1 (number)Greatest elementGroup actionComputer animation
Formal grammarFreewarePositional notationParsingBytecodeFormal grammarRule of inferenceParsingParsingPositional notationExpressionStreaming mediaToken ringDifferent (Kate Ryan album)Buffer solutionRevision controlProgramming languageSpacetimeDecision theoryMereologySet (mathematics)CodeCombinational logicProcess (computing)Computer programmingWordPeg solitaireLine (geometry)Instance (computer science)Series (mathematics)Arithmetic meanSubsetContext-free grammarValidity (statistics)Computer animation
Formal grammarParameter (computer programming)Rule of inferencePeg solitaireDot productClosed setExpressionSequenceParameter (computer programming)Branch (computer science)Arithmetic meanMatching (graph theory)RecursionOrder (biology)Operator (mathematics)CodeToken ringoutputError messageComplete metric spaceSet (mathematics)Electronic mailing listSoftware testingMereologySystem callNetwork topologyComputer programming1 (number)Regulärer Ausdruck <Textverarbeitung>SyntaxbaumString (computer science)Statement (computer science)Inclusion mapFormal grammarLine (geometry)SpacetimePoisson-KlammerParsingParsingMultiplication signProgramming languageSquare numberComputer animation
Block (periodic table)Suite (music)Lambda calculusRegulärer Ausdruck <Textverarbeitung>Formal grammarParsingAbstract syntax treeSyntaxbaumStatement (computer science)Formal grammarProgramming languageString (computer science)ParsingRule of inferenceOpen setControl flowSoftware testingExpressionComputer programmingComputer configurationSpacetimeEntire functionNetwork topologyWordClosed setCASE <Informatik>Functional programmingStatement (computer science)1 (number)CodeToken ringLibrary (computing)Matching (graph theory)ParsingMathematicsNP-hardPeg solitaireLine (geometry)Clique-widthInfinityImage resolutionOcean currentContext awarenessBranch (computer science)Decision theoryComputer animation
State of matterNeuroinformatikFormal grammarParsingRun time (program lifecycle phase)Network topologyMultiplication signProgramming languageSemiconductor memoryDefault (computer science)Hydraulic jumpPower (physics)Statement (computer science)Sinc functionParsingRule of inferenceEntire functionLecture/ConferenceMeeting/Interview
Transcript: English(auto-generated)
Okay, so for the next talk, we have formalizing a language by Jeremiah Page. He is a software engineer based in the United States and working for Active State.
Hello, Jeremiah. Hi, how are you? Thank you very much. We're very happy to have you here. Thanks. Okay, welcome everyone to my talk, formalizing a language. Let's get right to it now. Python is a language that has been designed since the very beginning to be very readable.
It was observed that people spend more time reading their code than they do writing it, so it was desired to build a language that could be easily read. Humans read Python code and they parse it. They consume the information in a way that best fits in their own head, and that can be different per person.
Each of us might read a piece of Python code differently depending on the situation or our background. For instance, this line of code, I would probably read out loud as comp equals requests.get https bp2021.europython.eu When I read that line of code out, most of you probably
understand exactly what I'm trying to do, which is pretty great because I left out some very crucial pieces of information that a computer might want to know. I didn't tell you actually all the characters that were in the string. I didn't tell you that there was a string. I didn't tell you where the string started or ended or that it was an argument or that ours even doing what they call.
You all just kind of inferred that, which is great, but the computer doesn't infer that. The computer wants to know precisely what I'm doing. So although humans read code more than they write it, the computer reads our code even more, and it wants to understand what the meaning I'm trying to convey when I write
this line of code. So it also parses this code, but it parses it into something called an abstract syntax tree or an AST. This is the output of the AST module that comes with Python if I put in this line of code.
It may not look quite like a tree, but if you squint a little bit, you can see that these nested levels of object calls look kind of like a tree. There's a root, there's some children, and there's some sibling nodes. And our flatline code has been given a little bit of structure by the computer. And then the computer uses this AST to get the output it really wants, a stream
of ones and zeros. So this last line here is the bytecode of the same line of Python code. So first Python makes an AST, then it makes the bytecode. So how does it know what kind of structure to give all of these strings that I wrote out earlier?
Well, Python compiled it into bytecode. Yes, we're going to talk about compiling Python code today, and no, I won't have time to get into interpreting Python at all. Compiling Python starts with a step called lexing. This is where the computer goes through the string I wrote out and figures out where all the word boundaries are.
Words to Python are things that you and I might call words like comp and get, but it's also things like significant whitespace, punctuation, and strings, whatever may be contained within the quotes. Things that don't matter to the computer like insignificant whitespace or comments are thrown out at this point.
Then that stream of tokens is passed into the next step, which is parsing. Parsing takes the tokens and generates the AST that we saw earlier. This AST tells the computer a lot of information about the atoms of this statement. It gives it the relationship between different pieces that I wrote out.
It gives a precedence and a relationship between all the different components. The computer needs to know which parts are more fundamental than others. For instance, the computer can't assign a value to the comp variable unless it has a value to assign, so it has to get something back from the get method call.
But it can't call the get method unless it knows where the get method lives in memory, so it has to go and look it up. So this tree structure relates directly to the order of operations that this has to happen. Next, and finally, is assembling.
Python takes the AST that it built, and it flattens it back out into a 1D representation, a stream of ones and zeros. But you'll notice if you look here that the order, even though it reads left to right again instead of a tree, is not the same that we wrote it out. Computers don't actually always appreciate the way that we write code.
It doesn't follow the ordering at once. It needs to do the more fundamental actions before the more complicated ones. And so the ordering of bytecode actually more closely follows the AST if you were to read it from bottom up than it does the way that you and I write code, left to right.
So it got all the structure by, or how did the computer apply all this structure to our code? Did it through formal grammar, and specifically grammar rules. So what is formal grammar? Well, formal grammar is a set of rules that describe a language.
In this case, I mean set in a pretty mathematical sense in that it defines the entire space of all combinations of words that make out Python programs. And by defining the entire space that's valid Python programs, it simultaneously defines the space that is invalid Python programs. Take for instance this line of code that we saw earlier.
We saw that it was lexically valid. When we passed it through the tokenizer, or the lexer, we got a series of tokens. And they were all correct. And then we passed those tokens to the parser, and the parser produced an AST. It was syntactically valid. And if we continued down that route, this code would have actually run.
Compared to that to the next line of code, which I've changed a little bit, if you didn't look too closely, you might think that this next line of code was also valid Python. But I've made some changes, and it's no longer syntactically valid. It is still lexically valid. If I were to pass this line of code to the lexer, it would produce a stream of tokens,
no problem. After all, I've only used valid Python characters. It doesn't care what the meaning is, it doesn't know what the meaning is. It passes those tokens on to the parser, but the parser can't find any resolution. It can't parse this line of code, because it's not part of the Python language.
Up to and including Python version 3.9, Python used a very special, specially defined subset of formal grammar called deterministic context-free grammar. This is just a very formal definition. The deterministic just means that this grammar has only one solution whenever it reads a
program, which is good. We want our programming languages to always produce the same code whenever we write the same line in Python. And context-free just represents how the rules are able to be written in the metagrammar.
Speaking of metagrammar, we have to have some way to write down these rules so that we can all agree what they are. The older version of Python grammar used eBNF notation, which is just a very formal way of writing grammar rules that's used for various different languages.
It's actually a very special notation that was designed just for Python, but it follows other eBNF notations out in the wild. It's sometimes called a metagrammar because this is a way to write down rules for a language that tell you how to write another language. And all this was actually parsed using an LL1 parser.
The parser is the engine that takes in grammar rules and takes in the token stream and produces the AST as a result. So an LL1 parser is just shorthand for left to right, which is how it reads the code. Leftmost derivation, which is the way in which it decides rules to follow in grammar,
and that once represents the size of its lookhead buffer. So as the parser is going along the token stream, it has exactly one peak it can use to try and resolve its different grammar rules. And starting in version 3.9, Python got a parsing expression grammar, usually called PEG.
It got a new notation to represent this PEG, but we'll see later that it was heavily inspired by the original grammar, while still borrowing from other PEGs used in other languages. And it got a new parser, packrat parser, which was specially designed for PEG.
The most significant difference between this and the previous parser for our purposes is that PEG, or the packrat parser, has an infinite lookahead. So it can continue looking down the stream of tokens until it resolves any ambiguity in the rules, and then make a decision.
If we go back to our original example, compiling Python code starts with text, and that text is lext into tokens, and those tokens are parsed into an AST, and that AST is finally assembled into bytecode. The only part of this whole process that's changing in Python 3.9 is going from tokens to the AST, the parsing.
The tokens that represent your Python programs in 3.9, and the AST that results from parsing it, haven't changed at all. It's just this little process. Now we'll look at some of these rules that we've been talking about for so long.
This is the older eBNF grammar rules for the import statement. Each of these lines is a grammar rule, and they start with a name, and then a colon, and then a definition. A grammar rule definition can include other rule names, string literals, or token literals, which are the all caps,
and they can also include some metagrammer syntax. So the import statement is defined in the grammar as matching if the import name rule matches, or the import from rule matches. So to understand this further, we have to descend into another rule.
In this case, we'll just descend the left hand branch of this so we can understand it a bit better. The next line down is the import name rule, which is the import string followed by dotted as names, the rule. And dotted as names is defined below as dotted as name followed by zero or more dotted as names.
You'll see that star at the end, the clean star, represents zero or more repetition, just like in our regular expressions. So we descend again to dotted as name, which is a dotted name followed by an optional as string and a name token. Square brackets indicate the optional presence of some part of the rule.
And the name again, all caps, means it's a token. So last rule on the list, dotted name is a name token followed by zero or more dot name tokens. So we can put these rules together to see that together they define the import statement as you and I know it and use it on every day.
So now we'll see how Peg defines the same import statement. Looks pretty similar. In fact, three of these rules haven't really changed at all. The import statement is still matched if either the import name rule or the import from rule match.
Import name's the same. Dotted as names is the first one to change. And it's just using a new metagram syntax that was not possible in the old EVNF. This string dot some other rule name is just a shorthand for saying that when there's more than one match of the rule name, they're separated by commas.
It works just like a string dot join in Python code. And at the end of this, we use the clean cross like we would in a regular expression to indicate one or more. Dotted name hasn't changed or dotted as name hasn't changed. But dotted name has changed.
It now matches if another dotted name rule matches followed by a dot name or just a name token. So in this case, we actually have a recursive rule. It keeps matching itself until it finds the terminal branch which is just the end of a dotted name sequence, just a name.
And then when it reaches that point, it follows its recursion back up and the outermost dotted name contains all the inner ones. So now let's look at a slightly more complicated example to see where PEG really shines. This is the older EVNF definition of an argument list.
This is where you can pass a Python callable. An ARG list is one or maybe more arguments separated by commas. And then there's an optional trailing comma which is for our readability benefit. So the real interesting part is in the argument. Argument is, here we look at it, it's the test rule followed by comp4.
We don't define those here, but for the purposes of these rules, you can think of the test as basically any Python expression. It's actually slightly more inclusive than any Python expression, but it's almost the same. Then a comp4 is a comprehension for loop, so a generator.
An argument is one of those, or it is a named expression, or a keyword, or an unpacking, either star or double star. These are the flavors of arguments that you're allowed to pass call with a Python, and an ARG list is one or more of them.
How does PEG define this? PEG takes a lot more rules and longer rules to define. It's actually the same thing. Remember, both these grammars are in 3.9. They're both recognizing the same language. PEG's taking a little more time to do it. A little more space anyways.
So PEG arguments is really just matching when it matches ARGs. That trailing comma and closing parenthesis is just matching its place in the larger program overall. So ARGs is defined using, again, this new PEG syntax as a comma-separated sequence of starred expressions or named expressions,
and then optionally followed by keyword arguments. Or it's just keyword arguments. So keyword arguments there is in both branches of the ARGs rule, so we'll descend into that. Keyword arguments is defined as quorg or starred, one or more, followed by one or more quorg or double-starred.
Or it's just quorg or starred, or it's just quorg or double-starred. And all these are defined below, but they're pretty self-explanatory, I think. So why is PEG repeating itself so much here, and taking so much longer to define the same thing?
Well, these rules are enforcing a very specific order. Quorg says, yeah, you can have your keyword arguments, you can have starred expressions and double-starred expressions, you can have either or both, but if you have both, you have to do all your starred expressions before any of your double-starred.
And we take that rule and plug it back into the ARGs above, saying, yes, you can also have named expressions and or starred expressions and or double-starred expressions, but if you have all three, their musts, appearance are in order. You have to be finished with your named expressions before you start any of your unpacking operators. This is the actual rule of Python.
We've never been allowed to call something and start with a star-star expression, and then follow up with some arguments, and then some keyword arguments, and then some named expressions. But the old grammar actually didn't really recognize this. It was just looking for any sequence of arguments. So it would accept it, and it would build a parse tree,
and it wouldn't think anything of it. Your program would actually parse. The reason it wouldn't execute is because there's post-processing done on the parse tree from the LL1 parser. After it was done with your program, the tree would be walked again, and certain conditions, like the ARGs, would be checked against rules that didn't quite
fit in the grammar. And if your arguments were misordered, then you get a syntax error there, instead of from the parser. But with the PEG, it just won't parse. If you try and input your arguments in the wrong order,
you won't match a complete set of grammar rules, and your program will fail the parse. You'll never get the tree. You'll get a syntax error straight from PEG. So this more clearly captures the actual language of Python by describing the exact scenario you have to have.
So what can PEG bring us besides some more verbose grammar to describe a language that we already have? Well, it brings new things too. First up is the feature I am most looking forward to in Python 3.10, parenthesized with statements. I'm very excited for this. So you might not know, but coming out in the 3.10 later this year,
you'll be able to logically join multiple lines into one with statement using parentheses. We do this with all our other statements. Our long ifs, our long fors, our long import statements. Sometimes automatically for us using linters, they get wrapped with parentheses.
But until 3.10, this was not possible with with statements. You might wonder why, but it has to do with the new PEG grammar. So this is the rule in 3.10 for with. It's either the with keyword followed by an opening parenthesis, one or more with items, and closing parenthesis.
Or it's the older way specifying it, which is just the keyword and one or more with items. So this is the new shiny way. What was the old way? The old way looks pretty similar. It's still the with keyword and one or more with item. We just don't have that second branch.
So why didn't we add it in 3.9 instead? Well, we could have written it. It would have been legal in the grammar. We could have said this would also be another option for the with statement. Is the with word and then the opening parenthesis literal, one or more items, closing parenthesis.
But remember that ebnf grammar is parsed with an ll1 parser, and that one is the size of its look at buffer. So when it's looking at a with word in the token stream, it has one peak. It can look one token ahead. So it does, and it finds an opening parenthesis. And what should it do?
It has two choices for this rule, and it needs to pick one. Well, one of those rules has a literal opening parenthesis in it, so you'd think it'd pick that. But if you look at the other rule, the other rule, what comes immediately after the with word, is a with item. If we look at with item, it comes first there as a test.
And remember, a test is essentially an arbitrary expression, and Python expressions can be enclosed in parentheses. We can do this for all kinds of reasons, for precedence or to join logical lines for the expression itself. And so there's ambiguity here. We could be following either one of these rule branches,
and it just, it wouldn't parse very well. But with a peg, it has infinite look ahead, and it will just keep going past the first width and the first opening parenthesis until there is some resolution to the ambiguity. And then it will pick one. One other new feature from peg is soft keywords.
Python's always had keywords, words that you're not allowed to use for your own purposes, and it's always had reserved words, ones that you could reassign if you want to. You probably shouldn't. But now, starting with Python 3.10, we're going to have two new soft keywords.
These are words that, if used in the right place, will start a Python statement. But if used in any other place, they can be used just like a name. You can find them, make them your function name, or whatever you like. Now, some of you might be saying, wait, we've had soft keywords before. Async and await were both soft keywords.
You would actually be right. Back when they were introduced in Python 3.5, async and await were both soft keywords. Even though we didn't have a peg parser at all. But I told you, you have to have a peg parser for soft keywords. So what happened? Well, the parser in Python 3.5 conscripted the tokenizer
to do quite a lot of extra work for it. When they were introduced as soft keywords, the tokenizer was changed so that if it found the string async, or the string await, it would have a look around its current lexing context. And it would say, is this a place where async has to be a keyword? If it is, it would replace the token in the string,
or it would insert into the string the token async. If it was in a place where it didn't have to be a keyword, it would leave it alone as a name token, just like all our other strings and names. Then when the parser was actually doing its thing,
it would just check, did I receive the async token? If I did, this is a keyword. No other checking is necessary. However, if it got a name token, and that name token happened to contain the string async or await, that would be fine too, and that would match. And this way, async and await were both allowed to be soft keywords.
This is not the realm of the tokenizer normally. You can see here more grammar from Python 3.10. The definition for how you do a function def is the def string followed by a name token, or it's the async token def string name token.
So this is kind of strange. Def is a keyword we've definitely always had since the beginning of Python, but it doesn't get a special token. It just matches a string in the grammar. Also, async's not even a soft keyword anymore. It's a hard keyword, and we don't have an LL1 parser anymore.
We've moved on. But our tokenizer is still tokenizing this word specially because that was what it took to make soft keywords. So going forward with match and other words, peg doesn't need this conscripting of the tokenizer. With its infinite look ahead, it can make the decision whether where it's seeing the match word
is appropriate as a statement or as a normal name. So this is the match statement rule just below the function def rule. You can see it just uses a normal string like it does for def and every other Python keyword. Just async and await that are going to continue to get this special treatment because of a feature that was introduced
back in Python 3.5. So what has peg brought to the language? What have we gained by completely switching out the grammar of a very popular programming language? Well, our peg grammar actually accurately defines
the entire space of Python programs now. So that's good. That's what grammar is officially supposed to do. It also means less steps between the tokens and the AST. Parsing step with the old ll1 because we had to walk the tree after we made it to do additional checks
and also in some cases to reshape the tree. There were more steps getting from one to the other. So we've simplified the parsing step. We've added new soft keywords. So far we have a matching case, but there's no reason why we can't keep adding soft keywords to the language as we continue to accept new keywords
and we won't break existing code. This is how we get a match statement in 3.10 and we keep re.match as a function in the standard library at the same time. And I don't think they'll become any hard keywords like async and await did. And pegs just has the ability to express more elaborate ideas than an ll1 parser did.
So looking forward to more changes from peg. Thanks zero Python for putting on this conference. Thank you for attending and thank you active state for sending.
Thank you very much for your talk. Some questions have come in from the audience so I'll just ask them. Does peg have any performance runtime improvements? Actually slower, I believe. So both grammars are on Python 3.9.
Mostly get to kind of prove that it works. There's lots and lots of testing, but we want to be sure. So 3.10, there is no more ll1 parser, but both of them exist in 3.9. If you didn't realize this, it probably means that you were using the peg parser the whole time because that was the default. So there is a performance impact.
I believe it's around 10% or at least it was when it first went out the very first release of Python 3.9. So it takes a little bit longer, but a lot of improvements have been made so that it takes about the same amount of memory.
But it was decided that computers have come long enough in the 30 years since Python first came around that it was worth it for the extra power. Great. And then there was another question. What are the main reasons why they moved from ll to peg?
So the main reason is that I think the primary reason it was chosen to finally move parsers is that by the time we got to Python 3.8, we had already accepted rules into Python language that couldn't be expressed in a grammar
that could be purely parsed by an ll parser. So we were already doing rules that absolutely had to have post-processing done on the tree. It wasn't just inconvenient to write them out that way, but we were going beyond the formal definition of ll, and with peg we're not.
We can again contain the entire definition of language to the grammar itself. So there are examples, but like the with statement, we would never have gotten around to having a parenthesized with statement if we stuck with the ll forever. And thanks a lot.
And then I just wanted to let you know that the audience really enjoyed your talk and that if you like to, you can answer questions in the breakout room or in the conference room. Yeah, I'll jump into the conference room now. Thank you very much.
Thank you.