We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

How to Build a Compiler

00:00

Formal Metadata

Title
How to Build a Compiler
Title of Series
Number of Parts
37
Author
License
CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Compilers are all around you: Babel, Handlebars/HTMLBars, Glimmer, Uglify, and more. In this talk we'll walk through every part of a compiler from the parser to the generator. Learn about visitors and traversal, paths, scopes, bindings, and everything else. By the end compilers shouldn't seem like magic, and maybe you'll even want to contribute back to them.
VideoconferencingCodeCompilerBuildingPoint cloudSoftware frameworkCircleProjective planeTwitterXMLComputer animation
CompilerTwitterBitFile formatForm (programming)CompilerMultiplication signCodeLevel (video gaming)WordComputer animation
NeuroinformatikNumberCompilerComputer scienceProjective planeProcess (computing)Quicksort
Web pageCompiler
Source codeMachine codeBinary codeCodeQuicksortCompilerOperator (mathematics)CompilerNeuroinformatikFundamental theorem of algebraLevel (video gaming)Formal languageWritingForm (programming)
Level (video gaming)MappingAssembly languageMathematical optimizationBinary codeInheritance (object-oriented programming)Programming languageFormal languageCodeReading (process)Operator (mathematics)Type theoryBefehlsprozessorProgrammer (hardware)WritingProduct (business)CompilerCompilerSource codeMikroarchitektur
CodeWritingFormal languageProgrammer (hardware)Level (video gaming)CompilerAssembly languageAbstractionOperator (mathematics)CompilerVirtual machineSoftwareSource codeTransformation (genetics)P-groupCASE <Informatik>Different (Kate Ryan album)
MereologyCodeToken ringSource codeIntermediate languageNumberInformationObject (grammar)Theory of relativityRepresentation (politics)CompilerOperator (mathematics)Mathematical analysis1 (number)Phase transitionDifferent (Kate Ryan album)ParsingTransformation (genetics)Abstract syntax treeString (computer science)
Computer programNumberAbstract syntax treeFile formatToken ringType theoryObject (grammar)State of matterRepresentation (politics)ExpressionSystem callComputer programmingParameter (computer programming)Compiler
Computer programCompilerFormal languageMathematicsForm (programming)Level (video gaming)Type theoryTransformation (genetics)Object (grammar)Element (mathematics)Focus (optics)CompilerFlow separationCategory of beingParameter (computer programming)System callExpressionNetwork topologyMereologyNumberWord
Process (computing)Order (biology)Reverse engineeringLevel (video gaming)ExpressionNumberComputer programmingSystem callComputer animation
Inheritance (object-oriented programming)Vertex (graph theory)CompilerCodeInheritance (object-oriented programming)String (computer science)Letterpress printingDifferent (Kate Ryan album)Type theoryRepresentation (politics)Phase transitionOrder (biology)Object (grammar)Element (mathematics)Operator (mathematics)Pattern languageWordCompilerToken ringSubsetData structureMereologyComputer animation
CompilerCompilerLevel (video gaming)
CodePerspective (visual)Frame problemCompilerBit rate
CompilerQuicksortBitCodeReal number
Formal languageCodeDifferent (Kate Ryan album)CompilerSource codeLine (geometry)Entire functionCompilerElectronic mailing listMereology
MathematicsType theoryElectronic mailing listSelf-organization
Function (mathematics)Electric currentToken ringoutputCodeInformation managementNumberCodeToken ringPhase transitionString (computer science)Mathematical analysisCompilerNumberParsingSequenceEntire functionSoftware testingSpacetimeAnalytic continuationType theoryLoop (music)outputExpressionClosed setSystem callOcean currentCycle (graph theory)Data storage deviceMultiplication signPosition operatorFlow separationCursor (computers)Functional (mathematics)Group actionSet (mathematics)LengthPattern languageExistenceState of matter
Differenz <Mathematik>Inheritance (object-oriented programming)Function (mathematics)NumberToken ringSelf-organizationToken ringMereologyFunctional (mathematics)NumberCursor (computers)ParsingCodeSoftware testingDifferent (Kate Ryan album)RecursionMultiplication signOcean current
Inheritance (object-oriented programming)Token ringRegulärer Ausdruck <Textverarbeitung>Closed setType theoryExpressionFunctional (mathematics)System callOcean currentLoop (music)Token ringNumberResolvent formalismRecursionObject-oriented programmingSet (mathematics)InfinityCore dumpParameter (computer programming)
Closed setFunctional (mathematics)Token ringExpressionNumberMultiplicationCodeSystem callParameter (computer programming)Self-organizationMobile Web
Computer programToken ringFunctional (mathematics)Computer programmingStatement (computer science)Type theoryClosed setMultiplication signParameter (computer programming)System callLoop (music)Finite-state machineToken ring
Token ringFunction (mathematics)Inheritance (object-oriented programming)ParsingTraverse (surveying)Type theoryNational Institute of Standards and TechnologyFunctional (mathematics)
Inheritance (object-oriented programming)Function (mathematics)Data typeComputer programNumberTraverse (surveying)Functional (mathematics)Transformation (genetics)System callInheritance (object-oriented programming)Category of beingNumberLevel (video gaming)ExistenceWordRecursionNetwork topologyComputer programmingVertex (graph theory)Process (computing)CASE <Informatik>ExpressionRAIDType theoryMatching (graph theory)Software testingOcean currentSinc functionError messageMechanism design
Computer programContext awarenessInheritance (object-oriented programming)NumberRegulärer Ausdruck <Textverarbeitung>Statement (computer science)Function (mathematics)Level (video gaming)Default (computer science)CompilerFormal grammarAbstract syntax treeFunctional (mathematics)Transformation (genetics)Traverse (surveying)String (computer science)Closed setParameter (computer programming)IdentifiabilitySystem callFunction (mathematics)CodeContext awarenessToken ringMereologyExpressionNumberInheritance (object-oriented programming)Level (video gaming)Computer programmingLine (geometry)Control flowNetwork topologyCategory of beingBitCompilerStatement (computer science)Electronic mailing listACIDException handlingAbstractionWordMultiplication signVertex (graph theory)Group actionOpen setState of matterAnalytic continuation
ParsingCodeCompileroutputCodeFunction (mathematics)Transformation (genetics)ParsingToken ringString (computer science)Term (mathematics)MereologyRight angleBoom (sailing)XMLUML
Set (mathematics)Asynchronous Transfer ModeVideoconferencingService (economics)Event horizonXMLComputer animation
Transcript: English(auto-generated)
You might know me from a little pet project called Babbel.
I also work at this company called Cloudflare. I work on the UI team. You should all come work there, because we use this really awesome, terrible, terrible UI framework. Anyways, you can also follow me on Twitter at the James Kyle.
So you want to know how to build a compiler, or maybe you don't, and now you're just stuck listening to me. Either way, I'm going to teach you. The format of this talk has a bit of a backstory. I got into compilers through Babbel, or 6-to-5 at the time. I always understood what compilers did from a conceptual level, but I wasn't really
able to grok them until the last year and a half. The reason I was able to get into them is because when I first looked at the 6-to-5 code base, it was the first time I was able to read the code of a compiler and really understand it. After Babbel blew up into a massive project,
I became interested why we weren't also blowing up in the number of contributors. I learned that most people are pretty scared of compilers. It's thought of as this totally unapproachable, computer science-y thing that only the nerds of the nerds can understand. I realized that we need to do a better job of teaching people and making it more approachable to people.
So I decided that I was going to go out and really learn everything there was to know about compilers. I jumped onto Amazon, and I ordered the essential books on compilers. And about three weeks later, I discovered why compilers were totally unapproachable.
Those books are 500 pages of the most dull explanation of compilers that one could ever have written. I love compilers, and I was so bored, so bored reading them.
I couldn't even come close to finishing them. The learning problem with compilers is not that they're complicated. It's that we teach that they are complicated. So what I'm gonna do today is give you an oversimplified but reasonably accurate overview
of what most compilers look like. And then I'm gonna try and make it more real to you by walking through the code of an actual compiler that I've written. So what even is a compiler? A compiler is simply a tool that transforms source code written in a higher level language
into a generally lower level target language. But let's go back to the fundamentals to explain why they exist. A computer's CPU executes millions of tiny little operations. Operations so tiny that you would never want to write them out manually. And they're sort of pure form, they're written as binary.
And this is known as machine code. Obviously, there is no way a programmer could understand and be productive writing code like this. So CPU architectures support a mapping of binary operations as an easier read language like this. And this is known as assembly code.
It's a super low level programming language that gets converted into binary code by a thing called an assembler. Externally, an assembler looks a lot like a compiler. It's taking one type of code and translating it into a lower level type of code. However, internally, they're much simpler than compilers. Ignoring any kind of optimization
an assembler might be doing, they're really just a straight mapping of one syntax to another. However, assembly code is still too low level to be really productive in. So the next level up is source code. And this is where the vast majority of programmers enter.
These are programming languages that you are familiar with and would actually write large pieces of software in. Source code is different than assembly code. In most cases, it does not map directly to machine operations. It's meant as an abstraction to allow humans to understand what is happening and be productive in writing code.
As such, an assembler wouldn't be able to translate source code down into machine operations. We need something that can understand a more complicated syntax and transform it very intelligently into optimized lower level code. And this is where a compiler comes in. Since compilers are doing complicated operations,
they have been broken down into a couple different stages. These stages vary compiler to compiler, but they all fit within these three primary groups. Parsing, transformation, and code generation. Parsing is taking raw code and turning it into a more abstract representation of that same code.
Transformation takes that abstract representation and manipulates it to do whatever the compiler wants it to. And code generation takes the transform representation of the code and turns it into a new string of code. Parsing gets broken down into a couple different phases.
The two biggest ones that you will see are lexical analysis and syntactic analysis. Lexical analysis is pretty simple. It takes a raw code and it just splits it up into these little things called tokens. Tokens are an array of tiny little objects that describe an isolated piece of the syntax.
They could be numbers, labels, punctuation, operators, whatever. Syntactic analysis will then take these tokens and reformat them into a representation that describes each part of the syntax and their relation to one another. This is known as an intermediate representation
or an abstract syntax tree. An abstract syntax tree or AST for short is a deeply nested object that represents the code in a way that is both easy to work with and tells a lot of information. Now this might be hard to visualize, so let's look at what this code would be. So we would represent tokens like this.
We have an array of objects with a type of paren or name or number, and each of them have the same format of a value. An abstract syntax tree gets represented like this. You have a much more descriptive format for them. You have a program with a call expression named add
with parameters that could be a number literal or a nested call expression. And this is the final representation that we'll use for the rest of the compiler. The next type of stage in a compiler is transformation. Again, this just takes the AST form from the last step and makes changes to it.
It can manipulate the AST in the same language or it can translate it into an entirely new language. And let's look at how we would transform an AST. You might notice that our AST has elements within it that look very similar. These are objects with a type property. Each of these is known as an AST node.
These nodes have defined properties on them that describe one isolated part of the tree. We can have a node for a number literal with a value of that number, or a node for a call expression with a name and the parameters. And when transforming the AST,
we can manipulate nodes by adding, removing, replacing properties. We can add new nodes, we can remove nodes, or we could even leave the existing AST alone and create an entirely new one based on top of it. And since, for our purposes, we're gonna be targeting a completely separate language, we're gonna focus on creating an entirely new AST
that is specific to the target language. In order to navigate through all of these nodes, we need to be able to traverse through them. This traversal process goes to each node in the AST depth first. So for the following AST, we would go to the top level program
and see that it has a body, it was an array of nodes, and we go to the first one, we go through them linearly. We hit a number literal which doesn't have children, so we move on to the next one, which is a call expression which does have children and so forth. The reason I use the word visiting is because there is this pattern
of how to represent operations on elements of an object structure. The basic idea here is that we're going to create a visitor object that has methods that will accept different node types. And when we traverse our AST, we will call the methods on this visitor whenever we encounter a node of the matching type.
In order to make this useful, we will also pass the node and a reference to the parent node. The final phase of a compiler is code generation. Sometimes, compilers will do things that overlap the transformation, but for the most part,
code generation just means taking our AST and stringifying it out. Code generators work several different ways. Some compilers will reuse tokens from earlier, others will have created a new separate representation of a code so they can print it linearly. But from what I can tell,
most use the same AST we just created, and that's what we're gonna focus on. Effectively, our code generator will know how to print each of the different node types, and it will recursively call itself to print nested nodes until it's printed one long string of code.
And that's everything. That's all the major pieces of a compiler. And that's not to say that every compiler looks exactly like this, but compilers, and compilers serve many different purposes, so they might have a lot more steps than I've detailed here. But now you should have a general high-level idea of what most compilers look like.
So you can all write your own, right? Of course not. When I was thinking of how to create this talk, I knew I wanted to frame things from a more comfortable perspective, but I also wanted to teach in the same way that I learned, by reading actual working code.
In my proposal, I declared that I would write the world's smallest compiler, thinking to myself, that would be super easy. I just have a couple of requirements. It has to demonstrate what a real-world compiler would do. It has to do something complicated enough
to justify building a compiler. And it has to be small enough a piece of code that it can be explained in a conference talk. It turns out, I was a little bit confident, and that's really freaking difficult. I tried all sorts of things,
and eventually I just gave up. There was no way I was gonna write a compiler for this talk, it just wasn't a practical idea. A few months passed by, and now I'm learning about Lisp languages. I play around with a couple different languages, and I discovered how powerful their compilers are.
Interested, I started looking into how Lisp languages get compiled, because their syntax is very different than the C-like language that you're all familiar with. And then I realized that Lisp was the perfect syntax for this compiler. It's a super simple syntax to parse, and it's really easy to translate into a new syntax.
I played around with the idea, and after only a couple of hours, I had a viable candidate for a conference talk. The entire source code is around 250 lines of code. If you're unfamiliar with Lisp, it looks something like this. We basically just put the parens on the outside. So in JavaScript, it's that.
That, that, that, that. And then you could do it in mathematical operation, like that, which is that, that, that. And the cool thing that Yehuda pointed out to me was that handlebars, he pointed out that handlebar syntax is actually a type of Lisp.
I didn't even, that didn't even occur to me, but it is. And so yeah, that's what we're gonna do. We're gonna take a Lisp syntax, and we're gonna translate it into a C-like syntax JavaScript that you're all familiar with. So here it is, the code of our compiler.
We're gonna start off with our first phase of parsing, lexical analysis, with this thing called the tokenizer. We're gonna take our string of code, and we're gonna break it down into an array of tokens. We'll start by accepting an input string of code, and we're gonna set up two things.
A current variable for tracking our position in that string, like a cursor. And then we're gonna set up a tokens array for pushing our tokens to. We start by creating a while loop, where we are setting up our current variable to be incremented as much as we want inside of the loop.
We do this because we want to increment current many times inside of the while loop, because our tokens can be any length. It's not one loop per character. We're also gonna store the current character in the input. The first thing that we want to do is check for an open parenthesis.
This will later be used for call expressions, but for now we only care about the character. We check to see if we have an open parenthesis. If we do, we push a new token with the type paren and set the value to an open parenthesis. Then we increment current,
and we continue on to the next cycle of the loop. Next, we're gonna check for a closing parenthesis, and we'll do the same exact thing as before. Check exists, push a new token, increment current, continue. Moving on, we're now gonna check for white space. This is interesting, because we care that white space exists
to separate different characters, but it isn't actually important for us to store as a token. In fact, we would only just throw it out later on. So here, we're just gonna test for existence, and if it does exist, we're just gonna continue on incrementing current. The next type of token is a number.
This is different than we've seen before, because a number could be any number of characters long, and we want to capture this entire sequence of characters as a single token. So we start this off when we encounter the first number in the sequence.
We're going to create a value string that we're gonna push characters to, and then we're gonna loop through each character in the sequence until we encounter a character that is not a number, pushing each character that is a number to our value in incrementing current as we go. After that, we push our number token
to the tokens array, and we continue on. The last type of token will be a name token. This is a sequence of letters instead of numbers. They are the names of the function, the call expressions in our lisp syntax. Again, we're just gonna loop through all of the letters,
pushing them to a value just like we did with numbers, and we push the value as a token of the type name and continuing. Finally, if we have not matched any character by now, we're just gonna freak out and throw an error. The end of the tokenizer,
we just simply return the tokens array, and that's it, that's the entire tokenizer. So now is the parser. Here, we're gonna take this tokens array that we've generated, and we're gonna reformat it as an AST like that. Okay, so we define a parser function
that accepts our array of tokens. Again, we keep a current variable that we use as a cursor. But this time, we're gonna use recursion instead of a while loop. So we define a walk function. Inside of the walk function, we're gonna start by grabbing the current token.
We're gonna split each type of token off into a different code path, starting off with the number tokens. We test to see if we do in fact have a number token. If we have one, we'll increment current, and we will return a new AST node called number literal, and we'll set its value to the value of our token.
Oops. Next, we're gonna look for call expressions. We'll start this off when we encounter an open parenthesis. We'll increment current to skip that open parenthesis, since we don't care about in our AST. We create an empty node with the type expression,
or call expression, and we're gonna set the name as the current token's value since the next token after the open parenthesis is always gonna be the name of that function. We increment current again to skip the name of the function. And now, we want to loop through each token that will be the params of our call expression
until we encounter a closing parenthesis. So we create a while loop, and we continue it until it encounters a token with a type of paren with a value of a closing parenthesis. Now this is where recursion comes in.
Instead of trying to parse a potentially infinite number of nested sets of nodes, we're gonna rely on recursion to resolve everything inside of it. To explain this, let's look at our Lisp code. You can see that the parameters of add are a number and a nested call expression that includes its own numbers.
You'll also notice that the tokens array that we have, you can't actually see it, but if you look at the code, you can see multiple closing parenthesis, and we have to make sure that they match. So we're gonna rely on the nested walk function to increment our current variable past the nested closing parenthesis.
So back to the code, we'll call the walk function, which will return another node, and we'll push it to the parameters of our node. Finally, we will increment the current one last time to skip the closing parenthesis,
and we'll return the node. Again, if we haven't recognized the token type by now, we're gonna freak out again and throw a syntax error. Now, we're going to create an AST, which will be a program node. And you'll notice that it has a body, which have effectively statements of that program.
We're gonna keep calling our walk function in a loop until we've run out of tokens, and just keep pushing those as statements into our AST body. And the reason that we're doing this inside of a loop,
as you can see, is because they can be next to each other rather than nested. At the end of the parser, we will return the AST. Next up, the traverser. So now we have our AST, and we want to be able to visit all the different nodes in that AST
with a visitor. We need to be able to call the methods on this visitor whenever we encounter a node with a matching type. So we define a traverser function, which accepts an AST and a visitor. Inside, we're gonna find two functions, a traverse array function
that will allow us to iterate over the array and call our next function, traverse node. Traverse node will accept a node and its parent node so that it can pass both to our visitor methods. We'll start by testing the existence of a method
with a matching type, and if it does exist, we will call the node and its parent. Next, we're gonna split things up again by the current node type. We'll start with our top level program. Since program nodes have a property named body that has an array of nodes, we'll call traverse array and traverse down into them.
Remember, traverse array will in turn call traverse node to our causing the tree to be traversed recursively. Lot of words. Next, we'll do the same for call expressions and traverse their params. In the case of number literals,
we don't have any child nodes to visit, so we'll just break. Again, if we haven't recognized the node type, we'll throw an error. Finally, we kick start the entire process by calling traverse node with our AST with no parent because there is no parent. Next up, the transformer.
Michael Bay doesn't come in here. Our transformer is going to take our AST that we have built and pass it to our traverser function with a visitor and it will create a new AST. This new AST is designed for JavaScript. In fact, it's actually the Babel AST that's right there.
So we have our transformer function which will accept the LISP AST. We'll create a new AST which like our previous AST, will have a program node with a body that has an array. Next, I'm gonna cheat a little
and create a bit of a hack. We're gonna use a property named context on our parent nodes so that we can keep our traverser flat. Inside of the methods of the visitor, we're just gonna keep pushing nodes to that context and you can see it's basically just mapping the old AST to the new one.
Normally, you would create a better abstraction for this, but for our purposes, this keeps things simple and easy to read. So we'll start by calling the traverser function with our AST and a visitor. The first visitor method accepts number literals.
We'll create a new node also named number literal and we'll push it to the parent context. Next up, call expressions. We'll start by creating a new node call expression with a nested identifier that is the name of the function. Next, we're going to define a new context
on the original call expression node that will reference the expression's arguments so that we can keep pushing arguments to it. Then, we're gonna check if the parent node is a call expression. If it's not, we're going to wrap our call expression with an expression statement.
We do this because the top level call expressions in JavaScript are actually statements. Last, we push our possibly wrapped call expression to the parent node's context and continue to build up our AST. At the end of our transformer function, will we return that new AST that we've just created?
Now let's move on to our last phase, the code generator. Our code generator is going to recursively call itself to print each node in the tree into one giant string. We'll break things down by node type and if we have a program node, we'll map through each node in the body
and we'll run them through the code generator, joining them with a new line. For expression statements, we'll call the code generator on the nested expression and we'll add a semicolon because we like to code the correct way. For call expressions, we will print the callee,
which again was that identifier of the name of the function, add an open parenthesis, we'll map through each node in the arguments array and run them through the code generator, joining them with a comma and then we'll add a closing parenthesis. For identifiers, we just want to return the node's name, which is a string and for number literals,
we just want to return the node's value, which is also a string. Again, have them recognize the node, freak out. Finally, we have our compiler function. Here, we're gonna link together each part of the pipeline. We'll take the input and pass it to the tokenizer and create our tokens.
We'll take the tokens and pass it to our parser and create an AST. We'll take the AST and pass it to the transformer to create our new AST. We'll take the new AST and pass it to the code generator which will create our output string and we'll finally return our output. And that's it. That is all of the code in our compiler.
But let me prove that it works to you. The tokenizer, actually I need to move up to that screen. The tokenizer will take our string and generate tokens and, if I can match my cursor, boom, it does. Let me prove it again and boom, it does.
The parser will take our tokens and generate a new AST and boom, it does. It does, yeah, there we go. The transformer will take our generated AST and turn it into a new AST.
The code generator will take the new AST and generate the output and it does. Finally, from beginning to end, running it through all of them, we will run it through the tokenizer, the parser, the traverser, the transformer, the code generator and it generates our code.
And that is it.