Bestand wählen
Merken

How to Build a Compiler

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
so the and a fire
out of i and j go
as you might know me from will pet projects
circle babble and I also work
at the company called CloudFlare you team uh you should all come work there because we use is really awesome
terrible terrible UI framework
some I knew ways style you can also follow me
on Twitter at the James Kyle so
you want to know how to build a compiler or maybe you don't and I just stuck listening to me the either way and to teach you this is
the form of this talk is a bit of a back story I
got into compilers through babble word 605 at the time now is understood what compilers dead from a conceptual level but I wasn't really able to grok them until last year and a half the reason I was able to get into them is because when I 1st looked at the 605 codebase it was the 1st time I was able to read the code of a compiler and really understand it after
babble blew up into a massive project I became interested why we were also blowing up in a number of contributors but I learned that most people are pretty scared of compilers is thought of as is totally unapproachable computer science the thing that only the nerds of the nerds can understand I
realize that we need to do a better job of teaching people in making it more approachable people so I decided they were going to go out and really learn everything there was to know about
compilers I jumped on the Amazon and I sort of ordered the essential books on compilers and about 3 weeks later I discovered why compilers were told an approachable I think the to the
those books are 500 pages of the MOS dull explanation of compilers that 1 could ever have read I love compilers and I was still bored soul bordering them I couldn't even come close to finishing up the the learning problem with compilers is not that the complicated it's that we teach that they're complicated so what I'm going to
do today is give you an oversimplified but reasonably accurate overview of of what most compilers look like and then to try and make it more real to you by walking through the code of an actual compiler that I've written so
what even is a compiler the
compiler is simply a tool that transforms source code written in a higher-level language into a generally lower level target language but let's go back to the fundamentals to explain why they exist a
computer's CPU executes millions of tiny little operations operation so tiny that you would never want to write them out manually In the sort of pure form the
written as binary to and this is known as
machine code obviously there is no
way a programmer could understand in productive writing code like this so CPU architectures to support a mapping of binary operations as an easier read language like that's and this is known as assembly code
it's a super low level programming language that gets converted into binary code by in a thing called an assembler externally and some looks a lot like a compiler is taking 1 type of code and translating it into a lower level however internally the much simpler than compilers ignoring any kind of optimization and someone might be doing they're really just a straight mapping of 1 syntax to another however assembly code is still too low too low level to be really productive so the next level up is source
code and this is where the vast majority of programmers and these are programming languages that you were familiar with and what actually write large pieces of software and source code is different
than assembly code in most cases it does not map directly to machine operations it's meant as an abstraction tell of humans to understand what is happening in be productive in writing code as such an assembler wouldn't be able to translate source code down to fishing operations we need something that can understand a more complicated syntax and transform very intelligently into optimized lower-level code and this is
work compiler comes in since compilers are doing complicated operations they haven't broken down into a couple different stages the stages very compiler-compiler but they all fit within these 3 primary groups
passing transformation and code generation passing is taking rock
food and turning it into a more abstract representation of that single transformation takes an abstract representation and manipulates it to do whatever the compiler wants to and
cogeneration takes the transform representation of the code and turns it into a new strain of code
passing gets broken down a couple different phases the 2 biggest ones that you will see on lexical analysis in syntactic analysis it now
lexical analysis is pretty simple it takes a raw code into splits it up into these little things called tokens tokens on
array of tiny little objects that describe an isolated piece of a syntax they could be numbers labels punctuation operators whatever
syntactic analysis will then take these tokens and reformat them into a representation that describes the each part of the syntax and their relation to 1 another this is known as an intermediate representation or an abstract syntax tree an abstract
syntax tree or AST for short is a deeply nested object that represents the code in a way that is both easy to work with In tells a lot of information there this might be hard to visualize source look at what this code would be
so we would represent tokens like this we have an array of objects with a type of her rendered name neighbor number of and they just you should have the same format the value In abstract each syntax tree gets represented by this you have much more descriptive format for them we have a program with call expression named add parameters that could be a number literal or a nested call expressions and this is the final representation that will use the rest the compiler the next state type of
stage the compiler this transformation again this just takes the AST form from the last step and makes changes to it it can manipulate the AST in the same language or can translated into entirely new language let's look at how we would transform St you might
notice that REST has elements within that look very similar these are objects with the type property each of these is known as an air-steam node these nodes have defined properties on them that describe 1 isolated part of the tree we can
have a node for a number of the role with the value of that number word node for a call
expression of the name in the parameters In
when transforming the AST we can manipulate nodes by adding removing replacing properties we can add new nodes we can remove nodes were given leave the existing AST alone and create entirely new 1 based on top of it and since for our purposes retiring completely separate language where focus on creating an entirely new AST that is specific to the target language In
order to navigate through all of these nodes we need to be able to traverse through them this reversal process goes to each node in AST that for so for the following
we would go to the top level program and see that has a body in which an array of nodes and we go to the first one we go from the nearly we had a number literal which doesn't have children so we want next 1 which is called expression which does have children and so forth it the reason
I use the word is because there is this pattern on of how to represent operations on elements of an object structure the basic
idea here is that we're going to create a visitor object that has methods that will accept different node types and when we traverse RAST we will call the methods on this visitor whenever we encounter a node of the matching type the In order to make this more useful we will also pass the node and a reference to the parent node the final phase of a
compiler is cogeneration sometimes compilers will do things that overlap the transformation but for the most part cogeneration just means taking REST and string of finding out code generators work several different ways some compilers will reuse like tokens from earlier others will have created a new separate representation of a code so they can print it literally uh the from what I can tell in most use the same AST we just created and that were folk effectively are code generator will know how to print each of the different node types and it will recursively call itself to print nested nodes until it's printed 1 long string of code we and that's
everything that's all the major pieces of a compiler and that's not to say that every compiler looks exactly like this but compilers and compiler serve many different purposes so they might have a lot more steps in detail here but now you should have a general high level idea of what most compilers look like so you can all write your own
right you know the 1st but when I was thinking of how to create this talk I knew I wonder frame things from a more comfortable perspective but I also wondered teach in the same way that I want by reading actual working code the in my proposal I declare that
I would rates the world's smallest compiler thinking to myself that would be super easy I just have a couple of requirements it has to demonstrate where
real world compiler would do has to do something complicated enough to justify building a compiler it has to be small enough piece of code that can be explained in a conference talk it turns out that I was little confidence and and that's really difficult I tried all sorts of things and eventually I just give up the there was no way I would generate compiler for the stock it just wasn't a path to idea the a
few months passed by and I worry about list languages I play around with a couple different languages and he discovered how powerful their compilers are interested I started looking into how list languages get compiled because the syntax is very different than a C-like language that you're all familiar with and then I realized that list was the perfect syntax for this compiler it's a super simple syntax 2 parts and it's really easy to translate into a new syntax it played around the idea in after only a couple of hours I had a viable candidate for conference talk the entire source code is around 250 when the if you're
unfamiliar with less that looks something like this basically just put the brown on the other side so in JavaScript's is that that that that that question and then you can do in mathematical operations that that that the and I and the cool thing that you would have pointed out to me was that
handle bottles what are the handlebars and text is actually a type of let's I don't even like the binding occurred to me but that is of and so yeah that's organ you where take a list syntax were translated into a C-like syntax they are often with them so here
it is the code of our compiler were going to start off with the 1st phase of passing lexical analysis this thing called the tokenizer we're gonna take our string
of code we're breaking down into an array of tokens will start by accepting an
input string of code there was a set of 2 things the current variable for tracking opposition in that string like a cursor uh where separate tokens array for pushing the tokens to we start by creating the while loop where we're setting up our current variable to be incremented as much as we want inside the loop we do this
because we want to increment current many times the inside of the while loop because ah tokens can be any length it's not 1 group per character the there were also this store the current character in the input the 1st thing that we want to do is check for an open parentheses this will later be used for call expressions but for now we only care about the character we check to see if we have an open parentheses if we do we push a new token with the type around and set the value to an open parentheses the the then we increment currents and we continue on to the next cycle of the loop next were check for a closing parentheses in will do the same exact thing as before check exists pushing a token increment current continue uh moving on were not been checked for wait states this is interesting because we care that white space exists to seperate different characters but it is actually important for us to store as a token in fact we would only just throw it out later on so here was test for existence if it does exist where it's going to continue on incrementing current the next type of token is a number the this is different than we've seen before because number could be many but could be any number of characters long we want to capture this entire sequence of characters as a single token the the so we start this off when we encounter the 1st number in the sequence we're going to create a value string that we're gonna push characters to and then we'll loop through each character in the sequence until we encounter a character that is not a number pushing each character that is the number to our value incrementing current as we go after that we push a number token to the tokens are a and we continue on in the last type of token will be a name token this is a sequence of letters instead of numbers they are the names of the function of the court patterns and the syntax again where just gonna loop through all the letters pushing them to a value just like we do the numbers and we push other value as a token of the tightening and continuing finally if we have not match any character by now resist freak out and from there the end of the the tokenizer we just simply return the tokens array that's step that the entire tokenizer so now is the pastor here we're gonna take this
tokens array that we generated an organ reformatted as an AST with that the OK so we define a part
a function that accepts are a of tokens again we keep a current variable that we use as a cursor but this time we're gonna use recursion instead of a while the so we define a walk function inside of the walk function were a start by grabbing the current token were splits each type of token off into a different code path starting off with number tokens we test to see if we do in fact have a number token if we have 1 will increment currents and we were turned a new AST node called number literal and will set its value to the value of R. token the next we're going
to look forward to call expressions will start this off when we encounter an open parentheses will increment current to skip that open parentheses since we don't care about in we create an empty node with the type expression Foucault's preference I receptor name as the current token value since the next token after Oprah parentheses is always in the name of the function we increment current again to skip the name of the function and now we want to loop through each token be the parameters of our co-expression until we encounter a clothing for clothing closing parentheses so we create a while and then we continue it until it encounters a and token with the type of bran with a value of a core closing professors and this is where workers income then instead of trying trying to pass a potentially infinite number of nested sets of nodes rely on refers to resolve everything in sight to
explain this let's look at with of you can see that the parameters of ad or a number and nested call expression that includes its own numbers you also notice that the tokens erase that we have you can actually see but you would have the mobile closing parentheses and we have to make sure that they match uh organ rely on the nested walk function to increment our current variable perhaps the good the nested closing parenthesis the 2 backs the code
who call the walk function which will return another node and will push it to the parameters of our finally we will increment the current 1 last time to skip the closing parentheses and will return the again if we haven't recognizer token type by now where freak out again throw syntax now were going to create an AST which will be a program note and those that have a body which have effectively statements of that program the and we're gonna keep calling a walk function in a loop until we run out of tokens I just keep pushing those as statements into our AST body and the reason that were doing this inside of a loop as you can see
is because they can be next to each other rather than the the at the end of the parser we
will return the AST next up the traverser so now we have a st
and we want to be able to visit all the different nodes in that NIST with the visitor we need to be able to call them methods on this there there whenever we encounter a node with a matching types so we define a traverser function
which accepts an AST and a visitor and sigh with define 2 functions the a traverse or a function that will allow us to iterate over the raid n called on next functions it that traverse no traversed node will accept a node and its parent node so they can pass both to our visitor methods will stop I checked the they're testing the existence of the method with a matching types type if it does exist we will call the node and its parent next we're going to put things up again by the current node types will stop will start out with our top level program since program nodes have a property being body has an array of nodes will call traverse array and traverse down into them remember traverse will in turn call traversed node through causing the treated because workers at the traversed recursively onwards but next will do the same for call expressions and traverse the operands In the case of number literals we don't have any child nodes visit soldiers break again if we have a mechanism that I will throw an error finally we kick start the entire process by calling traverse node with RA if the with no apparent because about next up the transformer Michael they any Our
traverse to that a transformer is going to take our AST that we have built and pass it to our traverser function with a visitor I will create a new AST this new AST is designed for Dr. In fact to the babble that's right so we have transformer
function will movement which will accept the list AST will create a new AST which like our previous AST will have a program node where the body that as an array next time achieve a little and create a bit of a hack so we're gonna use a property in context on of parent nodes so we can keep our traversed flat are inside of the methods of the visitor words from keep pushing nodes to that contacts and you can see it's basically just mapping the oldest he's the new 1 that but normally you would create a better abstraction for this but but for our purposes as keep things simple and easy to read so I will start by coinage Mercer function with REST and a visitor the 1st visitor method except new at number of literals will create a new node and the number of literal and will push it to the pair context next up call expressions I I will start by creating a new node call expression with an acid identifier that is the name of the next we're going to define a new context on the original call expressed node that referenced expressions of arguments so that we can keep pushing arguments to it then we're gonna check if the parent node is a call expression if it's not were going to wrap our co-expression with an expression statement we do this because the top level called preference and JavaScript are actually state the last week which are possibly wrapped call expression to the parent nodes context I can continue delivery xt at the end of a transformer function or we return that new AST that we just created now let's move on to her last states the code generator are code generator is going to recursively call itself to print each node in the tree into 1 giant strength will break things down by node types if we have a program node will not through each node in the body and will run them through the code generator joining them with a new line the for expression statements will call a code generator on the nested expression and will add a semicolon because we like to call the correct way a for call expressions we will print the collie which again was an identifier of the name of the function add open parentheses will map for each node in arguments array and run them through the code generator joining them with a comma add a clothing practices pr identifiers we just wanna return the node's name which is a string if a number literals we just 1 return the node value which also expect again hammered as no freak out finally we have ah compiler functions the here were linked together each part of the pipeline will take the input and pass it to the tokenizer in crater tokens will take the tokens and pass it to a pot string grain St will take the AST in passage of the transformer to create a new AST will take the newest AST in past the code generator what their output string In will finally return a output the and that's it that is all of the code in a compiler and but let me prove that it works to you the the tokenizer action in
move let's the tokenizer will take our string and generate tokens and if I can match because boom and that's when prove
again and boom it does at the
pasta will take a tokens and generate a new AST and prove does the
right it it does get the
transformer will take are generated AST and turn into a
new the code generator will take the new
AST and generate the output and
finally from beginnings ends running it through all of them we will run it through the tokenizer the parts of the terms to the transform of the code generator and it generates
a code and that is it's the
who you the mind
Streuungsdiagramm
Videokonferenz
Kreisfläche
Code
Gebäude <Mathematik>
Projektive Ebene
Compiler
Twitter <Softwareplattform>
Compiler
Framework <Informatik>
Bildschirmmaske
Bit
Compiler
Wort <Informatik>
Code
Übergang
Prozess <Informatik>
Compiler
Zahlenbereich
Projektive Ebene
Informatik
Compiler
Quick-Sort
Homepage
Compiler
Code
Nichtlinearer Operator
Fundamentalsatz der Algebra
Bildschirmmaske
Compiler
Formale Sprache
Computerunterstütztes Verfahren
Quellcode
Quick-Sort
Übergang
Mapping <Computergraphik>
Nichtlinearer Operator
Programmiergerät
Assembler
Formale Sprache
Mikroarchitektur
Schreiben <Datenverarbeitung>
Maschinensprache
Biprodukt
Binärcode
Code
Lesen <Datenverarbeitung>
Programmiersprache
Programmiergerät
Assembler
Compiler
Minimierung
Formale Sprache
Quellcode
Code
Übergang
Mapping <Computergraphik>
Software
Datentyp
Vererbungshierarchie
Nichtlinearer Operator
Virtuelle Maschine
Subtraktion
Assembler
Abstraktionsebene
Compiler
p-Untergruppe
Quellcode
Code
Compiler
Selbstrepräsentation
Transformation <Mathematik>
Code
Objekt <Kategorie>
Nichtlinearer Operator
Subtraktion
Zahlenbereich
Token-Ring
Parser
Code
Phasenumwandlung
Eins
Analysis
Objekt <Kategorie>
Abstrakter Syntaxbaum
Selbstrepräsentation
Mereologie
Relativitätstheorie
Visualisierung
Zwischensprache
Quellcode
Information
Parser
Code
Programm
Parametersystem
Compiler
Formale Sprache
Mathematisierung
Selbstrepräsentation
Systemaufruf
Zahlenbereich
Token-Ring
Transformation <Mathematik>
Objekt <Kategorie>
Bildschirmmaske
Arithmetischer Ausdruck
Datentyp
Abstrakter Syntaxbaum
Dateiformat
Optimierung
Aggregatzustand
Objekt <Kategorie>
Netzwerktopologie
Knotenmenge
Kategorie <Mathematik>
Datentyp
Mereologie
Systemaufruf
Zahlenbereich
Wort <Informatik>
Element <Mathematik>
Programm
Knotenmenge
Arithmetischer Ausdruck
Kategorie <Mathematik>
Compiler
Formale Sprache
Fokalpunkt
Arithmetischer Ausdruck
Knotenmenge
Prozess <Physik>
Reverse Engineering
Zahlenbereich
Optimierung
Ordnung <Mathematik>
Übergang
Teilmenge
Objekt <Kategorie>
Nichtlinearer Operator
Knotenmenge
Subtraktion
Vererbungshierarchie
Datentyp
Selbstrepräsentation
Vererbungshierarchie
Wort <Informatik>
Element <Mathematik>
Ordnung <Mathematik>
Datenstruktur
Phasenumwandlung
Subtraktion
Knotenmenge
Compiler
Selbstrepräsentation
Hochdruck
Mereologie
Token-Ring
Code
Übergang
Zeichenkette
Rahmenproblem
Perspektive
Compiler
Bitrate
Code
Subtraktion
Reelle Zahl
Compiler
Mereologie
Formale Sprache
Mailing-Liste
Quellcode
Ganze Funktion
Code
Quick-Sort
Selbst organisierendes System
Datentyp
Mailing-Liste
Trennungsaxiom
Compiler
Token-Ring
Ein-Ausgabe
Code
Loop
Token-Ring
Menge
Funktion <Mathematik>
Code
Ein-Ausgabe
Strom <Mathematik>
Phasenumwandlung
Analysis
Zeichenkette
Cursor
Folge <Mathematik>
Selbst organisierendes System
Gruppenkeim
Abgeschlossene Menge
Zahlenbereich
Raum-Zeit
Loop
Arithmetischer Ausdruck
Subtraktion
Existenzsatz
Mustersprache
Speicher <Informatik>
Ganze Funktion
Softwaretest
Lineares Funktional
Dicke
Vererbungshierarchie
Systemaufruf
Strömungsrichtung
Token-Ring
Ein-Ausgabe
Token-Ring
Zahlenbereich
Dreiecksfreier Graph
Mereologie
Ein-Ausgabe
Zeichenkette
Aggregatzustand
Parametersystem
Lineares Funktional
Subtraktion
Vererbungshierarchie
Regulärer Ausdruck
Abgeschlossene Menge
Zahlenbereich
Token-Ring
Strömungsrichtung
Code
Unendlichkeit
Loop
Knotenmenge
Arithmetischer Ausdruck
Token-Ring
Funktion <Mathematik>
Menge
Zahlenbereich
Datentyp
Resolvente
Speicherabzug
Rekursive Funktion
Cursor
Programm
Lineares Funktional
Parametersystem
Befehl <Informatik>
Zustandsmaschine
Selbst organisierendes System
Mobiles Internet
Abgeschlossene Menge
Systemaufruf
Zahlenbereich
Token-Ring
Code
Loop
Knotenmenge
Arithmetischer Ausdruck
Token-Ring
Datentyp
Optimierung
Lineares Funktional
Knotenmenge
Token-Ring
Funktion <Mathematik>
Polygonzug
Vererbungshierarchie
Datentyp
National Institute of Standards and Technology
Parser
Prozess <Physik>
Disk-Array
Zahlenbereich
Transformation <Mathematik>
Übergang
Knotenmenge
Arithmetischer Ausdruck
Polygonzug
Existenzsatz
Datentyp
Vererbungshierarchie
Optimierung
Programm
Softwaretest
Lineares Funktional
Kraftfahrzeugmechatroniker
Datentyp
Kategorie <Mathematik>
Vererbungshierarchie
sinc-Funktion
Systemaufruf
Strömungsrichtung
Matching
Funktion <Mathematik>
Zahlenbereich
Fehlermeldung
Bit
Abstrakter Syntaxbaum
Compiler
Atomarität <Informatik>
Baum <Mathematik>
Gruppenoperation
Regulärer Ausdruck
Zahlenbereich
Befehl <Informatik>
Transformation <Mathematik>
Kontextbezogenes System
Code
Übergang
Netzwerktopologie
Arithmetischer Ausdruck
Knotenmenge
Mapping <Computergraphik>
Vererbungshierarchie
Optimierung
Analytische Fortsetzung
Gerade
Funktion <Mathematik>
Formale Grammatik
Programm
Parametersystem
Lineares Funktional
Befehl <Informatik>
Kategorie <Mathematik>
Vererbungshierarchie
Abstraktionsebene
Default
Systemaufruf
Ausnahmebehandlung
Token-Ring
Mailing-Liste
Kontextbezogenes System
Funktion <Mathematik>
Offene Menge
Zahlenbereich
Mereologie
Wort <Informatik>
Identifizierbarkeit
Compiler
Zeichenkette
Aggregatzustand
Parser
Transformation <Mathematik>
Code
Ein-Ausgabe
Mereologie
Token-Ring
Compiler
Term
Code
Videokonferenz
ATM
Dienst <Informatik>
Ereignishorizont
Menge

Metadaten

Formale Metadaten

Titel How to Build a Compiler
Serientitel Ember Conf 2016
Autor Kyle, James
Lizenz CC-Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.
DOI 10.5446/34701
Herausgeber Confreaks, LLC
Erscheinungsjahr 2016
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Compilers are all around you: Babel, Handlebars/HTMLBars, Glimmer, Uglify, and more. In this talk we'll walk through every part of a compiler from the parser to the generator. Learn about visitors and traversal, paths, scopes, bindings, and everything else. By the end compilers shouldn't seem like magic, and maybe you'll even want to contribute back to them.

Ähnliche Filme

Loading...
Feedback