We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Ten simple rules for creating your own compiler on the CLR

00:00

Formal Metadata

Title
Ten simple rules for creating your own compiler on the CLR
Title of Series
Number of Parts
163
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
With the .NET Core becoming open source, now is the perfect time to create new languages that span the breadth and depth of the Common Language Runtime. In this talk, I will discuss: - EBNF syntax in a nutshell, and how you can use it to generate your own language, - How to create your own interpreter in ANTLR 4 and have it up and running in C#, - How to parse your custom language into a syntax tree, - How to implement the differences between virtual machines, interpreters, and compilers It will also focus on practical compiler examples in C# rather than burden you with compiler theory. If you're the type of developer that loves learning new languages, then this is definitely the talk you need to attend!
28
30
76
128
CodeHacker (term)ParsingNetwork topologyEstimationType theoryNumberString (computer science)WritingRule of inferenceAbstract syntax treeData typeElectronic mailing listLine (geometry)WordFormal grammarUniverse (mathematics)outputMereologyImplementationStress (mechanics)Compiler constructionSoftware engineeringFrequencySemantics (computer science)Greatest elementSoftware design patternCompilerData structureOrder (biology)Type theoryMultiplication signComplete metric spaceCodeBitForm (programming)Semiconductor memorySystem callLibrary (computing)ExpressionNumberSoftware repositoryString (computer science)FreewareProgramming languageAbstractionTrailNetwork topologyGoodness of fitSubject indexingWritingSymbol tableRecursive descent parserDampingProcess (computing)ParsingRight angleSound effectRevision controlDifferent (Kate Ryan album)CASE <Informatik>ParsingSyntaxbaumC sharpAlgebraic closureVideo game consoleCompilation albumToken ring
SequenceRule of inferenceRecursionProgrammschleifeParsingVertex (graph theory)Formal grammarNetwork topologyCodeEquivalence relationComputer fileEntire functionMultiplication signMereologyRule of inferenceSequenceMatching (graph theory)System callComputer programmingCASE <Informatik>Loop (music)RecursionBitOperator (mathematics)TheoryWritingOrder (biology)Recurrence relationAlgebraic closureData structureInfinityProcess (computing)Point (geometry)Natural numberSampling (statistics)Doubling the cubeSimilarity (geometry)Electronic mailing listProgramming languageString (computer science)Projective planeField (computer science)Extension (kinesiology)Software repositoryException handlingJava appletAbstract syntax treeSemantics (computer science)SequelRow (database)DemosceneForm (programming)EmailSet (mathematics)Plug-in (computing)Run time (program lifecycle phase)Operating systemSocial classDisk read-and-write headAuthorizationType theoryDensity of statesMenu (computing)Figurate numberVisualization (computer graphics)ParsingComputer clusterRecursive descent parserInstallation artObject-oriented programmingSyntaxbaumSingle-precision floating-point formatJSON
Formal grammarProgramming languageMereologySemantics (computer science)ParsingCodeNatural numberCovering spaceElectronic mailing listMathematicsStatement (computer science)ParsingComputer animation
Rule of inferenceParsingInclusion mapSoftware testingRepetitionHill differential equationError messageoutputBoilerplate (text)CubeParsingSocial classCodeMereologyExtension (kinesiology)String (computer science)Boilerplate (text)Order (biology)Strategy gameGradient descentForm (programming)Different (Kate Ryan album)CASE <Informatik>Formal grammarBitSoftware testingRadical (chemistry)Electronic mailing listGoodness of fitPoint (geometry)Traffic reportingLatent heatComputer wormState of matterSystem callRule of inferenceLogical constantProcess (computing)Library (computing)ParsingContext awarenessProgramming languageFerry CorstenStandard deviationGroup actionNetwork topologyKey (cryptography)Interface (computing)Line (geometry)Streaming mediaMultiplication signCorrespondence (mathematics)Endliche ModelltheorieCall centrePattern languageExecution unitSoftware repositorySequenceEquivalence relationBit rateError messageParsingJava appletUniform resource locatorOcean currentVertex (graph theory)Syntaxbaum
ParsingFormal grammarLogical constantString (computer science)CodeSoftware testingMultiplication signSyntaxbaumGoodness of fitRule of inferenceRight anglePoint (geometry)MereologySource codeNetwork topologyJSON
Programming languageSoftware testingDemo (music)outputArtificial lifeControl flowString (computer science)MereologyPort scannerComputer animation
Network topologyMenu (computing)Abstract syntax treeParsingAbstractionSingle-precision floating-point formatEstimationSoftware testingProgramming languageCompilerCondition numberParameter (computer programming)Function (mathematics)System callMereologyNetwork topologyCASE <Informatik>Programming languageInterpreter (computing)Level (video gaming)Type theoryHypermediaData conversionSoftware testingProcess (computing)RoutingInterface (computing)Execution unitRule of inferenceAbstract syntax treeContext awarenessMultiplication signTouch typingMathematicsTranslation (relic)Representation (politics)Formal grammarCodeCompilation albumConcurrency (computer science)Correspondence (mathematics)Strategy gameDeadlockMechanism designEntire functionCloningInternet service providerTrailOperator (mathematics)Inheritance (object-oriented programming)MappingHierarchyElectronic mailing listPattern languageSoftware design patternData structureRepository (publishing)Social classPoint (geometry)Water vaporClassical physicsExpressionEndliche ModelltheorieSyntaxbaumBitFigurate numberAbstract syntaxRootNational Physical LaboratoryGoodness of fitJSON
SyntaxbaumComputer wormAbstract syntax treeDifferent (Kate Ryan album)Radical (chemistry)Network topologyCloningType theoryMereologyCorrespondence (mathematics)Function (mathematics)outputCodeGraph coloringAbstractionMultiplication signMedical imagingRule of inferenceAbstract syntaxProgram flowchart
ImplementationHill differential equationRule of inferenceMaxima and minimaDifferent (Kate Ryan album)Interface (computing)Network topologyCodeCodeMultiplication signQuery languageLink (knot theory)Social classMereologyCloningInterface (computing)Extension (kinesiology)ResultantSyntaxbaumString (computer science)Content (media)State of matterElectronic mailing listDemo (music)Library (computing)Entire functionAbstract syntaxInterpreter (computing)Abstract syntax treeSymbol tableComplete metric spaceParameter (computer programming)CompilerExpressionIdentity managementSinc functionRule of inferenceSoftware testingNetwork topologyPhysical lawPoint (geometry)AbstractionData dictionaryFunction (mathematics)
Execution unitFunctional programmingMachine visionOperator (mathematics)Control flowPattern languagePoint (geometry)Programming languageMultiplication signLink (knot theory)Parameter (computer programming)ExpressionForm (programming)CASE <Informatik>Type theoryMereologySocial classLine (geometry)Right angleFormal grammarPlastikkarteString (computer science)AbstractionResultantInterpreter (computing)Network topologySystem callElectronic mailing listRevision controlNumberMedical imagingCompilerMathematical analysisCodeSemiconductor memoryComputer configurationSinc functionSymbol tableParsingSyntaxbaumAbstract syntax treeLevel (video gaming)Data typeDifferent (Kate Ryan album)Element (mathematics)Stress (mechanics)MappingCalculationHierarchyData conversionReflection (mathematics)Information overloadNeuroinformatikPerformance appraisalFuzzy logicBitComputer forensicsInterior (topology)Differenz <Mathematik>Video game consolePointer (computer programming)DecimalMoving averageInferenceInteger
EstimationRight angleReading (process)Semiconductor memoryPoint (geometry)Software frameworkCodeFunction (mathematics)Abstract syntax treeCompilerInterface (computing)Inheritance (object-oriented programming)Functional programming1 (number)MereologyDiagramFreewareParsingoutputAbstractionParsingProgramming language.NET FrameworkReflection (mathematics)AuthorizationGreatest elementMultiplication signBasis <Mathematik>Abstract syntaxMetaprogrammierungSyntaxbaumFormal grammarPlotterImplementationInterpreter (computing)Compilation albumPlug-in (computing)DivisorOrder (biology)Complex (psychology)Metropolitan area networkNetwork topologyComputer programmingGoodness of fitCASE <Informatik>Computer animation
Transcript: English(auto-generated)
Good afternoon, everybody. My name is Philip Loriano, and if you're here I'm gonna teach you how to make a compiler in less than an hour, so let's get started So if you're the average person this is probably the expression that you'd get every time somebody says you're gonna write your own compiler and
Let's just let's just get this out of the way and say Why you would actually? attempt to write your own language, so Aside from the fact that you get to learn a lot more about how these languages are built what I'm going to show you within the next hour is that if you have a good knowledge of
simple design patterns as Well as C-sharp a lot of the things that you would think are scary are not actually scary at all So quick show of hands how many people have actually read this book or have heard of this book Yeah, so there's a good number of people here who have run into the infamous dragon book and
My only concern with the dragon book is that it never really had a git repo or anything practical That you could get out of it and actually write something useful at the end of the day So they give you an idea of some of the stuff that I've been working on
It turns out that over the past few years I've been writing different parts of a compiler, and you can see the list up here. It doesn't matter so much What's important is that I've been writing the bits and pieces of a compiler in one form or the other and I asked myself How difficult would it be to actually just put all this stuff together and make it into something practical
That you can make and it turns out It's not really that difficult given the tools that we have today of course. There's a whole a large amount of Compilers theory that you have to go through, but just like anything else in software engineering
You don't have to write anything from scratch anymore There's good tool sets out there And we'll look into things like antler that will do most of the work for you So you don't you can just get on to work focusing on just the semantics of your language That you want to create rather than having to worry about all the nitty-gritty details of creating your own parser
So in general there's three things that we need to just keep track of that every compiler no matter How complex or simple actually does? So the first thing that Compilers typically do is they take your code, and I've got an expression that looks like this which is 7 plus 3 times
5 minus 2 and I just sketched up a diagram and on the first line you'll see that's what it looks like in list The second line is really what we would be normally be used to in c-sharp So the idea there is it you've got the parser it converts it into this parse tree and
Once it's in that parse tree There's a process within the compiler itself that actually optimizes the parse tree and converts it into an abstract syntax tree So in this case I'm gonna have a parse tree this says 7 plus 3 times 5 times 2 in a very primitive sense What an abstract syntax tree would be is the optimized version of the parse tree where you would have
Things like the multiply operator, and it would reduce things down to like it would simplify it and say 10 times 3 and of course. This is a really really contrived simple example, but the gist is pretty much the same
Now the third step is when you have the parse tree You've got the abstract syntax tree now the next question is what do you actually do with it? When you have the abstract syntax tree in general There's really two approaches you could say three for the VM But that doesn't really apply so much on the CLR given this the CLR itself is the VM
the two approaches that you normally go with are you could either interpret the tree or You could compile it down to a DLL and execute it So with that in mind one of the things that I'm gonna be talking about is some of the things that I learned I mean This is the simplest possible way. You could do it in less than an hour
so The first rule is you need to make sure that the grammar is simple And I chose LISP because it's possibly one of the most oldest and the most stable Grammars you could ever think of I Created my own toy language, which I call not quite LISP. It's basically a
Toy language in the sense of where I just put the structure and the grammar together But I left out the semantics because I wanted to see How easy or how difficult would be just implement certain language features you can think of it as a sandbox
In order to learn other languages and learn how to actually put one together You basically have to dissect one or create a dead one and see how it all works And this is pretty pretty much what I did and one of the nice side effects is out of this Particular thing is that you could pretty much put together a language in a very short period of time
so LISP NQL's basic data types are just numbers and strings the Important thing is that it parses most LISPs even some closure syntax typically the only difference if you've run into LISP before the
How it works is that everything is a list so a typical console dot write line call in C Sharp would look like this Where I have the first item would actually be a symbol The next item would be a string and this entire thing that you see here is actually a LISP It's a list not a LISP
What that means Effectively if you've never run to LISP before is that the code itself is the data the data is the code you could actually swap this and in memory this this is exactly what it would look like and The best part is that when I actually compile and interpret this
These six lines of code is pretty much all I need in order to run it that's it and What I'll get into as we go along is you'll see that it's not really as scary as you Probably went through in university because the tools are already there We could already do it in a very short amount of time and still be able to deliver something valuable
So the second part which I have to stress is that if you try to write your own language You try to write your own compiler There's quite a temptation to do a not invented here and start writing your own parser. I
recommend against that simply because there's a lot of Good libraries out there. Like for example antler which already do this very well and Again, it's it's a very well-known problem that's been solved many many times before but for the sake of completeness
There's really just two types of parsers that are available First one is the top-down recursive descent parser and the bottom-up parser now a top-down recursive descent parser Just means that it starts from it looks at your input matches it against the existing list of rules and
Then it starts from the rules and it works all the way down to the tokens that you have Or the keywords that you have in your language and sees what how it matches up with the rest of your grammar Now bar a bottom-up parser does the opposite it goes from the particular to the general and
Looks at what your input currently is and tries to infer what the rule is based on What tokens it's seen so far and Back to that whole thing about not writing your own parser antler does this amazingly well the author himself Has spent good 25 years
He writing this rewriting it over four times. So as He said, you know, why spend three days when you got somebody who who's already done it for 25 years for you It's free. So we're gonna get into that The other interesting thing about this is that in the languages itself?
For example Roslyn or F sharp all the interesting stuff that you want to create inside of a language is not really in the parse tree Because when you think about it the parser or the actual form of the code itself Is not as important as the semantics if you want to do things like type inferencing immutability
That's stuff that you actually want to focus on in the abstract syntax tree Because you want to find out if what what a type is you want to find out what a method call relates to But you won't be able to do that until you actually have the tree in place and you're able to manipulate it now Installing antler is pretty simple
So the first thing that you do is you could just go into Visual Studio You take the extensions and you go to and search for antler language support and plug it right in It's fairly straightforward. Now. One thing I do have to caution is that you do need Java 7 Because antler is a Java
program The only caveat there is that it's in while it might be a Java program it actually targets C sharp With this particular tool So what happens is that all the stuff is really done behind the scenes for you? And you don't necessarily need to worry about having to run the Java command line or run the program yourself
So that's fairly useful Now once you have both Java 7 you also have the Plug in installed. All you really need to do is just install the antler runtime into whatever project you have so it's basically this and
The new get does its thing you pretty much have From that point you're gonna get your support for generating language of the parser on the fly So that being said we need to be able to dig into antler and understand how it actually works so
Here's a sample grammar for parsing a CSV file, so it's not particularly exciting But it shows you just how little code you actually have to write in order to parse something like a CSV file So if you've never run in if you never worked with antler or E BNF before
It's worth noting that The parser rules are in lowercase at the Lexer rules are in uppercase and how it works is that you pick a starting rule in this case is the file rule here and Based on that it reference recursively references all the other rules
That you see on this particular grammar, so you start with file rule in reference the header rule The header is essentially as it would be in the CSV file where the header is actually the first row The Other thing to keep in mind is that for the row itself you might have a sequence that goes like this
So the first thing is you might have a field that could be either a text or a string After that, it will become a separated or pipe delimited. You could also change it to make it match that way The star here is similar to what you'd see in old MS-DOS where you say it's either occurs zero or
Many times and a little match what even if it's not there the other interesting thing that you'd notice here for this particular grammar is that it the Return the carriage return is optional so the question mark character essentially means it's either zero or matches zero or one time and
This is just to account for the fact that some operating systems might have the carriage return others might not The Lexer rules are fairly straightforward So for example in this case, you might have a not operator Which basically says match everything except for these particular characters and it has to match at least one character
The string follows the same principle Essentially, it starts with the quote characters either single or double quote and ends At the point where it either sees the single or double quote So as I mentioned so partial rules are lowercase and just to go over the grammar recurrence rules and operators
We plus just means zero or one the star means it has to occur zero or more times But the question mark is me it has to occur
Zero or once and the squiggly character really is you it has to not match the following sequence Now the other thing is that if you have the pipe operator, it just means it has to match either rule a or rule B and That's fairly straightforward. It'll just pick the first one that matches
The other thing is that you have to be cautious because you will run into left recursion But antler is a bit more forgiving in that sense because it does support direct left recursion But it doesn't support indirect. So if you you have a rule that effectively calls itself It's okay provided that it doesn't call eight eight calls B. Call C and then C. Calls a
if you're not familiar with left recursion inside of Grammars it is the equivalent of having an infinite loop inside of your program So you have to be very careful about that because if you run it It's all all of a sudden. It'll just sit there in an infinite loop and you won't know why it just hung
But in general with antler, it's pretty smart about those things it's much easier to deal with the important thing to know here is that with antler the The code that it generates is it generates the nodes for you. It also generates the parse tree nodes
the parser itself The other thing that it does is it also generates the visitors and the listeners so if you want to be able to traverse through an entire tree and Figure out what your language actually does or what it means most of the work is really done for you And you just have to be able to subclass whatever classes that they provide as part of the code gen
so here's the not quite list grammar, so essentially the the idea here is that I have a very small set of rules if you actually look at the the full size grammars that they have in the repo you'll have everything from Swift to objective C Java and
As well as other languages like sequel and it's very long in this case I kept it very short because this pretty much covers most of the lispy things that you would see in something like closure and the idea here is that I want to be able to have this list of
Or this tree of rules and antler is going to be able to drill through this tree and give me an entire structure and And show you and and show me the code without having me to worry about all the theory that's involved in creating a recursive descent parser
that being said recursive descent parsers are pretty easy to write and When we actually start looking at the code, it's it's similar. It's actually quite unreadable That's the interesting part so when I go to say
My not quite list grammar. You'll see it looks Like this and it looks pretty clean compared to the actual parser which is Here so this is the code that is generated out of the
Parser generator so at first it looks okay, but as the more I scroll down You'll start to see that the code is quite unreadable Like you get you get statements like this The good part is you don't necessarily have to worry about this
The important part to remember about this is that this code despite its non aesthetic nature is very very fast and The best part about this is that as soon as I change the grammar here it Automatically generates it so that I don't have to worry about What's actually hiding under the covers it pretty much hides the the dragons so that I could focus on the semantics of the language
Rather than just having to worry about all the little things that might happen so When we actually look at the code it generates two things so the first one is the visitor class
and The second one is the listener class the difference between the two is that the visitor class is? Pretty much your typical visitor pattern it has multiple methods each visit method corresponds to one of the rules that you would see inside of the Grammar
So you see visit set key value pair quoted list blah blah blah it goes all the way down This in turn is actually implementing an iNQL visitor interface which has all the visit methods and the idea behind this is that if you want to create your own custom actions and you Want to parse this any parse tree you just add your own?
Visitor derived class and then run it through the tree Now if you're more comfortable with the SACS model It actually and there also supports this idea where you can Enter and exit a particular node and be notified once the tree actually the Walker goes through the tree, so
This one's slightly different where it would say enter key value pair exit your rule And it would just go through the entire tree and be able to call enter exit And that might be useful when you're checking for certain things like context or scope and
From that point it's fairly simple and straightforward because if you don't have to worry about the actual Grammar or the actual code that's generated under under the hood Then it's fairly move on it's fairly easy to move on to more interesting things if you want to see some more complex examples you could just check this URL here and
They've got quite a few languages there that you so you could learn from I mean if you want to if you wanted to write your own Java parser or Lua Parser the code is already there, and it's fairly simple to get started in this based on the same principles that you see here
The next question is once you have the grammar in place you have an idea of what? Lang what your language should look at look like the next step is how do I actually make sure that I can trust this? code because With code gen one of the problems is that you need to be able to test everything and
That's that's basically the gist of it because if I have I introduce another language feature inside of the grammar How do I actually test and make sure that everything works so with this particular rule? I'm going to show you how to actually Get your fixture Your test fixtures in a certain state and make it easy to test your parser
So that it's always in a good working state So you can find the not quite list repo here I've got quite a few good code examples in there that show you how to do this fairly simple simply Ideally I mean the the the there's a few goals that we want to do for just testing the grammar
I want to make it simple I Want to make it easy to pass and I want it too easy to test and I want it easy to fail and the idea behind that is that Currently what I've seen in the generated code Inside of antler is that there is quite a bit of ceremony in order to actually just run the parser
Which is quite surprising considering how much work they do for you So one of the things that you'll see inside of the not quite list repo is this What you'll see is that There Is an interface here called I language strategy and the idea behind it is that if you subclass it and you you
Implement it then it pretty you could pretty much cut all the boilerplate code or all the ceremony down to one line of code So for example, if I were to implement this for the NQL language, you could do the same thing
It's actually going to be fairly similar to what you see here So and the reason why is that you if you see these three methods The these three methods in sequence are actually what's being called in order to set up the parser the difference is that if you implement this interface and
then This particular extension on the string reduces it down to one line of code because all you have to do is pass your class that implements I language strategy and You could see there it's it's got a few classes that it references and we'll go through that so for example, the lexer is something that is generated by
the by antler the common token stream itself is is Actually a part of the antler standard library so you don't have to worry about that. That's fair. You can easily Refactor this to make it even more efficient the interesting part about here this third method is
This particular method call Now the reason why it's so interesting is that in this particular point inside of this method I'm actually picking the starting rule from which all the other rules come from so in this case I'm actually picking the compile unit rule and that's important because that's where the descent actually starts
So when you actually implement your own language strategy However, what language or what form it is? All you have to do is basically implement these three members It's fairly straightforward. And the best part is that you reduce it down to this so now that we got code that actually is easy to
Execute you could run the parser then it should be fairly easy to test this stuff as You can see it's fairly straightforward. And at the same time there's other ways you could check for errors inside of a test So when you look into what the parser is actually generating
I've actually created a couple of extension methods There's one method called children and what that does it does it just enumerates the immediate children Underneath a parse tree node at the same time. There's also another method called Descendants which gives all the nodes under a parse tree and that's fairly useful because if you want to check for certain
Node types or in this case you want to check for parse errors All you need to do is be able to check for the error node in both class So the test code looks like this so in this particular test code What I'm doing is actually I'm actually checking to see how it parses a specific terminal node in this case
I'm checking to see if it parses a Parse tree equivalent of a string node So I have a test terminal parse method and the idea behind that is it's it's a bit noisy
Because what in this particular case I had to drill down But the important part that you just need to pay attention to when you're doing when you're doing this at home is these three lines of code Because what I'm actually doing is that I'm going down to the very lowest point in the tree And I'm pulling out the terminal node now in
In antler when it generates the code, and it gives you a terminal node. It'll give you a expected rule or expected rule constant and You just have to extract the payload now the part that Isn't really mentioned so much is when you actually look at the C-Sharp code
You have the constants here at the top of the generated class, so if I wanted to look for a string The constant is already here, so when I try to match it up inside of the test itself. It's just a matter of making sure That I have the correct
Constant that I'm not matching it against so by the time I actually test this and when I drill down to the actual code that checks whether or not I have the expected rule and The expected text it's just a matter of look Inspecting the actual parse tree node to make sure that I have the right items in place
so at this point if I've been lost you then this is fairly easy to actually generate the parse tree It's really really simple You don't even have to read the dragon book you just run the generator make sure you've got a lot of good test coverage And you're still good to go
So the idea here is that if you've got enough tests It should be fairly easy to detect any issues with the grammar, so If I go back to say my code and I introduce something that Starts to break stuff like for example if I just delete the string
part and Then I save it of course. It's going to say that it's changed That's because the antler is running in the background if I run the tests Of course I had it failing before because I was runs it again you're going to see for failing tests and the beauty of this is if
If you have these kinds of tests for these small inputs It's really really easy to keep adding new language features without breaking stuff, so if I run this again Let's see if my like live coding demo works
then essentially everything should pass I Hope see it's pretty simple because all I had to do is As long as I had the coverage I can just keep going and I keep working with my language so the next question is
Now that I have a parse tree. How do I actually convert this into an abstract syntax tree? and As I mentioned before if you're familiar with the design patterns you're pretty much halfway there There's no magic here
It's actually quite boring and interesting at the same time because you've worked with design patterns Many many times. I mean hey does anybody here not seen this book before Yeah, we'll see That's my point and just to be thorough For visit the visitor water the visitor pattern is
Essentially this idea that you have a fixed tree structure And you want to introduce operations to the tree without actually changing any parts of the tree and this is important because if you want to run an interpreter against a Parse tree or you want to convert it into an abstract syntax tree
Then you need to be able to add operations to it without affecting the grammar per se So if we look at the antler generated visitor code And I know we've seen this before with the actual base class, but this is a bit cleaner because this is the interface So One thing you'll notice is as I mentioned before what you look at the my toy language grammar
you'll see that there's a correspondence between the actual language rules and the actual visit methods that you see inside of the interface and It's fairly consistent in that sense and the idea is that I want to be able to convert a parse tree
Into an abstract syntax tree simply because although Antler does a great job of generating the parse tree. There's certain things I want to introduce into my abstract syntax tree that don't necessarily map one to one with the syntax For you if you wanted to create any particular language features
For example, if you wanted to come up with a C or C sharp based language There's certain things that don't necessarily map correctly unless you were to explicitly say That you want it to map So for example, maybe you want to create a conditional node that says if I if this condition is true Then I should execute this this particular piece of code in a parse tree
You could do that but in an abstract syntax tree where you actually control the model It's actually much cleaner and easier to do that So what I've done in this particular case is I have a class called the parse tree converter You could actually go through this in the repository yourself and we'll get into that and the idea is that The NPL based visitor if you remember has a
generic parameter that allows you to specify what the return type should be every time you visit a particular Note, so for example, if I go here you'll notice that the parameter is Basically a type parameter you could decide what comes out of it and
In this case what I really want to do is I want to convert it into an I note of AST note Now the node itself is A generic class and the base class is just the AST node now There's a few member few members here that are particularly interesting The first one is the node ID and that's useful because I want to make sure that no matter what AST that I run into
I can identify which one I've run into even if I clone it and the idea there is that I Want to be able to walk through a tree once if I can keep track of all the nodes that I've visited then it's fairly easy to be able to
Figure out what the context is or where do I need to jump to based on that? the child nodes are fairly self-explanatory and that that's the idea of where you could actually just see what the immediate nodes are under one particular node and the third one is is my favorite one is it effectively makes the entire tree immutable and
the idea there is you don't want to put any setters on there and If you don't have any setters essentially what it means is that you could go through the entire tree and spawn Subcompilers and sub interpreters and you never have to worry about concurrency So here's an example of how you could actually test the parse tree converter. So I've got a simple expression here
I have my call that we're all familiar with which is basically I convert I Make the call with my language provider or my language strategy, and it gives me the compile unit context I know I cheated in this case because I know what the type is, but you probably don't want to do that in the test
and What I've done with the parse tree converter is I've actually just Checked the output node to make sure that it's of the expected type in this case I wanted to make it map fairly cleanly with the parse tree so that there's always a root node
So I know where I start The Other question there is that That's great. I have a root node But how do I actually want the AST to look and essentially if you want a good example of some? ASTs that are already out there in the wild
You can look at code DOM which is fairly old, but it's it's fairly reliable because nobody changes it or touches it anymore and More recently you can check the Roslyn API and the idea about behind these abstract syntax trees Is that it is a full representation of all the language features that you would see? For say VB or C sharp and one of the more interesting things about these
Abstract syntax trees is that if you look at the Roslyn compiler it actually has one abstract syntax tree For both VB. It's it's basically the secret sauce and How they're able to easily go between VB and C sharp Translation because essentially you have one syntax tree that outputs to two different languages and
As I mentioned before your syntax tree should be fairly Robust against concurrency because especially in cases like Roslyn where you might have a large code base that you want to compile and there's no guarantees that
There's any kind of locking or Mechanism that would make it work, and you want to avoid any kind of deadlocks especially if you're trying to go through the entire tree and make sure that Everything works as it's supposed to Now the tree hierarchy that I have is a fairly straightforward mapping of lists
And when I actually look at the parse tree converter You could see that What I'm outputting is the node But the nice part about this is there's nothing special about this
This is fairly This is just plain C Sharp code and what I'm doing is based on the actual input like for example when I have the terminal When I call visit terminal here, I'm actually drilling down to what the actual text is I'm just checking what the payload is and based on the the generated constant that you see there
I'm actually just creating the corresponding abstract syntax tree node The other thing that's worth noting is that I have since I have Created different types of abstract syntax tree nodes for every visit method it cleanly rolls up into this nice tree
Because as soon as the parse tree crawls through the tree it convert it effectively clones the parse tree and converts it into an Abstract syntax tree node that is exactly what I want it to be. So and I think we can
Breeze past this one because I mentioned it a couple times But it's worth noting that the simplest possible thing that you could do with an abstract syntax tree is make sure that it's immutable so Basic rules for immutability if you have to put it setters make sure that it's never public
And just make sure that if you've got an abstract syntax tree, you don't modify it you can clone it But you don't modify it. So as I mentioned before you've got this interface Every single class Effectively has no setters if I want to change a value I just set it in the constructor
Inside of the clone method itself It's just basically a new copy with the new contents and that's pretty much how it is across the entire library Now if you were to do this yourself, it's fairly straightforward because there's just you just have to make sure that You really don't have any setters or anything that would allow you to change state I would go as far as to say don't even have private members that have shared dictionaries because that'll get you in trouble
Especially when you start crawling through scope so Just just for the sake of completeness. This is what a symbol looks like in not quite lisp So if I were to create something like an example symbol Inside of the code itself the symbol would look like this. I also mentioned that I wanted to have the
Node ID because I wanted to identify every single Node that I come across The important thing to note is at least for the design I had was I wanted to make sure that
Even if I cloned a particular node The idea there is that the node ID would always be different every time so you could have two symbol nodes They might have the same symbol, but the minute I clone it it gives me a completely different identity
so Now the other thing is that since I have extension methods here that allow me to just run link queries It's fairly straightforward to Be able to traverse the entire tree or do a quick search So for example, I have the descendants Extension method and the idea there is that I just get the entire list of descendants in one
Nice long I enumerable I can check it. It's very easy to modify. I mean, it's easy to check within a test and Assuming you get to this point, which is not that hard This is where the fun stuff happens
So, let's say you have the abstract syntax tree, you've got the parse tree now you want to do something with it Now this is how I would do it So I've got two interfaces I've got one compiler interface and I've got an interpreter interface
First thing you'll notice is that the parameters for both the compile and the interpret methods are exactly the same because I'm actually taking an abstract syntax tree node and The output of either the compile method or the interpret method is effectively For the compile method I I just kept it simple and the idea there is I want to generate a link expression tree that rolls up into one
Expression that I can just call compile against and then just run it. It's that simple For the interpreter, that's where the interesting parts comes out. So for the interpreter since in lisp The code is the data and the data is the code
When I take a particular abstract syntax tree node The coolest part about this is that the result of the interpret method is just another node so Let's see and just to recap to where we are. Let's go back to the three basic steps and I'm
Serious when I say this is pretty much all you need to do There's a lot of stuff that's hidden for you But essentially if you can memorize these steps and you could start building your own language and make it do whatever you want So the first note the first part is that you want to convert the code into a parse tree You've got your string of foo whatever you want to parse into
you take the parse tree you convert it into an immutable abstract syntax tree and The last part is that I'm going to show you how to feed that same AST But through both the compiler and an interpreter and we'll see what happens so in this particular demo it turns out that I have both a compiler and
an interpreter Now it doesn't do anything particularly exciting But if you want to drill into actually how it does it that's actually the more interesting part. So I have Two essential hierarchies that I go from so I've got the base compiler class and the base interpreter class
Now if I start with the base compiler class This is where it really gets interesting So you you'll notice that there's a visit visitor pattern here. The return type is a link expression tree so if you worked with expression trees before this is fairly straightforward and
It's quite deceptive actually and the reason why it's so deceptive is that I actually wrote my own late binding method Called invoke so what invoke does is that it's smart enough to look into The particular class that you're calling it against and match the parameter
With the specific type that you're looking for. So if I call visit With on the invoke method it'll actually match it up with the correct Visitor method call the cool part
That I'm always proud of is the fact that when you get down to say the list node so if you remember in lisp You've got operations that are if a list an operation in lisp is Where you have the symbol as the first character first element inside of a list and then everything that follows it is Essentially just the parameters to that operation
So for example, I could have a bright line but and then right line hello world and then afterwards is going to be the string and then that would be the Parameter that's passed to the right line So what I've done in this particular case is I modified it So slightly so that I parse out the name of the symbol
I also have the option in the subclasses to alias the symbol. So for example one one Common thing that would happen is you don't want to have to call the add method With by spelling it out. You might want to just use something like the plus Operator or the minus operator for subtraction and this this allows the subclasses to do that
so assuming I already have the symbol here the cool part is when I get down to say here and it breaks out the actual child nodes from that list and Then finds the derived class method that actually does the operation
So it seems sounds kind of fuzzy But it actually becomes fairly straightforward because take for example if I were to call the add operation so what it's actually doing is if I wanted to add functionality to my So-called parse tree or my compiler. All I have to do is come up with a matching method that has the two parameters
That I want to match with interestingly enough if I were to take This method and modify it so that it takes either a string node instead of a number node Essentially what I'm doing is a form of type coercion so on the fly essentially what you can do is you convert this into a
String actually you convert a string back into a number and then you take the expression and add it together as if there was no issues on the other side What you could see is the pattern is pretty much the same if I have a right line method
And I want to call the console dot right line. It's essentially the same pattern that you see here so for example, I have right line, and I'm looking for a string node and The reflection itself is smart enough to figure out that this is the method that you want to go with of course
I also had to come up with an overload This one's fairly interesting And the idea here is that I'm actually using link expressions to do the type inferencing for me Because when I actually go and do this, so let's see if I could Put the breakpoint here, and then I just do compiler of course it's going to work okay, so
So the interesting part of here is that when it gets to the right line? And it sees an expression like for example if I wanted to have a computed expression The nice bits about this is that the type inferencing is pretty much done for me
Link effectively is the brains out of this the link expressions does the work for me When in normal cases I probably would have had to write all the compiler smarts because I have to go through the abstract syntax tree so The cool part about this is that when I get to the expression type it already it tells me that it's an integer
Now I have to actually search the console class for the correct right line method That with that particular expression, so I would do right line with an int in this case and Then I would just inject the actual expression call so that it calls the correct method
so how this particular class works is that it just goes through every single one of Your abstract syntax tree nodes and converts it into expression an expression the same way a parse tree Converter was able to convert it into an abstract syntax tree, so
At the end of the day when I actually run this It's fairly straightforward to see that it does exactly what it's supposed to do so in this case I'm just adding 1 plus 1 nothing too exciting, but the interesting part is that out of this very?
Little code I was able just to come up with my own compiler The other thing to notice is that I didn't actually do the I own myself I probably would have done it myself normally But in this case you don't have to because this is just as fast as compile code On the other side of this now that we have a good compiler the base interpreter is actually
almost exactly the same one thing to remember is That it's basically the same pattern It's just the difference here is that in the methods that I go through with the calculator Is that when I actually run across? The nodes that I'm interested in and the correct operation that I'm interested in I do the evaluation immediately
So in this case I actually have to convert it into an int Because I'm just basically taking text and doing the conversion of course if you're doing a real language you want to take handle cases of where you have precision and
That's fine because you could always change the actual parse tree grammar To have something like C sharp where you could say what the data type is like for example if you want to Decimally say 1.0 D But in this case when it actually goes through every single operation that you see here I
automatically Roll it up and add it together or subtract and do the operation and then return the result immediately so what this does is it takes all the operations and then just computes it and Until it gets to a point where you only you're only left with one node
The special case that does happen is that if I have a right line method since console dot right line Obviously returns void it's worth noting that it's not I'm gonna return a nil Of Course the when you start working with your own language you might have different concepts of what a null value is in my case
I wanted to keep it simple so nil doesn't never equals nil. That's why I always return a new one The other thing is that I mentioned is that I have aliasing So I don't necessarily have to type out the name of the method
I just have to be able to do the plus or minus It's just a simple map and the idea there is that every time it sees the Whatever operator it just maps it to that particular method and based on that. I mean that's pretty much it I mean it doesn't have to be that complex The idea here is that even though I presented a very very simple language
You could pretty much do a lot more with this because this is just a simple form of analysis against an abstract syntax tree That being said this is my favorite part So it turns out in this case, there's not a lot of code involved in if you want to implement meta programming
well, at least when Lisp and If you want to implement meta programming we it's just a matter of making sure that you have a plug-in infrastructure and you're able to
Modify the abstract syntax tree before it hits the compiler or the interpreter so Sounds complicated, but it's not because if you've been working with things like containers The plug-in infrastructure is already there. You could use whatever framework you want at this point it's just a matter of having the interface that modifies the AST and then once you modify the AST you just keep going and
And add whatever functionality you want so That being said I mean It's at this point it's fairly straightforward, it's very easy to get started but one thing I just want to remind everybody is that no matter how simple your language or
Whatever your ambitions might be it's always worth noting that C sharp was really not built in a day. It actually took 15 years to get it, right so when you're doing this stuff and maybe you don't want to do a language, maybe you do it's
I'm just gonna leave you with this particular diagram So aside from building a compiler a lot of the more interesting stuff that we work with on a day-to-day basis and don't even think about Has everything to do with these abstract syntax trees? So you have a compiler? That's what we're talking about, but there's other things that you could do like translate languages
Which is what Roslyn does when it goes from VB to C sharp and at the same time We also have the input language Being the same as the output language now why in the world would I want to do that? The interesting part about that is everybody here uses v sharper, right? So v sharper essentially is a parser
That parses your C sharp or your VB code sticks it in an AST and then refactors it and then spits out the code These are the kind of tools you can make with this technology and as you can see it's not That complex at all. It's actually quite simple
I'd even say too simplistic, but That being said it it really comes down to just practice I'm personally I haven't even begun to create the language of my dreams But what I can say is I won't give up and I'll just keep going until I get it, right
It's it's not you know, I mean, I'm not getting paid for this in any sense It's just something that I love to do. So I If you really want to get into this, it's just a matter of practice That being said since I've butchered compilers that you read You could actually look into the books themselves
That explain how to actually do this in a more in-depth manner. I know we only had a One hour to do it But the first two books that you see here are the ones that are written by Terrence Parr. He's the author of antler and That's gonna be useful if you just want to create any language in general
Since this is more of a CLR talk you probably want to also check out meta programming in dotnet and That's that's good because Jason Bach talks about all the different back ends that you can work with so you can work you do see so Reflection dot emit you could also do
Expression trees which are probably the simplest ones and that's the one I did today you But if you start with these particular books then you should be well on your way to creating whatever language you want And I'll just leave it up here so you can take a picture of it. So
That being said, yeah, does anybody have any questions? Yes, so The question was with which back ends would I use and why and when so I
Would use expression trees if I just wanted the speed but I didn't actually want to serialize it down to a DLL or any XE that's one if I wanted to actually ship this and Make it so I can just do an X copy into some directory and run it
that's when you would use something like reflection dot emit or Cecil Cecil is good if you just want to be able to read things into Memory and modify and then save it back There's Code DOM is well really old-school, but the idea there is if you want to generate your
super duper awesome DAL layer from 2003 then you can do that You never know But that that being said anybody else Yeah, you could pretty much make antler to do anything. I think they even have a brain fuck grammar
Yeah, so it's it's an amazing tool and the best part it's free and It really abstracts away a lot of the things that you would have to worry about because when I was just trying to learn
parsing I just lost it when I said when I was just trying to do the bottom of parsers versus the top-down and and the thing about that is that You get so lost in just trying to do the not invented here syndrome with a parser That you lose the plot you lose the idea that you want to create this great language And then you get stuck in the parser, so that's that's really the point of the talk
It's just to get you started so that you could just do whatever You need to do and focus on just the semantics and implement the language that you want so Anything anybody else? All right So that's pretty much it for me. Thanks for attending my talk