We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Roll your own compiler with LLVM

00:00

Formal Metadata

Title
Roll your own compiler with LLVM
Subtitle
Easy IR generation
Title of Series
Number of Parts
561
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
A close look at IR code generation inside a Modula-2 compiler. With LLVM, one of the main tasks of an compiler engineer is to design the IR code generation from an abstract syntax tree (AST). How easy is it? Although I have worked with and on LLVM for several years now I have not yet written an compiler from scratch. So I began to work on my own compiler for an simple yet powerful language: Modula-2. In this talk I introduce the architecture of the m2lang compiler. My main focus is the generation of basic blocks and IR code from the AST. I show which general patterns emerged and which areas of the languages caused trouble. I also give hints when it is useful to introduce yet another intermediate language between the AST and the IR.
Euler anglesCompilerProgramming languageMultiplication signCompiler constructionCodeAbstract syntaxWeb pageAbstract syntax treeGreatest common divisorComputer programmingSource codeProjective planeSystem programmingOperating systemSoftware maintenanceWindowParsingCategory of beingFunction (mathematics)MereologyLevel (video gaming)Standard deviationCompilerElectric generatorType theoryConstructor (object-oriented programming)Different (Kate Ryan album)Phase transitionRepresentation (politics)Semiconductor memoryInterface (computing)Real numberIntermediate languageModule (mathematics)Characteristic polynomialRecursionImplementationGroup actionCompilerFormal languageStructured programmingCalculationProcedural programmingSocial classException handlingPresentation of a groupModul <Datentyp>Java appletObject (grammar)Product (business)AbstractionEmailSingle-precision floating-point formatFactory (trading post)FreewareFault-tolerant systemComa BerenicesYouTubeNatural numberMathematical optimizationSqueeze theoremPhysical systemAssociative propertyExtreme programmingElectronic data processingWhiteboardDigital photographyInsertion lossObservational studyJust-in-Time-CompilerSemantics (computer science)Arithmetic progressionSpacetimeOpen sourceMathematicsData managementOffice suiteNetwork topologyInternetworkingRepository (publishing)Canonical ensembleAsynchronous Transfer ModeHand fanConcentricHome pageGUI widgetLogistic distributionComputer animationLecture/Conference
Block (periodic table)Pointer (computer programming)CodeAbstract syntax treeFocus (optics)BitCategory of beingFunctional (mathematics)Control flow graphComputer configurationInstance (computer science)Multiplication signImage resolutionMathematical optimizationIntermediate languageAbstract syntaxExpressionRepresentation (politics)LinearizationFerry CorstenCondition numberAbstractionSingle-precision floating-point formatView (database)Computer fileProcedural programmingCountingDataflowHydraulic jumpBranch (computer science)Graph (mathematics)Computer programmingType theoryBuildingMereologyElectric generatorProcess (computing)BlogCASE <Informatik>Point (geometry)Matching (graph theory)InformationWikiMobile WebSuite (music)Set (mathematics)Configuration spacePlanningAuthorizationData miningWater vaporSpacetimeControl flowSemiconductor memoryElectronic mailing listWeightGame controllerExtreme programmingService (economics)System callWell-formed formulaSubsetStatement (computer science)File systemForm (programming)Alpha (investment)Line (geometry)SoftwareMedical imagingAreaRadical (chemistry)Connected spacePresentation of a group
Semiconductor memoryMessage passingDataflowMathematical optimizationImplementationControl flow graphAbstract syntax treeBitStatement (computer science)CASE <Informatik>Hydraulic jumpLoop (music)Constructor (object-oriented programming)Condition numberIntermediate languageAbstract syntaxPointer (computer programming)Module (mathematics)Block (periodic table)Software frameworkType theoryGraph (mathematics)Source codeSemantics (computer science)SyntaxbaumException handlingControl flowProgramming languageKeyboard shortcutArc (geometry)Formal languageData structureCodeRepresentation (politics)Image resolutionPoint (geometry)CompilerFerry CorstenGame controllerExtreme programmingOcean currentAbstractionPresentation of a groupMathematicsLevel (video gaming)Open setSeries (mathematics)Computer simulationSet (mathematics)EmulatorService (economics)Workstation <Musikinstrument>Network topologyLimit (category theory)FamilyElectronic mailing listComputer programmingInformationDirected graphMusical ensembleMultiplication signFirst-person shooterFiber bundleMetadataCartesian coordinate systemAlgorithmEndliche ModelltheorieBlogForm (programming)Personal digital assistantMixed realityLecture/Conference
AreaFormal languageFocus (optics)Flow separationRecursionNamespaceConstructor (object-oriented programming)Source codeCasting (performing arts)ParsingVariable (mathematics)Reduction of orderLoop (music)Traffic reportingCompilerElectric generatorUnit testingSubsetSet (mathematics)Module (mathematics)Software testingLevel (video gaming)AdditionCoroutineInterface (computing)Computer programmingCore dumpFitness functionKeyboard shortcutPreprocessorWritingCASE <Informatik>Image resolutionMultiplication signProgrammschleifeView (database)Sound effectMathematical optimizationBlogQuadrilateralSpacetimeMusical ensembleGroup actionExtension (kinesiology)Endliche ModelltheorieProcess (computing)Single-precision floating-point formatPhysical systemConnectivity (graph theory)Game theoryWaveVideo gameState of matterCommutatorPower (physics)Compiler constructionHand fanBuildingData storage deviceLength of stayPoint (geometry)Special unitary groupDialectObservational studyExpressionShape (magazine)WordAlpha (investment)MereologyWater vaporOrientation (vector space)Control flowLecture/Conference
Computer animation
Transcript: English(auto-generated)
Okay. Yeah. Hello everybody. My name is Kai. I'm a very casual contributor to LLVM. I haven't done last year something with LLVM, but the years before.
I also was the maintainer of LDC, that's the LLVM-based D compiler. So, I have some experience working with D. Yeah, and today I will talk about my latest project which is my own compiler.
And for sure, compiler construction is a very, very big topic. So, if you know for example the Dragonbook, it's I think thousands and more pages long. So, in 30 minutes, I don't have time to tell everything about compiler construction.
So, I concentrate on the topic which interests me most, and this is the generation of the intermediate representation for LLVM, because that's the main interface to LLVM. Okay. And the big question is, if you look at the output of compilers,
for example, Clang or LDC or the Rust compiler, the generated intermediate representation looks always a bit different and sometimes it looks more elegant. Sometimes there are constructs in it which seems to be
redwood and which could be optimized away. And my question was, why does this happen? Why do some compilers generate lovely beautiful intermediate representation, and why do others generate some junk with it? By the way, it does not really matter
because LLVM optimizes this all away. But my question is, why does this happen? And to study this question, it's obviously good to have a compiler. And yeah, it should be for a real language,
a compiler for a real language, but also not too complicated. So, because I'm the only person working with it, so it must be somewhat manageable by one person, that this does rule out C++, for example. And I looked a bit around,
and I also looked at programming languages I have experience with and I came up with a very old language for my compiler. I choose Modula 2. Okay. So, that's indeed very old.
The first implementation of Modula 2 was in 1979. It was done by Niklas Wirth. I think everybody knows him as one of the main contributors to structured programming. It was developed in Swiss, in Zurich.
So, it's a real European language, and it has some beautiful characteristics. So, the first one is, it has a very carefully designed syntax. For example, if you ever tried parsing C++ or C,
you remember the dangling else problem. If you look at the else, you don't know to which if the else belong. That's not the problem with Modula 2. The syntax is designed that it's very clear to which if the else belong.
There's also something revolutionary. Modula 2 has a module concept. For example, compare this with C++ today. It's a working module concept and it dates even back to a language called Modula, and it was taken from there.
Yeah, and it's a system programming language. So, there are low-level facilities integrated. You also have procedure types, which is basically pointer to procedures. Yeah, it was quite popular in the 80s and in the 90s.
What they did in Zurich first was they wrote the operating system in this language. So, this proves that the real system programming language, and it's called the Linus operating system. Never seen it, only know it from literature.
There were some big toolkits constructed with Modula 2. I know the compiler toolbox from the Gesellschaft for Mathematica in Germany. It's also called Cocktail. It was once open source. I think today they license it.
It was also quite popular for Windows programming. So, if you search on the Internet, you find a lot of Windows programs written all in Modula 2. So, it's a real programming language, but it's also simple enough to create a compiler on my own for it.
I concentrate on the, let's say, original language definition, which was defined by Wirth in a book called Programming in Modula 2, the fourth edition, also abbreviated as PIM4.
Later, there was an extension, the ISO Institute, they created a working group and they worked on standardization on Modula 2 and they also defined some new features of the language, for example, exception handling or an object-oriented facility with classes.
But I'm just looking first at the original definition. For those of you who never heard of Modula 2 or never have seen it, I just cut an example from the book. It's a simple calculation of
the greatest common divisor and the LCM. Nothing fancy, but this gives you a flair of the language. So, we have modules, we can import from modules, and we have the standard things like if, while, and so on.
So, it's a very nice language. Okay. So, my first approach, how to deal with the compiler. Hopefully, everybody knows that the normal compiler has different phases or stages.
So, I start with a source file in Modula 2, and the first part is the parsing, and to generate an abstract syntax tree. For the parsing, I currently use ant-lr, which is a compiler construction tool,
which is very easy to use. It's written in Java, but it can output code for Java, C++, Python, and a bunch of languages. For my purpose, it has a very nice property.
So, when you do parsing, you get a parse tree, and with ant-lr, you can define some rewriting rules, so that you also can define your abstract syntax tree. So, in parsing, for example, you see the keyword if, then comes an expression, and then comes the keyword then,
and the then is completely redundant in memory representation. So, with the abstract syntax tree, you just get rid of this. So, this is very easily done with ant-lr. It's a nice tool for doing this.
Then, I have a semantic phase where I do some type resolving and other checks. That's currently very crappy. What I do is I construct a hand-coded abstract syntax tree
for the purpose of the code generation, because my main goal is I want to get rid of the ant-lr tool. It's an additional dependency, and I do not want to have this dependency, and there again, modular too. It's a very, very easy programming language.
You can code recursive death syntax by hand. It's not very complicated, but just for speed, I currently use the ant-lr, because I like to study the generation of the LLVM IR. So, the parts before that
is currently very, very crappy. I haven't published source for it yet. There's already a GitHub repository, and as soon as I think it's code, which I think someone can look at it,
I will publish it under this URL, but for now, it's very, very crappy. I'm not proud of this code, so be patient. Okay, so why, so my main focus is the code generation.
I have a decorated abstract syntax tree, and decorated because I always done type resolution, scope resolution, and so I have everything in place, and from this, I now want to generate LLVM intermediate representation, and the building block
for this intermediate representation are the basic blocks. Every time I generate an instruction, for example, an edge on return or a branch, it goes into a basic block. That's more or less the container for the instructions, but with a very, very sharp restriction on it
because every basic block is single entry and single exit. So, it's just a linear flow of instructions, and there can be several jump into this basic blocks, and at the end, there's a terminating instruction,
some kind of branch, or invoke of method also, and there the basic block is left, and inside the basic block, there's just one single flow. So, this is a simple example how it looks like.
In this case, it's taken from Clang, but looks with some tweaks in it, but that's the look of a basic block. It's a building block because most optimizations are done with basic blocks.
So, if you look inside LLVM, there's a lot of things done on optimization, and the main focus is always the look on the basic block. And there's another property. You can also see with LLVM,
a procedure or function consists of several or a lot of basic blocks, and these blocks form a so-called control flow graph. If you have used the LLVM tools, there are some command line switches where you can generate a graphical view
of this control flow graph using the DOT program. That's very nice. If you do some debugging also, it's worth looking at these graphs to better understand what is going on there. Yeah, so now it's clear where the instructions have to go,
and sometimes I'm a very simple person, and I like to do things simple. So, my first approach is how to generate the code.
I have a basic block. I generate a basic block. I have the pointer to this basic block, and then I generate the instructions into it. How does it goes? Okay, some code here. I have a note that is a note of my abstract syntax tree.
It's here for a procedure. It's very, very simplistic here. I have a name. I have the inner blocks. That's the list of statements, and I can accept a visitor, and what is my visitor? Okay, the visitor. It has to create an LLVM function type.
It says to create the LLVM functions. When I have this, I can create my basic block, and it's always nice to use this ER Builder because it makes the things a little bit easier, and then I have my basic block, and what I would do now is,
with the visitor, just generate the code into this basic block. Okay, and I already said basic block. It's always single entry, single exit, so this approach means sometimes during the generation, I have to create new basic blocks and to switch the pointer in the visitor,
but that should be no real problem, hopefully. Okay, and hopefully, what happens, this approach works very well if I generate code for something I plus one for an assignment,
but as soon as I start nesting structures, for example, if and else nested together, I get into trouble. And why do I get into trouble? Let's have a look at the inner if.
There's the B greater than C condition. Okay, that's an expression. I generate the code for it, and then comes the if, so I have to generate a compare, and then comes the branch, and this branch ends my count basic block, and I have to generate more or less
three new basic blocks, one for the then part, one for the else part, and one for everything which comes after end. So, the condition is one basic block, this comment is one basic block, this comment is one basic block,
and then I have to continue after it, so I need a basic block there. And okay, I can do this in the visitor instruction, but when I do this, then I have basically empty blocks, because for this end,
I generate a basic block. I have after the then part is finished, I have to jump to this basic block. This one is empty, and I have to put something into it, and what I have to do is, I have to jump after the outer end.
So, with this approach, what I do generate is basically this basic block, I have just a label and to jump to another label, which is useless, looks ugly, and this is exactly the thing I do not want to generate.
From my memory, Clang do not generate such code, but for LDC, I know instances where we generate such code, and that's my motivation, where does it come from? This is a very simple example,
why you have to meet the need to generate such code. A bit more of insight about it, why does this happen? If you think a bit about it,
then it becomes very clear. The abstract syntax tree is a representation of the syntax of the text file. It's not the text file itself, it's abstracted because I have looked for the keywords and separated all this,
but it's still very close to the textual representation. On the other hand, the basic blocks, they form the control flow graph. That's a single, each basic blocks, as I explained, it's a single flow of instructions, and then there are jumps or if you create the graph,
there are some connectors to the next basic blocks. It does not look as easy or as simple as I have put it here. It's much more complicated. Just think of a switch, for example. But it's very clear from the graphical representation,
there's a mismatch between both. If I start on the left side to generate the right side, it will not be as easy as I have tried with my visitor. So, my next question is, can I fix this?
Yeah. The first one is LLVM is a very good tool. So, one approach to do this is just configure the optimization process in LLVM to do some jump resolutions.
Put it under the O0 option and then it scan. So, let LLVM do the work. That's very simple. But my personal opinion is it's still ugly. I do not want this.
There. I came up with another solution. The main problem is that when I generate, I always have with a jump or I need to target, and I somehow need to know what is the target.
In the case of the nested if-then-and construct, I don't know what is really the target for my jump. Therefore, I generate an empty basic block. What I try to do is just induce
the control flow graph on my abstract syntax tree, which simply means for me that I have to add a new pointer to the S nodes which represent a statement, and I call this exit2. So, this is a pointer to the statement
which starts the next basic block. This can be constructed with a recursive visitor very fast. That's okay for modula2. With this, I can solve my problem.
By the way, there's another construct, a loop and exit construct, which is similar to a while and break construct in C++. And there you need to do some scope resolution. If you enter the, or if you see the exit, you also have to know which loop to end,
and you need to be a bit careful about this. But basically, I can construct this with a visitor. I have then additional annotation in my abstract syntax tree. And with this information, I can get rid of the basic blocks.
Okay. That's a very restrictive implementation. What you can also do is just, let's go to the full-blown control flow graph. It can be constructed from the abstract syntax trees.
There are algorithms for it. And then I have a control flow graph there. And I can generate the intermediate representation from LLVM of it. That's very, very costly if you do it only to get rid of some basic blocks and other ugly generated instructions.
But, it's nevertheless, it can be good to do. So, what I like to do is, I have my abstract syntax tree,
and I want to transform it in a kind of high-level flow graph. And first, how to do it? The first thing is, I get rid of some higher-level constructs. So, every programming language has some syntactics,
syntactic sugar, how it is called, which just means that you have a statement which can be made up of lower-level statements. For example, we have the while loop. And the while loop can replace with a loop,
which is an endless loop and an exit. So, for example, here on the right side, there's the while with a condition, and this can be replaced with a loop and an if and an exit inside. And this is a more lower-level construct, and I can do this in modular too
for the for loop, the while loop, and the repeat loop. And also, the Boolean conditions end and all, they have a short-circuit evaluation, which is defined in the language as being an if-then-else construct. So, I can also replace this with a nested if-then-else.
Then I have formed a lower-level, in this case, all still syntax graph, and then I can go and replace all the implicit jumps with explicit labels and go-tos.
And when I do this, then I have basically constructed a control flow graph. There are two things which here happen here. The first one is if you take this route, think about your debug metadata,
because in the source, there's the while statement, and then you do something on your syntax tree, and suddenly there's a loop statement. You still have to remember that was the while and there was the condition, and so this can be very tricky. Yeah, okay, and so when you do this,
what you have done is you have created more or less your own intermediate representation. It's higher than the intermediate representation of LLVM, but it's a bit lower than your abstract syntax tree.
And I think two years ago now, there was a big announcement from the Rust compiler people. They made a big statement. They have created their own intermediate representation.
It's called Mir, and it was a very big news hype. At least I saw it as a news hype, and I was always interested what have they really done, and they have done this. They created a control flow graph with some lower-level constructs,
and called this their own intermediate representation. It's a very valid approach, yeah, and when do I want to do this?
This is very helpful. For example, if I have to do some type checking or scope checking, so if I'm still working on the semantics of the language, it can also be helpful if I generate a lot of synthetic code. For example, clean up handlers for exception handling or some other fancy stuff which might happen.
In this case, it's very good if I don't have to work on the syntax tree because then I always have to use the, as a construct of the source language with the control flow graph, I can introduce my own instructions. For example, the go to. This makes these things much, much easier,
and this was for the Rust people the reason to do it because they had to do some type checking, and it was very, very difficult to do it on the abstract syntax tree, but with this control flow graph, it was much, much easier. Therefore, they introduced it.
When I look at my problem, my modular two compiler, then this currently seems to be overkill. With my approach, with this additional pointer to the next basic block, I'm currently fine.
This might change when I look at the standard modular two, which introduced exception handling because then there's an additional out arc of a basic block. You can jump to the exception handler, and so this can be changed there,
but currently for the fourth edition of programming a modular, it's really not required. From my point of view, so my recommendation, just be careful and check if you need to do this complicated stuff because what does LLVM do?
I generate the intermediate representation. I give it to LLVM. Once the first pass of LLVM constructs the control flow graph from this representation because all the optimizations works on this data structure. And so it might also be worse if you think
you have to add something like this. Maybe you can add a new optimization pass to LLVM to just do this stuff. That's also possible, and for example, in LDC, we do this to, when we create scopes, then we have to allocate memory
and we try to deallocate this memory or put it on the stack, and there's an extra LLVM pass to do this. Could also be done at a higher level, but it can be done with LLVM and just use the framework you have. Yeah, that's it for me. So, thanks for listening,
and when there are questions, feel free to ask. The module of two LLVM bindings and then you can make your compiler self-hosting?
So, the question is if there will be some modular two bindings for LLVM, and I say no, because that's a lot of work. So, what I currently do is, obviously, I write it in C++.
I haven't said it, but the source code was C++, and that's really the easiest way, in my opinion, to work with LLVM. I had a look at the C interface, and if I would do some modular two binding, I obviously would use the C interface,
and it's not that complete and not that easy to use. So, I really prefer the C++ interface. So, no, there will not be bindings for LLVM. So, is that why you didn't use D? I don't use D in this case
because I just wanted to have very little dependency. So, my, let's say, internal goal is to have only the same, or only to have LLVM as dependency. That's also the reason I want to get rid of aren't-lr. It's not, the tool is very cool,
and I have used it for several other projects, for example, writing in SQL parser, and I only can recommend this tool, but my goal is to just reduce dependencies, and D would be another dependency. I have a new compiler, I have a new library,
and therefore, I currently stick to C++, and I also want to train my C++ coding a bit. Do you have an opinion whether to create LL cast in your early on or have mem2rec or something to generate those in LLVM?
Because that's whatever went back and forth. So, the question is if I have an opinion where to generate a log A instructions in LLVM, and I currently do it first, but I don't have a real opinion on it.
So, my main concern about the generation is I do not want to generate useless instructions. The log A, I have to do it somewhere, so you can debate where to do it, but you have to do it,
but the empty basic blocks, that's the thing, they have a very strong opinion about it, I don't want to have it. So, this was my main focus. The other thing, it's debatable, and I really have no opinion about it. Yeah, everything else?
Was there anything from the Modula 2 language that didn't map well onto LLVM IR? I haven't found yet one. For me, the most difficult thing was to understand the scoping of,
and yes, where are the scopes? Because they have, they really thought about scaling. If you create a language, you have to think about how to scale it for large programs, and they had the same sorts,
and they created inner modules, so which is basically like a name space in it, but with a module name for it, and I had some problems with the scope resolution, but that had nothing to do with LLVM. What I have not yet tried is,
and I don't know how it fits, because I really haven't looked at it. Modula 2 has also defined the module for coroutines, and there are some new instructions for coroutines in LLVM, but I haven't looked at this. I'm still building the compiler core, but this would be very interesting to see how this fits.
Yes. The question is, if there are some undefined behavior in Modula 2
which can be exploited for optimizations, I'm not really sure. There are areas which are not, in my opinion, not clearly defined. For example, what happens if you have nested loops with Y loops and then exit in it? What is really the scope resolution there?
I'm not sure if this can be exploited. There are also hints if then you have a for loop, like in C++, you have a variable for this loop, and the report states that after the loop, the value of the loop variable is undefined,
but clearly it has the last value of it, so I'm thinking about, it's worth to generate a warning about this because it says, okay, you're now exploiting something which is undefined, but again, I'm not sure if it's really that exploitable.
So, the language itself is very cleanly defined. I really appreciate this. Okay, could you please repeat?
Ah, okay, so, so, other candidates I have considered, I had looked at some subset of C. That is also from my experience. Once upon a time, there was a compiler called small C
which has a very restricted set of C constructs, and I also looked at this, but yeah, it's also an ancient language. I also thought about having some subset of C++, but I ruled this out.
So, C and C++, I ruled out because of the preprocessor. I wanted to have something which, where I have one compiler program and not to think about an additional layer of C language. So, with this, I looked at small C, but I ruled it out because of the preprocessor, and that's basically all I have looked at
because then I thought about Mooladhoo and found, that's very nice. It's a simple language, but you can, it's still useful. So, that was the main aspect for me. How are you going to get rid of Antler, right?
The way, like it is done with Clang. So, the syntax of Mooladhoo is basically LL1, I think with one exceptions, and therefore I will create recursive tests
in parser by hand. That's not very fancy stuff, I agree. Compiler constructor or parser construction toolkit is much better there, but it's also very easy to do, and it's not, the syntax is not as complicated as C++,
but it will be a handwritten recursive test in parser. Is what you are expecting. Yeah, yeah. So, the question is how do I test that the outcome is what I expect.
Currently, there are no tests, but what I intend to do is use the LLVM tooling, there's this lit test, and to try to use this.
And obviously, I will try to create some kind of unit test framework in Mooladhoo to run the things, but yes, it is very ashamed. There are currently no tests.
Coroutines, I haven't looked, I also said it before, the coroutines module, I haven't looked at it yet. I'm still looking at the core of the language, but coroutines are very interesting
because there are some additional LLVM instructions for coroutines, and for sure, when I come to this point, I like to exploit these instructions. But currently, I'm just looking at the core. Thank you very much. We're running out of time for this session. Thank you, Kyle.
Thank you. Thank you.