We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Generating a Code Generator

00:00

Formal Metadata

Title
Generating a Code Generator
Alternative Title
Pattern Based Code Generation for GPUs
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Automatic, pattern-based code generation for Mesa's compiler infrastructure has been a long standing dream. Nearly a decade ago experiments were conducted using systems like BURS and lburg. Each of these attempts encountered various insurmountable road blocks. In the intervening years, both software and GPU architectures have changed significantly. These changes have enabled a code-generator generator to be a reality. The design and implementation of one system will be presented. In addition to the successes, various difficulties and rough edges will be detailed.
Social classCompilerProjective planeCompilerImplementationDampingCellular automatonUniverse (mathematics)Compiler constructionJust-in-Time-CompilerAlgorithmSheaf (mathematics)Basis <Mathematik>Data structureProduct (business)MIDIBitComputer animation
Topological algebraMachine codeFormal languageTime domainCompilerStatement (computer science)Software design patternEstimationBefehlsprozessorPhysical systemComputer programmingNetwork topologyTexture mappingCompact spaceCellular automatonNormed vector spaceExecution unitMusical ensembleAutomationLipschitz-StetigkeitConvex hullMaxima and minimaSeries (mathematics)Mathematical optimizationPhysical systemMachine codeUniverse (mathematics)Topological algebraTopological algebraGreatest elementImplementationCompilerAssembly languageRewritingBitOrder (biology)Mathematical optimizationSet (mathematics)QuicksortPoint (geometry)Software design patternRoutingExpressionDifferent (Kate Ryan album)FlagDatabaseNetwork topologyError messageProblemorientierte ProgrammierspracheSequenceStatement (computer science)EstimatorLogicMereologyOrthogonalityOperator (mathematics)Source codeFunction (mathematics)KnotDevice driverInverse elementShader <Informatik>Open setGrand Unified TheorySyntaxbaumMultilaterationMessage passingInverter (logic gate)Computer animation
Source codeOperator (mathematics)ExpressionSource codeCombinational logicAbsolute valueOrder (biology)Connectivity (graph theory)Open sourceSelectivity (electronic)Stack (abstract data type)MereologyProcess (computing)Vector spaceExistenceProjective planeView (database)Multiplication signComputer architectureFunctional (mathematics)Auditory maskingCompiler constructionProduct (business)Proof theoryBuildingException handlingMatching (graph theory)Topological algebraDescriptive statisticsControl flowArchitectureMathematicsData structure1 (number)Negative numberFormal languageRegulärer Ausdruck <Textverarbeitung>Physical systemCompilerMachine codeBitSoftware design patternRange (statistics)Element (mathematics)Constraint (mathematics)Right angleResultantGraphics processing unitMathematical optimizationBefehlsprozessorProblemorientierte ProgrammiersprachePattern matchingSinc functionTrigonometric functionsOnline helpCombinatoricsComplex (psychology)Computer animation
Vector spaceMilitary operationArchitectureIntelElectronic mailing listStatement (computer science)Pattern languageMachine codeSource codeAlgebraic numberMathematical optimizationSteady state (chemistry)Patch (Unix)Matching (graph theory)Network topologyExpressionGame controllerMathematical optimizationSoftware design patternCondition numberQuicksortoutputMereologyVariable (mathematics)String (computer science)Multiplication signPoint (geometry)Machine codeLogical constantMessage passingShader <Informatik>Software frameworkSoftware bugSource codeRootInfinityTopological algebraForm (programming)FlagResultantSeries (mathematics)Device driverCASE <Informatik>BitAuditory maskingRight angleLoop (music)State of matterVector spaceComputer architectureGraphics processing unitGaussian eliminationOperator (mathematics)Computer fileTopological algebraOperatoralgebraField (computer science)Negative numberAsynchronous Transfer ModeControl flowFirst-person shooterOpen sourceComputer animation
Machine codeSource codeSteady state (chemistry)Algebraic numberMathematical optimizationoutputFunction (mathematics)Formal languageGraphics processing unitAssembly languageRight angleNetwork topologySoftware bugMatching (graph theory)Slide ruleMathematical optimizationAuthorizationVideo gameMultiplication signSampling (statistics)Machine codeType theoryTopological algebraSoftware design patternLimit (category theory)BitTrailFile formatCompilerFirst-person shooterComputer hardwareIntegerPoint (geometry)Source codeError messageTopological algebraVirtual machineMereologyFunction (mathematics)Medical imagingFormal languageoutputFood energyDampingDifferent (Kate Ryan album)CASE <Informatik>Loop (music)Process (computing)Projective planeCone penetration testDescriptive statisticsNumberRevision controlObject (grammar)ExpressionPrice indexFinite-state machineEntire functionKey (cryptography)ResultantOnline helpFitness functionSubject indexingSyntaxbaumOperatoralgebraGraphics processing unitAssembly languageComputer animation
Assembly languageGraphics processing unitFormal languageFunction (mathematics)Representation (politics)Mathematical optimizationObject (grammar)Network topologyMachine codeIdeal (ethics)Compact spaceMIDIEmailLine (geometry)OpcodeSource codeVirtual machineIntelComputer fileCloningPattern matchingExclusive orOpen sourceVirtual machineComputer fileComputer architectureTopological algebraDescriptive statisticsTable (information)Assembly languageSlide rulePhysical constantFunction (mathematics)Mixed realityCASE <Informatik>Compact spaceMachine codePoint (geometry)Right angleType theorySoftware design patternTransformation (genetics)Source codeCombinational logicCompilerBlogFlow separationSet (mathematics)BitSign (mathematics)DampingSequenceCovering spaceTopological algebraPairwise comparisonComputer fontElectronic mailing listFormal languageLogicObject (grammar)FlagCondition numberMathematical optimizationWebsiteFitness functionMereologyGoodness of fitScripting languageSingle-precision floating-point formatPredicate (grammar)Line (geometry)Different (Kate Ryan album)Data conversionIdeal (ethics)Differenz <Mathematik>Network topologyBytecodeEmailDoubling the cubeComputer animation
Pattern matchingInversion (music)Exclusive orEquations of motionSoftware design patternDatabase normalizationNegative numberFlagLogicDescriptive statisticsVirtual machineType theorySource codeBitQuicksortCASE <Informatik>Computer architectureComputer animation
Series (mathematics)Virtual machineLine (geometry)Insertion lossShader <Informatik>MereologySoftware testingExecution unitSource codeDisintegrationComputer engineeringCode generationBlock (periodic table)MereologyVirtual machineResolvent formalismMultiplication signGaussian eliminationArithmetic progressionSoftware design patternPatch (Unix)Matching (graph theory)Mathematical optimizationShader <Informatik>Source codeAssembly languageUnit testingLine (geometry)INTEGRALComputer architectureExecution unitDescriptive statisticsLimit (category theory)Computer fileMaxima and minimaQuicksortNegative numberComplex (psychology)MathematicsBitReal numberMachine codeRevision controlPattern matchingTopological algebraGraphics processing unitSet (mathematics)Electronic mailing listPhysical constantIntelCompilerCartesian coordinate systemDatabaseTopological algebraPoint (geometry)Ocean currentProjective planeComputer animation
Source codeCode generationDisintegrationComputer engineeringEstimationAerodynamicsSimilarity (geometry)Topological algebraCompilerDistribution (mathematics)Link (knot theory)Descriptive statisticsShader <Informatik>Electronic mailing listDatabaseQuicksortPatch (Unix)ExpressionControl flowCounterexampleTopological algebraNP-hardMessage passingExplosionSequenceFunction (mathematics)Computer fileWorkloadGame theoryRevision controlFlow separationEstimatorDynamical systemGraphics processing unitMathematical optimizationLink (knot theory)MereologyVirtual machineContext awarenessMultiplicationMultiplication signFluid staticsDifferent (Kate Ryan album)Topological algebraRight angleSeries (mathematics)Software design patternSource codeDisk read-and-write headNetwork topologyBitMachine codeSet (mathematics)Point (geometry)Numbering schemeObject (grammar)Block (periodic table)NeuroinformatikShared memoryFunctional (mathematics)DemosceneComputer architectureCuboidComputer animation
Point cloudFacebookOpen source
Transcript: English(auto-generated)
So, in the autumn of 2008, I started implementing Mesa's GLSL compiler completely from scratch. We had a GLSL compiler, but for a whole bunch of reasons that I think I've talked
about in other talks at other conferences, it had to go. So I'd never written a compiler before, I'd taken a compiler's class at university, but that was it. So I did a couple of months of research, and I started on the project armed with two books. One was Steven Munchnick's seminal work, Advanced Compiler Design and Implementation.
Now even when I started this project, that book was already around 10 years old, and it was starting to feel a little bit dated. It had a very small section about SSA as, this is this new thing that people are starting to play around with. But, everything else that was in it was really detailed explanations of battle-tested algorithms,
and all of the algorithms and data structures in it were related back to actual production compilers that people were shipping, and the problems that people had with them, and things that they had to do to actually make stuff work.
So it was really, really solid, practical advice. And in fact, there's a JIT project started by some folks at Red Hat that's using the middle IR from that book as its basis. So people are still using this, it's still a great resource.
The other book that I had in the other hand was a retargetable C compiler. So the draw of this book, this book is by a couple of university researchers that's primarily about the LCC compiler. The draw of this book was Elberg, which was a code generator generator system, and
Burrs, the bottom-up rewrite system that they had to implement code generation in that compiler. So LCC has this tool that's part of it called Elberg that basically implements a domain-specific
language for generating code generators, where you have a set of statements where you have a pattern of IR to match, a blob of something that looks a bit like assembly code to generate, and then an estimate of how much that blob of assembly costs.
This set of code is then compiled down into some data and code that gets built into the compiler and used by the Burrs, the bottom-up rewrite system, to actually implement
the code generator in the compiler. So Burrs uses two passes to go over the IR trees. It does a bottom-up pass, where it looks at progressively larger and larger trees to kind of come up with what's the least cost way to implement this little bit,
and then kind of grow that up, and it decorates the tree as it goes up, and then it does a second pass going top-down to actually emit the code. So this has two orthogonal benefits.
One is that open-coded pattern matching code is really, really tedious to write and is super error-prone. So for example, on later Intel GPUs, we have a flag on the sources of logical operations that says, take the bitwise inverse of this source, so you can combine
NOTs with your sources. So we can use that, say, with, if you see an IR pattern of NOT of A and B, well, you can just invert A and B and implement an OR instruction instead. So instead of emitting two instructions, you can emit one.
To do that in the compiler hand-coded, you end up with this lovely eye chart. So in order to fit this conceptually simple thing of the code that actually implements it, I had to use a six-point font, gut out all the comments, and turn it into this
disaster. So without the comments, if you saw this code in the compiler code, it's not even really obvious what it's doing. So, because of that, we have very, very few clever bits of code generation like this in
our compiler. I think when I implemented this, it helped a few thousand shaders in our shader database by hundreds of instructions, a percent or two in them. So when you have things that are hard to implement and hard to maintain, they have
to have a lot of impact, or there's no return on the investment. So we don't have a lot of that, but we could. So the other thing that attracted us to having some sort of automatic code generation is being able to have the decorations of how much you think a particular instruction
sequence is going to cost, so that you can try to do more, take a little bit more of a global view and do an optimal code output.
So this is definitely orthogonal to the first benefit, and some other work has been done on this that doesn't use a code generator generator. Eric Anholt has, I think, one of the oldest still open MRs in Mesa GitLab, since we switched
over to GitLab for a thing called Nultus in the driver that he was working on that tries to apply some of this. So it's orthogonal, but the two things play together really well, because if you're going to do this sort of global optimization, it helps to have a lot of different patterns
so that it can try to find different routes through emitting a large expression tree. If you have a bunch of different ways of implementing that, it gives the optimizer more freedom to pick better things.
So once we started digging into this years ago, the dream quickly turned into a nightmare. There were so many both practical and technical problems that we ran into.
The first big problem was we'd have to implement our own domain-specific language to do the code generator generator. In order to implement our compiler, we had to implement another compiler. And that was a huge problem in Mesa, because until very recently, we didn't have a way
to be able to build something during the Mesa build that would run on your build system to generate other build dependencies, just because of the way, the disaster that was Mesa's build system. So it took almost 12 years before we could have even done this.
And that ended up being a problem for some other things, like how we wanted to implement the shading language built-in functions, but that's another story. So it also turns out that source modifiers are a big deal on GPUs.
We can almost for free be able to say that a source in an instruction should be negated, or have its absolute value taken, or both. But that's not really a thing on CPUs, so none of the research or any of the published
work about how to actually implement a generator generator even considers these things. And it turns out they're kind of a huge hassle. But it's super important for generating optimal code on GPUs.
Most of the time, especially on the GPUs that we cared about when we started implementing this project, things like negation are free, but not always. On i965 GPUs, for example, trig functions and some other complex math you can't have
any kind of source modifiers on, so there's a lot of things that have to be taken into consideration of whether or not you have to resolve those source modifiers. And it wasn't even universal which things were free.
So on i965, absolute value was free. But on 915, which was a thing we still cared about 12 years ago, and R200 and R300, to kind of date where this started, you had to use separate instructions for doing absolute value.
They didn't have an absolute value source modifier. And if we were decorating, if we were going to have a bunch of patterns to implement all the different possible instructions, we didn't want to have to go through and decorate every single place all the combinatorial explosion of, yes, you can have absolute
value and negation on both the sources of an ad and all of the... It would have just made it way too verbose and would have kind of defeated the purpose. So kind of going along with source modifiers, there's also possibility of destination
modifiers, which is definitely a thing that doesn't exist on general purpose CPUs. You can have a saturate modifier that will clamp the result of an operation to the range 0 and 1. And on some NVIDIA GPUs, there's a different saturate modifier that clamps to plus or minus 1.
And we've kind of come a long way in our compiler design in the compiler stack in Mesa since then. And it's kind of dubious whether or not having in the IR the destination modifier as a special thing is actually that helpful.
And I think probably in the next year we're going to end up removing that. But it's still a thing that has to be considered, because being able to get that... Clamping happens a lot, and being able to get it for free on instructions is a big deal.
So then on the GPU architectures of the day, they were predominantly vector-based for at least part of their pipeline. So swizzling was hugely important. And that's not a thing that even exists on SIMD architectures.
SIMD architectures are just generally parallel, do the same thing on all the channels in their order. There's no reordering of channels except with some very special instructions that are dedicated just for reordering. Swizzles end up being worse than source modifiers, because source modifiers, there's
four combinations that you have to worry about, none, negate, abs, and negative abs. But with swizzles on a vec4 architecture where you have four element vectors, your base, there's either 256 or 1296 possible combinations of swizzles.
There's some architectures that can not only reorder X, Y, Z, W arbitrarily per component, but can also select constant values of 0 and 1.
And that doesn't even consider having to deal with write masks. So the whole thing was just so far out into doing basically compiler research to figure out how to do these things, that it was just completely impractical. But that dream of being able to write these concise descriptions of, when you see
this, generate this code, was so compelling that we tried several times. Both myself and Eric Anholt, each of us tried at least twice to come up with something that would do this, but every time ran into a roadblock and just gave up because we had
to ship products and not do compiler research. As much fun as that might be. So they say time heals all wounds, and I don't know if that's true, but I do know that if
you ignore technical problems long enough, they will probably become irrelevant. So what wounds did 10 years heal? Well, first of all, we don't have to write a DSL anymore. Pretty much everything that Elberg or any of the other code generator generators that
people had ever come up with, pretty much anything that those languages needed, we can handle with just Python data. We can make data structures that encapsulate the match this, do that.
We can do that really easily with just Python data. And in fact, NeurOptAlgebraic, that's already part of Mesa, is an existence proof that that kind of architecture works really, really well. There's a couple of things that NeurOptAlgebraic doesn't support that might be nice in a generator
generator. It doesn't support any kind of regular expressions or more complex pattern matching. You just do a static match this exact pattern and do this exact thing.
And in the do that part, it's very limited in what you can put in the do that part. And it's also a bit lacking on being able to put constraints on the sources. Over the years, we've added a lot of things to how it can do constraint matching on parts
of the patterns, but it can't do something like, if you have a match expression that has, say, two constants in it, and you want to say, match this pattern if this constant is less than the other constant. There's no way to do something like that, and there's a bunch of cases where that
would be handy. And then also, vector architectures are dead. That whole swizzling, right masking, all that nonsense just is irrelevant because all those architectures died off. The last architecture for Intel where we had to use a vector-oriented thing was Haswell,
which was years ago at this point. Broadwell supported it but didn't require the use of it, and we never used that in the driver that we shipped.
And there are still a few architectures that have some kind of more traditional SIMD or SIMD within a register kinds of things, like there's a handful of 4x8 instructions on VC4 that do operations on four parallel bytes within a single 32-bit register.
But because those look like traditional SIMD, they don't have the craziness of the swizzling and right masking and all that nonsense. So we don't have to worry about those either. They're not a problem. So looking at OPTAlgebraic as a DSL, you have a simple pattern to match is the first
field, and those can be arbitrary nested trees. You have a replacement as the next field, which are also trees.
And each of them have variables that are just text. We have some convenience names for them, ABCD, et cetera. So you can rewrite the expressions. And then there's sort of a global condition flag that's optional. So in this case, this will do lowering.
The IR has a subtract instruction, but most GPUs don't have a subtract instruction. You just do add of neg. So this does that sort of lowering pass. So the do that part of the Neur OPTAlgebraic DSL is always replace the old tree with this
new tree. And replacement is a little bit of a misnomer there because nothing gets removed from the input string. What happens is it just adds some more stuff.
You have a new result value, and all the users of the old result get modified to use the new result. And then we expect that later optimization passes like dead code elimination will clear out any of the bits that are no longer needed.
They'll do the actual deleting part. The other subtle point here is parts of the input string might not match any patterns that exist in your giant pattern file, which is actually a good thing. And eventually as you go through the optimization passes, you'll get to a point where nothing
matches. Otherwise you'd get stuck in an infinite optimization loop, which is a bug that happens occasionally if you have a series of patterns that sort of fight with each other. So if you had this pattern and had another pattern that converted add of neg to subtract,
and they weren't decorated with this condition, they would just keep undoing each other. And so each pass through, they'd both activate and say, yes, their optimizations happen, so keep going through the loop, and you just get stuck in an infinite loop. It is a bug that happens from time to time.
So especially after hearing me talk up NeurOptAlgebraic, it's probably not hard to guess where this is going. We can use this existing infrastructure to implement a new, match this, do that kind
of DSL. But there are still some kind of sticking points. The algebraic optimizations happen while the IRR is still in a very pure SSA form. But when we do cogeneration, we've come out of SSA form, and so there's a lot
of assumptions that are made by the OptAlgebraic framework that don't hold anymore. So for example, the OptAlgebraic can't handle source modifiers in the matching.
If it encounters a source modifier at any point while it's trying to match a pattern versus the actual IR, it just says, I don't know, and gives up. So, what seems to be sufficient for cogeneration purposes is to, if you encounter a value
and that same value but with some source modifiers, is to just treat them as they're completely unrelated variables. It's not a value and its negation, it's a value and a different value, and just match
the patterns as though they were, had nothing to do with each other. So it also can't handle having the destination saturate modifier. So far, I think it is sufficient to only be able to have the destination saturate
modifier at the base of the match tree. So when the subtract, well, I guess that isn't a very good one because there isn't much of a tree there. So only the root of the tree could have a saturate modifier, and if you encountered anything during the match farther down in the tree, you say, okay, it doesn't match.
So far, I haven't found any cases during cogeneration where it would actually be beneficial to detect saturate modifiers at other points in the tree. The couple cases that I came across where maybe something clever could be done, it turned
out it was actually better to just implement an algebraic optimization to change what the tree was going to look like before the IR was in the state where you have those as modifiers, where saturate is just an instruction, do the optimizations earlier
on, and then later on, you don't see that stuff during cogeneration, and it ends up being much simpler. That may change as we come up with more clever cogeneration patterns, and if it gets to that, we may have to just stop using the destination modifiers, which we
may end up doing anyway. It's still an open question. So, OptAlgebraic can't handle non-SSA destinations.
So after we've come out of SSA mode, if the shader has control flow, there's going to be places where writes happen to registers instead of to SSA values. And I don't, that doesn't, that's fine. It's completely safe to match the final thing as writing to a non-SSA value.
So the patch that I referenced there is, it just deletes an assertion out of the matching code is all. And so that's pretty easy. But going along with that, it also can't handle sources that are non-SSA values, and
we have to be able to match those because we have to be able to generate code from that. So far, I think that it is safe to match a non-SSA value in an expression tree if that SSA value only, or if that non-SSA value, pardon me, only appears once.
Because the problem is, if you have, the SSA values are, by their definition, they don't change. Non-SSA values, if you see two reads of them that are separated at all, those reads could
be reading different versions of that value. So they might be different values. So if you had the same thing appearing multiple times in an expression tree, each of those might be different. And so the tree doesn't really match, you might be generating bad code.
So, for example, if you had a pattern that matched A plus A and replaced it with A times 2, well, if those two As were non-SSA values, they might be different values. So replacing that with one of them multiplied by 2 is going to give you an incorrect result.
So I think that it's safe, and I have some results later on that help convince me that it's safe to, if a non-SSA value appears only once, that it's safe to just match
it and go on with life. The need for being able to have non-SSA sources in the trees poses another problem for the existing matching engine. But I'll get to that again in a couple of slides.
So, the existing infrastructure has really, really limited type tracking. So the values themselves in NER don't have any types, they're just a blob of bits. They only have a type when they're used by an instruction, and the instruction dictates
what the type is. So we have separate floating point add and integer add instructions. So if you have the same values as sources to both those instructions, in one of those cases that value is interpreted as floating point, and in the other case it's interpreted
as integer, even though it's the exact same sequence of bits. So, where this becomes a problem is, for example, on Intel GPUs we don't have a specific
instruction to convert integer to float. What we have is we have a move instruction, and you mark on the move instruction that its source is integer and the destination is float. So when you get into the do that part of the patterns, you need to know, I'm just emitting
a move instruction, but for this source I need to know what type it was being interpreted as in the pattern that got matched. So there's some additional tracking that needs to happen so that the values that are used in the do that part can get back connected to where they appeared in the original matching
tree. So like I mentioned before, opt algebraic doesn't require that everything in the input matches something in the patterns, but if you're generating code, everything that exists
in the IR had better end up as some kind of instructions in the output, or you're just like dropping stuff on the floor. So it isn't a problem per se, but it's something that when you're writing the code
generator, the machine description for the code generator, and when you're implementing the other infrastructure for the code generator, that you have to be aware of so that you can detect, I sent some IR into the matching engine and it didn't match anything. You need to generate some kind of an error or do something because you're about to do
something bad. And then the other thing that's kind of a quirk, it's not really a problem but is again something that you have to be aware of is in opt algebraic the input language
and the output language are the same. But the whole point of doing code generation is you're translating from one thing to a different thing. So what this means in a practical sense is about half of the infrastructure that exists
in opt algebraic is completely unusable for a code generator. The only part that is actually reusable is the matching part. We found a match, now let's do something isn't any good to you because we're doing something different. And it turns out that about 45% of the work of this project was completely rewriting
all of the infrastructure for doing the do that part of the process. So the other difficulty with non-SSA values that I alluded to before has to do with
the optimized matching engine in opt algebraic. So the algebraic optimization happens a huge number of times during the main optimization
loop and we found that it was a big chunk of the time was spent in just doing all of the matching. So Connor Abbott wrote a really, really optimized matching engine that uses a modified finite
state machine to very quickly do the matching. But the way that it implements this is it uses the SSA value numbers, the index on each SSA value, as the main key for the entire engine.
So if you have to match things that don't have SSA value indices, you're stuck. The whole infrastructure just completely falls apart. So we can't use any of the super optimized matching engine. That's okay because while algebraic optimizations might happen hundreds of times while compiling
a shader, code generation happens once. So if we have to use a slower path for that, it's a thing that happens once, I don't expect that it's going to be a big deal. Okay, so the output language.
There were a few goals and requirements as I was implementing this. The first was I really wanted the thing that a human authors to look as much like
the actual GPU assembly as possible. That's being implemented as some kind of Python object, but I wanted to make it so that when you looked at it, it looked like you were looking at GPU assembly because the people who are going to author these things and try to debug them and maintain them are going
to be familiar with that. That's the output that you see when you're debugging shader compilation issues. It's the format that's in the hardware documentation. It's what we get when we look at output from shader DB runs. That's the language that we work with. Let's make this other thing look like that just to reduce the cognitive burden of having
to understand these patterns. This can be challenging because a lot of the instructions and the way that register accesses are encoded and a bunch of other things about the GPU architectures are weird
and complicated. Getting those to look civilized as a bunch of Python objects is tricky. The other thing that's nice is you really, really want to be able to represent as much
of the weird, quirky bits of the architecture as possible. You don't necessarily have to get every weird quirk, but it turns out that those weird corners of how a particular GPU works, those are the cases where you're most likely to find opportunities
to generate smarter code. The way that you can access sub-parts of registers or mix types and instructions or do other kinds of weird things, that's where you can generate clever stuff.
So it's beneficial if you can just represent that. As I'd said before, in Python you have to be able to also represent these things as a list or tree or some kind of collection of just Python objects. The whole point of the exercise of not implementing our own DSL is we can just reuse Python.
And then in C, this is going to somehow generate out C code. We want it to be represented as either data or code or some combination thereof.
And I think ideally if we can generate a set of compact data tables to represent the transformations and the patterns, that's kind of ideal. When the code generator is hand coded, the existing hand coded one, the single generator
covers all the different GPU generations, and that's very practical in that case. But when you have instead of a hand crafted code generator, when you have a machine description
file essentially that describes translation of patterns from the IR to a set of assembly instructions, it's more natural to have a bunch of separate files that just cover a single architecture. And so kind of the thing that naturally flows from that is to compile each of those machine
descriptions into its own blob of data. So instead of having one big chunk of C code, you're going to have a bunch of pieces of data. So having those five or six or seven giant piles of data be as compact as possible is beneficial.
So here's how the output language looks as the list of Python objects. This implements the F sign instruction for 64-bit floats.
And I think that this covers... I picked this one because it hits almost everything that the output language can support. So the first thing it has is it declares a temporary register because in this sequence of instructions it's going to generate a temporary value.
And then it has an instruction that's going to load that temporary register with a double precision float immediate zero value. It does a comparison that sets the not zero conditional modifier flag.
Does another move. It does a bitwise AND instruction that only uses the upper 32 bits of the 64-bit destination and the upper 32 bits of the 64-bit source. And then it has an OR instruction that is predicated so it only executes if the compare
instruction actually set the not zero flag. So to pretty much anyone who's familiar with the assembly language of this GPU, what this is doing looks pretty obvious.
If you don't know this GPU assembly language, it's just a pile of gook, but you're probably also not authoring or maintaining one of these machine description files. But it covers pretty much everything and it does it in a compact, concise way. And it's just a bunch of Python objects.
The instruction generates an object and then the predicate method modifies that object. It does things in a pretty clean way, I think. So what I ended up doing with this is it, the output of when you run the Python scripts
to convert the code generator, or convert the machine description into C, is they generate a couple of tables of byte code. There's two separate tables.
One is a table of immediate values, because those could be up to 64 bits for each immediate value, and then a sequence of instructions, which each instruction is much smaller. I don't remember if I managed to get it packed into 16 bits or 32 bits.
And so the byte code, each instruction in the byte code does a very simple thing like declares an opcode, and then there's a separate instruction that declares the destination with all of the flags and modifiers that can be on a destination, and another one for each of the sources that also has all of the flags.
And the C code that then converts that byte code into actual machine code ended up only being 203 lines, including the license header. So maybe 150 lines of actual C code. So it's very small and very concise.
I ended up creating six different machine description files to cover all the various generations of Intel's GPU architectures, and the files are organized so that it's
really easy to just do a diff of one generation's machine description and the next generation's machine description, and you get a very compact output of, oh, here's all the stuff that changed from Gen 7 to Gen 8. It also makes it really easy when you need to add the next generation.
You just copy the old one and go to town making edits on it to add new features or remove things that use features that got removed or instructions that don't exist or whatever. Once I got the first machine description written, cranking out all the rest of them
was pretty easy. It's always the most work is to make the first one. So remember back a whole bunch of slides, there was the eye chart?
So we can now look at what it looks like to do all of that same kind of optimization but using the machine description. It's still a lot of stuff, but nowhere near as much. I fit it on a slide in a 10-point font instead of a six-point, and I didn't have
to delete any of the comments. But I want to just look at one of the sub-patterns here. So this says if we have a NOT of a logical instruction and sources that don't have any source modifiers, we're going to emit a single instruction that's the other logical
instruction. So if we had an AND, we're going to emit an OR with all the same sources, but we're going to set the negate flag. On this particular architecture, on a logical instruction, if you have a negate source modifier, it instead does a logical NOT.
And then all the rest of this is just to handle one sort of annoying quirk that came out of this is all of the patterns have to match for specific bit sizes. So there's a lot of redundant patterns that ended up existing in the machine descriptions.
So in this case, I have to generate separate patterns for 8, 16, 32, and 64 bits, because I had to know what types to decorate on the sources.
So it was a little bit of a hassle, and so that's why this ends up being a little bit bigger than it should have to be. There may be a solution to that, I don't know. It didn't end up affecting that many instructions that I wanted to generate, so it wasn't too
bad, but there's always potentially room for improvement. So we're talking about replacing a pretty major subsystem in the compiler, and whenever you're replacing a big subsystem with something new, there's one important goal that you really
should have in mind that people don't talk about a lot, and that's minimizing disruption. If you're going to replace some existing code that works with a new, shiny, better
version that's going to make some things faster, you'd better make sure that it doesn't substantially hurt other things, right? And if some things are hurt, you have to at least make sure that it's minimal damage and that the pros dramatically outweigh the cons.
So when I ran just the straight machine descriptions as I authored them without adding any new optimizations across our shader database, which has about 50,000 shaders from real applications,
there was no change in the generated code. So that felt pretty cool. I ended up, the impact on the mesa source was about 1,400 lines of deletions and 4,800 insertions, and a huge chunk of that was just the machine description files, because
there was six of them, and each one averaged about 500 lines. So there's some debate about whether or not we should extract common bits from the machine description files and have them stored in a separate file that then basically
just gets included. I think it's better keeping them standalone, where each one is entirely independent, because as the architectures change, less and less is truly shared across a large chunk of the
architectures, so it kind of has diminishing returns as time goes on. But I don't know, it's still a debate to be had. And then towards the end of the project, I added a couple really simple optimizations
that would have been a hassle to add in the hand-coded code generator, and those helped a few hundred shaders by a couple percent. But at this point, almost no work has been done on finding patterns, new optimization
patterns and clever tricks. But I have a big list of things that I want to try, and I know that just from having looked at lots of shader assembly over the years, I know there's a lot of stuff that we can do. So, future work—well, wait, no, finishing current work—there's still a couple of
the patches that are marked work in progress. The big one is, like I had mentioned before, about deciding what we want to do about the machine description files, keeping them standalone or trying to extract out common parts.
And there's a couple of patches in the MR that show what some of that would look like if we pulled out the common parts. The other thing is, there's still a couple of the patches that I think could really
benefit from some unit tests, in particular the one that enables non-SSA sources. I think that that's safe, but I would like to spend some time trying to come up with some counterexamples and try to encode those as unit tests.
I think I'm running out of time, so I'm going to try to go through the actual future work kind of quickly. One thing—some instructions can't have source modifiers, some instructions can't
have immediate values as sources, and there's a handful of other limitations like that that are currently handled by adding a bunch of extra patterns and putting in a bunch of complex stuff. So for example, on the Intel GPUs where the negation source modifier on a logical instruction
means take the logical NOT, we have to—if you have a source that has the negation source modifier on it, we have to emit an extra move instruction to sort of resolve that negation. I think that it would be better to encode that a different way by having a set of patterns
that say if you have an immediate value, you can convert that into something else by having a move, or if you have source modifiers, you can eliminate those source modifiers by adding a move, and then simply decorating some of those patterns as can't be an immediate
or can't have a source modifier, and then having the pattern matching engine first match the resolve kind of pattern and then match the other pattern, I think that would end up making things a lot cleaner and would help support some of the other future work
that I think needs to be done. So integrate code generation with CSE, so as you're emitting instructions, do the common sub-expression elimination at the same time so you don't re-emit the same instruction in each basic block.
Right now we have a separate CSE pass that will just go in and clean that up, but we could just integrate it directly with the code generation. That by itself doesn't buy you very much, but it does make it much easier to do dynamic
cost estimation, where if you have a sequence that's going to match and you know what this output pattern is, you want to be able to estimate what the cost is going to be. If you know that, well, I don't need to emit these three instructions because they already exist in this basic block, I can just reuse those values, that's going to give you a completely different cost estimate than if you had to emit all of the instructions
in the block, and it's really hard. The reason that I specifically say dynamic cost estimation, a lot of the existing research about code generator generators used static cost estimates, but GPUs are complicated.
It's really hard. If you have more than two instructions in your replacement pattern, it's hard without any context to be able to estimate how much that's actually going to cost. So the cost estimation function, it has to be dynamic or it's just lies, and at that
point, why are you doing it? And then finally, we get to great success of we have cost estimates, so then we can integrate it with something like Nultus or some other BERS or some existing optimization
scheme that tries to pick the best sequence of instructions for a given set of expressions. So then I've got a bunch of links to all the MRs and the work at Red Hat that I mentioned.
Yeah, there's an extra MR in there that contains some prerequisite patches for the actual code generator, generator MR, and then Eric's very old Nultus patch series. And I think I finished within time. Yay. Are there any questions?
Yes. Is the architecture in such a way that you could, instead of having common code, instead have it for an older version and then have a newer file and only the things that have changed in that version being the newer file and that overrides what's in the older version? Right.
So what I did, and there is a patch in the MR that basically, okay, so the question was is there a different way to architect the machine descriptions so that you basically have common stuff and then in newer, in the machine descriptions for newer architectures,
you just override other things. So what I did is common parts that I pulled out, I would pull out a common part and I would make a separate Python list for the common stuff and I would give that list some name.
And then in each generation that was going to reuse that list, I would just concatenate it in with the other list. So I would have sort of a concatenation of lists to pull in the shared, the things that could be shared, and then stuff that couldn't be shared was just listed out in each machine description.
That seemed like the cleanest way to do it, as just Python objects, as just a list of Python objects. I don't think that there was a good way to override things that were in the list, but I'm definitely not, I'm not the Python ninja on the team, so there may be a way.
Do I have time for another? One minute left. One minute left. Anyone? Yes, in the back. So the question is if you have multiple different non-SSA sources that the value
that you want, if you have a large enough tree and you have multiple of those sources that there might not be a point in the program flow where both those values actually exist.
Maybe that's true. And that's why I want to, I mean that's one of the things that I want to sit down and spend more time and try to like find some actual concrete examples of, sort of counter examples.
There were a couple other, I mean the thing that bothered me about this is, you know, I implemented that bit and I had that nagging voice in the back of my head of, are you sure that that's safe? It seems like there's a bunch of holes here you might fall in. I'm like, all right, well I have this giant database of thousands of shaders
and some of them are really huge. Like there's some compute shaders in some games that end up generating like 10,000 instructions. And they all just compiled and didn't explode. So I can't find a real workload that disproves it, so it's hard to, yeah. And I'm being told that my time is up, so I'm going to stop so I don't get in trouble.