Three Years Experience with a Tree-like Shader IR - TIB AV-Portal

Three Years Experience with a Tree-like Shader IR

00:00

6

Formal Metadata

Title

Three Years Experience with a Tree-like Shader IR

Title of Series

Number of Parts

199

Author

License

CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/32643 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Three years ago a small team at Intel took on the task of rewriting the OpenGL Shading Language compiler in Mesa. One of the most fundamental design choices in any compiler is the intermediate representation (IR) used for programs. The IR is the internal data structure used for all program transformations including optimization and code generation. At the time the compiler was designed, a number of alternatives were investigated. In the end, a tree-like IR was selected. With hindsight being 20/20, this talk will present the tree-like IR that was chosen and the issues that have been found with that IR in the interim.

Speech

Text

Image

00:00

CompilerRow (database)IntelSoftware developerOpen sourceRight angle

00:38

Multiplication signFormal languageBitSystem programmingCompilerProjective planeSet (mathematics)SequenceExpressionDynamical systemArchitectureHypothesisImplementationRevision controlRemote procedure callSource codeSearch algorithmSoftware architectureHardware description languageInheritance (object-oriented programming)MathematicsCompilerDescriptive statisticsWeb crawlerOperator (mathematics)Product (business)Real numberNetwork topologyState of matterParsingCodeQuicksortComputer architectureAxiom of choiceMachine visionResource allocationOcean currentExpert systemVirtual machineAdditionLoop (music)Message passingArmClassical physicsCartesian coordinate systemParsingDifferent (Kate Ryan album)Front and back endsTelephone number mappingShader <Informatik>MultilaterationLecture/Conference

06:14

Network topologyQuicksortBefehlsprozessorResultantSoftware developerAsynchronous Transfer ModePermutationCodeDifferent (Kate Ryan album)Binary multiplierData compressionCompilerRegular graphSet (mathematics)Multiplication signMathematical optimizationOperator (mathematics)Maxima and minimaRotationMathematical analysisBitPower (physics)Sound effectNatural numberExpressionMessage passingSequenceVector potentialSource codeFigurate numberCausalityTrailComputer filePlotterCoprocessorSingle-precision floating-point formatAlgebraTopological vector spaceElectric generatorGraphics processing unitFlow separationProgram flowchart

11:27

Structured programmingSemiconductor memoryPointer (computer programming)Array data structureProcess (computing)AdditionVariable (mathematics)Field (computer science)ResultantNetwork topologyGame theoryVirtual machineShader <Informatik>MultiplicationExpressionRevision controlOptical disc driveLecture/ConferenceXML

12:54

Data structureShared memoryBitVirtual machineSet (mathematics)Network topologyStructured programmingShader <Informatik>Lecture/Conference

13:33

WordMathematical optimizationStructured programmingUniformer RaumCodeBitCondition numberNetwork topologyGaussian eliminationDiffuser (automotive)Goodness of fitCartesian coordinate systemQuicksortRight angleLogical constantStatement (computer science)CompileroutputExpressionIntermediate value theoremGraphics processing unitDifferent (Kate Ryan album)Electric generatorFront and back endsShader <Informatik>Variable (mathematics)Trail

16:21

Computer fileStructured programmingQuicksortNetwork topologyInformationPointer (computer programming)Fitness functionSequenceSet (mathematics)Mathematical optimizationDifferent (Kate Ryan album)Arithmetic meanMultiplication signSemiconductor memoryVideo gameSocial classExtreme programmingMessage passingOperator (mathematics)Lecture/Conference

18:54

Mathematical optimizationInfinitySystem programmingVariable (mathematics)MappingOcean currentNumbering schemeoutputInformationOrder (biology)Addition32-bitAddress spaceSet (mathematics)Array data structureComputer programmingSimilarity (geometry)Multiplication signShader <Informatik>TrailUniformer RaumBuffer solutionSemiconductor memoryGoodness of fitQuicksortDifferent (Kate Ryan album)BefehlsprozessorObject-oriented programmingState of matterComplete metric spaceBitSpeicheradresseShared memoryCompilerPhase transitionWordMessage passingFirst-person shooter

22:14

Structural loadDatabase normalizationMessage passingArray data structureFunction (mathematics)Subject indexingGraphics processing unitSemiconductor memoryMathematical optimizationData storage deviceMappingMultiplication signCodeShader <Informatik>Extension (kinesiology)Ocean currentNetwork topologyLogical constantParity (mathematics)Dynamical systemCartesian coordinate systemGeometryShared memoryGoodness of fitDependent and independent variablesLecture/Conference

24:09

Variable (mathematics)Texture mappingReduction of orderRight angleForm (programming)QuicksortNetwork topologyAdditionSystem programmingError messageConstructor (object-oriented programming)Operator (mathematics)InformationDifferent (Kate Ryan album)ForestMessage passingKnapsack problemExpressionSystem callSemiconductor memoryHierarchySingle-precision floating-point formatBinary multiplierDevice driverArray data structureMereologyDataflowSocial classVolume (thermodynamics)Parameter (computer programming)Level (video gaming)Functional programmingFront and back endsMathematical optimizationCodeChainingAddress spaceOpcodeResource allocationElectronic mailing listStructured programmingSheaf (mathematics)Compact spaceProcess (computing)TrailNumbering schemeCalculationUniform resource locatorComputer programmingSequenceGaussian eliminationElectric generatorOcean currentGraphics processing unitType theoryCompilerIndian Remote SensingFilm editingBitAuditory maskingHoax

32:41

Dimensional analysisMathematicsMereologyChainingBlock (periodic table)Gaussian eliminationInformationCodeSet (mathematics)Form (programming)Mathematical optimizationNetwork topologyFood energyCASE <Informatik>Group actionElectric generatorSoftware maintenanceParity (mathematics)Message passingXMLUML

34:29

Projective planeMultiplication signCompilerBlock (periodic table)Electric generatorNetwork topologyDampingComputer fileCASE <Informatik>Revision controlWindowProduct (business)Game theoryFront and back endsShader <Informatik>Statement (computer science)ChainingStructured programmingGroup actionDigital Equipment CorporationSequenceTable (information)Loop (music)Hydraulic jumpFormal grammarTrailOffice suiteProcess (computing)BuildingComputer architectureLevel (video gaming)Ocean currentMessage passingBitMereologyWordCodeOrthogonalityQuicksortLecture/ConferenceComputer animation

39:05

Statement (computer science)State of matterInterface (computing)Shared memoryLine (geometry)Revision controlOverhead (computing)Different (Kate Ryan album)Device driverProjective planeArithmetic meanRight angleFitness functionOpen setMereologyVolume (thermodynamics)Group actionGame theoryOnline help1 (number)Core dumpShader <Informatik>MathematicsCuboidWave packetCompilerPhase transitionMultiplicationFormal grammarSemiconductor memoryUniformer RaumComputer fileMultiplication signExpressionSequence2 (number)Array data structureElectric generatorPhysicalismObject-oriented programmingPointer (computer programming)Frame problemCodeMathematical optimizationDemosceneProcess (computing)BitCartesian coordinate systemoutputSystem programmingFlow separationFunctional programmingLibrary (computing)Computer configurationCompilerLipschitz-StetigkeitBookmark (World Wide Web)Computer programmingSqueeze theoremQuicksortLink (knot theory)Computer architectureFunction (mathematics)Direction (geometry)Client (computing)Level (video gaming)Limit (category theory)Sound effectKernel (computing)Standard deviationNetwork topologyClosed setSource codeInstallation artBoss CorporationMobile appStructural load32-bitCache (computing)Front and back endsSinc functionLecture/Conference

Transcript: English(auto-generated)

00:01

Is the recording started?

00:22

Is the recording started? Okay. Alright. So, my name is Ian Romanek. I'm with the Open Source Technology Center at Intel. I'm one of the main developers on the GLSL compiler in Mesa, and I'm gonna talk about the experience that we've had over the last three years with a tree-like IR

00:45

for the OpenGL shading language. So, I'm gonna spend a bit of time kind of going over a little bit of the background of the current compiler architecture and how we ended up with this IR, talk about a bunch of the problems that we've discovered with the IR architecture over the last three

01:03

years, and then talk a bit about ways to evolve the existing IR into something better that doesn't have a bunch of those problems. So, we started this project in around 2010, and at that time, the GLSL compiler was kind

01:24

of a disaster. Technically, it supported GLSL 1.20, but there were a lot of real-world shaders that just wouldn't compile or generated completely awful code. It was written using a custom parser generator, so the guy who wrote it, a guy named Michael

01:44

Kroll, basically wrote his own version of YACC to write the compiler, and to put this in all fairness, he wrote this as an undergrad thesis project at his university so that he got anything working by himself that could actually generate code and compile anything

02:04

was a pretty spectacular accomplishment, but as something that you want to be able to run real applications, it just wasn't good enough, and it wasn't really an architecture that anyone else could understand or maintain or that we could make improvements on.

02:23

In addition to the sort of custom parser generator, the IR that it used used a register stack, so basically like the old 387. It was, yeah, there was a bunch of really weird choices in it.

02:45

We set out to rewrite it, and at that time, none of the people involved with the project, it was primarily myself and one other guy at Intel who were going to be working on this, we were graphics people and not compiler people, so we set out to find out what sort

03:02

of the state of the art was in production compilers, and we ended up looking primarily at two sources at the time. One was Advanced Compiler Design and Implementation by Steven Munchnick, which is a really, really excellent book. If you're interested in compilers, it is a really good source, but it kind of has a

03:23

disadvantage that it was published right just on the cusp of when SSA was becoming mainstream. So there's a chapter about SSA in it that he kind of prefaces with, well, there's this kind of new thing that people are trying, and we'll see if it takes off.

03:42

The other source that we looked at a lot was David Hansen's classic text, A Retargetable C Compiler, which is the book that, if you've ever heard of the LCC compiler, is basically the book about that compiler. One of the interesting things about that book is he developed a fairly sophisticated

04:07

system called Elberg for writing machine description languages, and so we sort of had this vision of, well, we have all these different weird GPU backends that we want to be able to support. Wouldn't it be great if we could just write a description of, when you see this kind of

04:24

sequence of things, you can generate these instructions to do it, and it has this kind of cost. His system had a really sophisticated dynamic programming architecture so that it would, kind of like an AI search algorithm, try to find the least cost instruction sequence

04:44

for a particular set of IR. We kind of had these two things as what we were building from. From that, and particularly because of the way that the Elberg system works, it wants

05:02

to know a lot about how partial values generated in expressions are then later used. Unfortunately, it also really wants to be integrated with your CSE pass and register allocation, and that's kind of where us not being compiler experts, we sort of missed

05:25

that bit, but I'll talk a little bit more about that later. What we ended up with is a tree-like expression IR where we have a base class of R values and then all the different sorts of R values derived from that, sort of the most important

05:44

one being IR expression where we have an enum for the operation, whether it's an add or a multiply or whatever, and then the set of operands that that expression takes. We get these kind of trees like this where you've got a multiply, an add, and then a

06:05

max expression. We could evaluate this tree in a fairly natural tree-visiting way, be able to figure out that this is a...actually, I should say mad with saturation.

06:22

We can take this whole tree and figure out that it should be a single instruction. We have a few code generation bits in the compiler that work like this, being able to convert these sort of really common min-max trees to generate saturate modifiers,

06:45

add multiplies to generate mad. There's a whole, I think, six different kinds of trees that we can figure out are actually a lerp instruction. There's also a bunch of algebraic optimization passes that exploit this tree-like nature

07:07

so we can identify that if you have a pow instruction that's two to some power, we can actually generate the EX2 instruction instead. It turns out that GPUs tend to have really regular instruction sets with primarily very,

07:24

very simple instructions, and they don't have complex addressing modes. Primarily the benefit that you get out of being able to do sophisticated analysis of these trees is if you have CPUs that have crazy instructions or crazy addressing

07:42

modes like a scaled plus offset kind of addressing mode or post-increment addressing modes or things like that, that you can identify those in the trees and be able to automatically generate those instructions. GPUs don't have any of that. A lot of this potential from these trees we discovered when we were quite deep into

08:05

it, we weren't really going to be able to realize that potential. Plus, these trees can have various rotations and permutations, so if you want to be able to generate a mad instruction, you have to identify this kind of tree and that kind

08:22

of tree, and all other kinds of weird rotations, including transposing where the min and the max are to be able to generate the saturates. So being able to evaluate these trees turned out to be in practice a lot more complex

08:41

than we had anticipated or had hoped. We also have difficulties with code generators for it, so this particular tree ends up being a single instruction, but there's a bit of difficulty with the code generator if you have a sequence that's going to generate multiple instructions because then

09:02

as the code generator is recursing back up through the tree, it has to keep track of where it has stored these partial values so that then the next layer up in the tree can know that it has to know that the multiply stored its value in register 26, so then when you would generate the add instruction, it knows

09:23

that that source has to come from register 26. And that ended up making the code generation kind of a nightmare. It also makes doing any kind of a CSE pass exceptionally difficult because you have

09:42

these crazy, crazy big trees and you have to do your CSE on sub-parts of the trees. We also have difficulty identifying these kinds of sequence if, say, in the middle here

10:00

there's a swizzle operation so that the result of the multiply gets swizzled and be able to identify that when trying to generate the mad instruction. It ended up being a lot more complex than we had anticipated. Plus, you have the difficulty of if what the developer has written is some piece

10:24

of code like this where your multiply and your add don't end up sort of naturally in the same tree, the tree processor will completely miss, oh hey, I can generate the value. If x isn't used anywhere else, I can

10:40

generate the value of y using a single mad instruction. To overcome this, we had to write this really awful pass in the compiler called, if you look in the Mesa source code, there's a file called opt-tree-grafting that will try to find these kinds of places and merge those trees together.

11:03

So it'll take the tree that generates x here and put it in place of the dereference of x in that expression, and it's a horrible piece of code. I feel very sorry that Eric had to write that code.

11:21

I've had to apologize to him several times for it. In addition to expressions, we also have another kind of rvalue is our dereferences.

11:42

The simple of which is the simple variable dereference, and there are other more complex dereferences. There's a separate one of these subclasses each for array dereferences and for structure dereferences. If you have some complex thing like you have an array of structures

12:03

and that structure contains an array of structures that contains an array of structures, etc., etc., and you want to get at some actual data member in that, you end up with a whole tree to dereference all the way down through that array of structures,

12:21

etc., to get down to that actual piece of variable that you want. The end result is you have a huge subtree for this with a lot of pointers that you have to walk through while you're doing this recursive process just to figure out that what you actually wanted was register 48.

12:42

This makes complex shaders, especially complex shaders, take up huge amounts of memory in the IR. There are a couple of games available on Steam that right now on a 32-bit machine have difficulty compiling all of their shaders because they exhaust VM.

13:09

Clearly this giant tree structure for a value... and it's not their fault, right? It's entirely our fault because we have this extremely verbose set of data structures

13:21

for expressing these things. We kind of have three main kinds of R-values. We have expressions, we have these trees of variable dereferences,

13:43

and we have constants. Dereferences have the exact same structure for dereferencing temporaries. Both temporaries that the application has explicitly declared that only exist during the lifetime of the shader,

14:02

and temporaries that are automatically generated by the compiler, uniforms, shader inputs and outputs, all end up looking the same in the compiler, even though at the code generation backend they may be treated differently.

14:20

These end up being plunked into the assignment instruction, which has a dereference for the left-hand side and then some R-value tree for the right-hand side, and a condition modifier on it. We put that in there with the intention that any assignment could be a conditional assignment.

14:47

Many GPUs have either predicated execution or have explicit condition codes on all of the assignments so that the instruction may not write its value depending on the value of the condition code.

15:02

So we had this thinking that we would use this automatically as we're generating code and flattening out if statements and doing things like that, and be able to generate things compactly and make it easier on the code generation backend. But the problem is this right-hand side isn't just add X and Y,

15:24

it's potentially this giant tree of gook that's going to generate 46 intermediate values. Maybe all of those need to be conditional, maybe they don't. It kind of depends on the backend.

15:41

So we have this thing in there that should have made things easier, but in practice actually made things quite a bit more complicated because now all of the optimization passes when they're trying to do value tracking or do dead code elimination or things like that have to be aware that maybe an assignment occurs or doesn't

16:04

because they're all conditional. It made things more difficult instead of making them easier. So we had all these great ideas and all these things that really seemed like a good idea at the time,

16:23

and most of it turned out to just be painful and make everyone's life much more difficult. The only thing out of all of this that panned out was when we created the tree structure,

16:42

we created a set of helper classes to make it easier to traverse through the trees and only execute operations at certain kinds of nodes in the tree so you could do things on them to make writing independent optimization passes a lot easier.

17:05

And that has worked out, but it's been kind of a double-edged sword. On the one hand, if you want to write a new optimization pass that identifies a certain kind of sequence and replaces it with a different kind of sequence, it's really easy to write those.

17:23

You can sit down and probably write two of them before I'm done with this talk. The flip side of that is we've got about 27 of those, and so now we have all these different optimization passes that operate completely independently, and most of them, once they've done some modifications on a set of instruction trees,

17:45

have to stop because now they've changed information that they later need and basically start over. So through a whole optimization sequence, we end up running through all of these trees hundreds of times.

18:02

So we're walking through these trees, through all these pointers, through these extremely verbose dereference structures. It's extremely... I'm not even going to say it's cache unfriendly. It's downright cache mean, and the performance of it is fairly awful.

18:21

So we've made this trade-off that it's easy to do this thing, so we've done the hell out of this thing, and it's ending up being fairly costly. So I think the first thing that we need to do is attack the wordiness of these IR trees

18:46

and try to get some memory back. So one of the things that we should have done in the first place is basically have two different sorts of ways that we keep track of values.

19:02

Anything that is or has an array, just pretend that we're on a CPU and give it some memory locations. Fortunately now, we actually have a lot of infrastructure for doing these kinds of things in GPU programs thanks to uniform buffer objects, where you can create a region of memory

19:26

that has a bunch of variables in it that you can directly access from the shader. So we have a bunch of this infrastructure now, and transitioning to, at least during the initial compilation phases, just putting everything that's an array into a fake UBO

19:42

and then assuming that we'll optimize it out later would be pretty easy. And then everything else, just give it a fake register out of a pool of infinite registers. So just have a 32-bit register set.

20:00

And then keep a mapping of register number to IR variables so that we can still be able to tell that this register is actually a vertex shader input if we need to generate different code for that. As a follow-on to this, we could also include swizzle information on the register usage.

20:26

So even though we would have this extra mapping to say that register 238 is actually variable foo, it would end up being a lot smaller than the current IRD reference system for a few reasons.

20:42

The biggest one is that there's one mapping per variable instead of there being an IRD reference for every time you try to use the variable. In addition, all of the compiler-generated temporaries and through a bunch of our optimization passes, we generate a pretty good number of temporaries.

21:06

None of those actually need IR variables anymore. You can just say, okay, I need a temporary. It's the next fake register. And not even have this mapping because you don't need it. That variable isn't visible to anything.

21:21

And then also, after the shader has been fully linked, you can just throw away the map. Either completely throw it away, or at least just reduce it down to only including things that are visible external to the shader, like the shader inputs or the set of uniforms. So that ends up being a much smaller set of things to keep track of.

21:46

You would also need a similar kind of mapping for arrays so that you know that a particular base address is mapped to the base of a particular variable.

22:00

In order to make this work, we'd need a couple of additional optimization passes that we don't have now, but for supporting UBOs we actually should have now. One is a pass that kills redundant loads and redundant stores of arrays.

22:24

So if you tell, okay, I already loaded index 6 of this array into register 85, then the next time someone accesses index 6, you just pull from that same register instead of reloading it.

22:41

And likewise with stores. We don't have stores on UBOs yet, but we will pretty soon with... I can't remember what the extension is called. There's some GL4 extension that adds that capability.

23:01

So then if you've managed to kill all of the redundant loads and stores of an entire array, then you can kill its entire mapping to memory and just keep that array in registers. Since we have a lowering pass already for GPUs,

23:23

most GPUs can't do non-constant indexing of certain kinds of arrays, so quite a few of them can't do dynamic indexing of the arrays of the output from the vertex shader to the fragment shader, for example. We already have a lowering pass that converts that kind of dynamic indexing

23:42

to some really ugly, awful code that makes it look like a bunch of constant accesses. With those optimization passes and the existing lowering pass, we would end up with parity with what we have today.

24:02

So then what we end up with is instead of the current tree of Guk, we'd end up with a bunch of UBO loads and something that looks like this for our R values. Because it's going to still derive from... I think my laser died...

24:22

from our R value, it still has type information, so we haven't lost that. This structure on LP64 systems has the same size as the existing IRD reference variable, but it's larger by 4 bytes on LP32 systems,

24:42

but you don't have the giant tree of these anymore for any kinds of array of structure, of array accesses, and you also no longer need to have swizzle expressions as R values.

25:00

On LP32 we've increased the size of an individual node, but we've reduced the depth of the trees, so it still is a winning trade-off. This right here would dramatically simplify code generation backends

25:26

and would dramatically reduce our memory usage. It's also worth noting that after having done linking and done dead code elimination, if you make a quick pass over the instructions and compact out holes in the used registers,

25:45

you could keep track of the highest register number that's been used, and the code generation backend if it says, oh, the highest fake register number that was used in this program is 48, and I have 128 registers,

26:01

then the backend doesn't even have to bother doing its own register allocation pass. For simple shaders, this would actually make code generation quite a bit faster, too, because you don't even have to bother with... At least on the i965 driver is the single most expensive part of the code generator.

26:21

So that seems like winning. The next place that we want to go after having reduced our memory usage is we want to get to what I'm going to call flatland. Get out of this forest of trees and get to a flat-looking IR.

26:42

I think that that would look something like this, where you've kind of smashed a single IR expression into an IR assignment. So you have the register of the left-hand side, the operation that's going on, the registers that are the right-hand side,

27:01

and the right mask. Already from this where we're sort of giving everything its own fake register and we've flattened things out, you can almost smell SSA coming. It's almost there already, but there's a couple of additional things that need to happen

27:23

to actually make that work out. There are a couple other R-value-like things that need some treatment. So R calls. We have an additional instruction for function calls

27:43

that pass the list of parameters. Already kind of look like this, where the call has a location where it's going to assign the return value of the call and then the function that's being called and all the list of parameters.

28:00

It would get refactored into a new kind of class hierarchy. Also, texture instructions are kind of a mess in the current IR because GPUs keep adding new texture instructions that have additional parameters, so we kind of keep growing this horrible instruction node in the IR.

28:26

I think what we want to do with that is just throw that away and make texture instructions look like function calls in the IR and kind of take a page out of the LLVM playbook

28:46

and use what they call intrinsics and just have specialized intrinsics for texture instructions instead of having them be specific opcodes.

29:00

So if we flatten things out like this, we've lost some information. We no longer know how this value that's generated by a single instruction is later used. So if we have a multiply instruction, we only have that multiply instruction in isolation.

29:20

We have no idea when examining that that the only thing that ever uses this multiply instruction is an add instruction, and that add instruction is only used by a min, and that's only used by a max. So we don't have any way to generate the mad sat instruction anymore, at least not directly.

29:41

Also, some of the optimization passes that want to take different kinds of sequences of instructions and replace them with a different single kind of instruction become more difficult.

30:01

So a lot of those instructions we could probably immediately generate as what I'm calling IR calculation, which is, I think, a fairly terrible name, but I couldn't think of anything else better to call it when I was writing these slides. When we were going directly from the AST to IR,

30:24

even if we could generate the mad sat, then almost all of the optimization passes that we currently have create new opportunities to generate these complex instructions.

30:41

This is especially true for the lerp instruction. We had a lerp pass that operates at the high level IR, and then we found that because of optimizations that happened in our backend on the lower level IR that the 965 backend uses, that even it was creating more opportunities to generate lerp instructions.

31:01

So we had to create a pass that operated on that IR, so we have multiple things that are trying to generate lerp instructions. So we still need to be able to recognize these things that were really sort of tree-like flows of expressions

31:22

to be able to generate more complicated things. So we could go backwards from this, back to a tree, run a tree grafting pass on the forest of trees, then use that to go back into this form to get those complex instructions.

31:41

But I felt myself getting a little bit sick just saying that, much less having to write that code. And I think if I make Eric write that code again, he'll cut me. So the way that compilers that don't have tree-like IRs traditionally do this,

32:05

and that by the way is pretty much all of them, is using UD chains, or at least a simplified form of UD chains. And I know at least one person in the audience is already thinking, but you don't need UD chains with SSA.

32:22

No, that's not entirely true. So even referring back to one of the original papers about SSA, they address the issue of UD chains and that you don't need UD chains for as many different things with SSA as you did without SSA.

32:43

So for example, you don't need UD chains for doing CSE or for doing dead code elimination or a bunch of other things. But you do still need them for doing instruction generation and a very small set of optimization passes. The advantage that you get with SSA is that you get a whole bunch

33:05

of simplifying assumptions about your UD chains, and they're a lot easier to generate and keep up to date, and you can represent them more compactly. And for what we need to maintain code generation parity with what we have today,

33:24

we can get away with a really, really simplified form of UD chains so that we won't have to... So a really simplified form of UD chains that roughly matches what we would still need with SSA

33:43

so that we won't have to really throw away any code when we make that transition, and the two can live side by side. Primarily what we need to know is, within a single basic block, when is a value only used by one other thing, and vice versa.

34:07

Since it's within a particular basic block, when does a value that's consumed only have a single reaching definition? Any other cases, we don't even need any information.

34:20

Just, okay, don't even bother tracking that. So we can get away with a really, really simplified form that we would generate directly from the tree and be able to keep up to date in a really easy way. So one other thing that we can't do today

34:41

and for which we would need UD chains is to be able to detect cases like this where we have two different... I believe that we would get this after going to SSA. I wouldn't want to try to do this before. Where we have two different reaching definitions of X, but if we just pushed that clamp back up into each of those basic blocks,

35:05

we could generate a DP with saturate on both of those and be able to cut an instruction out of what we generate today. What we generate today is two dot products and then an extra move with saturate today that is useless.

35:22

Actually, in the back ends we might have... I actually think in the 965 back end we have a pass that tries to eliminate those moves with saturate. I don't know if other back ends have similar thing.

35:43

There's a couple other bits of dirty laundry in the current compiler architecture that I wanted to bring up. They're sort of orthogonal to the mistakes that we made with the IR.

36:03

We don't explicitly track basic blocks. We kind of implicitly have them because of the way that if instructions and some of the loop instructions are structured, but there isn't an explicit structure that's here as a basic block,

36:22

and that ought to get fixed. I think that that would... will sort of be fixed when we do a transition to flat land. It was really difficult to keep basic blocks before... The tree structure made it a lot harder.

36:42

As soon as you tried to move things around in the trees, it would invalidate your basic blocks, and you would basically have to build them from scratch, so we just didn't bother. We also don't have an explicit IR for switch statements. At AST level, when a switch statement is encountered,

37:01

it's immediately translated through some really horrific code into a sequence of if statements, which most of the time is what your GPU is going to do. However, if the switch value is uniformly constant,

37:25

almost every GPU would be better off using a jump table, and we have absolutely no way to generate that. We haven't cared about it too much, because if you look at ShaderDB, there are exactly zero occurrences of switch.

37:42

Part of that, though, is an artifact of there not being a lot of games for Linux, and especially not a lot of games for Linux that aren't games that were also targeting consoles. PlayStation 3 and Xbox 360 couldn't really do switch statements,

38:05

so nobody wrote them in their shaders. Now that there's a new generation of consoles, there's a pretty good probability that we're going to start seeing some of those. The other thing that needs to die in a fire is the explicit generation of the AST.

38:24

When we first started on the compiler project, because of supporting Windows, any files that were generated as part of the build process were tracked in source control.

38:43

If you had a YACC grammar file, the C file that gets generated from it was committed to source control. It meant that every time you modified the grammar, you had to remember to commit this other file too. People were really horrible about that,

39:03

and so it was very common that the committed .c file was 12 versions behind the grammar file. I wanted there to be as few changes as possible to the grammar file.

39:25

What I did when writing that was have it explicitly create an AST and then process that AST in a C file into the actual IR. There's this explicit step there.

39:42

Now that we don't do that ridiculous thing of committing the generated C file to source control, it would make a lot more sense to do a bunch more of that work actually in the YACC file and skip that step of generating this other explicit AST

40:02

since the grammar itself generates it implicitly. We ought to be able to cut some transient memory usage and probably get a little bit more performance out of the compiler frontend that way.

40:20

We didn't care about it too much because shader compilation time wasn't showing up that much in applications that existed for Linux in 2010, but now the same applications that run out of VM on 32-bit systems

40:43

take quite a bit of time to compile everything. I think the one that hurts us the most is Dota 2. If I remember correctly, when you run it without any extra options, at startup it tries to compile 11,000 shaders.

41:02

Yeah, that would be why it runs out of VM. And yeah, it takes a while. Anything that we can do to speed up initial compilation would be a tremendous help to anyone wanting to run that game.

41:21

We already have some other folks that are doing work on a shader cache so that the second run, when you basically see cache for shaders, so then the second run of the game, it starts up really quick, but the first run is still, you've got time to go get a sandwich. By go get, I mean drive to the next city over to your favorite sandwich shop that's there

41:45

and then come back and yeah. Okay, so I believe that's everything that I have. Any questions? We have a mic.

42:07

Well, the Dota 2 thing still sounds better than what happens in Vine. You play the game, somebody jumps across the corner, tries to shoot you, and then you can go get a sandwich. Right. And that's actually a separate issue usually.

42:23

So there's a bunch of games. The one that I know of that's the worst about this in its native client is Serious Sam 3, where they know at level load time what all their shaders are,

42:42

but they don't necessarily know how those shaders are going to get used together until they encounter things maybe much later in the level. So what they'll do is compile everything up front, and then when you get, say, to a boss and know that this shader needs to get used with that shader,

43:01

they'll actually call the link program to link them. And that's where you know that the boss is right around the corner because it pauses there while it's relinking the shader. And, you know, it seems like a silly architecture, but it turns out that's sort of the natural way to do things in DirectX,

43:20

where everything is separate and you construct your shaders so that the outputs of one and the inputs of the other are going to have the same layout, and then at any time you can arbitrarily say, use these two together. That functionality is available in OpenGL through separate shader objects,

43:40

and as soon as I get back to Portland from this conference, I'm going to get back to working on that. So a bunch of those apps will start to have a much better experience once that code gets there. DirectX apps, they also assume that feeding the shaders is cheap. So we had this one game which times the first frame

44:02

and then decided how fast to animate some lips and facial expressions, and it took ten seconds to compile and it started, yeah, I'm going to run that entire intro sequence in three frames, let's better animate this real quick. Excellent. And so then, of course, after that, after the first frame that took ten seconds,

44:23

it was hitting sixty and it looked ridiculous. Just a different question, if I may. I wrote a really hacky HLSL compiler as an undergrad project five years ago. Mattia was the one to clean it up later on, so sorry. But back in that time, the shining object was to use LLVM for shade optimization.

44:48

What happened to that? LLVM is kind of a pain in the behind.

45:01

There's kind of two big difficulties with it. The first one is their story for ABI is what's an ABI. And that makes a big problem if you want to be able to, say, ship an updated driver on multiple different versions of distros

45:23

that maybe have different versions of LLVM installed. Basically means, and this is what... So lots of, especially OpenCL implementations, lots of closed source drivers use LLVM. And what they do is they pick some version of LLVM that they like,

45:41

import all of its code into their project, and statically build it into their project. And the LLVM license lets you do that. I'm not going to import LLVM into Mesa. And so then it means, if we were to start using it, it means that we'd basically be in the same situation that the Radeon drivers are in,

46:03

which is, you can't get an updated Radeon driver on the previous version of Fedora because of the LLVM mismatch. And I can't do that. And then also, doing code generation out of LLVM

46:25

if your code generator isn't in the upstream LLVM tree is kind of a nightmare. So it's a project interface issue rather than something, some limitation. Yeah. I mean, that's kind of the first deal breaker.

46:48

And that sort of architecture that they have for LLVM meets the goals of that project, right? So I totally understand why they want to do that. And it had been proposed a few times

47:02

that Mesa should have a standard interface between sort of the core part and the driver part. And we pitched a fit about that suggestion, sorry, Luke, for kind of the same reasons, right? That no, we don't want to have this ABI because we want to have the flexibility to be able to take stuff that's garbage

47:21

and throw it out and do something that's better and not have to do like we do with the kernel and say, no, we have to keep dragging this old horrible interface around for years and years because it was there once.

47:44

We don't have much time left. Yeah, I see. I'm running out. Howdy. This is a higher level question. You were quite certain you were going to start seeing shaders

48:01

that use switch statements in a significant way. Could you... I don't know for sure that we will. Or just a graphical effect that would make use of switch statements in it. So one thing that's happening is people are looking for ways to avoid the overhead of switching shaders.

48:23

And one of the ways that people do that is with, like, Uber shaders. Right. And one of the ways that you could do that to be able to kind of switch at a fairly coarse way

48:40

between the different behaviors in a shader would be with a switch statement that selects on a uniform, for example. So now that the consoles can do this in a fairly credible way, it's at least possible. There's a couple other ways that people could do that.

49:04

Another one that would be similar is using... There's a piece of GL4 functionality called shader subroutines, which are roughly like function pointers in GLSL that you set as uniforms. There's not a lot of use of those either for, I think,

49:23

mostly the same reasons that the previous generation of consoles can't do them at all. So I think we're kind of at a transitional phase here, where the things that people were previously doing because it worked on PS3 and Xbox 360,

49:44

they're going to start giving up a bunch of those techniques because now they can do different things on PS4 and Xbox One. And so we're going to start seeing a bunch of shaders that, you know, next year we'll see shaders that look really different from the ones that we saw last year.

50:04

And trying to predict exactly what those things will look like is hard. Okay, I think it's time. Thank you.

Recommendations

34:37

Can we manipulate tree forms like numbers?

22:00

Experience with wisp

35:54

Presentation -The single monetary policy: 20 years of experience

25:59

Tree matchings with Behavior Trees

1:06:18

A system is not a tree

45:33

Interviewing like a Unicorn

52:29

Coding Like a Girl

26:36

28:59

20 years with DocBook

29:51

20 years with Gettext