GPU Userspace - kernel interface & Radeon kernel modesetting status

Video in TIB AV-Portal: GPU Userspace - kernel interface & Radeon kernel modesetting status

Formal Metadata

GPU Userspace - kernel interface & Radeon kernel modesetting status
Title of Series
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
The GPU is one of the most complex piece of hardware in modern computer. With kernel modesetting, more part of the driver move from userspace to the kernel allowing a cleaner support for suspend/resume and others GPU specific handling. The complexity of OpenGL driver, and also driver for new API such as OpenCL, are in userspace and will more than likely stay there. This presentation will look at the unique problem of GPU kernel API to userspace. How userspace can interface with the kernel to submit GPU command in an as efficient as possible way. A brief review of what have been done and what is done now for various GPU, and insight on what might be better solution in the future will be given. Last part of the presentation will devolve to the status of radeon kernel modesetting which is now the largest driver inside the linux kernel with more the 70 000 lines of code and supporting more than 7 different GPU families.
Point (geometry) Compiler construction Hacker (term) View (database) Computer hardware Moment (mathematics) Self-organization Planning Student's t-test Mereology Compiler
Point (geometry) Functional (mathematics) Pixel Transformation (genetics) Block (periodic table) Device driver Function (mathematics) Formal language Revision control Array data structure Shader <Informatik> Computer hardware Cuboid Right angle Software framework
Functional (mathematics) Machine code Assembly language Code State of matter Multiplication sign Binary code Data storage device Compiler Personal digital assistant Computer hardware Representation (politics) Data structure Extension (kinesiology) Compilation album Mathematical optimization Form (programming)
Programmer (hardware) Assembly language Natural number Term (mathematics) Personal digital assistant Different (Kate Ryan album) Shader <Informatik> Computer hardware Device driver Cuboid Probability density function
Point (geometry) Dataflow Game controller Sine Machine code Computer file Link (knot theory) Connectivity (graph theory) View (database) Correspondence (mathematics) Field (computer science) Computer programming Trigonometric functions 2 (number) Emulator Goodness of fit Computer hardware Negative number Cuboid Absolute value Bit Process (computing) Exterior algebra Personal digital assistant Right angle
Scheduling (computing) Group action Code Multiplication sign Execution unit File format Set (mathematics) Mereology Mathematics Synchronization Scalar field Flag Texture mapping Block (periodic table) Constructor (object-oriented programming) Bit Sequence Proof theory Structured programming Vector space Normal (geometry) Cycle (graph theory) Writing Resultant Point (geometry) Slide rule Dataflow Game controller Computer file Connectivity (graph theory) Coprocessor Field (computer science) Number Goodness of fit Term (mathematics) Computer hardware Arithmetic logic unit Mathematical optimization Plug-in (computing) Alpha (investment) Addition Scaling (geometry) Compass (drafting) Line (geometry) Limit (category theory) Rectangle Word Shader <Informatik> Synchronization
Pixel Scheduling (computing) Computer file Code Connectivity (graph theory) Multiplication sign Source code Execution unit Function (mathematics) Mereology Replication (computing) Field (computer science) Theory Product (business) Heegaard splitting Emulator Hacker (term) Different (Kate Ryan album) Term (mathematics) Scalar field Alpha (investment) Predicate (grammar) Variable (mathematics) Message passing Vector space Shader <Informatik> Function (mathematics) output Resultant
Addition Dataflow Differenz <Mathematik> Interpolation Multiplication Pixel Game controller Source code Linearization Branch (computer science) Limit (category theory)
Game controller Just-in-Time-Compiler Mapping Computer hardware Mathematical optimization Compiler
Point (geometry) Complex (psychology) Scheduling (computing) Machine code Code Multiplication sign Decision theory View (database) Connectivity (graph theory) 1 (number) Device driver Computer programming Software bug Usability Intermediate language Emulator Medical imaging Gaussian elimination Mathematics Computer hardware Energy level Representation (politics) Data structure Reverse engineering Exception handling File format Interactive television Code Device driver Compiler Message passing Loop (music) Password output
Broadcast programming Information Code Multiplication sign Shared memory Code Bit Cartesian coordinate system Disk read-and-write head Mereology Computer Computer programming Subset Gaussian elimination Emulator Medical imaging Message passing Shader <Informatik> Single-precision floating-point format Computer hardware Compilation album Resource allocation Multiplication Compilation album
Logical constant Graphics processing unit File format Connectivity (graph theory) Electronic mailing list Bit Computer programming Theory Software bug Befehlsprozessor Invariant (mathematics) Vector space Personal digital assistant Mixed reality Computer hardware Representation (politics) Data structure Abstraction Mathematical optimization Stability theory Condition number
Point (geometry) Area Dataflow Functional (mathematics) Game controller Semiconductor memory Shader <Informatik> Bit Data structure Resource allocation Compilation album Compiler
Dataflow Building Machine code Code Transformation (genetics) Multiplication sign View (database) Device driver Translation (relic) Hidden Markov model Branch (computer science) Mereology Computer programming Usability Gaussian elimination Heegaard splitting Programmschleife Scalar field Computer hardware Representation (politics) Energy level Data structure Binary multiplier Loop (music) Mathematical optimization Alpha (investment) Information Independence (probability theory) Bit Compiler Loop (music) Principal component analysis
Functional (mathematics) Information Code State of matter Computer hardware Independence (probability theory) Mathematical optimization Computer programming Product (business) Connected space Compiler
Dataflow Pairwise comparison Game controller Functional (mathematics) Online help Mereology Computer programming Compiler Intermediate language Message passing Mathematics File archiver Representation (politics) Traffic reporting Compilation album Mathematical optimization
Slide rule Scheduling (computing) Open source Code Connectivity (graph theory) Execution unit Combinational logic Function (mathematics) Disk read-and-write head Coprocessor Trigonometric functions Heegaard splitting Scalar field Operator (mathematics) Flag Selectivity (electronic) Endliche Modelltheorie Data structure Mathematical optimization Plug-in (computing) Compilation album Compiler construction Assembly language Compass (drafting) Projective plane Shared memory Expert system Bit Flow separation Vector potential Digital photography Message passing Vector space Shader <Informatik> Vertex (graph theory) Game theory Resultant
Multiplication Electric generator Code Transformation (genetics) Linear regression Software developer Multiplication sign Source code High-level programming language Combinational logic Theory Compiler Software bug Latent heat Software Shader <Informatik> Shader <Informatik> Energy level Pattern language Software testing Right angle Compilation album Resultant Mathematical optimization
who made this possible that for example with this apparently outside still doing this organization thing preferred about myself I'm a PhD student by day and then by night I occasionally do some shader compiler hacking and that's what I'm going to talk about okay so here's a user plan for my part first of all very
quickly what's glsl it probably most people know who that just very briefly then I'm going to talk about the radiant hardware r300 car 500 from a compiler writers point of view then kind of the main part will be some overview of what the compiler looks like right now and some thoughts and how we got there there will be a part on what is missing for glsl and how can we get there and some final thoughts as well okay so would you please all raise your hands for me for a moment okay those please don't raise your hands okay those of you who have worked with glsl or the tedious I assembly please put your hand down okay so most people actually actually have but there are some will happen but that's a nice trick you know to get everybody conspiration
okay so here that are very rough overview of what the OpenGL pipeline looks like you know vertex fetching from with its arrays you have transformations that are applied then in the latest versions you have a geometry shader which can again modify that output but the hardware I'm going to talk about doesn't have it so it together then protests are assembled and rest your
eyes the resulting pixels or fragments are again shaded and it all ends up in the framework for eventually and the point is that these yellow boxes are programmable so you can as you see on the right hand side there is some sea like language that you can use to modify the functionality of these yellow blocks and what we need to do as driver writers
is to get from this textual representation of sensi code well see like code to fill in a structure like here which just contains the binary machine code that the hardware understands and this population step is what I'm going to talk about and then once this is once compiled then every time we do rendering using their data we just use these storage binary values and send them off to the other
actually the glsl for the first step of compilation is in mesa entirely independent of the hardware it generates a intermediate assembly language which is also used for fixed function old-school opengl for the a or e extensions assembly extensions and in the case of gallium if you have some other stage better like maybe the xorg state tracker which is being hacked on then this also ends up in this assembly language form so what we need to do is we need to take this assembly language and put it into machine code and the thing is that we also need to do some optimization steps because each hardware is a little different and the assembly that is generated by the compiler may not be optimal for what we need to do and here's just some example of what
this assembly language looks like so it's very self-explanatory you have some instructions like move subtract multiply and business daily Sanderson okay by the way if there are any questions of course feel free to interrupt me and okay so about the hardware I'm going to talk about our 302 our 500 which are supported in nature by a single driver or actually there is a single classic driver and there is missing a gallium driver but ok here's what this marketing terms that this roughly corresponds to in case you haven't seen it yet and if not just let me say that the newer chips librarian HD and onwards are very different in terms of programmability and i'm only going to mention it in shortly at the very end ok so we have a
programmable vertex shader that's the first yellow box you've seen and the hardware there is very close to this assembly that we use intermediately which is quite nice another nice thing is that there aren't many differences across the harbor from our 300 to our 500 and the differences that are there are all in terms of new features to their backwards compatible which makes life easy for us let me give you an idea of what a PDF
instruction looks like so first of all you have a bunch of of register files here indicated on the right hand side in which well you see it most of them are pretty standard except for this strange alternative temporary register file here this is just a second register file which you can store temporary values which has some different restrictions than the other temporary register file and then instructions go basically in three steps the first step is to select the up to three operands for the instruction that you want to use most of them have to but multiply and add history we can select first which register we want to use we can then take absolute values we can do Swiss link which means exchanging components or replacing components are 0 or 1 and then you can do component by its negation and this is very nice because it's very flexible in fact together the exquisite instruction that we have in the assembly you get it for free then the instruction is executed and then you have some some post processing as well our 500 is a bit more powerful here to understand to support flow control if else endif particularly in in a nicer way then we can do it on our 300 and then things are stored of person in the registers okay and okay the the machine code is just a bunch of bit fields and basically each box that have gone to on the left hand side corresponds to one bit field in this machine code so this really corresponds to our view of what the hardware does
okay so from my point of view what's good about it i've already said they're very flexible it'll support most instructions that we want to implement are supported natively by this hardware which is not the case for a fragment programs especially older hardware where you have to emulate instructions like sine cosine and so on there are some not-so-nice things that the worst thing is that there are some operand restrictions so if i go back here one
slide you have up to three operands but you can only use one integrate just at a time you can only use one constant register at a time and if you want to do to use more than you have to use some kind of spilling news you can also only use two temporaries at a time which means that it would be nicer to have 850 a month i add would be cool if we could put two of the operands into the temporary file and the other one into the ultimate temporary because then we could do it in one cycle instead of using a micro instruction that takes two cycles but this is an optimization that we don't be yet because well lack of manpower basically also a nice feature that this processor has is that you can under certain limitations you can combine a vector instruction with a scalar or Drake instruction but the limitations are kind of nasty and again because of the lack of manpower I mean nobody has have designed to really make use of that so far the fragment processor it's called us and the mg documentation for some reason I'm not entirely sure the the weirdest thing about this piece is that the arithmetic unit is split into a three vector part for the ether three component part sorry for RGB components and one scale apart for the alpha component what's a bit well tricky but we got used to it is that there are many changes especially going to our 500 in terms of how extra instructions are scheduled additional features flow control but the nice thing is that the alley let's say philosophy of having this are to be and a slit has stayed pretty much the same which makes it easier for us to share a lot of code and again I'll show you a similar picture one before no wait
there's another thing first text texture instruction schedule is an interesting problem as well because in our 300 you don't have a sequence of texture and and auto instructions that are intermixed instead you have one set of registers into which you can ride texture instructions one set of registers in which you can write arithmetic instructions and then there is our additional bit fields that tell the hardware that okay please execute first the first for texture instructions and then please execute the first ten arithmetic instructions and so on line I try to visualize here you have one block attention instructions that and they alternate and the problematic thing is that you have a very limited number of blocks on the r300 there are only four blocks of texture instructions and four blocks of arithmetic instruction fetch can use so you have to be careful to try to prove texture instructions so that they run at the same time otherwise you might not be able to support even rather simple shaders I think there was one but one's about a compass plugin that used five directly in the textures and the thing is that text rectangle textures need to be you to get caught in scaling we are referencing construction and then you have the written take checks original take text and run out of blocks so while optimizations have to do was to move all these sexy instructions together the r500 is nicer in that respect their you really have a normal sequence of instructions the sucker for that is unified five words for instructions very nice there is some potential optimizations doing manual synchronization between texture and arithmetic which should be rather simple but nobody has bothered so far yeah this is still that's an interesting question and I suspect that it does matter because you have the you have a synchronization flag in the arithmetic instructions which tells the the processor please wait until all the texture instructions are finished so if you can do some clever grouping and maybe move the arithmetic instructions that need to use the texture result as far down as possible then you could have maybe better throughput so so yeah it's a good point it's still somehow matters probably but we haven't done okay now here's what the
instructions look like the the most important message is that you have this big vertical split between the three component vector hard over here and a scalar part over there even the register files are you can think of them as completely separate and as similar as before well this time you only have a constant and a pic suspect register file because the pixel stack contains both the temporary variables and is also initialized by the input into minor difference you have slightly more flexibility how you control your sources and operant so what you first do is you select source fields where registers are loaded and then you have the ability to reduce visiting across across all the units which allow us in theory for some nice hacks because you could have an operand here that uses the are component of registers 0 and the alpha component of register 10 in theory but I don't know if that's particularly useful and you can do the usual modifications then you have the instructions which are in principle separate accept that some stuff like dot product needs some crosslink also you have the ability to take the output is similar in structure and replicate over there if you want to do that that means it can't use the RGB instruction slot for that instruction and well you use of output modifications and then you can write it to the frame buffer or is it not directly to the frame buffer that works but to the output which is then put into blending or you go back to the temporary register okay so some challenges here is I've
mentioned this quickly before you there are many instructions that need to be emulated but this is pretty simple and to do that works well and there is this split which is a challenge in terms of instruction scheduling we have some code that does it and I think it does it actually fairly well except for one problem I've seen a lot of shaders that do something like compute the reciprocal of a scalar that is in the X register and write the output again to the X register the problem with that is that the RGB unit can't do reciprocals so what we have to do is we load the X component into the into the Alpha and then replicate the result to the RGB which wastes the RGB vector is locked in that instruction and there's a question of maybe we can move these components around in a clever way but that's a more difficult subject I guess I'd be we're not doing again limited manpower on the older chips you have to do some Swizzle emulation that that has been pretty stable for two years now so and of course there are some sand little bonus features that would be like to use optimally like what I didn't explain is
this presa thing it allows you to do something like subtract source 0 from source one before doing the actual instruction this allows you to do something like linear interpolation in a single instruction instead of using a multiplication in the multiplication addition would be nice to have there are some limitations there because you this flexible with whistling when you want to do that which is the main reason why it lazy so far and supporting that
okay there's a picture about flow control you have the issue that when pixels when all pixels want to jump in a branch instructions then is fine because it's operated many pixels separately if none of them want to jump it's also fine if someone to jump in some don't then you actually have to twiddle with some deactivating some pixels temporarily and use both branches and anything else under but I'm not going to elaborate on
that too much the nice thing about your control support in the r500 is is one very flexible and it's very easy to map jit glsl to the hardware actually there are some other challenges which I'm going to mention later there are lots of possibilities for optimization there but we can think about that later okay so
far for the hardware details and now I want to give you an overview of well high-level overview of how the compiler works right now and how we got there ok
so in the beginning we were young and needed a driver and we didn't know too much we have no documentation and so what we did was just loop over all the instructions and try to convert them into machine code as well as we could then as we learn more about how the hardware really works and so on we wanted to use new features we wanted to fix bugs that cause in complexity because you have interactions between emulating instruction and doing this whistling emulation on the older chips for example there was also the issue that initially we did the art 300 and 500 fragment program entirely separately which was not a good way to live with that so we wanted to do pro chair and there and so what ended up happening from a very high level point of view is that often there was a decision to take a single pass in this in this compiler and split it into simpler multiple passes that communicate using some representation will change over the time and I guess that's that's actually the main philosophical change that took me personally quite some time to embrace is that to really embrace multiple passes also since last year when the growling driver started to pick up to eat there was a decision to share the compiler between the two to try to make it as independent as possible from from the other things and just gently so I talked
about multipass and if there is an explosion going on and this is the one you had initially this is roughly what we had at the end of two thousand eight and this is more or less what it looks like master right now so you see that first a single pass is going to split up and first we do the EP emulate instructions and just replacement by native instructions in the assembly format then there was a strange renamed not quite static single assignment in that code elimination pass which also took care of gluings whistle emulation because there's a tricky thing about Swizzle emulation the way Mesa generates assembly is that you often have swizzles if you use only two components in an instruction you tend to gets whistles like X yyy that's not a native s'mizza lon our 300 but actually you don't need to use the third and fourth component you can just ignore that and so what this pass for the first time did was to analyze which components of the input operands are actually used and then mark the unused ones and take care of that innocent election and then there was a separate scheduling in the fragment program scheduling these pairs of RGB and a as well as we could and then amid and at that stage actually only the final image was different between our 300 or 500 all the rest is pretty much hair except for some instruction emulation details and then again its glitz to make things slightly easier and there is even a new password up here we use pretty much this assembly format as intermediate representation down here there is a new instruction format which is modeled after what they have achieved us this is lit really is represented informative obviously structure there well what are
the trade-offs of single pass vs multipass in principle most parts can be slower because there might be some information that you have to recompute several times however the advantages are really overwhelming because it's easier to wrap your head around one pastor does only a single thing instead of trying to do anything that once so it's hopefully lot more understandable and maintainable now it's easier to share code because if you have a single pass that does does something then maybe it applies to some other hardware as well and the compilation time doesn't matter that much because we only compile shaders at the start of an application usually and of course the slows down the start of the applications and we shouldn't completely ignore it but it may be worth it because we just we don't have enough people working on this thing and having it easily maintainable is just much more important
okay here's an example of how we can share passes right now between fragment program computation and vertex program compilation of course the final image can't be shared but dead code elimination is shared this is a past that only is important for vertex program so time to share this is something that we should share register allocation but you don't but we don't do it right now which is a bit sad we'll get there and instruction emulation everything that can be shared there is shared I mean neither is a subset of the other so it can't be shared and highly but whatever we can tell you care well that's very nice from a maintenance for
now I think that to understand some program the best way to go is to try to understand the data structures and the most important data structure here is how do we represent the programs in the intermediate steps and that's actually a very simple representation it's just a doubly linked list of instruction structures and then instruction formats come in come in two flavors there's the assembly style and the one that is sitting closer to the fragment program hardware as I've already said we maintain a list of constants used because we need to add constants when we will emulate to sine and cosine but that's it about this intermediate
representation I really personally like the doubly linked list because it's very easy to insert modify remove instructions which is something that we do a lot it's also it's also easily understandable I think I really don't like to gsi so for this kind of stuff there is one downside which is that to really do optimization like people or whatever we want to to look at an instruction say okay this instruction what writes some value now we want to know which other instructions use this rhythm value and this is a query that with this representation in the worst case has to look at the entire program which is slow unfortunately I did experiment a little with trying to do a bit more clever data structures here which can you know you the National expedia which will be very nice the problem is that to make sure that all the invariants that you want to have that they are maintained is tricky and and can easily lead to bugs because well I mean theory you can do all the abstractions you wanted see right but but somehow it's not very nice to express them this is something where where C++ tempts me because there you can express some expected extractions more easily yeah it modifies in place yes yes that's the problem now yeah there are several different approaches you could try to try to fix it but that's what happens right now I thought about this the kind of problem is that you often have instructions they don't I mean your registers are vectors right and they don't actually there are many instructions that don't actually replace the whole vector but they kind of mix the original value with with some new component that gets overwritten and I didn't really find a good way to deal with it I don't know if there are some literature on this kind of stuff but I I did look at lvm it didn't seem like it was really I mean it was rather geared towards what you have been a usual CPU so it seemed rather problematic although for some of the newer GPUs it might be worth looking at it again I'm not sure if there is a talking about this is able to the patient stuff like that you don't go against the fact we can use Oh well I guess in SS e that these vector stable conditions but on the other hand you will have some niceties like their implicit so yeah I don't know maybe maybe one can talk about this later about how you do with this kind of thing in an essay that would be interesting okay another little detail
is that at some point we want to do dynamic allocation so that you don't need to think about fixed size Ares and maintaining that as a bit painful so what we do is we have a memory pool structure which only has an allocation function delicate so that we need at the end of compilation everything is thrown away like okay
yeah this is all already kind of my overview of the of the compiler and now I want to talk a little bit about what we need to do for glsl and what are the kind of the remaining things to get really get really good support here what is worth mentioning at this point explicitly is that actually most glsl shaders today work just fine there are some features which are missing which is why it's a bit dodgy to claim to support glsl flow control supporting vertex programs in the area supporting loose isn't there yet and there are some additional instructions I think that we would still need to emulate but I mean that's a small thing the other two things are a bit more a bit bigger and a person would be nice to have remember to my pigeons okay how could we go about
implementing nuke support mapping the instructions on to the until the harder machine code is actually pretty simple the the real problem is that again it comes down to this to this data flow stuff because if you right now compile a a program which has loops then the stuff like dead code elimination just doesn't understand that if you write a register here at the end of the loop and then read from it again at the top of the booth that there is a dependency which goes backwards there is code to support branches so that works fine I think salutes aren't supported yet and this is the harder part because some of the code is rather subtle and you have to be careful about got to modify where but I hope to get around to that soon
optimizations are also an interesting problem so here is this glsl program that are shown at the very beginning and here is the assembly that the Mesa glsl compiler produces which has 32 instructions if you do some very clever transformations you can get it down to eight a bit more realistic goal which could still be manageable I think would be to go at least down to 16 or something here is an interesting kind of philosophical problem about how how you do structure things on a high level because yeah it would be nice to if the the hardware independent glsl compiler already did some optimizations there are there are some optimizations that it could just do like that the problem is that actually well you don't know about the final hardware that this compiler doesn't know about the final hardware especially in gallium which is a bit not nice so we can't well the thing is that for example as far as I understand be the Intel hardware they are is probably actually quite happy about about this kind of stuff but we are less happy because scalars are placed pretty much randomly ignoring this rgba versus alpha split so so what do you do I mean do you go into the glsl compiler do some optimizations which would be nice to us but then piss off maybe some other hardware I don't know about move over there huh buildings like or Intel so right now we try to do everything in the Indy driver which doesn't have the original gle so it just has the assembly and tries to understand as well as it can one question yeah for example when it comes to actually having that individual home instead of having to analyze it the second time I guess it would be nice but it mean we already do this unused component analysis which is not very complicated yeah provide you that I mean everybody will need that information that's that's true yeah I guess they're there is a there would be value in pushing this unused marker into what Mesa does and also about TGS I does yeah is anyone looking at that hmm is anyone working on something like that I don't think so I don't think so I don't think anybody does anybody really have this this really high level view but still they're not even in translation to there is a lot of extra stuff this intermediate classes but if we could have a viable yeah that's a ivory lifting some of this optimization stuff into Mesa and then maybe I'll menteng the teachers I representation will be very very useful thing here are just some examples of the kind of thing you can do something that is a bit magic maybe but actually works that are 500 is here you have something that first multiplies two scalars and then subtract them that's this kind of if you go back to the glsl that's the
CRS function here which does a kind of well it's a modified dot product really and maybe we could can recognize that
and do some magic which actually works in Hardware on our 502 to save some instructions this by the way is a is an example of why doing some of these optimizations in the device independent code is maybe not a good idea because if I get code like this on the r300 fragment program then I'll be pissed off because then I have to worry about all this whistling here which is not supported on RT on it that our 500 can do it yes that's true that's true the connection yeah I think the question is the question is do you want to do this before the Galleon state tracker produces tjs I or and you can still share that's Google some I mean there are some optimizations that might be easier to do before the state tracker gets its hand on it because you might have still more information about where the code comes from from the glsl I don't know how feasible with us i mean the glsl compiler is soon I'm afraid of that special instead of compiler yeah
then been there is some stuff that you can do with constant folding like here this is the great report comparison from from the glsl which has a zero constant there and we can do that more efficiently in the in the archive on a program program by using some of the some of the flow control features that it has yeah how do we implement such
data flow optimizations well one approach that wouldn't change this intermediate representation would be to have helper functions that help you figure out where values are used and where they come from and then add the optimization just as an additional compiler pass that just because there's one thing that you can know if you have some miss compilation you can just disable that compiler pass and see if it helps which is useful for debugging and then hopefully with the help of functions in place doing the actual optimization it's not too tricky if we have an essay based some representation then I guess this would look different that this is something that that would work in the current intermediate representation or let me use okay and with that I go to the last part about Pro tearing our satendra and some other stuff as far as
code-sharing is concerned well we've already seen a lot of examples that are rather hardware-specific that you just cannot share but I think there are still many things that could be shared and it would be nice if you could share them the the real problem which I think also already appeared in the discussion is that to be able to share code we need to share data structures and we have to somehow agree on something that works well there and it's maybe for some future discussions are 600 is interesting because it has the same processor for for vertex pregnant and geometry shaders there already is an assembler that I think works fairly well it doesn't do any optimizations however the processor is quite interesting because it has four separates I lose that that our photo well for the vector instructions but you can actually do different instructions on each component and then there is an additional fifth unit that can also support these in a reciprocal sine cosine these more esoteric instructions I think that you know glsl that use a lot of scalars max very well on to this model but there are some problematic operant selection restrictions that you if you really want to use the Hobby to its full potential you have again now how do you do the instruction scheduling exactly do you maybe move some components from the X to the Y or somehow of course we can't reuse anything that we did for 300 because the split is just too different but again optimization passes would be nice to share okay now there is
one slide on how to get involved in shaler compilation stuff it is a bit scary i have to admit and here's what you need to have before is you do need to have some understanding of glsl and of these assembly instructions otherwise there's no way to really wrap your head around this the best way to get this i think is to just take on some toyota k shins or maybe if you want to have some new conference plug-in or whatever that you want to work on that would be a nice way to learn it you definitely don't need to be a 3d expert you just need to understand how to do I sell works and the SMP and of course as for for all open source projects I mean pick something small as a first project maybe it's something nice would be if you have some really used changer from some open source game or some or from compass and you just look at what does the combination result look like right now there are some debug flags that you can toggle to enable this output then you could look at the assembly that it generates and maybe you notice something that doesn't look look good that could be easily optimize them and try to optimize that and of course it's I think it's a learning by doing thing because there's really no book on the subject I think I mean there are some books on a general compiler design of course but I don't think there's anything from of shade of compliance and specifically okay and one more thing about maybe
thinking about how do we improve the way that we work because if you have better tools for your development and of course you don't have to worry about the small stuff as much one important thing is to keep the source document maintainable I mean everybody knows this and preaching to the choir here probably but i think the things that we did in the compiler by doing going to multi pass and so they helped a lot in that respect there is a question of maybe programming at a high level I know C++ is a touchy subject but sometimes I feel like it would be nice to have i I've heard that some compilers they use some patent based optimization stuff where you just you know you have something if you have a multiply followed by add then just combine it to one instruction and lots of patterns like that and maybe instead of writing a specific C code for each of these replacements maybe we could find something higher level that just describes these patterns and these transformations in some very high level language and then use some generation that produces code to do that there's some theory then when you modify something in the compiler it's very easy to break stuff especially some shuttles whistling combinations and so on so it's good to have automated testing so test test test right nice Arista kiss that if there are no quickly restrictor regressions after you changed something then probably you're fine I mean it's no guarantee of course but I think the test two right now covers a lot of things that are typical bugs that are reintroduced again and again when you work on the compile kind of crazy idea here to make the thing even more robust maybe we could you know generate shaders randomly and then just render using them and compare it to some software rest result but maybe that would be an approach that helps us find more compilation bugs I haven't tried it too maybe something to to hack on and I mean if you have some ideas of course it's always nice to share these insights and yeah I think there was fast me when I thought I would be and I'm down so thank you for attention so this means that there is no time for questions if there are some that


  306 ms - page object


AV-Portal 3.19.2 (70adb5fbc8bbcafb435210ef7d62ffee973cf172)