We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Nouveau Status update

00:00

Formal Metadata

Title
Nouveau Status update
Subtitle
The overdue Nouveau status update talk.
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
I will talk about: * features * ongoing work and necessary reworks * big and important technical issues * overall state of the Nouveau project This is a replacement for Manasi Navare's "Enabling 8K displays" talk, which got cancelled as Intel rejected her travel request again.
33
35
Thumbnail
23:38
52
Thumbnail
30:38
53
Thumbnail
16:18
65
71
Thumbnail
14:24
72
Thumbnail
18:02
75
Thumbnail
19:35
101
Thumbnail
12:59
106
123
Thumbnail
25:58
146
Thumbnail
47:36
157
Thumbnail
51:32
166
172
Thumbnail
22:49
182
Thumbnail
25:44
186
Thumbnail
40:18
190
195
225
Thumbnail
23:41
273
281
284
Thumbnail
09:08
285
289
Thumbnail
26:03
290
297
Thumbnail
19:29
328
Thumbnail
24:11
379
Thumbnail
20:10
385
Thumbnail
28:37
393
Thumbnail
09:10
430
438
MikroblogTheoryComputer animation
Graphics processing unitOpen sourceDevice driverKernel (computing)Asynchronous Transfer ModeTuring testAtomic numberPrime idealReverse engineeringCache (computing)Shader <Informatik>Software developerCore dumpRandomizationCompilation album40 (number)Game theorySoftware testingComputerMultiplication signMultilaterationExpected valueSet (mathematics)AreaAutomatic differentiationSoftware developerBuffer solutionTime zoneElectronic visual displayWebsiteSpacetimeSinc functionMereologyState of matterTerm (mathematics)VideoconferencingSound effectProcess (computing)Integrated development environmentRight angleVariable (mathematics)DistanceLaptopProjective planePoint (geometry)Hydraulic jumpField (computer science)Physical systemRange (statistics)Theory of relativityPower (physics)TouchscreenCartesian coordinate systemInsertion lossArmOpen sourceExtension (kinesiology)Device driverDefault (computer science)Shader <Informatik>Source codeRun time (program lifecycle phase)FirmwareKernel (computing)Asynchronous Transfer ModeReverse engineeringElectric generatorTuring testPrime idealConformal mapGraphics processing unitNormal (geometry)Mathematical optimizationComputer animation
Core dumpSoftware developerFirmwareCodeRandom numberPatch (Unix)Mountain passBuildingPhysical systemTuring testSoftware developerComputer hardwareEmailKernel (computing)FirmwareCore dumpSoftware testingDevice driverHand fanGraphics processing unitBitCodeElectric generatorData management2 (number)Negative numberRepository (publishing)Network topologyInterface (computing)Inheritance (object-oriented programming)Projective planeNumberTask (computing)Statement (computer science)Power (physics)State of matterRandomizationJava appletSemiconductor memorySoftware bugFront and back endsComputer programmingMathematical optimizationSlide ruleElectronic mailing listOpen sourceReading (process)IterationMultiplication signLevel (video gaming)MathematicsShader <Informatik>Different (Kate Ryan album)Distribution (mathematics)Scripting languageGoodness of fitOpen setHypermediaVery-high-bit-rate digital subscriber lineWordAddress spaceReverse engineeringVideoconferencingSet (mathematics)Shape (magazine)Computer clusterProcess (computing)Group actionCuboidMereologyRow (database)SoftwarePatch (Unix)MetreView (database)Electronic visual displayComputer animation
Mountain passBuildingPhysical systemTuring testKernel (computing)Task (computing)MultiplicationDevice driverContext awarenessData recoveryReverse engineeringProjective planeStress (mechanics)Software testingMetreTheory of relativityOrder (biology)InformationPlanningBasis <Mathematik>Touch typingSoftwareDecision theoryCartesian coordinate systemMultiplication signInterface (computing)Crash (computing)Event horizonWell-formed formulaCASE <Informatik>Compilation albumDevice driverDifferent (Kate Ryan album)SpacetimeWater vaporProcess (computing)ResultantContext awarenessElectric generatorOnline helpView (database)Physical systemBitGroup actionLaptopForcing (mathematics)Level (video gaming)Power (physics)Computer hardwarePoint (geometry)GoogolData managementWebsiteInstallation artBuildingGame controllerValue-added networkInstance (computer science)Direction (geometry)Real numberFirmwareCodeAreaFreewareMultiplicationSoftware developerKernel (computing)PCI ExpressVirtual machineTuring testRandom matrixInheritance (object-oriented programming)Graphics processing unitRun time (program lifecycle phase)Software bugBridging (networking)Repository (publishing)PasswordBootingBranch (computer science)Computer animation
MultiplicationDevice driverData recoveryContext awarenessTask (computing)SpacetimeComputer hardwareCompilerGraphics processing unitFood energyDirection (geometry)Software testingFigurate numberArc (geometry)Uniform resource locatorDevice driverCASE <Informatik>Power (physics)Level (video gaming)FreezingGodLoop (music)Acoustic shadowElectronic mailing listGame theoryMultiplication signData recoveryContext awarenessVideo gameSpacetimePhysical systemVirtual machineStudent's t-testSet (mathematics)Order (biology)Latent heatState of matterShift operatorAreaProjective planeFingerprintPoint (geometry)MathematicsProcess (computing)Annihilator (ring theory)Functional (mathematics)Hand fanGroup actionOpen sourceLie groupView (database)Computer clusterPressureArmOpen setTask (computing)CompilerAdditionRandomizationShared memoryMathematical optimizationCartesian coordinate systemControl flowComputer programmingForcing (mathematics)Graphics processing unitGoodness of fitBitRight angleCrash (computing)CodeShader <Informatik>Run time (program lifecycle phase)Kernel (computing)Performance appraisalReverse engineeringCore dumpInterface (computing)Near-ring1 (number)Pattern languageThread (computing)Artistic renderingComputer animation
InformationWhiteboardSoftware bugWhiteboardTask (computing)Link (knot theory)Touch typingCuboidComputer animation
InformationWhiteboardDoubling the cubePoint (geometry)Software developerDevice driverRun time (program lifecycle phase)FirmwareAxiom of choiceCore dumpMultiplication signComputer hardwarePlastikkarteGame theoryVirtualizationSoftwareLevel (video gaming)Stack (abstract data type)Game controllerGraphics processing unitRevision controlAdditionProcess (computing)Right angleTelecommunicationPort scannerTrailForcing (mathematics)Film editingLie groupPosition operatorFLOPSVideoconferencingCausalityBlock (periodic table)Open setStiff equationSoftware testingComputer animation
InformationWhiteboardSpeicherschutzPoint (geometry)Dressing (medical)Service (economics)Arithmetic meanElectric generatorComputer architectureBitInformation securityProcess (computing)Software testingShooting methodFood energyCrash (computing)Musical ensembleContext awarenessMonster groupData managementElectronic signatureComputer clusterExtension (kinesiology)Moment (mathematics)CoprocessorState of matterKernel (computing)WeightSemiconductor memoryBootingProduct (business)Core dumpAddress spaceComputer hardwareDevice driverPower (physics)FirmwareBacktrackingMultiplicationCausalityComputer animation
InformationWhiteboardPoint cloudOpen sourceComputer animation
Transcript: English(auto-generated)
Okay, hi. Is the sound okay? Okay, cool. So, I want to give a status update on Novo. We kind of missed last year, so we are really sorry about that.
Luckily, it worked out this year, even though we missed the CFP and everything. So, yeah, let me just start.
That's still our goal. I mean, we are sure we can't really promise performance expectation users might have from a normal open source OpenGL driver, but at least we are really trying hard to make it reliable. And we know that we have a lot of places where we still have a lot of issues, especially about reliability,
but at least that's where we want to be in the future, and hopefully we get there at some point. So, but what actually happened since the last time we gave this update,
and I guess the most important feature for newer systems is that we finally support mode setting on Turing GPUs. So, if you have a recent GPU, there's a high chance that you can use your displays with the Novo driver.
I think it's still missing for really new GPUs, like some newer chipsets coming, which were released some months ago or something. That might still be missing, but at least for most of the Turing GPUs, it should be fine now.
There was also some atomic mode setting support, which is essentially for user space. If you want to change the resolution, but something happens, like the cable is not reliable or something,
the kernel can automatically detect certain problems with that, and that you don't get to a state where you have a black screen. So, that's quite important. I think there are a lot of X compositors which are buggy, so we kind of disable it on the kernel side,
but hopefully we get there as well. Lut was working on a lot of reverse prime improvements. When I'm looking at this, two years ago or something, there were a lot of issues with reverse prime. The HDMI audio was not working, or the GPU wasn't powered on anymore,
so there were a lot of random failures. Lut did a good job to work on all of those and get it to a state where I think we can say it's good enough now, and there shouldn't be many issues left.
If you know of some, just... Yes? What is reverse prime? Reverse prime is if you have a laptop, and usually your desktop is rendered on an Intel GPU, but some of the laptops have external display ports, like HDMI or display port, wired to the Nvidia GPU.
So, reverse prime is kind of the X term for it, so you kind of have to display it on the Nvidia GPU and copy the buffers and all that kind of stuff. I don't know how it's in detail, but yeah, that's kind of the thing.
Yeah, Optimus is the marketing term for it. Taito Nvidia, I don't know how it's called on Radeon GPUs, but yeah, that's kind of the thing. I was also working on having a near-backend for Novo,
so right now we can, compared to the old one where we used TGSi for the shader compilation, today we can also use near. It's still turned off by default, and at least I hope we can turn it on by default if it's reliable enough.
I know that there are some regressions, but supporting near is required, not technically, but how we do stuff in MISA right now, it's kind of required for OpenCL, for Vulkan, and for OpenGL 4.6, and that's, I think, also one of the reasons why Radeon SI moved to near by default,
because it gives them display support, which is required for OpenGL 4.6. It would be cool if people would start testing it, like running it with games or their normal desktop or something, and we have an environmental variable for that. You can set and then near is used.
For my testing I've done, I usually saw higher performance than with the TGSi perf, but I didn't spend time on any kind of performance optimizations. So if you test it and you see, oh, that one application is much slower than with the TGSi perf,
then I would be really interested in knowing about this. And also updates on OpenCL, I was working on that quite a lot last year.
The initial support was merged quite recently, and it's right now, I think, kind of supported on firmware newer. I only really tested it on Pascal, so there might be a lot of issues on other chips as well. I think Pierre is working on enabling it on Tesla as well,
which is the generation before for me. And if somebody really, really wants to play around with that, we also have an environmental variable to enable it as well. I'm sure that none of the applications work right now,
because the kernel compilation isn't really finished at this point. But if anybody's interested in it and thinks, yeah, maybe I want to do some OpenCL driver stuff, there's this nice OpenCL conformance test suite, which is testing most of the OpenCL features.
I would say it's much easier to fix those issues than, for example, jump into an OpenGL driver and figuring out what's wrong there. Because the runtime is way more simpler,
and it's more easy to follow what's actually going on there. I kind of called it staffing, because as an open source project, we don't really do staffing, but we have a few paid developers.
Besides myself, there's also Ben Skaggs working for Red Hat on the nouveau driver full-time. And we also have, for example, Lute, which is also part of the same team inside Red Hat, working a lot of fixing mostly display-related nouveau issues as well.
I'm also, how do I phrase it, in the same team, but I can't really spend that much time on the actual issues I would like to work on. But, I mean, it's still paid developers.
And we also have an intern at Red Hat, which is right now working on the nouveau shader cache, which kind of helps with, especially loading times of games and so on. And I think we did some testing, we saw improved speedups of three to four times faster
over shader compilation, because we can skip a lot of that in this area. I don't know how much relevant it is for games or something, but I hope that at least some games needing like four or five minutes won't take as long.
Yes? Nobody knew at the year working on a nouveau? I will come to that later. Oh, yeah, it was asked if there's anybody in NVIDIA working on nouveau. NVIDIA will be a topic for later.
Yeah, so we have also community members working on a nouveau driver. I think the most present one is Elia, which is doing a lot of, well, maybe not a lot, but he's doing quite some OpenGL stuff. He's implementing some new extensions,
fixing random issues. He's also working on some display-related stuff. Yeah, I mean, big thanks to him, because he's still working on a nouveau driver.
Quite a few of developers two or three years ago, most of them essentially moved on either by being hired by other companies and working on other drivers now,
or maybe just lost interest or something. Which makes it really hard for us right now, because we have a lot of issues to work on, or, for example, also want to implement a Vulkan driver, but we really don't have time for anything big anymore, essentially.
I mean, there are random contributions from random developers doing some stuff, but it's never something where I would say, oh, yeah, this really stands out, or there is this new guy putting a lot of work or a lot of time into it.
Yeah, it would be nice if I could get more people interested into the project. I know that it's difficult to work with, and I know that a lot of people are kind of feared about doing hardware stuff,
but I think if one is interested enough, they are able to work on that. Like, for myself, before I actually started to work on Novo, I was a Java backend developer, so I really did something completely different. I had no experience in hardware programming at all,
and I kind of just jumped into the project because I was thinking, yeah, I have this NVIDIA GPU, and I really want to have an open source driver for that, and there are issues, so I just started to work on that.
Yeah? Yeah, I mean, yes, Martin is right.
There are a lot of low-hanging fruits. I have a slide later on where I kind of list some tasks people could work on, but yeah, I want to talk about this later. What's also interesting is, like,
working together with NVIDIA, negative stuff first, and I think that's the most annoying part, or most of the most annoying thing for us right now is that getting firmware, because it has to be signed, otherwise the hardware doesn't execute it, and we needed to kind of access some states on the GPU.
We are not allowed with unsigned firmware, and it's required, for example, for OpenGL, and what we would like to have is that if new hardware is released, that we get on the same day the firmware
for using those GPUs. I don't really have numbers on how long it takes, an average, but I think we had a situation where it took around one and a half year for some generation, but might be less, might be more.
I really didn't look it up, so I don't want to make a concrete statement there. What's also annoying is, and it's true for power management as well, is we also need special firmware. The most annoying thing is
on the second generation of Maxwell, we could re-clock the GPU, and we know enough to do it, but what needs signed firmware is controlling the fans. You can have high speed, but the GPU would overheat because we can't make the fans faster.
That's annoying because we are so close, but we still can't do it. It's even worse on newer generations where even bits of the re-clocking itself, like changing the voltage, is essentially locked unless you have signed firmware.
There were thoughts about doing the same. We do also for the video acceleration where you're essentially extracted from the individual driver, but it's kind of really annoying because that would mean, yeah, distributions or user have to manually execute a script
every time they install their distribution, then it has to be extracted, and then we have to reverse engineer the interfaces, and those change essentially every driver version, so it would be super annoying, and a lot of work we also don't have time for.
There are also good things in regards to NVIDIA, is that overall, at least myself, I get the feeling that it's improving over time. It might be not that obvious to others because we also... I have to be a little bit careful
what I'm talking about because we have some partnership with NVIDIA from a retro perspective, so I know a little bit more, but I can't really tell about much. But at least what we're working on with NVIDIA is getting documentation out,
and they actually have a Git repository on GitHub, and you can essentially see the commit history, and there is some useful stuff going in there, especially documentation for the displays, interfaces of how do we program the GPU to drive the display,
and how to program the MMU as well for doing memory-related stuff. So that's super helpful, and I hope that there will be more useful documentation, also for different areas, but we just have to wait and see what's happening there.
Terry is employed by NVIDIA, and he also works on Tegra code mostly. He's doing upstreaming of a lot of Tegra bits for the core kernel, like just using the Tegra devices,
and it's not that much novel-related, but there's also some novel contributions to fixing random issues or having this Tegra Galium driver as well. So that's also cool, and it's really good to see that there's at least some people at NVIDIA really dedicated to this.
Also, there is on the mailing list, if you scroll through it or search for people with an NVIDIA email address, there are some patches from NVIDIA people, which is also quite cool, and I hope there will be more in the future.
Some stuff we are currently working on is the biggest thing, most likely, getting OpenGL 4.4 and 4.5 ready. There's this kind of requirement to pass the official kernel's conformance test read.
I think on some GPUs, we are at a stage where we pass every test, but there are random failures if we run the full thing, because the full thing does, I don't know, 30 iterations of the test with different parameters, and last time I did this, the first failure was after 10 hours of running it,
and it's super painful to debug, and it's probably some random issue of not initialized memory or something random. We have no idea. Also, I'm not really in the mood of having to wait for 10 hours to hit a bug just to debug it,
so that's a little bit annoying, but we are getting there, and earlier I was also fixing a lot of issues, and myself as well. Hopefully, soon as we get official OpenGL 4.5 support. We also would like to improve the performance.
There are sometimes random shader optimizations landing in the tree. For example, also the near work could lead to improved performance, but sadly, we really can't do really, really big reworks to improve the performance in a way
like the Intel guys or Radeon guys are doing where they essentially use different interfaces of VISA or rework certain other areas. Building a CI system, we would also have something wired up to the free drive stop GitLab CI system.
I don't know if all of you heard about this, but right now we have a CI pipeline on the Git repository whenever somebody opens an MR or somebody pushes to the master branch that we have this software and hardware pipeline
testing if commits request something. We do test the software vendors, but there are also instances really testing it on hardware, and I would want to have the same for NuVAR as well. I kind of have to see how much time it consumes because maintaining such a system
can be very time consuming, but let's see how that goes. OpenCL support is also what we are still working on, and I hope I will be done with it soonish.
There are a few shaded kernel compilation stuff that we still have to figure out. And what we are also working on is
getting OpenCL support for Volta and Turing. Volta isn't really relevant to any users because I don't think anybody has this super expensive Volta GPU, but it's kind of similar to Turing, so it's essentially the same work. And once we get the firmware
for hardware acceleration for those as well, we want to have this be done inside Mesa as well, so users can finally use OpenCL on Turing GPUs as well.
Important things we really want to fix is, I think the most prominent issue is the runtime power management issue. There are a lot of laptops where you have an NVR GPU, and when we turn it off, then it fails to turn on.
And usually it leads to the system crashing or people not being able to boot the Linux installer, and then they have to disable the runtime power management stuff. What's kind of annoying about this issue is that I have no idea what's wrong there,
and nobody else was able to help as well. So I was talking with AppSim developers about this issue, and there was no real conclusion on that either. We also have no idea if it's a driver bug inside NuVu
or maybe it's a hardware bug, or maybe it's a kernel bug. What's kind of interesting is that it only happens with a certain internal bridge controller and not with any else. So maybe it's a hardware bug.
The biggest problem is just that the firmware code involved with turning off the GPU is accessing undocumented registers, so we don't even know if what we are doing is even correct or maybe we have to do some things before doing so, but there's also no public documentation on that, so that's super annoying to fix.
Is it coming up from NVIDIA? Some serious help? Excuse me? Is some serious help in that direction? So the question is if there's any help from NVIDIA from that direction.
Kind of. Kind of? I can't really... No real information here. No, there's not anything useful coming out. What I kind of hoped what would happen because the official NVIDIA driver supports this on the latest driver for tiering GPUs as well, so they have no support for turning off and on the GPU,
but if we do it on tiering with NuVu, it works. So I can't really say. I would like to reverse engineer it with older GPUs, but if the driver doesn't support it, I can't do it as well. I tried to request information from NVIDIA on this,
but they... Don't get it. I really don't know. I didn't get anything useful. Let's put it this way. Does a similar issue happen if you put a hardware passcode to virtual machine to have a key event, for example? So just a comment or a question?
Does the decision happen when you do that? So the question kind of was that a similar thing happens with device passcode as well, and if it's kind of related or not,
now it's a different thing. I mean, it's totally unrelated. The runtime power management is usually something which is only implemented in laptop firmware, so it's really a firmware-level feature of cutting the power on the PCIe device. I remember a few issues with device passcode,
but no, it's a different thing. What might also be relevant in the future is that right now devices are not hot unplugable. This is mostly relevant for eGPU cases where users have their case, and they are unplugging the device, and then the kernel just crashes right now.
I noticed that on a few generations, it doesn't crash, but user space is still screwed up. It's one of the bigger rework we would like to do because it essentially touches all of the driver because at any point the device can just vanish,
and you kind of have to deal with it in the kernel driver. kernel drivers for GPUs are quite huge, and if you lose the assumption of, yes, my device is always there, a lot of things you can't really rely on anymore
because right now, if you want to access device memory, we just do it. There's no check for is the device still there or not. That is a little bit bigger rework,
but if somebody is interested in fixing this, the biggest advantage is that there's no hardware knowledge required at all. It's essentially just unplugging it, and you see this kernel crash, and then you try to figure out how to fix it. It's kind of straightforward, but it's a lot of work.
What we also really would like to fix in user space or work on is multi-threading. That's mostly an issue which comes up with Chromium, for example, because they're doing multiple context, OpenGL context in the same different-
Given more Chromium, one new file, there's no hardware acceleration there. If it turns it on, I get problems on various websites. That's for humanity now. That's one of the reasons. There are also other reliability issues
because rendering can fail or there can be random corruptions, but I think what really drove the decision from Google was that the multi-threading thing really causes the GPU to just crash.
What's also happening is there are a lot of Chromium-based applications which also have to maintain their own blacklist, I think, so there are other applications having the same issue and essentially just crashing. There are also a few games, I think, doing multi-threaded rendering or OpenGL,
and the core issue is that you just corrupt the application state or the Mesa state as well, so we send invalid commands to the GPU, and then the GPU might also just crash. Sometimes the Mesa just crashes, so it's really annoying,
but we would like to work on or fix that. We also would like to have a Vulkan driver. I think right now the main reason why we can't do that is because we would like to have a new kernel interface for the Novo drive in order to properly implement Vulkan as well,
which also would require us to revert the Mesa driver at some point as well, but it would lead to a more reliable driver. Context recovery is also kind of something a lot of drivers are implementing as well, is that sometimes it can happen that the GPU context
or the GPU just crashes, and we would like to recover from that. In the past, what I saw and some users as well is that it could happen that the GPU context crashes, user space never knows about this, and X just freezes.
So the user is not able to do anything. They can't even switch to the TTY in order to restart X or something, so they just have their machine, nothing happens anymore, and they have to essentially force reboot. There are some improvements with that, with the 5.6 kernel,
and hopefully most applications will now just crash if that happens, but if anybody of you have this case with the updated kernel, or you still have a freeze and your system just doesn't do anything anymore, then just contact us, and we will see what's the reason for this.
But hopefully that doesn't happen anymore. Question? Do you have any kind of guesstimate of how many years away a reliable Vulkan driver would be? So the question was,
how many years are we away from a reliable Vulkan driver? I don't know. I mean, Vulkan driver is usually less work than an OpenGL driver, but there are also a lot of additions to Vulkan happening. I really don't know. I mean, hopefully it's not that far away.
Combined forces together with Intel and AMD that also want a Vulkan driver. Yes, but we are already having shared stuff
with Intel and AMD on the Visa driver. So there are some common things, especially the SPIRVE compiler stuff, because Vulkan requires SPIRVE, and we have the SPIRVE to NEAR thing inside MISA, which is shared by all the Vulkan drivers. So at least in this area, we can just make use of what's already there.
There's also dispatching stuff, like if you call a Vulkan function, then the runtime has to check, oh, does it actually is implemented by the driver or not? So there are also shared code in this area. I think there are other bits as well. But yeah, I mean, there is stuff like this,
and I think the main goal is to get to a point where drivers really only have to implement their hardware-specific bits. What we also would like to work on is mainly debugging features.
Right now, if we encounter a bug, it's usually always a lot of work to figure out what's actually happening there. For example, we can't debug a shader, which is super annoying, so we can't see, oh, what's the value of this register? Or why did this shader loop forever or something?
It's really painful to debug right now such issues because it usually means, oh yeah, I have to adjust the GLSL code to figure out why does something happen? That usually takes more time than, for example, let's say just turn on a debugger and see what the shader is doing. We can't do that. We would like to do that.
There was some reverse engineering in this direction done already a lot of years ago, but yeah, it's not implemented yet. If anybody is interested to help us out on that
and has, essentially, NVIDIA GPUs and has these issues or something, it's always a good idea to try to look into that themselves. I mean, that's how I started to work on Nuvo. I just had issues related to reclocking,
and I was looking into it and figure out what's wrong. If there are some issues which really annoys you and you would like to get it fixed, you could try to look into this as well. We can always help with stuff, so if you're interested,
just try to do this, please. Being interested and motivated also is kind of a huge thing, of course. If you are not motivated, then nothing will come out of it. Especially for students, if they also want to get some money while doing that,
we would be happy to do also GSoC or EVOC programs, which is essentially kind of, you get money to work on open source project as an intern, like kind of intern life. I mean, can't really say it. It's not an internship, but it's essentially, there is some evaluation going on,
and we say, okay, this person is skilled enough to work on that, and then they get paid for, I think, GSoC is three months. I don't know the details of EVOC, but yeah. So there's money, and you're working on open source projects.
What's also kind of good entry-level task is compiler optimizations. Most of you, or some of you might have been, oh, but that sounds super complicated, but a lot of optimizations are just simple math. So you have special patterns and shaders,
which you can reduce to a simpler set of instructions just by applying some math to it, essentially. There are also more complicated compiler optimizations, but that's kind of the more trivial ones where you just see, oh, maybe there's a shift
in the left direction and in the right direction. Maybe I can also just apply an and if it's by the same amount of bits. What's also kind of fun project would be to use some GPIOs.
There are several of them on NVIDIA GPUs. Some of them just control the LEDs on GPUs, for example. Some of them get triggered when you don't have the external power connected to the GPU, and there are a lot of random stuff like this.
Maybe if some is more akin to work on hardware-level stuff, that's also something cool to work on. And we also have some fan-controlling issues where we actually got documentation for as well, and it's still a thing on our to-do list,
but if somebody else has a GPU where the fan-controlling is really odd, like if you have more load, the fans could get slower. I have a GPU like that myself as well, which is something which shouldn't happen, but if you find such odd issues with your GPU as well,
maybe if you want to look into it, that would also be a cool thing to look into. I think I'm done. There are a few links you can always look into as well.
Mostly we are on the IRC channel around talking with users if they have complaints or bugs or something. And we also have the Trello board where a lot of tasks are listed.
Maybe I can open it. But I think that's essentially it. Yeah, so I don't have anything anymore. Any questions or comments or maybe even suggestions?
I thought it's on the wiki. Yeah, I'm sure it's there, but if it's not, then yeah.
The question was if we have to do anything special for virtual GPUs. No, we don't.
It's essentially the same GPU. I think there are some virtual cards which have certain hardware features enabled, like double floating point precision performance. Usually on GeForce cards, you have really slow double precision performance,
but it doesn't matter because games are not using it. But besides that, I'm not really aware of anything. I mean, what I was actually working on is that in the past, if you used the NVIDIA driver only on virtual cards, you got the power consumption. And I've implemented it for GeForce cards as well.
Because it was just really a software check to turn it on or off in the NVIDIA stack. But yeah, generally it's the same. Yes? So I was curious about the version of OpenCL
that you were using. So the question was what OpenCL version we are targeting. Right now, the OpenCL runtime in MISA is implemented on a 1.1 level.
And at least that's what we are trying to match. But I was also working on some OpenCL 2.0 features, or Pierre was also working on consuming SPIRV, which is an OpenCL 2.1 feature.
So yeah, I mean, there are interested features in OpenCL, one which are cool to implement. But yeah, I mean, right now we are trying to get it to work first, and then look into implementing more features.
Any choice. But I never managed to, is there still development for those old cards?
So the question was if there's still development for old cards. Yes, if somebody has time to look into it. So I mean, the kind of the bad and the good thing is we are not a company, so we can just take care
of issues on older hardware because we don't have the pressure on only supporting the newest thing because we want to sell hardware. But on the other hand, we also don't have enough people to look into all the issues. If there's a bug on like 10 years old GPU and somebody fixes it, then we also would like to merge it, yes.
Yes? So I saw that you have some paid core developers. Who pays for his work and why? What is the interest of these companies? So Red Hat is paying for all of them, and the reason for doing so is just because of RAL.
RAL, Red Hat Enterprise Linux. And I mean, it supports the desktop, and because you want to make use of the GPU as well, you need the Novo driver because Red Hat doesn't want to ship any proprietary blobs, so Novo is the only thing.
I see the same firmware from NVIDIA for the clocking and the fun control. Does that require any additional work in Novo itself? The question was if using the firmware provided from NVIDIA causes additional work.
Yes, the firmware is executed on core processes on the GPU, and you also need a communication channel to the chips as well, to the CPUs as well. So that's one thing we have to work on. Also, the signature stuff,
because they are signed, requires a booting process which gets more complicated by the generation. So you have a core security chip which has to get booted, and then it boots the other processors with other firmware, and the signature stuff is different again,
and you have to allocate secure memory so the driver can tamper with it while the boot process is ongoing and stuff like that. So that's kind of the thing. The advantage of getting the firmware is we don't have to implement the firmware ourselves.
That's what we've done on older generations where we have the firmware for power management, especially useful on Tesla and Kepler because we do the memory with clocking on the PMU, which is one of the core processors,
but we also have firmware for context switching. So if you have multiple hardware contacts on a GPU, they have to get switched at some point, and on NVIDIA GPU, that's also done on those core processors. Yes?
Now that Red Hat belongs to that, to put more weight behind that. I mean... Can you repeat the question? Yeah. The question is if now that Red Hat got bought by IBM, if they are able to put more money into it.
I mean, they certainly could because it's a bigger company with more money than the Red Hat, but even if I would know about plans, I wouldn't be allowed to talk about it. So, yeah. I mean, the most obvious thing is yes, they could because they have more money.
I don't know. Yes? When you talk about debugging, you mentioned that there is problems in memory production going on. Is it possible to have an other site or something like that to build a kernel module?
No. The question was that, because I was talking about debugging, if it's memory corruptions, why we can't use address sanitizer? I was talking about GPU debugging. Like, all the sanitizers don't work with GPU memory. So...
So it's memory corrupts from GPU or not? Yeah. Yes. I mean, for example, the multi-threading issues we can fix with the address sanitizer. The problem is just that the cause is much more complicated than just throwing in some logs. Because if you throw in too many logs,
then the performance goes down. So there's no benefit of doing multi-threading. Sure, we could, but also kind of the architecture of how we're doing things make it really annoying to do so. Yeah. To make a log of what's driven into the GPU
and then when it dies, have at least a backtrack of what happened before. So you cannot read state anymore when it's crashed, but then you can at least see what was the last thing that happened. So the question was... Slow down things, of course, but...
So the question was if we could throw in a log so we know what happened last on the GPU before it crashed. Yes, we have that. With 5.6, Ben added a feature where... I don't even know... No, it's a kernel feature where
if you send commands to the GPU, you can do it in a synchronous way. And if the GPU crashes, we know what command submission was done last. It helps a bit, but a command submission can also be like 100 or 1000 commands.
But yeah, I mean, we could also do the command submission smaller then and then figuring out what's happened. So yeah, we have something like this now.
Okay, then I think I'm done.