We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Status of GPU offloading on Wayland

00:00

Formal Metadata

Title
Status of GPU offloading on Wayland
Title of Series
Number of Parts
199
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
This talk will be about the principles of GPU offloading, how it is handled with X DRI2, and how we decided to handle it on Wayland. It's been about 3-4 years since first experiments on GPU offloading on X were done. How is GPU offloading handled with X ? And how the different design of Wayland could influence the way it's handled ? In this talk, I'll present: * the technical difficulties involved in GPU offloading. * how X DRI2 handles GPU offloading. * What choices we made when designing Wayland GPU offloading support. * The work that has been done, and what is remaining to do for a better user experience.
Denial-of-service attackStudent's t-testBit rateBuildingReading (process)WordMaxima and minimaBitGraphics processing unitBuffer solutionSoftware testingShared memoryVolumenvisualisierungArithmetic meanComputer fileSemiconductor memoryDifferent (Kate Ryan album)Context awarenessInformationRight anglePlastikkarteAuthentication2 (number)Software frameworkPrice indexModel theoryNumberServer (computing)Prime idealDynamic random-access memoryMultiplication signTesselationLine (geometry)View (database)Bus (computing)PixelCartesian coordinate systemDefault (computer science)NeuroinformatikChannel capacityFrequencyPhysical systemClient (computing)Computer-assisted translationOrder (biology)ChainTrailFluidTangentPeripheralUsabilityConjugacy classSummierbarkeitSystem identificationSlide ruleFreewareDirection (geometry)Lecture/ConferenceXMLUML
PlastikkarteBuffer solutionGeometryTesselationLine (geometry)PixelHTTP cookieComputer-assisted translationSource codeXML
NeuroinformatikBuffer solutionContent (media)Configuration spaceSynchronizationClient (computing)Device driverServer (computing)Latent heatPrime idealPlastikkarteComputer fileUniform resource locatorMereologyBitElectronic visual displaySoftware frameworkArtistic renderingInternet service providerShared memoryContext awarenessAdditionNumberSet (mathematics)Multiplication signAsynchronous Transfer ModeAuthenticationCartesian coordinate systemSurfaceVolumenvisualisierungRootCategory of beingCodeCode2 (number)QuantificationReading (process)HTTP cookieSource codeRing (mathematics)Fitness functionSocial classCountingProteinOvalCyberspaceMathematicsDifferent (Kate Ryan album)HypermediaParameter (computer programming)Web 2.01 (number)Presentation of a groupArithmetic meanInstance (computer science)Model theoryWordComputer-assisted translationTwitterStability theoryConjugacy classRight angleKey (cryptography)Task (computing)State of matterSpeech synthesisXMLUML
QuicksortComputer programmingBefehlsprozessorComputer-assisted translationCASE <Informatik>Cartesian coordinate systemRight angleGame theoryCodierung <Programmierung>LaptopCycle (graph theory)MaizeCyberspaceTask (computing)BitInternetworkingSound effectBuildingType theoryCovering spaceRootWebsiteShared memoryCode2 (number)NumberArithmetic meanSynchronizationSpur <Mathematik>HTTP cookieWell-formed formulaFreewareDifferent (Kate Ryan album)1 (number)Sinc functionOrder (biology)Device driverCategory of beingHD DVDIndependence (probability theory)OvalMathematicsSystem callPrice indexContent (media)Multiplication signFlow separationInfinityCondition numberServer (computing)TesselationPerturbation theoryNumbering schemeBuffer solutionBootingPlastikkarteClient (computing)Software frameworkArtistic renderingInternet service providerSet (mathematics)Graphics processing unitSource codeXML
Flow separationClient (computing)Server (computing)LogicCartesian coordinate systemThread (computing)CASE <Informatik>Connected spaceParameter (computer programming)Prime idealComputer programmingGoodness of fitKernel (computing)Buffer solutionPatch (Unix)Electronic visual displayPlastikkarteCodeLibrary (computing)Denial-of-service attackWeb pageBitPerfect groupGraphics processing unitType theoryLatent heatUser interfaceAdaptive behaviorException handlingComputer fileDifferent (Kate Ryan album)Sinc functionComputer configurationProduct (business)Reading (process)Vector spaceOpen setComputer-assisted translationRight angleoutputVotingNumberHTTP cookieEvent horizonFile format1 (number)DialectMultiplication signNetwork topologyExtension (kinesiology)ForestReverse engineeringArithmetic progressionSubject indexingGraph (mathematics)Model theoryCyberspaceTriangleXML
GradientArithmetic meanCartesian coordinate systemExtension (kinesiology)Sinc functionIntegrated development environmentInformation privacyClient (computing)InformationNumberTheoryThermal expansionSound effectKritischer Punkt <Mathematik>TesselationBitBuffer solutionCommunications protocolCodeDirection (geometry)Lecture/Conference
Transcript: English(auto-generated)
OK, so I'm going to talk to you about GPO flooding on Milan. So my name is Axel Davie. I'm a student at the Ecole Norma Superior of Paris. So first, I will tell you what are the different technologies involved in GPO flooding
and how we can handle them. Next, I will explain how GPO flooding is handled with XDRI2 and how it is then I will show you how we decided to make it work with WELAN. And to finish, I will tell how it can work with XWELAN.
So first, the different technologies involved in GPO flooding. The first technology we can have in mind is rendering. So to render, we have to be authenticated to be able to render. The traditional way to be able to render is there is a master, a DRAM master,
for example, the X server. And if I am a client, I can open the device and generate a magic number, which I will send to the server, which will then use this number to authenticate me, and I will be able to render. But there is a recent other way introduced,
which is render nodes. So render nodes are device paths that I can open. I can open the device with them and render without the need of authentication. But since I don't need any authentication, I won't be able to do some functionalities,
like gem names. So what are gem names? So to share the buffers between different contexts or devices, there are several ways. First, you may know that the devices, for example, if you have a dedicated cart, it has its own memory, it's VRAM.
So it's only for your device, but in RAM, you can share memory between all the graphic cards because all the graphic cards can have access to the RAM. To share the buffers, you have to give a meaning to the buffers.
So when you manipulate buffers in your driver, you will manipulate handles, which are numbers, and this number will only have a meaning for you. A meaning for you will be a buffer. For example, Mesa uses handles. But if you want to, another context you use your name,
your handle, you can generate a gem name. A gem name is a global handle for the device, which means that everyone with this number will be able to access to this buffer. But it's insecure because if you try and get the correct number, you can access the buffer.
And since the numbers start from one and are not very random, then it's easy to catch some buffers, so it's very insecure. And there are two use them because it was the only technology available at the time to share buffers.
But now we have also prime DMA buffer file descriptors, which describes what the buffer is and where it lies in memory and gives the right to access it. So it's secure because you can't guess all this information and to open a file you have specific rights, so it's very secure. And we will use it by default with Wayland.
And there are three use only them. Now about memory speed. So rendering to RAM is fast. If you have a direct channel to talk to them, for example, in VRAM DDR3
with a frequency of 900 megahertz and 128 bits for the channel, you have a high capacity to read and write, so about 14 gigabytes per second. So that's fast. But if you have to use a channel to access this memory, for example, the PCI Express channel.
So I took the example of the channel that is on my computer. And so I have eight lines of 500 megahertz bus. And so that means per second, I can share four gigabytes via this bus. If I had an external device with Thunderbolt,
it would be only one gigabyte per second. So that's much slower than direct access to RAM. So how can that have a meaning for deep view of reading? Well, a full screen, a full HD buffer
is about eight megabytes. But if I have to share 60 frame per second, it's going to be a little more, which is about 500 megabytes per second. So let's do a test on my system, which has an Intel card and an AMD card.
Both have their own memory access, but the AMD card, to talk to RAM, has to go through the PCI Express channel. And if I render a very, very light test, which then only, which doesn't give any indication on the performance of the graphic card,
but only on the speed of the memory. If I say, well, render this test in RAM, the Intel card, which has fast access with RAM, will be about 10 gigabytes per second. So it's near what we saw before, which was 14 gigabytes per second, the maximum.
So there is a slightly, a bit of rendering, so that's why we don't access the maximum number. But for the AMD card, we saw only five times, timeless frames, because it is limited by the PCI Express channel. So likely, since your application will be not light, but heavy,
it will use the PCI Express channel to communicate important data. So with GPU flooding, when you want to use these cards to render, you shouldn't expect to have high FPS like this. Now about tiling, you may have heard of it.
Tiling is a way of reordering the pixels in memory, so that your graphic card will render faster with this tiling. It's very good for performance. For example, on my Intel card, which tiling has a big performance impact, we've opened already now, I get three times more FPS,
frame per seconds with tiling. So that's a huge difference, but for some application, it won't have any difference. But the big problem about tiling is that we can't give a buffer to another card and say, okay, I read it, if it has tiling, because the other card won't understand anything. In fact, it depends on the models
and the things of the card. So for example, his Western Geos rendered on my AMD card, and the Intel card is this thing. So I say it shouldn't use tiling, so okay, the Intel card understand. And here, it's when I allow the AMD card to use tiling.
So we see that, for example, the green pixels are much more in line than before, where it was a year. So we can understand that it's faster to manipulate like that, but we would like this to be shown properly. So when we'll have to deal with tiling,
what we want is we want to combine the high performance of tiling and the fact that we need to share a linear buffer with the card. So what we want to do is to render to a tiled buffer, and then do a copy to a linear buffer. When I mean linear, I mean no tiling. And to finish with DMA-Buff fancies,
which when you share a DMA-Buff, which is what we do to share the buffer between the two cards, there are no fancies shared between the two devices, which I mean, internally, each card has its own fancies to say, I have not finished yet to write on this buffer,
wait a bit before reading it. And all these work pretty well, but with two different devices, they won't share these fancies, so they will show garbage. And hopefully, a new solution will come, which is done by Martin Lancaster,
which are DMA-Buff fancies, which are shared fancies for all the devices. And this will remove glitches because the two cards will share these fancies and will do things correctly. An extra feature that has nothing to do with GPU offloading is that we can pull a DMA-Buff
to know if the fancies are still there or not, which means that in user space, we can know if the rendering has been finished or not, which is good. Now, how does GPU offloading work with X.0.2? So, the main mechanisms to render with X.0.2
is I'm a client, I tell to the X server, hey, I want to render, and there is a special bit to say, I want to render on the specific device, not the main one, perhaps, and then the X server will give back a path.
We open the device, the file at this path, which is the device, and then we have to authenticate, generate a magic number, send it to the X server. Okay, then in the DRAI2 scheme, I have to ask the X server my buffers,
which I will use to render. So, I will ask, yeah, I want a buffer. Okay, the X server gives gem names, so remember that it is insecure, and the client will use these gem names to get the buffer and render to it. When it has finished rendering,
it will tell the X server it has finished, it will swap, and X will copy the buffer to the current location, and if you have heard about the additional copies with compositing, it's here. When you do compositing, you won't allow the X server to copy directly to the screen content.
Instead, the X server will have to copy to an intermediate location, and then the compositor will then copy what he sees at this intermediate location to the screen. So, it's two copies, whereas we could use only one, and we learn in DRAI3 improve this, and we have only one copy.
Okay, so how it works with GPU uploading. First, you have some configuration to do. You have to provide a specific driver for each device. DDX is a driver specific to load X with your device, and both have to be loaded at the start.
You can either write what you want for DDX to be loaded for each driver in X.Conf configuration file, or it's automatic if you don't have any X.Conf configuration file. And next, once the computer is started, you have to use X render
to say what you want to do with these devices. You have two modes. The one used, for example, with NVIDIA property driver is to have one GPU for display, for example, the Intel card, and one GPU for rendering, for example, the NVIDIA card. So, it's not GPU uploading exactly.
The second one is GPU uploading. You have one GPU for display and rendering, and the other one can be used eventually when you want by specifying specifically where you want to use it. So here, you will have to configure this mode. Each device will have a provider number, and you can specify which provider number
will be for displaying and which for uploading. And with DRI prime, you can say provider one, I want to use this provider for rendering, and often zero is for the one displaying, and one is for the one, the second one. So, that's why you use DRI prime equal one always.
You want to use the secondary device. So, internally, how does it works? In the X server, it has loaded the correct DDX for you and send you correct gem names on your device, but when you will commit the buffer,
the location of the X server when it does the copy will have to be shared between the two cards. So, for that, it will use the prime DMA buffer API, and it will have special codes in the DDXs to handle that and create a linear buffer, so with no tailing,
shared between the two cards. So, this will require special code in the DDX and X server to handle all that. When you don't have compositing and the client is full screen, the target will be directly the full screen can-a-wood buffer. There will be an exchange so that it is possible,
but if you don't have compositing and your application is not full screen, you will see nothing. It will be a black surface just because there is no one who cares about the copy to the linear buffer shared to the screen PIXMAP.
But when you have compositing, it is done, so it works. And to make glitches, to remove glitches, every time the buffer, we say that a part of the buffer has been run down again, we say, no, the whole buffer has to be loaded, and it works.
But with tailings. So tailings is, I'm displaying a frame and I write over the frame, whereas I'm displaying. So I have new and old content in the same frame. It's very strange and it doesn't look nice.
There is no synchronization at all with screen refresh, but at least the content is correct because you always copy to a buffer whose previous content was correct but old. Then even if you read before it has been finished writing,
you will see the correct thing. Because as you remember, there is no fence shared between the two devices. So one will write on the buffer and the other will read at the same time. So that's also the root of the tailings. We can't do without tailings in this situation.
So how can we make that work with Valen? So in Valen, the main mechanism is that the clear is aware of the path, of the device path of the compositor. The compositor says, I use this card and you can connect to it. I can authenticate you to it if you want.
But the client could also open another device and especially use the random node of the device and then doesn't need authentication to the compositor. And the rendering is a bit different because it doesn't ask any buffer to the server, like for DRI3.
It will have its own set of buffers. We'll make the server aware of these buffers. So it will import them and okay, I know these buffers. And the client will choose one, render, and then we'll send to the compositor and the compositor will use it.
If the client will render to another buffer and the compositor eventually, when it gets a newer buffer, will release the old one and say, you can reuse this buffer again. Okay, so how can we make GPU offloading with that? Obviously, we can choose the device we want to render to.
What we want to improve first over DRI2, we want to remove the tailings. It would be cool, right? But we want some synchronization. So it's possible we belong because we have a way to synchronize to the refreshes of the compositor without actually care about GPU. It's a sort of frame callbacks.
And it would like to have the least possible code in the server. So everything should be client-side. So it's possible since the client knows the device of the compositor, it can actually knows, well, I will give to the compositor
a buffer it can understand. So that makes sense. And he would like some sort of hot plug support and that's all. So the very first scheme that came to mind, which is not very good, is that the server will be the direct master
of all the cards and will advertise all of them. I can authenticate to them, okay. And the client will see, oh, I have all these sets of a device I can use. I will choose one. And then the client will send the buffer
the compositor can read with no tailing. But there is still some issues with this approach. We have still server cycle to be direct master of all the devices. And we want to simplify that. So we say that we will rely only on WonderNodes.
The compositor won't do anything, won't have new cards. We don't change anything to the compositor. And it will, everything will be client-side. It will use WonderNodes. And no need of extra code. So everything is in the client. But remember, for X.i2,
we had a way to specify which device we want to use. We had a provider number, which was mostly constant because it was zero for the one displaying and ones for the other cards. So here we have to find something else. The device path of the device is not constant across reboot or across updates
because it depends on the speed of the boot of the driver. So we decided to rely on a tag filled by UDA, which doesn't show that reboot and which looks like that. So this is the ID path tag for my dedicated card.
And when I want to use this card to render, for example, GLMark2, I will use this command. But since it's a bit tricky to remember all that,
we decided that one will be a special meaning to say that I want another card than the compositor. So when you have only two cards, it takes the other one. So it will work. But in Wayland, you can launch embedded compositors and you can launch client inside this compositor.
And here we would like to be able to launch clients inside the compositor on the dedicated card if we want. And so if I do this and launch an embedded compositor, all the clients will use a different device than their compositor, which is the embedded compositor.
So we are not using the device we want. They will all be using the Intel card instead of the AMD card here. So here will be the sure way you are sure what device will be used. And here, it's the different card than the compositor.
Okay, so Hopplec can be supported because there is no server side code to detect. There is a new device and do something with it. So that's cool. But as we said, we are going to share a linear buffer with the compositor. We don't use the same card than the compositor.
So it isn't optimal to render to this buffer and we said that we want to render to a tile buffer and do a copy to a linear buffer. So how can we do that? And there are two ways. The first way is say, well, compositing, it's light, we can do compositing to a linear buffer.
So when I would like, okay,
so there are two possible ways. So the first one is to launch an embedded compositor on the dedicated card. And all here inside will only know that their compositor use the dedicated card. And we use tiling because they know
they can share tiling buffers with their compositor. But the problem with that approach that we induce some lag because there is an extra layer. There is more CPU consumption. And so it's not kinda cool.
But remember that, okay, it's working again. Remember that we had, that had said that we derived, it was cool, that the content was okay even if fancy didn't work. So here we have this problem because we render, we give fully rendered buffer to the compositor
and it won't be copies of already correct buffers. And so if you directly show the buffer, it will be semi-black or it will not show the correct content.
So you have to have a small lag to show that the rendering finishes. And so right now only this solution will give correct content because the second way is to say, well, do the copy in Mesa and send the copied buffer. But we have glitches with this approach because we send a buffer that is not finished rendering
and the compositor will just read the black content. But when we do the copy, we can say where I will wait before committing that the rendering has been finished. So it's equivalent to GL finish, but it is very slow.
It's severely impact performance because when you have finished renting and you send the buffer, then there will be a long time before you feed new rendering commands to the graphic arts. So it won't be usefully. That's why decrease performance.
But the good thing is that in both cases, you can run a full disk top on the card you want. For example, if I launch a full screen nested compositor, it will just work. There will be a small lag, but all my desktop will be rendered on the dedicated cards.
It is a full screen buffer, so the Intel card won't have anything to do than this pane. Also, the two approach have no hearings even if they can win glitches. And the sync synchronization to the screen refresh is working. So obviously the second way here, the copy in Mesa is what we want to do
for most of applications, but it will have to wait the member fences to be fully workable. Okay, now I want you to think about another case, which is I have two cards, but also two displays,
and each of them is connected to one of the cards. How do we want to handle that? And that can happen on some laptops where the VGA connector is on the Nvidia card. And on Windows in this case, what we do is that they switch
so that all of the application are rendered on the most powerful card, which is the dedicated card. But first we don't know how to switch GPU online with our technology in Mesa, in the kernel.
Maybe one day, but I think it's very far. So what we want to have is a compromise, I think. I think we want that each client created on one display will be on the device connected to the display. It's possible. So I have two displays,
and each client will use one of the graphic cards. They will be able to use styling, et cetera, have good performance. But if I want to change off-screen one application, since it's on the other card, I will have to do something. So this will have to be handled by the compositor.
The compositor will have to have two well-on connection for each display and display both in, one for one display and one for the other. One will connect to the dedicated card and one for the integrated card. And when it knows a client want to change off-screen,
it will have to do a copy. So it's not perfect, but at least if you launch the client of the display you want, you will have the best performance possible because you can use styling, page flips, et cetera, without less copies.
Okay, so I will repeat a bit and summarize. For X, there are two. The server controls the devices. It is their master of them. It has special code to handle them. And there is a DDX for each device. The copy type buffer to the linear buffer is handled in the X server. The client doesn't do anything.
And there is special logic for that in the DDX. The clients always authenticate to the server. And obviously the DDX has special code to handle that. We don't have any code in the server except for the last example I told you,
but this isn't a feature working in X2. And we rely only on the nodes. The client knows it uses a different device than the compositor and adapts itself to have, it has special code to handle that and one dot to a type buffer and copy to a linear buffer. And it can be with an embedded compositor eventually.
So right now, what has been done so that the perfect GPU flooding support is there? So render nodes are here and they are going to be defaulted soon.
Right now, you have to have a special command line when using the kernel to have render nodes. The patches for using Derai Prime in Mesad to indicate what device we want to use with ID path tag, it's working. And also a cool feature with recent kernels is that the GPU you don't use can shut down when needed.
So you don't have to do it manually, it's good. What needs to be done is DMA preferences. As I said, it's a most wanted feature. We need also to handle the copy in Mesad, but once DMA preferences are coming, I think we will add it pretty soon.
Also, instead of using Derai Prime to indicate what device we want to use for our program, we don't want to use scripts and so. So ideally, we want something automatic, which I would say I want for this program always to use device A, for example. And Derai Conf, which is already used
to customize the parameters of the graphic cards for several programs, can be used for that. So there has been some work on it, but not finished. Also, some remaining applications are still relying on gem names, even when applications like VA API,
it's still using an old Wayland interface when there was no DMA support yet, so it has to be rewritten, but it's not so much work. And obviously, the compositor has to do some special things to handle multiple displays
connected to different devices, but I think it will be very cool. Now, how it can work with X-Wayland? So I don't know if you know about X-Wayland. So basically, it's an X server connected to the compositor, the Wayland compositor,
and it tells the new clients that come up, and X-Wayland will send the buffers correctly. Glamour is what we want to use for GPU acceleration. It's a library based in OpenGL that handles X-Wonder,
because X can do some acceleration from the clients that Wayland not. And since it's OpenGL-based, we don't have specific GPU code for all of the cards, so it's all in Mesa. And if you think of it,
we don't have a real need to support X-GPU threading like it is supported with Dera I2, because we can already launch an embedded compositor on the card we want, and inside all X applications, we run on the card of the compositor. So we can already have GPU threading without supporting X GPU threading
and have several DDXs. So what we want is to have only one DDXs loaded for X-Wayland. Okay. But there is some problems. Dera I2 doesn't work with one of the nodes, obviously because it uses gem names which are forbidden on one of the nodes.
So we have to support Dera I3 only for this use case. And Dera I3 work with one of the nodes, because the difference is you don't give the device pass you want the server to use. You ask an already opened file descriptor on the device.
So the server can do whatever it wants and use random nodes, et cetera. And Dera I3 use only DMF FD, which are supported by random nodes and are secure. But Dera I3 is not entirely ready. It's still under thing seen but I think it will be ready for production soon.
So do you have any question? Yes. Maybe someone can give a micro for the questions. Yes.
Okay.
Okay. So the question is about with Mac OS, they have an option they can, when we move,
we have to display connected to different GPUs. We can have a application on one display connecting to the right GPU. But when we move the display, notification is sent to the application so it can change of device. So can we have a similar support? When I was told there was an extension
but I think it's only GLX, I'm not sure. When we can say to the application that OpenGL is broken and it has to redo everything from the start. So it could be used to do that but I was told that no application care about that. So obviously it will be some application support.
But since they would have to do all the work again to initialize, they will need special code in the client to handle that.
So if it's not an OpenGL extension, it could be added eventually to the ground protocol or I don't know. We can find a way but maybe later. Is there other questions?
Here. So your question is when you do the copy
from the title buffer to the linear buffer, do we have enough information to convert the tiling when we do the copy? Is that your question? To avoid the extra bit. To avoid the bit, it is understanding the tiling directly.
Unfortunately, this would need support for the tiling for the cart itself. So it's not possible unfortunately. We need a copy for that. There is no other way.
Other questions? Don't be shy. Thank you. Thank you.