Defining a multi-architecture interface for SYCL in LLVM Clang
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/61407 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2023368 / 542
2
5
10
14
15
16
22
24
27
29
31
36
43
48
56
63
74
78
83
87
89
95
96
99
104
106
107
117
119
121
122
125
126
128
130
132
134
135
136
141
143
146
148
152
155
157
159
161
165
166
168
170
173
176
180
181
185
191
194
196
197
198
199
206
207
209
210
211
212
216
219
220
227
228
229
231
232
233
236
250
252
256
258
260
263
264
267
271
273
275
276
278
282
286
292
293
298
299
300
302
312
316
321
322
324
339
341
342
343
344
351
352
354
355
356
357
359
369
370
372
373
376
378
379
380
382
383
387
390
394
395
401
405
406
410
411
413
415
416
421
426
430
437
438
440
441
443
444
445
446
448
449
450
451
458
464
468
472
475
476
479
481
493
494
498
499
502
509
513
516
517
520
522
524
525
531
534
535
537
538
541
00:00
MultiplicationArchitectureContext awarenessEquivalence relationIntelOpen sourceNeuroinformatikLimit (category theory)Connectivity (graph theory)Latent heatCompilerComputer animation
00:31
Machine visionSoftwareSystem programmingStandard deviationIntelArchitectureBefehlsprozessorComputing platformMachine learningPoint cloudPlastikkarteTablet computerArtificial intelligenceCodeTesselationSupercomputerEmailCompilation albumDevice driverSource codeBinary codeSynchronizationOverhead (computing)Graphics processing unitMathematical optimizationMathematicsAsynchronous Transfer ModeRun time (program lifecycle phase)FingerprintFunction (mathematics)Library (computing)Link (knot theory)Telephone number mappingSemantics (computer science)Motion captureBitSource codeCausalityOpen setComputer hardwareAddress spaceInformationImplementationLevel (video gaming)Front and back endsSystem callMoment (mathematics)Symbol tableSpeicherhierarchieMathematical optimizationBriefträgerproblemFunctional (mathematics)NeuroinformatikRun time (program lifecycle phase)Overhead (computing)BytecodeLibrary (computing)Multiplication signModule (mathematics)Compilation albumWritingGraphics processing unitSingle-precision floating-point formatDevice driverMixed realityCodeCartesian coordinate systemNamespaceClosed setCycle (graph theory)Computer programmingGame theoryMathematicsCompilerMenu (computing)SpacetimeDressing (medical)VideoconferencingCASE <Informatik>Boss CorporationLine (geometry)Computer architectureSphereGaussian eliminationFitness functionString (computer science)Computer animation
05:31
CompilerRevision controlMathematicsCollaborationismImplementationBinary codeBulletin board systemCompilation albumImage resolutionLink (knot theory)Normal (geometry)Transformation (genetics)Library (computing)Personal digital assistantGraphics processing unitGeneric programmingDefault (computer science)CodeCompilerPoint (geometry)BuildingHacker (term)Software developerConnectivity (graph theory)Open setTrailImplementationFunctional (mathematics)Computer architectureDifferent (Kate Ryan album)Computing platformLevel (video gaming)MathematicsAsynchronous Transfer ModeLine (geometry)Symbol tableConnected spaceSoftwareLink (knot theory)Attribute grammarTwitterIntelligent NetworkMultiplication signRun time (program lifecycle phase)InformationMessage passingOrder (biology)Transformation (genetics)CollaborationismSource codeBitCompilation albumCASE <Informatik>Image resolutionLibrary (computing)System callTerm (mathematics)Group actionMoment (mathematics)Staff (military)ApproximationNumeral (linguistics)Sign (mathematics)CodeCausalitySquare numberComputer animation
10:32
Link (knot theory)Library (computing)MathematicsGraphics processing unitGeneric programmingDefault (computer science)CodeCompilation albumCompilerCompilerImage resolutionCASE <Informatik>Group actionObservational studyLatent heatDevice driverClient (computing)Expert systemCartesian coordinate systemTerm (mathematics)Game theoryIntelligent NetworkLink (knot theory)Process (computing)Multiplication signParity (mathematics)BildschirmtextClosed setSheaf (mathematics)Type theorySource codeBytecodeLibrary (computing)QuicksortComputer animation
13:26
Point cloudProgram flowchart
Transcript: English(auto-generated)
00:05
It's lightning talk on DPC++, I think from SYCL as far as I'm aware. Yeah, yeah exactly. Okay good afternoon. Yeah so I'm gonna be talking about compiler intrinsics in SYCL, in DPC++ specifically. This is Intel's open
00:22
source SYCL implementation. This is what I work on. Yeah so hopefully I'll be able to say something without saying too much in ten minutes. Yeah so I work for Codeplay. We had the first SYCL implementation, Compute CPP. We now work, we were acquired by Intel, so now we work on the Intel SYCL
00:44
implementation DPC++. That's what I work on. We have lots of partners, you know, hardware companies, that kind of thing. Whoever needs an open CL implementation, SYCL implementation and so on come to us. Yes so SYCL is a single source heterogeneous programming API, so you can write single source
01:02
code that can run on NVIDIA Intel AMD GPUs. Close to the mic, okay, voice up. Yeah so it's great for someone who's developing scientific applications to be able to write single source code that runs on whatever GPU the
01:22
implementation enables, such as CUDA, level zero for Intel AMD GPUs and so on. Yeah this is a really good thing. So I work specifically on the NVIDIA and the HIP, the AMD backends for DPC++. Okay so yeah I just want to talk a little bit about compiler intrinsics and how kind of, you know, math
01:42
function calls work in SYCL and DPC++ at the moment and how we can hopefully improve them so that we're contributing upstream. So what happens to SYCL cause? So essentially you get your SYCL cause in your source code. This is redirected to SPIR-V open CL cause F. You compile to SPIR-V, you make a SPIR-V module, this is a symbol within the SPIR-V module and then
02:03
that is the implementation is provided by a CL level zero Vulcan driver. Okay as I said I don't work on the the SPIR-V backend at all, I work the PTX, the CUDA or the AMD GPU backend. So what do we do with these symbols so that we get to the native implementations? We're not trying to
02:22
reinvent the wheel, we're not trying to do anything that the people who are making the GPUs aren't doing already, we're just trying to redirect to that. So how do we go from this to that and then compile to our PTX module our AMD GPU module, HSA module and so on. So yeah, how do we go from SPIR-V
02:44
OCL to MB-COS-F? So use a shim library, easy-peasy, that's fine. Okay you just redirect it, you compile it to bitcode, you link it at compilation time and you get to this this native bitcode implementation. This is great. Okay so we use libclc for this. So libclc is written in OpenCL. Okay OpenCL does
03:04
lots of stuff that SYCL doesn't expose as easily, like address spaces, that kind of thing. So we write it in OpenCL. This is great, this makes our lives really really easy, we can do it. So before we get into this, just why do we want to use a BC library in the first place? Why don't we use a .so?
03:22
Why don't we just resolve to some symbol that then a runtime is called and we don't care about it? So on a GPU the overhead of a function call is really high. Okay it's because we lose information about say address spaces, that kind of thing. The GPU memory hierarchy is a bit more complex than say for CPU, so we really really need to worry about this. We want to
03:42
inline everything so we don't lose any information about our memory hierarchies. We also allow compile-time branching of code based on the architecture, based on the back end, that kind of thing. We don't want to have these checks at runtime, we want high performance, that's the name of the game for what we're doing. This gives us greater optimization opportunities as
04:02
well. You can do lots of dead code elimination, lots of fun stuff in the middle end, because you're doing all these checks at the IR level. Okay so this is just kind of what it looks like, so we just have SpearVOCL cosf, we return nvcosf. Great, amazing, that's you know so easy. Okay and then this is the implementation which is provided by
04:24
Nvidia. This is in bitcode, we link this and then this is just inlined into our IR, this is great. Okay yes so we're linking the cycle code with libclc, then we link that with the vendor provided bc library, so we're
04:41
linking, linking, we get to the implementation, it's all inlined, it's all great, we love it. Okay so this works well, but so this is a bit of code from libclc, because we're dealing in OpenCLC we could choose something else, we could write a native IR, we find that OpenCL is actually easier to use than
05:02
and easier to maintain than writing a native IR, so we end up with some funny kind of problems with with mangling and all this kind of thing. This isn't nice, sometimes we need manual mangling, this has got to do with namespaces when they're they're interpreted by the the OpenCL mangling unfortunately.
05:21
Yeah we need to sometimes as well, sometimes OpenCL isn't as good as we want it to be, so we need to actually write in native IR as well, so it's a mix of like LVM IR, libclc, it's a bit messy, it's not great. Yeah so also we're we're exposing some compiler internals here, this is the NVVM reflect pass, which essentially takes your function call for NVVM reflect,
05:45
replaces it with a numeric value, this is totally done at the IR level so you can branch at the IR level based on this is a high compute, a high architecture, a newer architecture, do this new implementation, do this new built-in, there's an old architecture as well for things like
06:01
rounding modes this this pass is used, we're exposing this in source code through hacks, this isn't really, you know, it's not it's not kosher, but it works, who cares. Okay but consider the new proposal to add FP accuracy attributes to math built-ins, this is where we have say FP built-in
06:23
cause and we specify the accuracy in ULP that we want it to be computed to, okay this is totally lost on us, okay so this is what it would look like, yeah you have this attribute here, you have FP max error, this is really really needed in SYCL because SYCL is targeting lots and lots of different platforms, all
06:43
these platforms have different numerical accuracy guarantees, we really really need this, we don't use built-ins at all, we're sorry we don't use LLVM intrinsics at all, so this is we need to get to a point where we can start using this compiler infrastructure, we're not using it as much as we want to, so we
07:02
could do this using a libclc compiler kind of hack workaround, we do another you know pass, you just say compiler precision val, if it's that do some precise square root, if it's not do some approx thing, yeah we could do that the problem with libclc and this stuff, it's not upstreamable, it's a
07:21
collection of hacks, it's not totally hacks but like it's a little bit messy it's not written in the same API, it's libcl and it's LLVM IR it's messy, we can upstream this, we can all benefit from this, okay so the pro about doing some another adding another hack to the kind of pass is another hack
07:44
to the bunch is that it's easy to do, okay we can do this and we can keep going with our libclc implementation, it's pretty straightforward we've been doing this the whole time, yeah fine, we don't need to worry about the broader LLVM concerns, however we miss out on LLVM community collaboration which is why we're here and then how many of these workarounds do we
08:03
actually need in order to keep up with the latest trends and then libclc as bad as it could be now, like it just degenerates into an absolute mess and we don't want that, okay so we think the answer for this is to try and redirect, try and actually have it calling the compiler intrinsic, okay we
08:22
want to use compiler intrinsic and then have some generic behavior of these intrinsics for offload targets, okay and this would be used by say OpenMP by by you know CUDA clang and so on all these different targets but we don't have this transformation, we're not comfortable with this connection okay from an intrinsic to a vendor provided BC built in, okay why is that
08:45
essentially this needs to happen as early as possible in the in at the IR level so we're adding an external dependency in our LLVM kind of you know pipeline, we need to link this BC library early on in our
09:01
pipeline, we don't do this, we're not comfortable with doing this, we need to figure out a way that people will be happy with us doing this, okay obviously we're used to things resolving to external symbols but then that's a runtime thing, it's not a compile time thing, okay this needs to be in line, we need to do lots and lots of stuff with this at the IR level
09:23
okay so there will still be cases where we need libclc potentially it's not going to you know just disappear from our SQL implementation hopefully but we need to start pushing towards better kind of resolution, better use of these intrinsics in LLVM for offload in general okay so why, shared
09:44
infrastructure, keep on the cutting edge of new developments left compiler hacks and we make SYCL compilation eventually work upstream it doesn't at the moment but eventually we want it to of course, we're trying to upstream as much as possible but libclc is not upstreamable and that's a
10:05
discussion about making the intrinsics work for offload okay so time's up so we need to have this link step at the IR level early on in the IR kind of pipeline, that is problematic for some people but we need to talk about this so please join in the discussion here this is MVPTX code gen for LLVM sign
10:26
and friends if you have any opinions on this, sorry I kind of ran over a little bit but yeah any questions? I was wondering would it make sense to try to get rid of the mess by going
10:41
to an MLIR type of approach or like what are the benefits or downsides to going to MLIR? so I'm not an expert so the question was are there benefits can we avoid this by going to MLIR I think it's more fundamental than MLIR I'm not an expert on MLIR but I think we need basic resolution of intrinsics
11:05
presumably with with MLIR you'll have you know other MLIR intrinsics that will need the same kind of treatment we'll have the same questions there so this is the first case study this is the most simple case we're not trying to implement the new a few built-ins with the accuracy thing we're just
11:21
trying to decide how do we make this dependency on this external bclib work and do it in a very very confined sort of way yeah thank you Two questions, BTX from LLVM IR, there is a whole section about linking with the Bitcoin library from NVidia, so what's the difference with this?
11:44
and the second question is you mentioned NVVM which is the closed source BTX generator from NVidia and there is also the LLVM NVBTX backend, are we reaching speed parity with the closed source one?
12:01
it depends on the application we find that with so the second question first is there still a big performance gap between the native say NVCC compiler and LLVM Clang so in terms of DPC++ which is a fork of LLVM we're attaining say roughly comparable performance
12:23
whether you're using SYCL or you're using CUDA with NVCC and then any improvements that we make to the kind of compiler or whatever they're shared by Clang CUDA as well so the first question again was how is this different from
12:44
so essentially when you're linking bitcode or whatever you're not you're not using any LLVM intrinsics you're you're just redirecting things yourself you're not using intrinsic so you need to do everything explicitly you
13:00
you need to either have a specific kind of driver path that will do this for you or you need to specifically say I want to link this in at this time or whatever and so it's more manual it's not happening automatically it's not happening really within the compiler it's happening at link time LLVM link time
13:20
alright thank you Hugh