Defining a multi-architecture interface for SYCL in LLVM Clang
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/61412 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2023363 / 542
2
5
10
14
15
16
22
24
27
29
31
36
43
48
56
63
74
78
83
87
89
95
96
99
104
106
107
117
119
121
122
125
126
128
130
132
134
135
136
141
143
146
148
152
155
157
159
161
165
166
168
170
173
176
180
181
185
191
194
196
197
198
199
206
207
209
210
211
212
216
219
220
227
228
229
231
232
233
236
250
252
256
258
260
263
264
267
271
273
275
276
278
282
286
292
293
298
299
300
302
312
316
321
322
324
339
341
342
343
344
351
352
354
355
356
357
359
369
370
372
373
376
378
379
380
382
383
387
390
394
395
401
405
406
410
411
413
415
416
421
426
430
437
438
440
441
443
444
445
446
448
449
450
451
458
464
468
472
475
476
479
481
493
494
498
499
502
509
513
516
517
520
522
524
525
531
534
535
537
538
541
00:00
Context awarenessConnectivity (graph theory)Open sourceNeuroinformatikLimit (category theory)Latent heatCompilerGoodness of fitComputer animation
00:31
Process (computing)Machine visionSoftwareSystem programmingStandard deviationArchitectureBefehlsprozessorComputing platformCodeMachine learningPoint cloudArtificial intelligenceSoftware developerPlastikkarteTablet computerData modelComputer programmingSupercomputerSource codeEmailCompilation albumDevice driverBinary codeGraphics processing unitOverhead (computing)Mathematical optimizationMathematicsRun time (program lifecycle phase)FingerprintAsynchronous Transfer ModeFunction (mathematics)Link (knot theory)Library (computing)Telephone number mappingSemantics (computer science)Motion captureImplementationLevel (video gaming)Single-precision floating-point formatSymbol tableCycle (graph theory)CodeBitLink (knot theory)Front and back endsNamespaceMixed realityMoment (mathematics)BytecodeSource codeSystem callSpeicherhierarchieComputer hardwareAddress spaceMathematicsCompilerInformationGraphics processing unitCausalityComputer programmingRun time (program lifecycle phase)WritingDevice driverCompilation albumComplex (psychology)Cartesian coordinate systemClosed setFunctional (mathematics)Library (computing)Overhead (computing)Game theoryModule (mathematics)Mathematical optimizationDressing (medical)VideoconferencingIntelligent NetworkBoss CorporationSpacetimeString (computer science)Line (geometry)Multiplication signSphereCASE <Informatik>Natural numberOpen setComputer architectureMenu (computing)Message passingGaussian eliminationLimit (category theory)Workstation <Musikinstrument>Computer animation
05:31
CodeMathematicsControl flowAttribute grammarCompilerRevision controlImplementationCollaborationismCompilation albumBinary codeImage resolutionLibrary (computing)Link (knot theory)Normal (geometry)Transformation (genetics)Personal digital assistantGeneric programmingGraphics processing unitDefault (computer science)Level (video gaming)Connected spaceTwitterSystem callCASE <Informatik>Library (computing)Computing platformCompilation albumImage resolutionTransformation (genetics)CollaborationismDifferent (Kate Ryan album)Symbol tableLine (geometry)Link (knot theory)ImplementationSoftware developerBoss CorporationMoment (mathematics)BuildingGroup actionSource codeRun time (program lifecycle phase)Single-precision floating-point formatBitMessage passingTrailMultiplication signPoint (geometry)Hacker (term)Computer architectureCompilerView (database)Open setIntelligent NetworkOrder (biology)Asynchronous Transfer ModeFunctional (mathematics)InformationMathematicsAttribute grammarSquare numberRootNumberVector potentialCodeNumeral (linguistics)Computer animation
10:32
Link (knot theory)Library (computing)Graphics processing unitGeneric programmingMathematicsDefault (computer science)CodeCompilation albumCompilerObservational studyClient (computing)Bit rateDevice driverGame theoryCompilerMultiplication signLink (knot theory)CASE <Informatik>Cartesian coordinate systemImage resolutionIntelligent NetworkGroup actionProcess (computing)Source codeExpert systemUniverse (mathematics)Latent heatQuicksortLibrary (computing)Parity (mathematics)Right angleElectric generator2 (number)BytecodeSheaf (mathematics)Term (mathematics)Type theoryDifferent (Kate Ryan album)Closed setFront and back endsComputer animation
13:26
Program flowchart
Transcript: English(auto-generated)
00:05
It's lightning talk on DPC++, through SYCL as far as I'm aware. Yeah, exactly. Good afternoon. I'm going to be talking about compiler intrinsics in SYCL, in DPC++ specifically.
00:21
This is Intel's open source SYCL implementation. This is what I work on. Hopefully I'll be able to say something without saying too much in ten minutes. Yeah, so Codeplay, I work for Codeplay. We had the first SYCL implementation, ComputeCPP. We now work, we were acquired by Intel, so now we work on the Intel SYCL implementation, DPC++.
00:45
That's what I work on. We have lots of partners, you know, hardware companies, that kind of thing. Whoever needs an OpenCL implementation, SYCL implementation, and so on, come to us. Yes, so SYCL is a single source heterogeneous programming API, so you can write single source code
01:02
that can run on Nvidia, Intel, AMD GPUs. Close to the mic, okay. Voice up. Yeah, so it's great for someone who's developing scientific applications to be able to write single source code that runs on whatever GPU the implementation enables,
01:23
such as CUDA, level zero for Intel, AMD GPUs, and so on. Yeah, this is a really good thing. So I work specifically on the Nvidia and the HIP, the AMD backends for DPC++. Okay, so yeah, I just want to talk a little bit about compiler intrinsics
01:40
and how math function calls work in SYCL and DPC++ at the moment and how we can hopefully improve them so that we're contributing upstream. So what happens to SYCL cause? So essentially you get your SYCL cause in your source code. This is redirected to SPIR-V OpenCL cause F. You compile to SPIR-V, you make a SPIR-V module.
02:01
This is a symbol within the SPIR-V module, and then the implementation is provided by a CL level zero Vulkan driver. Okay, as I said, I don't work on the SPIR-V backend at all. I work the PTX, the CUDA, or the AMD GPU backend. So what do we do with these symbols so that we get to the native implementations?
02:21
We're not trying to reinvent the wheel. We're not trying to do anything that the people who are making the GPUs aren't doing already. We're just trying to redirect to that. So how do we go from this to that and then compile to our PTX module, our AMD GPU module, HSA module, and so on?
02:42
So yeah, how do we go from SPIR-V OCL to MBCauseF? So use a shim library, easy peasy, that's fine. Okay, you just redirect it, you compile to bitcode, you link it at compilation time, and you get to this native bitcode implementation. This is great. Okay, so we use libclc for this.
03:01
So libclc is written in OpenCL. Okay, OpenCL does lots of stuff that SYCL doesn't expose as easily, like address spaces, that kind of thing. So we write it in OpenCL. This is great. This makes our lives really, really easy. We can do it. So before we get into this, just why do we want to use a BC library in the first place?
03:21
Why don't we use a .so? Why don't we just resolve to some symbol that then a runtime is called and we don't care about it? So on a GPU, the overhead of a function call is really high. Okay, it's because we lose information about, say, address spaces, that kind of thing. The GPU memory hierarchy is a bit more complex than, say, for CPUs,
03:40
so we really, really need to worry about this. We want to inline everything so we don't lose any information about our memory hierarchies. We also allow compile-time branching of code based on the architecture, based on the backend, that kind of thing. We don't want to have these checks at runtime. We want high performance. That's the name of the game for what we're doing. This gives us greater optimization opportunities as well.
04:02
You can do lots of dead code elimination, lots of fun stuff in the middle end because you're doing all these checks at the IR level. Okay, so this is just kind of what it looks like. So we just have spear VOCL cos F. We return NV cos F. Great. Amazing. That's so easy.
04:21
And then this is the implementation which is provided by NVIDIA. This is in bit code. We link this, and then this is just inlined into our IR. This is great. Yeah, so we're linking the cycle code with libclc. Then we link that with the vendor-provided bc library.
04:41
So we're linking, linking. We get to the implementation. It's all inlined. It's all great. We love it. Okay, so this works well, but... So this is a bit of code from libclc. Because we're dealing in OpenCLC, we could choose something else. We could write a native IR. We find that OpenCL is actually easier to use
05:01
and easier to maintain than writing a native IR. So we end up with some funny kind of problems with mangling and all this kind of thing. This isn't nice. Sometimes we need manual mangling. This has got to do with namespaces when they're interpreted by the OpenCL mangler, unfortunately.
05:21
Yeah, we need to sometimes as well. Sometimes OpenCL isn't as good as we want it to be, so we need to actually write in native IR as well. So it's a mix of LVM IR, libclc. It's a bit messy. It's not great. Yeah, so also we're exposing some compiler internals here. This is the NVVM reflect pass,
05:42
which essentially takes your function call for NVVM reflect, replaces it with a numeric value. This is totally done at the IR level, so you can branch at the IR level based on this is a high architecture, a newer architecture. Do this new implementation, do this new built-in. There's an old architecture, da da da, as well for things like rounding modes this pass is used.
06:03
We're exposing this in source code through hacks. This isn't really, you know, it's not kosher. But it works, who cares. Okay, but consider the new proposal to add FP accuracy attributes to math built-ins.
06:21
This is where we have, say, FP built-in cause, and we specify the accuracy in ULP that we want it to be computed to. Okay, this is totally lost on us. Okay, so this is what it would look like. Yeah, you have this attribute here. You have FP max error. This is really, really needed in SYCL because SYCL is targeting lots and lots of different platforms.
06:43
All these platforms have different numerical accuracy guarantees. We really, really need this. We don't use built-ins at all. We're sorry, we don't use LLVM Intrinsics at all. So this is, we need to get to a point where we can start using this compiler infrastructure. We're not using it as much as we want to.
07:02
So we could do this using a libclc compiler kind of hack workaround. We do another, you know, pass. You just say compiler precision val. If it's that, do some precise square root. If it's not, do some approx thing. Yeah, we could do that, okay? The problem with libclc and this stuff, it's not upstreamable, okay?
07:20
It's a collection of hacks. It's not totally hacks, but it's a little bit messy. It's not written in the same API. It's lib, it's OpenCL, and it's LLVM IR. It's messy. We can upstream this. We can't all benefit from this. Okay, so the pro about doing some,
07:41
adding another hack to the kind of pass as another hack to the bunch is that it's easy to do, okay? We can do this, and we can keep going with our libclc implementation. It's pretty straightforward. We've been doing this the whole time. Yeah, fine. We don't need to worry about the broader LLVM concerns. However, we miss out on LLVM community collaboration, which is why we're here.
08:01
And then how many of these workarounds do we actually need in order to keep up with the latest trends? And then libclc, as bad as it could be now, it just degenerates into an absolute mess, and we don't want that. Okay, so we think the answer for this is to try and redirect, try and actually have it calling the compiler intrinsic.
08:22
We want to use compiler intrinsic and then have some generic behavior of these intrinsics for offload targets, okay? And this would be used by, say, OpenMP, by, you know, CUDA Clang and so on, all these different targets. But we don't have this transformation. We're not comfortable with this connection, okay, from an intrinsic to a vendor-provided BC built in, okay?
08:44
Why is that? Essentially, this needs to happen as early as possible at the IR level, so we're adding an external dependency in our LLVM kind of, you know, pipeline. We need to link this BC library early on in our pipeline.
09:02
We don't do this. We're not comfortable with doing this. We need to figure out a way that people will be happy with us doing this, okay? Obviously, we're used to things resolving to external symbols, but then that's a run-time thing. It's not a compile-time thing, okay? This needs to be in-line. We need to do lots and lots of stuff with this at the IR level, okay?
09:23
So there will still be cases where we need libclc, potentially. It's not going to, you know, just disappear from our SQL implementation, hopefully. But we need to start pushing towards better kind of resolution, better use of these intrinsics in LLVM for offload in general, okay?
09:43
So why? Shared infrastructure, keep on the cutting edge of new developments, left compiler hacks, and we make SQL compilation eventually work upstream. It doesn't at the moment, but eventually we want it to, of course. We're trying to upstream as much as possible, but libclc is not upstreamable, and that's a problem.
10:03
Okay, so the first step, try and have this discussion about making the intrinsics work for offload, okay? So, okay, time's up. So we need to have this link step at the IR level early on in the IR kind of pipeline. That is problematic for some people, but we need to talk about this.
10:20
So please join in the discussion here. This is MVPTX code gen for LLVM sign-in friends if you have any opinions on this. Sorry, I kind of ran over a little bit, but yeah, any questions? Would it make sense to try to get rid of the mess
10:40
by going to an MLIR type of approach, or what are the benefits or downsides to going to MLIR? So I'm not an expert. So the question was, are there benefits? Can we avoid this by going to MLIR? I think it's more fundamental than MLIR. I'm not an expert on MLIR,
11:02
but I think we need basic resolution of intrinsics. Presumably with MLIR, you'll have other MLIR intrinsics that will need the same kind of treatment. We'll have the same questions there. So this is the first case study. This is the most simple case. We're not trying to implement the new Fe built-ins with the accuracy thing.
11:20
We're just trying to decide how do we make this dependency on this external BC lib work and do it in a very, very confined sort of way. Thank you. Two questions. First one is a tutorial to generate MB or PTX from LLVM IR. There is a whole section about linking with the Bitcoin library from NVidia.
11:42
So what's the difference between this? And the second question is, you mentioned NVVM, which is the closed source PTX generator from NVidia. There is also the LLVM and the PTX backends. Are we reaching speed parity with the closed source one?
12:02
It depends on the application. We find that with, so the second question first, is there still a big performance gap between the native, say, NVCC compiler and LLVM Clang? So in terms of DPC++, which is a fork of LLVM, we're attaining, say, roughly comparable performance,
12:23
whether you're using SYCL or you're using CUDA with NVCC. And then any improvements that we make to the compiler or whatever, they're shared by Clang CUDA as well. So the first question again was, how is this different from... Is the tutorial to link bitcode with LLVM or not?
12:44
So essentially, when you're linking bitcode or whatever, you're not using any LLVM intrinsics. You're just redirecting things yourself. You're not using intrinsics. So you need to do everything explicitly.
13:00
You need to either have a specific kind of driver path that will do this for you, or you need to specifically say, I want to link this in at this time or whatever. And so it's more manual. It's not happening automatically. It's not happening really within the compiler. It's happening at link time, LLVM link time. Cool. All right. Thank you, Hugh.
13:22
Thank you.