We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Defining a multi-architecture interface for SYCL in LLVM Clang

00:00

Formal Metadata

Title
Defining a multi-architecture interface for SYCL in LLVM Clang
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
We have been working to bring multi-architecture support using SYCL to the LLVM Clang project. Our original approach was to implement a "Plugin Interface" to add support for a PTX back-end and subsequently we have also added support for GCN enabling NVIDIA and AMD GPUs. This short presentation will outline our approach to designing this multi-architecture back-end and recent work to formalise the interface in the SYCL specification. This work is enabling researchers using the pre-exascale Perlmutter and Polaris supercomputers and exascale Summit supercomputer to write code using open standard SYCL and deploy on these machines.
14
15
43
87
Thumbnail
26:29
146
Thumbnail
18:05
199
207
Thumbnail
22:17
264
278
Thumbnail
30:52
293
Thumbnail
15:53
341
Thumbnail
31:01
354
359
410
Context awarenessConnectivity (graph theory)Open sourceNeuroinformatikLimit (category theory)Latent heatCompilerGoodness of fitComputer animation
Process (computing)Machine visionSoftwareSystem programmingStandard deviationArchitectureBefehlsprozessorComputing platformCodeMachine learningPoint cloudArtificial intelligenceSoftware developerPlastikkarteTablet computerData modelComputer programmingSupercomputerSource codeEmailCompilation albumDevice driverBinary codeGraphics processing unitOverhead (computing)Mathematical optimizationMathematicsRun time (program lifecycle phase)FingerprintAsynchronous Transfer ModeFunction (mathematics)Link (knot theory)Library (computing)Telephone number mappingSemantics (computer science)Motion captureImplementationLevel (video gaming)Single-precision floating-point formatSymbol tableCycle (graph theory)CodeBitLink (knot theory)Front and back endsNamespaceMixed realityMoment (mathematics)BytecodeSource codeSystem callSpeicherhierarchieComputer hardwareAddress spaceMathematicsCompilerInformationGraphics processing unitCausalityComputer programmingRun time (program lifecycle phase)WritingDevice driverCompilation albumComplex (psychology)Cartesian coordinate systemClosed setFunctional (mathematics)Library (computing)Overhead (computing)Game theoryModule (mathematics)Mathematical optimizationDressing (medical)VideoconferencingIntelligent NetworkBoss CorporationSpacetimeString (computer science)Line (geometry)Multiplication signSphereCASE <Informatik>Natural numberOpen setComputer architectureMenu (computing)Message passingGaussian eliminationLimit (category theory)Workstation <Musikinstrument>Computer animation
CodeMathematicsControl flowAttribute grammarCompilerRevision controlImplementationCollaborationismCompilation albumBinary codeImage resolutionLibrary (computing)Link (knot theory)Normal (geometry)Transformation (genetics)Personal digital assistantGeneric programmingGraphics processing unitDefault (computer science)Level (video gaming)Connected spaceTwitterSystem callCASE <Informatik>Library (computing)Computing platformCompilation albumImage resolutionTransformation (genetics)CollaborationismDifferent (Kate Ryan album)Symbol tableLine (geometry)Link (knot theory)ImplementationSoftware developerBoss CorporationMoment (mathematics)BuildingGroup actionSource codeRun time (program lifecycle phase)Single-precision floating-point formatBitMessage passingTrailMultiplication signPoint (geometry)Hacker (term)Computer architectureCompilerView (database)Open setIntelligent NetworkOrder (biology)Asynchronous Transfer ModeFunctional (mathematics)InformationMathematicsAttribute grammarSquare numberRootNumberVector potentialCodeNumeral (linguistics)Computer animation
Link (knot theory)Library (computing)Graphics processing unitGeneric programmingMathematicsDefault (computer science)CodeCompilation albumCompilerObservational studyClient (computing)Bit rateDevice driverGame theoryCompilerMultiplication signLink (knot theory)CASE <Informatik>Cartesian coordinate systemImage resolutionIntelligent NetworkGroup actionProcess (computing)Source codeExpert systemUniverse (mathematics)Latent heatQuicksortLibrary (computing)Parity (mathematics)Right angleElectric generator2 (number)BytecodeSheaf (mathematics)Term (mathematics)Type theoryDifferent (Kate Ryan album)Closed setFront and back endsComputer animation
Program flowchart
Transcript: English(auto-generated)
It's lightning talk on DPC++, through SYCL as far as I'm aware. Yeah, exactly. Good afternoon. I'm going to be talking about compiler intrinsics in SYCL, in DPC++ specifically.
This is Intel's open source SYCL implementation. This is what I work on. Hopefully I'll be able to say something without saying too much in ten minutes. Yeah, so Codeplay, I work for Codeplay. We had the first SYCL implementation, ComputeCPP. We now work, we were acquired by Intel, so now we work on the Intel SYCL implementation, DPC++.
That's what I work on. We have lots of partners, you know, hardware companies, that kind of thing. Whoever needs an OpenCL implementation, SYCL implementation, and so on, come to us. Yes, so SYCL is a single source heterogeneous programming API, so you can write single source code
that can run on Nvidia, Intel, AMD GPUs. Close to the mic, okay. Voice up. Yeah, so it's great for someone who's developing scientific applications to be able to write single source code that runs on whatever GPU the implementation enables,
such as CUDA, level zero for Intel, AMD GPUs, and so on. Yeah, this is a really good thing. So I work specifically on the Nvidia and the HIP, the AMD backends for DPC++. Okay, so yeah, I just want to talk a little bit about compiler intrinsics
and how math function calls work in SYCL and DPC++ at the moment and how we can hopefully improve them so that we're contributing upstream. So what happens to SYCL cause? So essentially you get your SYCL cause in your source code. This is redirected to SPIR-V OpenCL cause F. You compile to SPIR-V, you make a SPIR-V module.
This is a symbol within the SPIR-V module, and then the implementation is provided by a CL level zero Vulkan driver. Okay, as I said, I don't work on the SPIR-V backend at all. I work the PTX, the CUDA, or the AMD GPU backend. So what do we do with these symbols so that we get to the native implementations?
We're not trying to reinvent the wheel. We're not trying to do anything that the people who are making the GPUs aren't doing already. We're just trying to redirect to that. So how do we go from this to that and then compile to our PTX module, our AMD GPU module, HSA module, and so on?
So yeah, how do we go from SPIR-V OCL to MBCauseF? So use a shim library, easy peasy, that's fine. Okay, you just redirect it, you compile to bitcode, you link it at compilation time, and you get to this native bitcode implementation. This is great. Okay, so we use libclc for this.
So libclc is written in OpenCL. Okay, OpenCL does lots of stuff that SYCL doesn't expose as easily, like address spaces, that kind of thing. So we write it in OpenCL. This is great. This makes our lives really, really easy. We can do it. So before we get into this, just why do we want to use a BC library in the first place?
Why don't we use a .so? Why don't we just resolve to some symbol that then a runtime is called and we don't care about it? So on a GPU, the overhead of a function call is really high. Okay, it's because we lose information about, say, address spaces, that kind of thing. The GPU memory hierarchy is a bit more complex than, say, for CPUs,
so we really, really need to worry about this. We want to inline everything so we don't lose any information about our memory hierarchies. We also allow compile-time branching of code based on the architecture, based on the backend, that kind of thing. We don't want to have these checks at runtime. We want high performance. That's the name of the game for what we're doing. This gives us greater optimization opportunities as well.
You can do lots of dead code elimination, lots of fun stuff in the middle end because you're doing all these checks at the IR level. Okay, so this is just kind of what it looks like. So we just have spear VOCL cos F. We return NV cos F. Great. Amazing. That's so easy.
And then this is the implementation which is provided by NVIDIA. This is in bit code. We link this, and then this is just inlined into our IR. This is great. Yeah, so we're linking the cycle code with libclc. Then we link that with the vendor-provided bc library.
So we're linking, linking. We get to the implementation. It's all inlined. It's all great. We love it. Okay, so this works well, but... So this is a bit of code from libclc. Because we're dealing in OpenCLC, we could choose something else. We could write a native IR. We find that OpenCL is actually easier to use
and easier to maintain than writing a native IR. So we end up with some funny kind of problems with mangling and all this kind of thing. This isn't nice. Sometimes we need manual mangling. This has got to do with namespaces when they're interpreted by the OpenCL mangler, unfortunately.
Yeah, we need to sometimes as well. Sometimes OpenCL isn't as good as we want it to be, so we need to actually write in native IR as well. So it's a mix of LVM IR, libclc. It's a bit messy. It's not great. Yeah, so also we're exposing some compiler internals here. This is the NVVM reflect pass,
which essentially takes your function call for NVVM reflect, replaces it with a numeric value. This is totally done at the IR level, so you can branch at the IR level based on this is a high architecture, a newer architecture. Do this new implementation, do this new built-in. There's an old architecture, da da da, as well for things like rounding modes this pass is used.
We're exposing this in source code through hacks. This isn't really, you know, it's not kosher. But it works, who cares. Okay, but consider the new proposal to add FP accuracy attributes to math built-ins.
This is where we have, say, FP built-in cause, and we specify the accuracy in ULP that we want it to be computed to. Okay, this is totally lost on us. Okay, so this is what it would look like. Yeah, you have this attribute here. You have FP max error. This is really, really needed in SYCL because SYCL is targeting lots and lots of different platforms.
All these platforms have different numerical accuracy guarantees. We really, really need this. We don't use built-ins at all. We're sorry, we don't use LLVM Intrinsics at all. So this is, we need to get to a point where we can start using this compiler infrastructure. We're not using it as much as we want to.
So we could do this using a libclc compiler kind of hack workaround. We do another, you know, pass. You just say compiler precision val. If it's that, do some precise square root. If it's not, do some approx thing. Yeah, we could do that, okay? The problem with libclc and this stuff, it's not upstreamable, okay?
It's a collection of hacks. It's not totally hacks, but it's a little bit messy. It's not written in the same API. It's lib, it's OpenCL, and it's LLVM IR. It's messy. We can upstream this. We can't all benefit from this. Okay, so the pro about doing some,
adding another hack to the kind of pass as another hack to the bunch is that it's easy to do, okay? We can do this, and we can keep going with our libclc implementation. It's pretty straightforward. We've been doing this the whole time. Yeah, fine. We don't need to worry about the broader LLVM concerns. However, we miss out on LLVM community collaboration, which is why we're here.
And then how many of these workarounds do we actually need in order to keep up with the latest trends? And then libclc, as bad as it could be now, it just degenerates into an absolute mess, and we don't want that. Okay, so we think the answer for this is to try and redirect, try and actually have it calling the compiler intrinsic.
We want to use compiler intrinsic and then have some generic behavior of these intrinsics for offload targets, okay? And this would be used by, say, OpenMP, by, you know, CUDA Clang and so on, all these different targets. But we don't have this transformation. We're not comfortable with this connection, okay, from an intrinsic to a vendor-provided BC built in, okay?
Why is that? Essentially, this needs to happen as early as possible at the IR level, so we're adding an external dependency in our LLVM kind of, you know, pipeline. We need to link this BC library early on in our pipeline.
We don't do this. We're not comfortable with doing this. We need to figure out a way that people will be happy with us doing this, okay? Obviously, we're used to things resolving to external symbols, but then that's a run-time thing. It's not a compile-time thing, okay? This needs to be in-line. We need to do lots and lots of stuff with this at the IR level, okay?
So there will still be cases where we need libclc, potentially. It's not going to, you know, just disappear from our SQL implementation, hopefully. But we need to start pushing towards better kind of resolution, better use of these intrinsics in LLVM for offload in general, okay?
So why? Shared infrastructure, keep on the cutting edge of new developments, left compiler hacks, and we make SQL compilation eventually work upstream. It doesn't at the moment, but eventually we want it to, of course. We're trying to upstream as much as possible, but libclc is not upstreamable, and that's a problem.
Okay, so the first step, try and have this discussion about making the intrinsics work for offload, okay? So, okay, time's up. So we need to have this link step at the IR level early on in the IR kind of pipeline. That is problematic for some people, but we need to talk about this.
So please join in the discussion here. This is MVPTX code gen for LLVM sign-in friends if you have any opinions on this. Sorry, I kind of ran over a little bit, but yeah, any questions? Would it make sense to try to get rid of the mess
by going to an MLIR type of approach, or what are the benefits or downsides to going to MLIR? So I'm not an expert. So the question was, are there benefits? Can we avoid this by going to MLIR? I think it's more fundamental than MLIR. I'm not an expert on MLIR,
but I think we need basic resolution of intrinsics. Presumably with MLIR, you'll have other MLIR intrinsics that will need the same kind of treatment. We'll have the same questions there. So this is the first case study. This is the most simple case. We're not trying to implement the new Fe built-ins with the accuracy thing.
We're just trying to decide how do we make this dependency on this external BC lib work and do it in a very, very confined sort of way. Thank you. Two questions. First one is a tutorial to generate MB or PTX from LLVM IR. There is a whole section about linking with the Bitcoin library from NVidia.
So what's the difference between this? And the second question is, you mentioned NVVM, which is the closed source PTX generator from NVidia. There is also the LLVM and the PTX backends. Are we reaching speed parity with the closed source one?
It depends on the application. We find that with, so the second question first, is there still a big performance gap between the native, say, NVCC compiler and LLVM Clang? So in terms of DPC++, which is a fork of LLVM, we're attaining, say, roughly comparable performance,
whether you're using SYCL or you're using CUDA with NVCC. And then any improvements that we make to the compiler or whatever, they're shared by Clang CUDA as well. So the first question again was, how is this different from... Is the tutorial to link bitcode with LLVM or not?
So essentially, when you're linking bitcode or whatever, you're not using any LLVM intrinsics. You're just redirecting things yourself. You're not using intrinsics. So you need to do everything explicitly.
You need to either have a specific kind of driver path that will do this for you, or you need to specifically say, I want to link this in at this time or whatever. And so it's more manual. It's not happening automatically. It's not happening really within the compiler. It's happening at link time, LLVM link time. Cool. All right. Thank you, Hugh.
Thank you.