We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Defining a multi-architecture interface for SYCL in LLVM Clang

00:00

Formal Metadata

Title
Defining a multi-architecture interface for SYCL in LLVM Clang
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
We have been working to bring multi-architecture support using SYCL to the LLVM Clang project. Our original approach was to implement a "Plugin Interface" to add support for a PTX back-end and subsequently we have also added support for GCN enabling NVIDIA and AMD GPUs. This short presentation will outline our approach to designing this multi-architecture back-end and recent work to formalise the interface in the SYCL specification. This work is enabling researchers using the pre-exascale Perlmutter and Polaris supercomputers and exascale Summit supercomputer to write code using open standard SYCL and deploy on these machines.
14
15
43
87
Thumbnail
26:29
146
Thumbnail
18:05
199
207
Thumbnail
22:17
264
278
Thumbnail
30:52
293
Thumbnail
15:53
341
Thumbnail
31:01
354
359
410
MultiplicationArchitectureContext awarenessEquivalence relationIntelOpen sourceNeuroinformatikLimit (category theory)Connectivity (graph theory)Latent heatCompilerComputer animation
Machine visionSoftwareSystem programmingStandard deviationIntelArchitectureBefehlsprozessorComputing platformMachine learningPoint cloudPlastikkarteTablet computerArtificial intelligenceCodeTesselationSupercomputerEmailCompilation albumDevice driverSource codeBinary codeSynchronizationOverhead (computing)Graphics processing unitMathematical optimizationMathematicsAsynchronous Transfer ModeRun time (program lifecycle phase)FingerprintFunction (mathematics)Library (computing)Link (knot theory)Telephone number mappingSemantics (computer science)Motion captureBitSource codeCausalityOpen setComputer hardwareAddress spaceInformationImplementationLevel (video gaming)Front and back endsSystem callMoment (mathematics)Symbol tableSpeicherhierarchieMathematical optimizationBriefträgerproblemFunctional (mathematics)NeuroinformatikRun time (program lifecycle phase)Overhead (computing)BytecodeLibrary (computing)Multiplication signModule (mathematics)Compilation albumWritingGraphics processing unitSingle-precision floating-point formatDevice driverMixed realityCodeCartesian coordinate systemNamespaceClosed setCycle (graph theory)Computer programmingGame theoryMathematicsCompilerMenu (computing)SpacetimeDressing (medical)VideoconferencingCASE <Informatik>Boss CorporationLine (geometry)Computer architectureSphereGaussian eliminationFitness functionString (computer science)Computer animation
CompilerRevision controlMathematicsCollaborationismImplementationBinary codeBulletin board systemCompilation albumImage resolutionLink (knot theory)Normal (geometry)Transformation (genetics)Library (computing)Personal digital assistantGraphics processing unitGeneric programmingDefault (computer science)CodeCompilerPoint (geometry)BuildingHacker (term)Software developerConnectivity (graph theory)Open setTrailImplementationFunctional (mathematics)Computer architectureDifferent (Kate Ryan album)Computing platformLevel (video gaming)MathematicsAsynchronous Transfer ModeLine (geometry)Symbol tableConnected spaceSoftwareLink (knot theory)Attribute grammarTwitterIntelligent NetworkMultiplication signRun time (program lifecycle phase)InformationMessage passingOrder (biology)Transformation (genetics)CollaborationismSource codeBitCompilation albumCASE <Informatik>Image resolutionLibrary (computing)System callTerm (mathematics)Group actionMoment (mathematics)Staff (military)ApproximationNumeral (linguistics)Sign (mathematics)CodeCausalitySquare numberComputer animation
Link (knot theory)Library (computing)MathematicsGraphics processing unitGeneric programmingDefault (computer science)CodeCompilation albumCompilerCompilerImage resolutionCASE <Informatik>Group actionObservational studyLatent heatDevice driverClient (computing)Expert systemCartesian coordinate systemTerm (mathematics)Game theoryIntelligent NetworkLink (knot theory)Process (computing)Multiplication signParity (mathematics)BildschirmtextClosed setSheaf (mathematics)Type theorySource codeBytecodeLibrary (computing)QuicksortComputer animation
Point cloudProgram flowchart
Transcript: English(auto-generated)
It's lightning talk on DPC++, I think from SYCL as far as I'm aware. Yeah, yeah exactly. Okay good afternoon. Yeah so I'm gonna be talking about compiler intrinsics in SYCL, in DPC++ specifically. This is Intel's open
source SYCL implementation. This is what I work on. Yeah so hopefully I'll be able to say something without saying too much in ten minutes. Yeah so I work for Codeplay. We had the first SYCL implementation, Compute CPP. We now work, we were acquired by Intel, so now we work on the Intel SYCL
implementation DPC++. That's what I work on. We have lots of partners, you know, hardware companies, that kind of thing. Whoever needs an open CL implementation, SYCL implementation and so on come to us. Yes so SYCL is a single source heterogeneous programming API, so you can write single source
code that can run on NVIDIA Intel AMD GPUs. Close to the mic, okay, voice up. Yeah so it's great for someone who's developing scientific applications to be able to write single source code that runs on whatever GPU the
implementation enables, such as CUDA, level zero for Intel AMD GPUs and so on. Yeah this is a really good thing. So I work specifically on the NVIDIA and the HIP, the AMD backends for DPC++. Okay so yeah I just want to talk a little bit about compiler intrinsics and how kind of, you know, math
function calls work in SYCL and DPC++ at the moment and how we can hopefully improve them so that we're contributing upstream. So what happens to SYCL cause? So essentially you get your SYCL cause in your source code. This is redirected to SPIR-V open CL cause F. You compile to SPIR-V, you make a SPIR-V module, this is a symbol within the SPIR-V module and then
that is the implementation is provided by a CL level zero Vulcan driver. Okay as I said I don't work on the the SPIR-V backend at all, I work the PTX, the CUDA or the AMD GPU backend. So what do we do with these symbols so that we get to the native implementations? We're not trying to
reinvent the wheel, we're not trying to do anything that the people who are making the GPUs aren't doing already, we're just trying to redirect to that. So how do we go from this to that and then compile to our PTX module our AMD GPU module, HSA module and so on. So yeah, how do we go from SPIR-V
OCL to MB-COS-F? So use a shim library, easy-peasy, that's fine. Okay you just redirect it, you compile it to bitcode, you link it at compilation time and you get to this this native bitcode implementation. This is great. Okay so we use libclc for this. So libclc is written in OpenCL. Okay OpenCL does
lots of stuff that SYCL doesn't expose as easily, like address spaces, that kind of thing. So we write it in OpenCL. This is great, this makes our lives really really easy, we can do it. So before we get into this, just why do we want to use a BC library in the first place? Why don't we use a .so?
Why don't we just resolve to some symbol that then a runtime is called and we don't care about it? So on a GPU the overhead of a function call is really high. Okay it's because we lose information about say address spaces, that kind of thing. The GPU memory hierarchy is a bit more complex than say for CPU, so we really really need to worry about this. We want to
inline everything so we don't lose any information about our memory hierarchies. We also allow compile-time branching of code based on the architecture, based on the back end, that kind of thing. We don't want to have these checks at runtime, we want high performance, that's the name of the game for what we're doing. This gives us greater optimization opportunities as
well. You can do lots of dead code elimination, lots of fun stuff in the middle end, because you're doing all these checks at the IR level. Okay so this is just kind of what it looks like, so we just have SpearVOCL cosf, we return nvcosf. Great, amazing, that's you know so easy. Okay and then this is the implementation which is provided by
Nvidia. This is in bitcode, we link this and then this is just inlined into our IR, this is great. Okay yes so we're linking the cycle code with libclc, then we link that with the vendor provided bc library, so we're
linking, linking, we get to the implementation, it's all inlined, it's all great, we love it. Okay so this works well, but so this is a bit of code from libclc, because we're dealing in OpenCLC we could choose something else, we could write a native IR, we find that OpenCL is actually easier to use than
and easier to maintain than writing a native IR, so we end up with some funny kind of problems with with mangling and all this kind of thing. This isn't nice, sometimes we need manual mangling, this has got to do with namespaces when they're they're interpreted by the the OpenCL mangling unfortunately.
Yeah we need to sometimes as well, sometimes OpenCL isn't as good as we want it to be, so we need to actually write in native IR as well, so it's a mix of like LVM IR, libclc, it's a bit messy, it's not great. Yeah so also we're we're exposing some compiler internals here, this is the NVVM reflect pass, which essentially takes your function call for NVVM reflect,
replaces it with a numeric value, this is totally done at the IR level so you can branch at the IR level based on this is a high compute, a high architecture, a newer architecture, do this new implementation, do this new built-in, there's an old architecture as well for things like
rounding modes this this pass is used, we're exposing this in source code through hacks, this isn't really, you know, it's not it's not kosher, but it works, who cares. Okay but consider the new proposal to add FP accuracy attributes to math built-ins, this is where we have say FP built-in
cause and we specify the accuracy in ULP that we want it to be computed to, okay this is totally lost on us, okay so this is what it would look like, yeah you have this attribute here, you have FP max error, this is really really needed in SYCL because SYCL is targeting lots and lots of different platforms, all
these platforms have different numerical accuracy guarantees, we really really need this, we don't use built-ins at all, we're sorry we don't use LLVM intrinsics at all, so this is we need to get to a point where we can start using this compiler infrastructure, we're not using it as much as we want to, so we
could do this using a libclc compiler kind of hack workaround, we do another you know pass, you just say compiler precision val, if it's that do some precise square root, if it's not do some approx thing, yeah we could do that the problem with libclc and this stuff, it's not upstreamable, it's a
collection of hacks, it's not totally hacks but like it's a little bit messy it's not written in the same API, it's libcl and it's LLVM IR it's messy, we can upstream this, we can all benefit from this, okay so the pro about doing some another adding another hack to the kind of pass is another hack
to the bunch is that it's easy to do, okay we can do this and we can keep going with our libclc implementation, it's pretty straightforward we've been doing this the whole time, yeah fine, we don't need to worry about the broader LLVM concerns, however we miss out on LLVM community collaboration which is why we're here and then how many of these workarounds do we
actually need in order to keep up with the latest trends and then libclc as bad as it could be now, like it just degenerates into an absolute mess and we don't want that, okay so we think the answer for this is to try and redirect, try and actually have it calling the compiler intrinsic, okay we
want to use compiler intrinsic and then have some generic behavior of these intrinsics for offload targets, okay and this would be used by say OpenMP by by you know CUDA clang and so on all these different targets but we don't have this transformation, we're not comfortable with this connection okay from an intrinsic to a vendor provided BC built in, okay why is that
essentially this needs to happen as early as possible in the in at the IR level so we're adding an external dependency in our LLVM kind of you know pipeline, we need to link this BC library early on in our
pipeline, we don't do this, we're not comfortable with doing this, we need to figure out a way that people will be happy with us doing this, okay obviously we're used to things resolving to external symbols but then that's a runtime thing, it's not a compile time thing, okay this needs to be in line, we need to do lots and lots of stuff with this at the IR level
okay so there will still be cases where we need libclc potentially it's not going to you know just disappear from our SQL implementation hopefully but we need to start pushing towards better kind of resolution, better use of these intrinsics in LLVM for offload in general okay so why, shared
infrastructure, keep on the cutting edge of new developments left compiler hacks and we make SYCL compilation eventually work upstream it doesn't at the moment but eventually we want it to of course, we're trying to upstream as much as possible but libclc is not upstreamable and that's a
discussion about making the intrinsics work for offload okay so time's up so we need to have this link step at the IR level early on in the IR kind of pipeline, that is problematic for some people but we need to talk about this so please join in the discussion here this is MVPTX code gen for LLVM sign
and friends if you have any opinions on this, sorry I kind of ran over a little bit but yeah any questions? I was wondering would it make sense to try to get rid of the mess by going
to an MLIR type of approach or like what are the benefits or downsides to going to MLIR? so I'm not an expert so the question was are there benefits can we avoid this by going to MLIR I think it's more fundamental than MLIR I'm not an expert on MLIR but I think we need basic resolution of intrinsics
presumably with with MLIR you'll have you know other MLIR intrinsics that will need the same kind of treatment we'll have the same questions there so this is the first case study this is the most simple case we're not trying to implement the new a few built-ins with the accuracy thing we're just
trying to decide how do we make this dependency on this external bclib work and do it in a very very confined sort of way yeah thank you Two questions, BTX from LLVM IR, there is a whole section about linking with the Bitcoin library from NVidia, so what's the difference with this?
and the second question is you mentioned NVVM which is the closed source BTX generator from NVidia and there is also the LLVM NVBTX backend, are we reaching speed parity with the closed source one?
it depends on the application we find that with so the second question first is there still a big performance gap between the native say NVCC compiler and LLVM Clang so in terms of DPC++ which is a fork of LLVM we're attaining say roughly comparable performance
whether you're using SYCL or you're using CUDA with NVCC and then any improvements that we make to the kind of compiler or whatever they're shared by Clang CUDA as well so the first question again was how is this different from
so essentially when you're linking bitcode or whatever you're not you're not using any LLVM intrinsics you're you're just redirecting things yourself you're not using intrinsic so you need to do everything explicitly you
you need to either have a specific kind of driver path that will do this for you or you need to specifically say I want to link this in at this time or whatever and so it's more manual it's not happening automatically it's not happening really within the compiler it's happening at link time LLVM link time
alright thank you Hugh