We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Utilizing AMD GPUs: Tuning, programming models, and roadmap

00:00

Formale Metadaten

Titel
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Serientitel
Anzahl der Teile
287
Autor
Mitwirkende
Lizenz
CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
During FOSDEM 2021, we presented in the same event the LUMI supercomputer and we discussed about the Open Software Platform for GPU-accelerated Computing by AMD (ROCm) ecosystem, how to port CUDA codes to Heterogeneous Interface for Portability (HIP), and some performance results based on the utilization of NVIDIA V100 GPU. In this talk we assume the audience is familiar with the content of the previous presentation. One year later, we have executed many codes on AMD MI100 GPU, tuned the performance on various codes and benchmarks, utilized and tuned a few programming models such as HIP, OpenMP offloading, Kokkos, and hipSYCL on AMD MI100 and compared their performance additionally with NVIDIA V100 and NVIDIA A100 (including CUDA). Furthermore, a new open source software is released by AMD, called GPUFort, to port Fortran+CUDA/OpenACC codes to Fortran+HIP for AMD GPUs. In this talk we present what we learned through our experience, how we tune the codes for MI100, how we expect to tune them in the future for LUMI GPU, the AMD MI250X, compare the previously mentioned programming models on some kernels across the GPUs, present a performance comparison for single precision benchmark, discuss the updated software roadmap, and a brief update for the porting workflow.
SupercomputerTrigonometrische FunktionGraphikprozessorProgrammierungEndliche ModelltheorieSupercomputerFamilie <Mathematik>Endliche ModelltheorieProgrammierparadigmaDiagrammComputeranimation
PartitionsfunktionGraphikprozessorROM <Informatik>VisualisierungInformationsspeicherungPhysikalisches SystemExtreme programmingBandmatrixMulti-Tier-ArchitekturSupercomputerBefehlsprozessorChiffrierungDienst <Informatik>HIP <Kommunikationsprotokoll>Offene MengeComputerInterface <Schaltung>SoftwareentwicklerSystemplattformLaufzeitfehlerOverhead <Kommunikationstechnik>ÄhnlichkeitsgeometrieBenchmarkMatrizenrechnungAnalog-Digital-UmsetzerSystemaufrufZeitzoneSimulationDecodierungThreadStreaming <Kommunikationstechnik>Kernel <Informatik>MultiplikationProgrammierungTypentheorieOpen SourceMixed RealityEinfache GenauigkeitPetaflopsNormalvektorPunktMAPAbgeschlossene MengeProzess <Informatik>SpeicherabzugResultanteCASE <Informatik>OrtsoperatorSpeicherverwaltungEndliche ModelltheorieVideokonferenzSichtenkonzeptEntscheidungstheorieGebäude <Mathematik>NummerungPhysikalisches SystemGefangenendilemmaProgrammZweiMathematikProjektive EbeneWort <Informatik>Ordnung <Mathematik>HalbleiterspeicherFächer <Mathematik>EinfügungsdämpfungUmsetzung <Informatik>KrümmungsmaßBenutzerschnittstellenverwaltungssystemOrdnungsreduktionSpieltheorieOffene MengeWeb logTermComputerAdditionMixed RealityDelisches ProblemARM <Computerarchitektur>MomentenproblemOrbit <Mathematik>DreizehnMailing-ListeSelbst organisierendes SystemHIP <Kommunikationsprotokoll>DivisionAusdruck <Logik>SchedulingPartitionsfunktionGibbs-VerteilungPhysikalischer EffektBildverstehenRechenwerkPartikelsystemGebundener ZustandBenchmarkNichtlinearer OperatorSkalarproduktDifferenzenrechnungMobiles EndgerätDreiMatrizenrechnungEntropiecodierungBitrateKartesische KoordinatenBitCoxeter-GruppeÄhnlichkeitsgeometrieSimulationParallelrechnerPetaflopsBefehlsprozessorInformationsspeicherungp-BlockThreadShader <Informatik>UmwandlungsenthalpieTUNIS <Programm>DefaultWürfelTypentheorieProgrammierungHardwareVerschlingungEinfache GenauigkeitMultipliziererKernel <Informatik>GraphikprozessorSupercomputerOpen SourceProgrammbibliothekPunktgitterInverser LimesZahlenbereichComputerarchitekturXMLComputeranimation
PetaflopsEndliche ModelltheorieProgrammierungSchnittstelleGraphikprozessorHIP <Kommunikationsprotokoll>Open SourceEinfache GenauigkeitGenerizitätTemplateImpulsROM <Informatik>IntelBefehlsprozessorDatenmodellSpeicherabzugMobiles EndgerätNichtflüchtiger SpeicherOffene MengeThreadSerielle SchnittstelleDynamisches RAMParallelrechnerDigital Rights ManagementDecodierungSystemplattformSupercomputerPseudopotenzialMusterspracheE-MailProgrammbibliothekKernel <Informatik>Element <Gruppentheorie>Gradientp-BlockÄhnlichkeitsgeometrieInterface <Schaltung>CompilerDivisionSoftwareentwicklerStreaming <Kommunikationstechnik>Design by ContractProgrammierungProgrammierparadigmaZweiOrdnungsreduktionResultanteDatenverwaltungBitThreadVektorraumEntropiecodierungSystemplattformGenerizitätKernel <Informatik>ComputervirusImpulsPhysikalisches SystemZellularer AutomatUmwandlungsenthalpieMultiplikationsoperatorProgrammbibliothekMobiles EndgerätParallelrechnerSpeicherverwaltungSupercomputerBitrateFramework <Informatik>E-MailWarteschlangeSoftwareentwicklerSchnittstelleElement <Gruppentheorie>Generator <Informatik>Twitter <Softwareplattform>Delisches ProblemPuffer <Netzplantechnik>CodierungKartesische KoordinatenOpen SourceAbstraktionsebeneEinfache GenauigkeitVerschlingungDivisionNichtlinearer OperatorSoftwareHalbleiterspeicherGruppenoperationMusterspracheGraphikprozessorTemplateBenchmarkBefehlsprozessorDämpfungMultigraphEndliche ModelltheorieGüte der AnpassungPhysikalischer EffektDoS-AttackeSystemaufrufRechter WinkelQuick-SortEinsProzess <Informatik>Abgeschlossene MengeWeb SiteAbschließungSichtenkonzeptAlgorithmische ProgrammierspracheProgrammPunktFunktionalBus <Informatik>SpieltheorieRechenwerkUnrundheitStatistische SchlussweiseOffene MengeFlächeninhaltGibbs-VerteilungComputeranimation
Kernel <Informatik>Streaming <Kommunikationstechnik>HIP <Kommunikationsprotokoll>Offene MengeBefehlsprozessorGraphikprozessorPetaflopsGrößter gemeinsamer TeilerTotal <Mathematik>ROM <Informatik>BandmatrixShader <Informatik>GamecontrollerProzess <Informatik>ATMVersionsverwaltungTopologieDecodierungFunktion <Mathematik>Jacobi-VerfahrenParallelrechnerOrdnungsreduktionZählenTeilbarkeitGemeinsamer SpeicherHalbleiterspeicherKontextbezogenes SystemInzidenzalgebraDifferenteMomentenproblemProzess <Informatik>Gesetz <Physik>Metropolitan area networkSuite <Programmpaket>BenutzerschnittstellenverwaltungssystemSpezielle unitäre GruppeOrtsoperatorSchwebungSystemplattformZehnEinfach zusammenhängender RaumProgrammKette <Mathematik>AggregatzustandGruppenoperationSoftwareentwicklerKonfiguration <Informatik>ResultanteVersionsverwaltungDesign by ContractBus <Informatik>Physikalischer EffektRechenwerkSichtenkonzeptPunktEntscheidungstheorieSelbst organisierendes SystemAffiner RaumNummerungBandmatrixProdukt <Mathematik>GraphikprozessorBitrateEntropiecodierungPuffer <Netzplantechnik>MAPBetriebsmittelverwaltungUmsetzung <Informatik>Jacobi-VerfahrenRichtungOrdnung <Mathematik>BefehlsprozessorZweiUnendlichkeitDienstprogrammComputerShader <Informatik>ParallelrechnerGamecontrollerFreewareRechenschieberInhalt <Mathematik>PetaflopsKartesische KoordinatenOrdnungsreduktionDefaultSoftwaretestService providerFunktionalComputeranimationFlussdiagramm
Streaming <Kommunikationstechnik>Offene MengeVersionsverwaltungKernel <Informatik>SoftwareentwicklerSampler <Musikinstrument>SkalarproduktBestimmtheitsmaßOffene MengeEntscheidungsunterstützungssystemDiagramm
SupercomputerMathematikSoftwareentwicklerOffice-PaketEndliche ModelltheoriePhasenumwandlungKernel <Informatik>CodierungDämpfungUmsetzung <Informatik>DiagrammFlussdiagramm
Ganze ZahlParametersystemHIP <Kommunikationsprotokoll>ParallelrechnerKernel <Informatik>Streaming <Kommunikationstechnik>p-BlockSoftwareOffene MengeSupercomputerSystemaufrufElektronisches ForumProgrammierungEndliche ModelltheorieThreadBandmatrixROM <Informatik>CompilerProgrammschleifeOverhead <Kommunikationstechnik>BeanspruchungDecodierungGraphikprozessorKonfiguration <Informatik>Weg <Topologie>Übersetzer <Informatik>Kernel <Informatik>NummerungSampler <Musikinstrument>Weg <Topologie>ResultantePhysikalischer EffektStellenringMultiplikationQuaderTUNIS <Programm>GraphikprozessorBitEntropiecodierungRechenwerkEigentliche AbbildungThreadSupercomputerProgrammschleifeHalbleiterspeicherProfil <Aerodynamik>BandmatrixFunktionalElektronische PublikationSoftwaretestRechter WinkelMereologieKonfiguration <Informatik>SpieltheorieProgrammierparadigmaRechenschieberDesign by ContractSpeicherverwaltungWellenfrontOffene MengeOrdnung <Mathematik>ZeitzoneProjektive EbeneEinfügungsdämpfungAnwendungsspezifischer ProzessorPerfekte GruppeComputerSchlüsselverwaltungDifferenteGoogolEndliche ModelltheorieGibbs-VerteilungMetropolitan area networkComputeranimation
Endliche ModelltheorieSchätzfunktionMultiplikationsoperatorProgrammierparadigmaCodierungComputeranimationBesprechung/Interview
FehlermeldungEntropiecodierungBinärcodeWhiteboardComputeranimation
Computeranimation
Transkript: Englisch(automatisch erzeugt)
Hello everybody, my name is George Marco Manolis, I'm a lead HPC scientist at CSE IT Centre for Science and I'm excited to be here to present some work about utilising MD GPUs, Tuning, programming models and roadmap.
Just before I start, to mention really fast what is Lumi, the supercomputer that's going to be installed in Finland. For more details, seen my last year's presentation, I'll just say that the motivation of this talk is about the MD GPUs,
that will have almost half exaflop AMD Instinct GPUs and the CPU partition has already been installed and I'm testing some pilot programs and openings through to public and have some storage system but I will continue to the next one.
Now, this is a mock-up of the architecture of the Mi100. This is not the GPU of Lumi, this is the GPU that we have access to. So you can see here some asynchronous computer engine, hardware scheduler, 8 shader engines, 1, 2, 3, 4 and another 4, with 16 computer units,
but 8 totally are disabled and this computer unit has 64 shader processing cores. So it has also the fabric links and all of the things, but I'm not going into more details, I just want to keep the 120 computer units.
Really one slide, just introduction to heap, you can find more details in the last year's presentation. heap is a heterogeneous interface for portability developed by AMD, can be executed in both platforms, AMD and NVIDIA. Many well-known libraries supported to heap, there's minimal overhead,
new projects or porting from CUDA could be developed directly to heap and the supported CUDA API is called with heap prefix, C CUDA malloc becomes heap malloc, etc. and you can find heap in this link. Now, a benchmark about matrix multiplication,
the code is C CUBE plus, it was converted to heap plus, about conversion again, see the last year presentation. In this small example, it's a matrix multiplication, 2000 size and 6 precision and all the CUDA cores were converted and it was linked with heap plus
and what you see here, the y-axis is gigaflops and the V100 achieved around 12, 13 teraflops in this situation and here the Mi100 achieved close to 22 teraflops and the Mi100 achieved more closer to the theoretical peak than V100
and it was quite straightforward and efficient. In this benchmark, we have N-body simulation so I give you where I found the code, all the CUDA cores were ported and we have almost 33,000 number of small particles and 2000 time steps.
Now, what's interesting here on the y-axis is seconds, the V100 was close to 70 seconds and when I ported and ran it on Mi100, it was close to 95 seconds, something like that. So, Mi100 was worse performance than V100
and then I was checking the code and of course, I was checking the ROKM changelog. So, in ROKM 4.1, they decided to use 1000 threads per block instead of 256 and when I changed this back manually to 256 threads per block,
then here I achieved a faster performance than V100 with like 50 something seconds. It's not huge speed up, but at least when we compare by default some values, maybe if we don't know how to tune, we can say, oh, Mi100 is worse than V100 etc.
So, some tuning is required. BubbleStream is a memory-bound benchmark from the University of Bristol, it's well known to benchmark memory. It has five kernels, add, and you can see here like the formula, multiply, copy, triad and dot.
Dot is interesting because it basically is a reduction and the triad has also similar performance for other kernels, I will mention later some results. So, I wanted to say that I use mainly these benchmarks in some cases. So, about improving the OpenMP performance on BubbleStream for Mi100,
the original code is here in OpenMP target teams, distributed parallel for SIMD, target to have the GPUs, create teams, distribute the workload and make it parallel. Now, the SIMD, if you use AMT bare metal system until LLVM at least 12,
it doesn't do anything. Here, if we have it or not, it doesn't matter. But if you create compiler, it's mandatory, you will see a warning that it will not be parallelized correct, so you need to use SIMD. Now, as I know that I want 256 threads per block
and double a number of teams from computation units in order to cover the lattice and everything, I know and add here thread limit and num teams 256 and 240. Now, for the dot kernel, because it was a trial and error,
we achieved with 720 teams the best performance. So, this is a way how to tune manually the performance to increase for OpenMP and Mi100 for specific kernels usually. Here, we discuss about the MixPens. It's a benchmark tool to evaluate the performance bounds of GPUs
on mixed operational intensity kernels. The executing kernel is customized on the rates of different operational intensity values, supported programming models, CUDA heap, OpenCL and SQL, and we'll use the CUDA heap. We'll use three types of experiments combined with global memory access,
single precision, double precision and half precision, multiply and addition, and the results will be presented with peak performance only, and the source of the benchmarks is this GitHub. So, the y-axis is the gigaflops, and we see here the V100 and VDV100,
double precision, single precision and half precision. So, we see that from double precision around close to 7.5 teraflops, and single precision is increasing close to double, 40 to 15,
and then at half, we see that it goes a bit down around 10 to 11 teraflops. So, what happens next? We compare the A100, which is close to 9.5 teraflops precision, and it goes more than 19 for the single precision,
and if I remember correctly, it's close to 54, 55 for half precision, so significant improvement for half, and of course for the other ones. And then we compare with the M100, which I repeat is one generation before the Lumen GPU, and what we see here,
double precision increases close to above 10 teraflops, and six precision is close to 22 teraflops, again better than A100, but the half precision is lower, it's close to 43, 44 teraflops,
so it's lower than A100, so half operations didn't really perform well significant to A100. Now, programming models, we have utilized with success at least the following programming models,
Lumen A100, heap, of course, OpenAPL floating, heap, SQL, Cocos, alpaca. I have done benchmarks for all of them, and I will present the results for all except the alpaca. Now, let's talk about SQL. We're using the heap SQL to support the AMD GPUs.
SQL is a C++ SQL source heterogeneous programming for acceleration of load, generic programming with templates and Lambda functions, big momentum callee, NARISC, Argon, Codeplay, Battleship, contract to support the virus system.
SQL 2020 specifications was announced last year. There's many terminologies, unified cell memory, buffer, accessors to access various buffers, data movement, queues, to send requests for the GPUs, etc. Heap SQL supports CPU, AMD, NVIDIA GPUs,
and Intel GPU experimental. Now, to mention something important, that the NVIDIA GPU support now is so changed the framework, so a code that is written with heap SQL can be executed on NVIDIA system
without having solved heap. So the fact that they say heap SQL doesn't have dependency on heap framework really, but this is how it started the name. And what does it mean? On the EuroHPC systems, something that you wrote on Lumi with heap SQL to run on AMD GPUs, will run on Leonardo
without really to request any software installation from the support of Leonardo. Of course, it depends on external libraries, etc. but not because of the SQL. Cocos Code implements a programming model in C++
for writing performance portable applications targeting major HPC platforms. It provides abstractions for both parallel execution of code and data management. The fun thing is coming from ECP and NSA. A lot of terminology. View, execution space, serial, threads, OpenAP, GPU, etc.
Memory space, DRAM, NVRAM, and others. Pattern is like the pattern for the parallel fold, the reduction of the scan, etc. And execution policies, static, dynamic scheduling, thread teams, etc. Supports CPU, AMD, NVIDIA GPUs, like Intel, KNL, etc.
This link is well known with many tutorials online. So, let's discuss about Alpaca. It's an abstraction library for parallel kernel acceleration. It's header-only C++14 abstraction for header development. It's developed by HZDR in Germany.
Similar to Coto terminology, grid, block, thread, plus element that goes really deep per thread, let's call. So, you can do the vectorization. You can control everything. Platform decided at the compiled time. So, single source interface.
Easy to port CUDA codes through CUPLA. The interface is really similar. It's quite easy, I would say. Some terminology also. Queue blocking or not blocking. Buffers and work division. Define some things. Supports heap, CUDA, TBB, OpenMP for CPU and GPU, etc.
And there is a GitHub link. Also, we also have collaborated with the developers and have ported BubbleStream to Alpaca. But the results are in a paper to be presented. Now, let's discuss about BubbleStream results. So, here you see the left three groups of graphs
is about the thread. And the right three are about the dot kernel. And the colors is CUDA or heap to break the device. Heapsicle, Cocos, and OpenMP for floating. Now, this is the V100, A100, and AMD A100.
And what you see here is that for NVIDIA, for example, it's close to 800 gigabytes per second. The V100. The A100 is close to 1.4 terabytes per second. And A100 is close to 1 terabyte, a bit less, per second.
Now, I would say that NVIDIA, this one, the A100 is really close to the peak and similar to V100. Now, I would like to mention that I will use the Cocos code from BubbleStream.
And it's not optimized for GPU, really, because I had to change it for two layers of reduction for the dot kernel. So, here the Cocos is not bad, but it's not optimized.
Compared to the others, I could tune them manually. Now, for V100, we see that the reduction performs similar to the triad, except the Cocos, but the Cocos is not optimized. Now, for A100, again, we see the same, that a bit Hipsicle and a bit Cocos is a bit lower.
But for AMD, what we see is that we see the same trend that something that is similar with Hipsicle is lesser with Hipsicle and Cocos. But OpenMP goes quite lower performance.
This is something that OpenMP of loading is shared because of the reduction is not that efficient. So, we try to see the performance during the time, during the new versions, but the reduction is something that OpenMP is not efficient.
So, if your code has a lot of reductions, maybe OpenMP is not the right approach. AMD Instinct MI250X. So, here we are with a Lumi GPU card,
and there are many differences. The same similars and differences. Now, it has two graphics compute dies, GCDs, and I will show in the next slide what I mean with two graphics compute dies. Every GCD has 64 gigabytes of SPM2 in memory,
a total 128. So, we have like two GPUs in one GPU. 26.5 teraflops peak performance per GCD, which means the peak performance per device, if I'm sorry, like that, is around 53 teraflops.
1.6 terabytes per second memory bandwidth, again, per GCD. So, for GPU, it's 3.2 terabytes. And we have 110 compute units per GCD, totally 220. And the GCDs are interconnected with 200 gigabytes per second
per direction. So, we have quite high bandwidth between them. And the interconnection is attached on the GPU, not on the CPU. So, on Lumi, if you do ping-pong tests from the CPU, it will go through the GPU.
So, you have to pay some overhead, maybe. I have not tested. We don't have access to these devices. So, this is an assumption. So, you have to understand that on Lumi, the interconnection is on the GPU. So, your MPI, send, buffer or whatever should have an allocation on the GPU memory to be more efficient.
So, the Mi-250X, this is again a mock-up designed by myself. What I want to show you. This is now four shader engines, one, two, three, four.
Before the Mi-250X was eight, but it was four. This is the one GCD is this thing here. The second is this one. So, each of them GCDs has 110 copied units. They have 200 gigabyte interconnection between these per direction. There is an infinity fabric here around that goes outside from these GPUs.
There is infinity. Memory controller is for SVM2 memory. And there are two of these. This is one GPU. So, you don't see here two GPUs, it's one. Now, someone could say this is two GPUs.
But terminology, okay, it's one device with something, let's say, two GPUs if you want to say it like that. You should call it GCD is the correct term, I would say. So, what's here about how to use this device? And what do we miss here?
I mean, okay, if I have one MPI process, what happens? So, how to use these things? Utilize Cray-MPH with GPU support. And you have to export this variable to support the GPU. I'm not sure if this will be changed. So, don't take it for default, but maybe it can be changed or not.
I don't know. Now, use one MPI process per GCD. So, two MPI processes per GPU. And eight MPI processes per node if you plan to utilize four GPUs. What do I mean?
If your program is one MPI process, okay, there is no other solution. You will use one GCD. If you use one GCD, then you have 110 copied units, you have 1.6 terabytes per second peak, and not 3.2. That's why you use two MPI processes per GPU, plus one, in order to use all the bandwidth and all the computation.
And if you want to use more GPUs on the system, you can go from two MPI to four to six or eight. I mean, usually the six doesn't sound correct now, but one to four GPUs usually, but it depends on your application, of course. MI 250X can have a multiple context setting in the same GPU, okay?
That supports many MPI processes per GPU by default. This is what we call CUDA MPS. It's by default active, so you can run many MPI processes in the same GPU. But be careful about contention and all these things. It's not like free ride, okay? So, you have to be careful about this.
Now, if the application requires use different number of MPI processes, I suppose you will be free to use different MPI processes. We have not yet tested, but we hope so, that it works efficiently as it is. Now, OpenCC. GCC will provide OpenCC through metagraphic contracts,
now called CME CTA, but it's taking the functionality. What does it mean? The performance aspect will not be really comparing to the other compilers. It's been supporting OpenCC version 2.6, maybe 2.7 for Fortran. This is quite old OpenCC version, but maybe it's enough,
and they announced that they will not support OpenCC for C++. What does it mean? If you have C++, don't use OpenCC for loomie, the general methods. Now, there's a clack from Okrit. This is about OpenCC for LLVM, only for C,
Fortran and C++ in the future probably. It translates OpenCC code to OpenAP of loading, and if the code is in Fortran, we could use also a GPU for that I will mention later. Here's the clack from ORNL, and I use here the clang that I got from ORNL, and I have a jacobi file.
The options that do not matter for the moment about OpenCC to print some OpenAP and converting. The original code is this one. You see SCC, parallel loop, reduction, private, etc. And the new code that did nothing except to use the clang is the omp target teams,
map to from, first private, reduction, and distributed, although it misses some parallel for here, etc. and the loop. So, it converted it. Maybe it's not efficient, but it's still under development. Now, I want to show you here some results. I'm sorry, but this is for V100 only.
What I want to show you here is OpenCC from PGI, still PGI, clang 12. This is bubble stream, the five classic kernels. Now, OpenCC GC10 and OpenCC OG10. Those are the development of version 10,
the latest OpenAP development stuff that have also other improvements, and here you can see how the OpenCC is improving while I use the latest GC from OG10, and the crucial was the GCC support. How the OpenAP, this is the OpenAP basically,
improved the DOT. The DOT performance was significantly improved, and among all the other kernels, but the DOT performance was significantly improved here through the GCC. So, the DOT do only functionality. Now, they do some performance improvements, but you see they're still behind from other compilers.
Now, the GPU Fort is a new tool from AMD that basically you can convert CUDA photon codes or OpenAC photon codes to OpenAP float or ship. This way, some stuff from photons they have to do manually. It goes automatically for you to create interfaces,
to call the kernels from C++ files, et cetera. There's a quite complicated workflow, but it works for some examples, but still there are some developments, and it's developed, as you see here, from Dominic and Mazda, from AMD and the team, and you can use AMD OpenAP after to use OpenAP
or floating and all these situations. Just to show you a simple example, on the left here, I have a code. I have OpenCC. You see in the box the main OpenCC part, and it creates the right part, ifdef original file.
And what it says, ifdef GPU Fort, then start calling some GPU Fort, SCCIntelligion, copy and other aspects, and how to launch the kernel and stuff, but else is the original code. So, this part is exactly the same thing. So, this is only if you have GPU Fort, but it's not only this.
Automatically, it creates also this part, the external C routine, to include the launch for the kernel, and it has many, many options that maybe you don't use in the original code, but it defines it either way, and here is the kernel. So, all this code is created from the previous slide.
And here you can see the kernel house created automatically. All these are automatic. Now, I had shown you last year the porting diagram,
and what's changed here. These things change. If you have OpenCC, you have Cray for photon, Clack or Flack if it's produced in the list, and GCC, and if Fortran, use GPU Fort. So, basically, I said also already about HP,
it supports only Fortran for OpenCC, and the Clack and Flack OpenCC is research projects, so I don't know, we have no contract with them, so I don't know what will be released, and GCC still lack performance, but they're still in the game here,
and if you use GPU Fort, you can use if the performance is good, perfect. If no, you can profile and tune OpenAP and heap calls, improve data transfers, and see what else you can achieve. So, tuning. Use multiple wavefronts per compute unit.
It's important to hide latency in the instruction throughput. Tune number of threads per block, as I said, number of teams for OpenAP offloading, and other programming models support these things. Memory coalescing increases bandwidth, unrolling loops allow co-pilot to profess data, small kernels can cause latency overhead, does the workload,
use the local dataset memory, as a small memory that's really close, the compute units are close, and it has really high bandwidth. Profile, this could be a bit difficult and without proper tools. Conclusive future work. A code written in C++ and MPI plus OpenAP is a bit easier to be ported to OpenAP offloading
compared to other approaches. Heapsically, Cocos and Alpaca could be good options, considering that the code is in C++. There can be challenges, depending on the code and what GPU functionalities are integrated to an application, how new they are also. You'll be required to tune the code for high occupancy, track historical performance among new compilers.
This is what I'm doing, but also you have to do when you use a new compiler, maybe it can be worse for some things. This is it for OpenACC and OpenAP offloading. For MDGPUs, it can be tricky to install and I have many issues. I'll track that. I'll track how profiling tools work for MDGPUs.
I'll try to test Rokprof, TAO, Scorpie, SPC Toolkit. And also, we have an accepted paper, Evaluating GPU Programming Models for the Loom Supercomputer. It will be presented at Supercomputing Asia in March and will show more results with Alpaca also.
So that's it for me. Thanks a lot for any questions. Thank you.
Less than a minute for live Q&A. Maybe very quickly, the question by Chris, do you have any experiences on the time and effort involved in porting existing codes to AMD GPUs? So basically, it totally depends on what is the programming model used. So I had a situation I did for testing the GROMACS,
which is a huge code. And it took me almost one and a half days when I didn't know exactly the errors. But finally, when I knew the procedure, in less than five hours, I was able to port and have the binary, not the performance, but have the binary.
Okay, yeah.