Enable AVX-512 instructions in Valgrind
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 287 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/57127 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Assembly languageImplementationAsynchronous Transfer ModeElement (mathematics)Embedded systemRead-only memoryCodierung <Programmierung>Total S.A.Multitier architectureMachine codeFlow separationExistenceAssembly languageSubsetSoftware testingMoment (mathematics)Mathematical analysisInformation retrievalClique-widthMachine codeVector spaceRevision controlFlow separationCodierung <Programmierung>Asynchronous Transfer ModeSemiconductor memoryVirtual machineBroadcasting (networking)Auditory maskingSelectivity (electronic)MultiplicationArithmetic meanElement (mathematics)Library (computing)Thermal expansionSet (mathematics)BitData structureComputer animation
04:00
Binary fileCore dumpLibrary (computing)Revision controlAssembly languageCASE <Informatik>Covering spaceElectric generatorFunctional (mathematics)System callMathematical analysisRight anglePermutationBitElement (mathematics)Software testingLinear regressionVector spaceMoment (mathematics)Form (programming)XMLComputer animation
06:22
Assembly languageCASE <Informatik>Error messageOperator (mathematics)XMLUML
06:47
Clique-widthMassGamma functionType theoryEscape characterWitt algebraUniformer RaumInformation managementAssembly languageComputer animation
06:58
Electronic mailing listBenchmarkInformation retrievalAssembly languageElectric generatorComputer fileSoftware testingInformationWorkloadLinear regressionMultiplicationParallel portParameter (computer programming)Row (database)Patch (Unix)SupercomputerData structureoutputXMLUML
07:48
Core dumpSoftware testingBenchmarkParallel portExistenceSupercomputerLinear regressionIntelRead-only memoryPairwise comparisonEmulatorLimit (category theory)BenchmarkCartesian coordinate systemLatent heatEmulatorSemiconductor memoryBinary codeCalculationCore dumpFloating pointAsynchronous Transfer ModeBefehlsprozessorExistenceSoftware testingMathematical analysisVector spaceError messageFigurate numberDiscrepancy theorySoftware frameworkMachine codeData managementMultiplication signTraffic reportingComputer animation
Transcript: English(auto-generated)
00:05
Hello, my name is Tatiana Vrinina, used to be Nineva, I work for Intel on enabling OXF-12 assembly instructions in Valgrind. This work, started in 2017, has been on a few hiatuses, but finally seem to be getting close to the completion.
00:21
So, I will briefly talk about the OXF-12 assembly instructions, describe how they have been implemented in Valgrind, in summary, and leave 15 minutes for questions and answers. So, what are OXF-12 instructions and why do we need them in Valgrind? OXF-12 is a set of vector assembly instructions, instructions that operate on multiple values at once, and they operate on 512 bits of data.
00:49
At the moment, if Valgrind encounters one of these instructions in user code or called library code, it stops completely. These instructions are an expansion of OXF-2 instruction set, it provides twice as many
01:07
vector registers of twice as big a length, and it introduces a few new features. For example, instruction masking. In OXF-12, we no longer have to execute vector
01:23
instruction on all elements, we can select on which elements to execute this instruction. The selection is done in form of bitmask, specified in a special mask register, there are 8 of them, and elements that are masked away are not accessed, so they should not cause any memory or access violations.
01:48
Other new features are new instruction prefix that can specify rounding mode for this specific instruction, or embedded broadcast, meaning that instruction, instead of using multiple values,
02:00
broadcasts and uses one value as its operand. OXF-12 also provides a new encoding of memory displacement, and some of these instructions can work on 8-bit or 16-bit granularity. Also, OXF-12 is not a uniform set, it consists of multiple subsets, the most important are foundation,
02:27
which should be available on any machine that supports OXF-12, it consists of about 400 assembly instructions. And vector lens, it provides a shorter version of OXF-12 instructions of 256-bit width and 128-bit width.
02:50
So, these two are the biggest sets, and at the moment only Knyte's landing and Skylake instructions are enabled fully, but the remaining subsets are relatively small, and it should not take long to add them.
03:07
Also, how these instructions has been implemented in Valgrind? The principles have been to separate OXF-12 code from OXF-12 as much as possible, so it can be easily disabled.
03:21
Second, to reuse existing IRs as much as possible, to reuse the amount of work Valgrind analysis tools should do with OXF-12 code, and to have as little impact on non-OXF-12 runs as possible. It has been done with one minor exception, if the code, if Valgrind
03:43
is compiled with OXF-12 support, it maintains structures for OXF-12 vector registers. So, it should add a bit of overhead, should not be much, but it haven't been measured yet, so these tests are yet to be done.
04:02
Valgrind deals with instructions in very short form. It parses the instruction, at the moment it's up to OXF-12, translates it into intermediate representation, at the moment again up to 256 bits of OXF-12, passes it to Valgrind analysis tools, the analysis tools analyze the code, generate instrumentation if
04:24
it's required, then Valgrind core generates assembly for all this code, and it can generate assembly up to OXF-12. If we introduce OXF-12 to this scheme, well, we have to parse any OXF-12 instruction, it has to be done.
04:41
We need to translate it into intermediate representation, and here we know not all of this can reuse OXF-12. For example, vector permutation needs access to the entire vector, to swap, for example, the corner elements. So, we need 512-bit IRs, which means Valgrind tools must also support 512
05:02
-bit IRs, and Valgrind core should be able to generate the right assembly for those. With Valgrind tools, an M-check has been manually enabled, the other tools don't support OXF-12 yet. For the assembly, OXF-12 once encounters a similar issue, again with
05:25
permutation, and OXF-12 calls a handwritten C function that does the permutation. So, I tried the same approach in OXF-12, just generate calls of handwritten functions. It worked, but it did not deal well with corner cases and spin hard to cover corner cases with regression tests.
05:48
So, instead I tried using a reference library instead of my handwritten code, and it worked much better, has less errors, but unfortunately the library is proprietary.
06:01
So, for now I resorted to using OXF-12 Intrinsics, because they must generate the right result, and they are the simplest way to get things done. But it limits OXF-12 Valgrind to OXF-12 machines, and it requires GCG version 5.0.
06:23
Now, OXF-12 instructions are parsed slightly differently, comparing to OX2. OX2 are parsed into IR and into assembly in switch cases. But when adding hundreds of new instructions to switch cases to 10 different files, it turned out to be full of errors.
06:42
So, instead I described assembly instructions in a text form, like name, operation code, prefix parameters, appearance, how to translate it into IR, how to translate it into assembly. And use this file to generate C arrays, listing the necessary information.
07:06
And then Valgrind Core finds the right row in this array, and uses parameters or IRs or assembly lists there. This file also wrote easier generation of regression tests.
07:21
They still have been updated manually, because some instructions need specific inputs. But the majority of tests have been generated automatically. I tried to also use this structure in Memcheck, but it did not work, because Memcheck has to understand how exactly the instruction propagates undefined values, so it needs much better understanding than can be described in a text file.
07:48
Now, this patch has been tested on NAS parallel benchmarks and multiple HPC workloads. There is one favor in Valgrind Core. On existent Valgrind limitation, Valgrind ignores rounding mode of floating-point calculations.
08:07
And this benchmark specifies rounding mode in a vx512 prefix. So, this limitation might appear more often on a vx512 code. I ran some Memcheck and Helgrind tests on specific applications for memory and threading testing.
08:25
vx2 and vx512 reports matched, which was my purpose. It has not always been the absolutely correct result, but it matched vx2, which was the goal.
08:41
And there is one more topic I would like to briefly cover. It was very difficult to detect when Valgrind emulation has a bug. So, I implemented some instruction incorrectly, it's emulated incorrectly, it generates the wrong value, but in CAD and Core unnoticed in the code for a long time.
09:02
So, my manager Julia Fedrova suggested dumping register values under Valgrind and in some other application run and comparing them. To find the very first discrepancy, fix it, and hopefully be done, if not, we can find the second discrepancy. So, I wrote logging tools for Valgrind and Intel Binary Analysis Tool Framework pin.
09:24
The tools are not user-friendly, but they have been invaluable to detect emulation errors. They can do it on user-specified granularity, they can do it after each instruction, they can do it once in a while and narrow down back to per-instruction granularity.
09:41
If anyone is interested, please let me know, I haven't posted this code yet, I'm not sure if it's available. So, the next steps are to hopefully upstream this patch, test it on user application and figure out what are user requirements for this vector of Valgrind, which CPUs and analysis tools and applications they need.
10:04
So, that's it, thank you very much.