Adding Power ISA 3.1 instruction support to Valgrind
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 287 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/57130 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2022209 / 287
2
4
6
8
12
17
21
23
31
35
37
41
44
45
46
47
50
62
65
66
67
68
71
73
81
84
85
86
90
92
94
100
102
105
111
114
115
116
117
118
121
122
124
127
131
133
135
137
139
140
141
142
145
149
150
156
164
165
167
169
170
171
172
174
176
178
180
183
184
189
190
192
194
198
205
206
207
208
210
218
220
224
225
229
230
232
235
236
238
239
240
242
243
244
245
246
249
250
253
260
262
264
267
273
274
277
282
283
287
00:00
Power (physics)Software developerOrder (biology)Power (physics)Point (geometry)Process (computing)Meeting/Interview
00:27
Power (physics)ImplementationAdditionOperations researchIntegerQuadrilateralCodeReverse engineeringWordoutputSource codeTranslation (relic)NP-hardBinary fileAnalog-to-digital converterFile formatDecimalCalculationSystem callFunction (mathematics)Computer fileComputerExponentiationNormal (geometry)InfinityFluid staticsOvalNachinvariante UntergruppeStatement (computer science)State of matterCovering spacePower (physics)String (computer science)Type theoryRegulärer Ausdruck <Textverarbeitung>Functional (mathematics)Parameter (computer programming)Pointer (computer programming)ResultantDifferent (Kate Ryan album)DeterminantPairwise comparisonCondition numberPoint (geometry)Data storage deviceOrder (biology)Electric generatorComputing platformExpert systemSign (mathematics)Social classModal logicSequenceBinary multiplierComputer programmingCartesian coordinate systemData conversionNeuroinformatikStatement (computer science)Revision controlSystem callLatent heatNumberCodeError messageSoftware developerCalculationSource codeoutputImplementationBuffer solutionBitWordUniform resource locatorIntegerFile formatTranslation (relic)Direction (geometry)Matrix (mathematics)Auditory maskingShift operatorQuadrilateralVariable (mathematics)CASE <Informatik>Limit (category theory)Computer fileStatisticsComplex (psychology)WebsiteWaveOnline helpData structureDivisorOperator (mathematics)Normal (geometry)Reverse engineeringAdditionInfinityException handlingComplete metric spaceConstructor (object-oriented programming)Eccentricity (mathematics)Client (computing)Ultraviolet photoelectron spectroscopy2 (number)Bus (computing)Key (cryptography)Round-off errorComputer animation
07:56
CodeImplementationMatrix (mathematics)Maxima and minimaLatent heatPower (physics)RankingIntegerControl flowFunction (mathematics)System callData structureMereologyOvalSound effectFluid staticsSimilarity (geometry)Military operationField extensionArchitectureModulo (jargon)Statement (computer science)Thermal expansionResultantRow (database)Functional (mathematics)Extension (kinesiology)Parameter (computer programming)Power (physics)Computer architectureImplementationBitPoint (geometry)StatisticsMultiplication signRight angleInstitute of PhysicsStatement (computer science)Field extensionThermal expansionDifferent (Kate Ryan album)Computer fileNumberCodeSocial classOperator (mathematics)Latent heat1 (number)Acoustic shadowData structureCASE <Informatik>System callElectronic mailing listType theoryMereologyMatrix (mathematics)DivisorBinary multiplier2 (number)String (computer science)Vector spaceState of matterPointer (computer programming)Touch typingProjective planeTwitterBinary codeCellular automatonOnline helpHypermediaLevel (video gaming)Ultraviolet photoelectron spectroscopyConstructor (object-oriented programming)Open setDesign by contractFerry CorstenEquals signComputer animation
15:25
Thermal expansionCodeField extensionStructural equation modelingPower (physics)Online helpPointer (computer programming)Operator (mathematics)Limit (category theory)Arm4 (number)outputNormal (geometry)Functional (mathematics)DataflowDoubling the cubeSystem callNumberGeneric programmingComputer architectureTerm (mathematics)Point (geometry)Software developerQuicksortSinc functionWindowBitIntegerType theoryComputer virusImplementationMereologyData structureVector spaceMultiplication signMessage passingComputer animation
21:45
Computer animation
Transcript: English(auto-generated)
00:05
Hello, my name is Carl Love. I'm one of the Valgrind developers. I've been working on Valgrind for about 10 years now. I've done Power 8, Power 9, and Power 10 support in Valgrind. Today's talk is on adding the Power ISA 3.1 instruction set support to Valgrind.
00:23
ISA 3.1 is implemented in the Power 10 process order. The goals for today's talk is to briefly look at the new features in the ISA 3.1 and the primary goal here is to help developers to understand the methods which are used for adding new instruction support
00:44
into Valgrind. There's basically four methods that you can use. The first one is to implement using existing IOPS. The second one would be to create a new IOP to map the new instruction to. You can also implement a new clean helper and implement with a dirty helper.
01:02
Now the first we're going to look at ISA 3.1. It added about 259 new instructions to the previous version of the instruction set specification. We now have both 32-bit and 64-bit instruction length. We've got 128-bit integer operations, and there's a
01:23
large number of instructions for doing matrix multiply assist. Basically, you can do matrix multiply on various sizes of data, 4-bit, 8-bit, 16-bit, etc., both integer, and then you have floating point. So, and you can do masking on it. There's a whole class of these instructions for doing matrix multiply.
01:42
Those instructions use this ACC register file, which we'll talk about here in a minute. And additionally, there's, you know, we've got this 128-bit or quad word size data, and so various instructions have been added to operate on that data size. The implementation of a new instruction using existing IOPS is
02:03
the preferred method. It basically allows the most granularity for the various Valgrind tools to go and insert code as needed into the user's program to instrument it as necessary. Basically, you take the instruction, you break it into a sequence of operations, which you can implement with the existing IOPS that are supported on that host platform.
02:25
For example, the BRW instruction, it's just reverse the order of the bytes in a word. It takes a 64-bit wide data, which is two 32-bit words. So, the input order is as follows. You've got basically word 0, byte 0, 1, 2, 3, word 1, byte 0, 1, 2, 3,
02:42
and the result is going to be word 0, bytes 3, 2, 1, 0, and word 1, 3, 2, 1, 0. The word order isn't changed, just the order of the bytes. So, here's the implementation in Valgrind. You've got four IR attempts that are created. You're then going to use an AND IOP to mask off the first byte, the second byte, the third byte,
03:05
the fourth byte, storing it into the temporary. You're then going to use a SHIFT IOP to move the bytes to the new location. You're going to order everything back together, and the result is assigned into, in this case, variable result, which is then stored into our register.
03:21
That's the typical way you're going to see most instructions are going to be something along those lines. The issue comes in, in that case, you only use 11 IOPs. The issue comes in when you start generating lots and lots of IOPs for a particular instruction. It's a very complex instruction. So, if you end up with hundreds to thousands of IOPs, you may, in fact, start overflowing the translation buffer in Valgrind,
03:43
and you're going to get an error message when you try to run your code on that. So, it's also hard to debug and potentially would result in rather slow execution for that instruction. A clean helper bit is a C code function that replaces a long sequence of IOPs that does a computation.
04:05
The result of the clean helper is placed back into an IR value, just as if you'd been doing it with IOPs, and then you store that IR temp into a register it may be. The key thing here is that you're not touching the guest state directly.
04:23
For example, the DCFFIXQQ instruction is a 128-bit signed integer to DFP instruction. DFP, of course, is a format for storing the base 10 floating point values,
04:40
as opposed to the IEEE floating point format. It basically results in less round-off error as you're doing computations of these floating point values. It's used a lot in banking applications. So, for this particular instruction, we were able to do the conversion using an existing IOP supported on power. However, once you've done the conversion, you need to know how to set the condition code values,
05:04
and the condition code values are set based on the converted value. Is it a normal number, a subnormal number, quiet not a number, infinity, et cetera? So, you have to go in and look at the DFP number that you've got,
05:22
and do a bunch of if statements to determine what the floating point type is. That requires a lot of if statements, nested if statements, and that ended up generating a lot of IR code, which is too expensive. So, we use the C simple clean helper to do it.
05:41
So, the implementation looks like this. Again, the first thing here we've got is we've got VB source. We're going to reinterpret that as an I128, and then we're going to convert from the I128 into a D128, and that gets assigned into temp D128. So, this is the actual conversion of the input data into a DFP value.
06:03
Now, we've got to go determine what type of floating point value that is. So, we're going to call this function generate store DFP FPRF value, and it's going to go out and do all those comparisons to determine is it floating point, not a number, quiet not a number, subnormal, et cetera.
06:22
And it's going to generate a six-bit condition code value, which will then be stored into the condition code register. The function looks as follows. The generate store is going to make that clean helper call.
06:42
And so, we again, we see the sign statement here. We're going to be assigning into FPRF val. We're going to be using the make expert call function. The first argument is the return type of the result. There's zero reg params. We give it the string for the name of the clean helper,
07:04
and we generate the function pointer to the function. And then we have a vector, which we've got the six arguments for the clean helper in it. And so again, that's going to go off and do that determination what the type is. And again, the result is put into an IR temp,
07:22
and then back into a register as it would normally do. The next way of doing this would be using a dirty helper. Well, a dirty helper is very similar to the clean helper in that you're using C code to do a complex calculation. The difference here is that with a dirty helper, you're able to reach into the guest state
07:41
and read, write, or modify entries in the guest state. This is all happening kind of under the covers. And unless we tell Valgate exactly what we've done, the entries we've touched, how we've touched them, we're going to screw up Valgrind, because it's not going to know what happens. Everything's going to magically change underneath it,
08:01
and it's going to get lost. And it's not going to be able to report the proper statistics on the user code. We typically only use this as a last resort. We prefer to use a clean helper if at all possible. So again, the key thing here is we're going to directly access the register file in the guest state, and we have to tell Valgrind exactly what we did.
08:22
For example, the xvi4gr8 instruction is one of those MMA instructions. It's a 64-bit long instruction, and it's basically doing a matrix multiply using 4-bit data. The ISA 3.1 added this ACC that we mentioned. It consists of eight 512-bit accumulator entries.
08:45
Each accumulator entry has four 128-bit rows in it, and each of those rows is mapped to a VSR register for moving data back and forth. The instruction basically does this matrix multiply, as I said.
09:00
The result is stored into one of those eight ACC entries, i.e. the result is a 512-bit value. The problem we have is we couldn't use a clean helper, because clean helpers only allow us to return at most 64 bits, and we need to return 512. So we're just a little too big for a clean helper. So it was done, again, like I said, with a dirty helper.
09:22
So here's the implementation. In this case, we've identified the instruction. We're going to call this function vxmatmatrixgr, which is going to set up all the arguments for our dirty helper call. So that function looks like this. We have this structure, irdirty. We're going to create a variable d.
09:40
We're going to set that equal to this structure that's going to be generated by unsafe irdirty. That structure is going to tell us we have zero reg parameters. It's going to have the string name of the dirty helper, and we're going to have a function pointer to the function that implements this dirty helper.
10:00
And again, we have a vector of the arguments for the dirty helper. In this case, we have seven arguments. This function setupfxstatestruct is a routine that I wrote to take the D structure, and then it explicitly says which entry in the ACC we're going to be accessing.
10:20
That's given by AT. And what the operation is going to be read, write, or modify. That's given by AT underscore FX. And so this function is going to set up and tell Valgrind, via this data structure, exactly which parts of the guest data have been touched and how. So again, the key thing here with a dirty helper, you have to tell Valgrind exactly what you've done,
10:42
and that's all set up in this D structure. So here's the function. So first thing here is nfxstate. We're going to be touching four registers, i.e. four rows in the ACC. So fxstate sub zero is the first ACC entry.
11:04
fx is the operation, and fx is going to be read, write, modify. We're going to tell it the size of that entry in the guest state. We're going to do that for all four of those entries in the ACC. And then we're going to go and tell it exactly which entry in the guest state.
11:23
So fxstate sub zero dot offset. This is the first register of ACC zero, the second register in ACC zero, the third register in ACC zero. So we've actually told Valgrind exactly which entries in the state, how big they are, and what we're doing to them.
11:40
So the last method would be to add a new IOP. And this is generally done if there's an extension of existing IOPs. So for example, if we got a 32-bit, 64-bit divides, oh, now we've got a 128-bit divide operation. And this is something that conceivably another architecture might add eventually. They might also add an instruction to a 128-bit.
12:02
So it makes sense to have this because it could be used by other architectures in the future. We generally try to avoid creating an IOP that is architecture specific. The ISA 3.1 support added new IOPs for the various 128-bit instructions,
12:21
like divide, modulo, reinterpretation, multiply with carryout, et cetera. And again, these are all things that it seems possible or probable that another architecture might add something. Unfortunately, it's a little bit involved to go and add a new IOP. There's a lot of places you have to touch to do that.
12:40
So I can really only just briefly highlight what you've got to do. So basically, you're going to put the new IOP, in this case, say, D128 to I128S. You're going to add that into the libvex. underscore IR file. This is the list of all the IOPs that exist in Valgrind. And then for that particular IOP, you had to go into the power host.ppc.isol.c function.
13:07
And there's various case statements in there under function ISLVECExprWork. We have to add the entries in there to implement the new IOP, getting the arguments, putting them into the data structure. There's a case statement for printing the IOP name,
13:23
which can be called, another one to call to tell them whether or not that IOP might trap, and one for the type of the IOP. In host.ppc.desk.c, you actually add the support for issuing the actual instruction on the host for the new IOP.
13:42
And in memcheck.mccx.translate.c, you've got to, again, add the IOP in there so you can get all the shadow bit, or register bit set, so that, again, Valgrind can track which bits are valid and which ones aren't. So, basically, we had to touch a lot of different places in there.
14:02
A little hard to go through all those here. So, in summary, the ISA 3.1 added a number of new instructions. Most of those were implemented using existing IOPs. We used IOPs to the extent that we possibly could and only used clean helpers in those places where we absolutely had to
14:23
replace the expensive IOP code expansions. We used dirty helpers for the MMA instructions simply because we had no way to return 512 bits in a clean helper. And, again, we did add new IOPs for various instructions that made sense and where they made sense for, you know, existing class of IOPs that other op architectures might use.
14:48
So, with that, hopefully that gives you a basic idea of the various approaches you can use for adding support for new instruction and, basically, how it's done. Again, if you're adding a new IOP, those steps are going to vary a little bit depending on is it a binary, a trinary, unary IOP.
15:08
If it's integer or floating point. So, yeah, it gets a little trickier in the specific implementation there. So, you have to work a little bit more at that. All right. Well, we're going to have a question and answer at this point and I appreciate your time and attention.
15:23
Thank you.
16:34
Hello. I think we got the question and answer session now. So, I'm not sure why my picture's not coming up.
16:42
But first question I had here was where does the limitation on the return type come? And, basically, the clean helper is it's a C function. And so, you know, you've got your normal return value. So, I did look at maybe trying to return some sort of a pointer to a structure.
17:03
But look at the inputting. That looked like it was going to be very hard to do. I think we got the question and answer session now coming up.
17:21
But I had a contact come and, basically, we'll receive an answer session now. So, the second question here is when we generate IOPs, is there an optimizing class?
17:47
I'm not aware of one specifically for that. So, I didn't actually implement anything specifically for an op pass. Perhaps that's possible.
18:03
The next question was even 10 IOPs is kind of a lot and might be a little bit worrisome in terms of the number of IOPs being generated for a given instruction. Yeah, it's possible we could go in and try to implement a few power specific IOPs that might help reduce that number.
18:22
Maybe implement those as clean helpers. We've certainly done things like that in the past. DFP comes to mind where we implemented a couple of very generic DFP operations which we did with clean helpers, and that was able to reduce it but we tried again, do it such that they were operations that another architecture might use.
18:52
The next question is would it be possible to return a floater double or even a pointer to a vectorized floating point register?
19:01
I think, again, you're talking about in a clean helper here. You know, the floater, the double, that's okay because, you know, it's a C function. So, that's easy to do with a clean helper. Returning a pointer to something was a little trickier. That wasn't clear how well that would work when I looked into trying to do that for 128 or even a 512.
19:24
So, I didn't do that. Let's see. Julian says here, I was too lazy to implement return in anything other than integer register. So, yeah, there's part of the limitation there and the actual clean helper call.
19:45
The other thing you'd have to keep in mind is when you're making the call to a clean helper, you've got that vector of input value. So, we would have to fix that up too if you want to be able to pass in something larger than 64 bits. You know, 128 bits or 512.
20:00
So, yeah, trying to extend past that 64 bits could be a little bit involved and I didn't choose to do that. Again, I wasn't sure that that was going to be something used by other architectures. So, I just didn't bother to do that. I went with a clean helper and then with a dirty helper. It made more sense.
20:54
I'm not seeing any other questions here. Please go ahead and post and we got a few more minutes.
21:30
Well, I'm not seeing any more questions. So, I'm going to go ahead and thank you all for your time and let me know if you have further questions. You can just contact me as one of the Valgrind developers.
21:40
So, thank you for your time and I hope you enjoy the rest of the conference.