MUST: Compiler-aided MPI correctness checking with TypeART
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
Contributors | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/62020 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
SupercomputerInterprozesskommunikationParallel portError messageLibrary (computing)TelecommunicationCoroutineSystem programmingData conversionInterface (computing)Message passingMountain passEnvelope (mathematics)OvalData bufferRankingData typeDeadlockRange (statistics)ExplosionScalabilityMathematical analysisLocal ringAerodynamicsProcess (computing)ResultantBuffer solutionLibrary (computing)Message passingType theorySoftwareComputer programmingNormal (geometry)CompilerProcess (computing)Computer clusterLocal ringDeadlockTelecommunicationDynamical systemTraffic reportingInterface (computing)Focus (optics)OvalSpacetimeSystem callLengthBroadcasting (networking)Sinc functionData typeStandard deviationNeuroinformatikFehlererkennungSupercomputerRankingCodeBenchmarkDataflowRun time (program lifecycle phase)Error messageCartesian coordinate systemSet (mathematics)SpeicheradresseComputing platformForestMathematical analysisBitInformationInterprozesskommunikationNetwork topologyOperator (mathematics)Crash (computing)CoroutineIntegrated development environmentPointer (computer programming)Memory managementComputer animation
08:23
AliasingParallel portBenchmarkAerodynamicsCompilerWorkstation <Musikinstrument>DeadlockFreewareExtension (kinesiology)Run time (program lifecycle phase)Memory managementInformationData typeAddress spaceLocal ringBuffer solutionTypprüfungSemiconductor memoryInterface (computing)Read-only memoryVideo trackingPointer (computer programming)Translation (relic)Execution unitCodeUniqueness quantificationCompilerPoint (geometry)StrutFluid staticsQuery languageCountingDirectory serviceScripting languageProcess (computing)Task (computing)StapeldateiData bufferLink (knot theory)Wrapper (data mining)Performance appraisalTotal S.A.Type theoryBuffer solutionInformationRun time (program lifecycle phase)System identificationDampingGraph (mathematics)Functional (mathematics)IterationQuery languageComputer fileCorrespondence (mathematics)Memory managementSemiconductor memoryComputer programmingSoftwareFunction (mathematics)Broadcasting (networking)Portable communications device32-bitCASE <Informatik>State of matterSerial portSocial classExtension (kinesiology)Modal logicOperator (mathematics)System callLengthRepresentation (politics)Level (video gaming)Process (computing)CodeMereologyCategory of beingStack (abstract data type)CompilerMathematical analysisAliasingLocal ringData typeCompilerDataflowDeadlockPoint (geometry)Cartesian coordinate systemCodeView (database)Presentation of a groupComputer animation
16:39
Run time (program lifecycle phase)Memory managementPerformance appraisalTotal S.A.Read-only memoryOperations researchFluid staticsAerodynamicsMilitary operationSemiconductor memoryMathematical analysisLocal ringProcess (computing)CompilerVideo trackingMessage passingType theoryFormal languageExpert systemStack (abstract data type)Keyboard shortcutData typeLocal ringProcess (computing)Semiconductor memoryTable (information)NumberSystem callMathematical analysisMemory managementRun time (program lifecycle phase)Operator (mathematics)Pointer (computer programming)DataflowComputer programmingDistribution (mathematics)Heat transferOverhead (computing)TypprüfungComputer filePhysical systemCompilerFluid staticsTelecommunicationExpected valueGraphics tabletMacro (computer science)Context awarenessFreewareWordSupercomputerComputer animation
24:55
Video trackingMemory managementCompilerAerodynamicsMathematical analysisRead-only memoryComputer animationProgram flowchart
Transcript: English(auto-generated)
00:05
OK, we're good to get started with one more MPI talk, but I think a very different one compared to the others. Compiler Aids MPI Correctness Checker. Thank you. So my name is Alexander Hook, and today I'm going to talk about basically the dynamic MPI
00:22
correctness tool, which is called Must. And in particular, I'm going to talk about the compiler extension, which is called TypeArt, which is supposed to help with MPI type correctness checking. And first of all, as we heard before, the message passing interface is the defector standard
00:42
of distributed computations in the HPC world, right? And it defines a large set of communication routines and other stuff. And it's also designed for a heterogeneous cluster system where you have different platforms that communicate and compute something.
01:02
However, in that sense, it's also a very low-level interface where you have to specify a lot of stuff manually. And you can expect only a little error checking in general from the library itself. So the user is required for the simple MPI send operation
01:20
to specify the data, which is transferred as a typeless void buffer. The user has to specify its data length of the buffer and the type manually. And also, the message envelope, so the destination of the message and the communicator and stuff like that, has to be specified manually. So there's a lot of opportunity to commit a mistake,
01:43
basically. And this is quite a question to you guys. If you look at the small code, try to figure out how many errors you can spot in the small example. And just try to look at every corner, basically.
02:03
And while I'm talking, I can also spoiler you that I'm going to show you every issue in this small example in a couple of seconds, so to speak. When I first looked at it, my colleague Joachim showed me.
02:20
I couldn't find the most simple one. That was a bit crazy to me. Sometimes you don't see the forest in front of the trees. OK, so the most basic one, we don't call MPI in it, right? That's usually in MPI applications,
02:41
that's the very first call you're supposed to do, where you initialize the MPI environment. And then likewise, if you look at the end of the program, we do not call MPI finalize. So those are two simple mistakes. But then, in total, we have eight issues. I don't know how many you found, and I'm also not
03:00
going to talk about each one of them. But it's quite easy to, if you look at each individual issue, to kind of guess that it can happen to you also. And those are the pointers where they are. And in particular, I want to talk about the receive-receive deadlock, where,
03:22
for instance, two processes wait on each other without being able to continue. You can argue that all those issues, except maybe the deadlock, could be found by the MPI library itself. But typically on HPC systems, the library does not do any checking for performance reasons.
03:41
That's why many of these issues will cause maybe crashes for unknown reasons, or just produce some strange results. Well, that's why the dynamic MPI correctness tool, MAST, was developed in the past, which
04:01
is a tool that during runtime checks for issues and produces such reports where it finds some issues. And this is a report of the deadlock we have seen in the example code, where the message itself just describes there's a deadlock. And in the bottom left, you can see
04:22
a wait-for graph, which just shows you which rank waits for another rank causing the deadlock. This helps you to see where the deadlock occurs and why it occurs. And also, MAST can produce so-called call-stack information
04:42
where you can see beginning from main of the program to the basically origin of the deadlock. But this was submitted now. So to facilitate correctness checking for MPI, MAST uses a so-called distributed agent-based
05:01
analysis, which means that you have your normal MPI application with four ranks, four processes that communicate as you would expect as the user wrote it. But MAST will also create a analysis network, which helps you to do local analysis. It helps you to do distributed analysis. If you think about a deadlock, you
05:21
need information for more than one process to figure out that there occurred a deadlock in your program. So MAST creates that completely transparent to the user. So you would use MPI, CommWorld, and any other communicator as normal. MAST takes care of creating such a network. And also, what's maybe the focus of the talk today
05:44
is the local analysis where we look at process local checks. If you think about MPI type correctness of a sent operation, you can do a lot of stuff locally, or I should do a lot of stuff locally. And this is the focus.
06:01
So MPI type correctness, we focus basically on the buffer and the user-specified length and the user-specified MPI data type today. MAST can already detect mismatches of, for instance, the send and receive communication pair, where
06:21
MAST basically creates a so-called type map. It looks at the user-specified buffer size and the user-specified data type and compares it to the corresponding receive operation. And if there's a mismatch, obviously, there is going to be an issue. And MAST creates a report about that. This also, of course, works for collective communications,
06:43
where you can make sure that all ranks call, for instance, broadcast operation with the same data type. However, since MAST only intercepts MPI communication calls, MPI calls in general, it cannot look behind,
07:02
like it cannot look what happens in user space. So we cannot reason about the type of the void buffer data. And this is why we were motivated to create the tool TypeArt, which is something that helps
07:20
with basically figuring out what the memory location is that you put into your MPI calls. So if you look at the small example on the right side, completely processed locally, there might be some memory allocation in that example. It's a double buffer that was allocated by malloc, let's say.
07:41
And the question now becomes, how can we make sure that the data buffer, which is a void buffer, fits the user-specified buffer size? So is it of length buffer size? And it also should be compatible with the MPI flow type. And of course, we can already see that double
08:00
and MPI float, there's a type mismatch. But MAST cannot answer such a question without further tooling, because it just intercepts MPI calls. So to just show you that it's not an academic example, there's two well-known HPC benchmark codes which have some issues.
08:23
So one was reported in the past by others, where there's a broadcast operation. It uses a big int, which is an alias for a 64-bit data type. However, the user specified an MPI int, which is a 32-bit data type for the broadcast operation. So there's an obvious mismatch.
08:41
That could be a problem, likely. And also, for milk, there's an all-reduce operation where the user's passed in a struct with two float members. And it's interpreted as a float array of size 2, which is benign, to be honest. But that could be a portability issue in the future, maybe.
09:02
Depending on the platform, maybe there's padding or whatnot. Maybe it's an illegal operation. So this could also be an issue in the future. Well, from a high-level point of view, how does MAST work? Well, you have your MPI application. And during runtime, it intercepts all the MPI calls
09:20
and collects all the states that it's needed for deadlock detection and so on. And we added TypeArt, which looks at all those allocations that are passed to MPI calls for those local analysis of buffers, which is a compiler extension based on LLVM. So you compile your code with our extension.
09:43
And the extension instruments all allocations, be it stack, be it heap, which are related to MPI calls. And we also provide a runtime. So during runtime, we get callbacks of the target application, all allocations, all free operations. So we have a state of the allocation of the memory,
10:03
basically, in a target code. We also, of course, look at the allocations and pass out their type. So in simple cases, buffer A is a double type. More complex cases would be structs or classes. We pass the serialized type information to our runtime, which then enables,
10:21
of course, a must to make queries. So for instance, for an MPI center operation, we give the TypeArt runtime the buffer, the typeless buffer. And the runtime would return all the necessary type information to ensure type correctness of those buffer handles. This is the whole high level process behind it.
10:41
And then if you take a look at an example of the memory allocations, here's a small heap allocation of a float array. This all happens in LLVM IR. I'm just showing C-like code to make it easier to understand. We would add such a TypeArt alloc callback,
11:04
where we need the data pointer, of course. And then we need a so-called type ID. It's just a representation of what we allocated. That is later used for type checking. And of course, we need the dynamic length of the allocated array to reason about where we are in the memory space, so to speak.
11:23
Likewise, we handle stack and global allocations. For stack allocations, of course, we have to respect the automatic scope-dependent lifetime properties. And for globals, we just register once, and then it exists at our runtime for the whole program duration. And of course, for performance reasons,
11:43
you can imagine that the less callbacks, the better. Hence, we try to filter out allocations where we can prove that they're never part of an MPI call and just never instrument those. This is basically possible on LLVM IR by dataflow analysis.
12:02
So in the function foo, we have two stack allocations and then we try to follow the dataflow where we can see that A is passed to bar. And inside bar, there's never any MPI call. So we can just say, okay, we do not need to instrument this. This is discarded.
12:21
Likewise, for fuba, we can see that B is passed. If it's in another translation unit, we would need to have a whole program view of the program, which we support. But other tools have to create such a call graph with those required information.
12:42
Anyways, so also, if we had this view, we can see fuba also does not call MPI. So both stack allocations don't need to be instrumented, which helps a lot with the performance. Okay, so the type ID, which is passed to the runtime for identification
13:03
works as follows. Built-in types are obviously known a priori. So we know the type layout, float is four bytes, double is eight bytes, depending on platform, of course. For user-defined types, which means structs, classes, and so on, we basically serialize it to a YAML file
13:23
and the corresponding type ID, of course, so we can match those during runtime, where we have the extent, how many members, offsets, byte offsets, basically, from the beginning of the struct, and also the subtypes are listed, which can then be used for making type queries
13:43
about the layout and stuff like that. And then, of course, must needs to have some API to figure out type correctness, and this is provided by our runtime, which has quite a few API functions.
14:04
The most basic one would be this type out get type, where you put in the MPI buffer handle, and what we put out is the type ID and the array length. And then you can use the type ID subsequently. For instance, in this call, where you put in the type ID
14:22
and you get out the struct layout I just mentioned earlier, and this way you can assemble some iterative type checking, which is done in must. And then putting it all together, if you want to use our tooling,
14:45
you would need to first of all compile your program with our provided compiler wrapper, which is a bash script, and does the bookkeeping required to introduce the instrumentation, the type out stuff.
15:01
So you exchange your compiler, that's the first step. It's optional, you don't have to do it if you don't need this local type out checking. And then you would also need to replace your MPI exec or MPI run, depending on the system, with the must run, which also does some bookkeeping for must to execute the target code appropriately,
15:24
spawn all the analysis agent based networking and so on. And then the program runs as normal and must output file is generated with all issues found during execution of your program.
15:40
And as a side note maybe, as I said, must does this agent based network, and in the most simple case for the distributed analysis, there's an additional process needed for the deadlock detection and so on. So for SLURM or whatnot, you need to allocate an additional process.
16:01
However, you don't need to specify it in the must run stuff. It happens automatically in the background. All right, so that's it. If you look now at what the impact is of our tooling, well, that's quite dependent, as I kinda alluded to, how many callbacks you have,
16:22
how many memory allocations you actually have to track, and how good we are at filtering them. So here's two examples, Luleich and Tachyon, which are, again, quite well known HPC benchmarking codes. And Luleich is quite favorable for our presentation because there's not many callbacks,
16:42
and hence our runtime impact is quite non-existent, so to speak, where you can see that, this is compared to vanilla, without any instrumentation, without our tooling. Taibat almost has no impact, and then with Taibat analysis enabled
17:02
has, yeah, almost no additional impact. For Tachyon, the picture looks quite different, as you can see. There's an overhead factor of about three using when you introduce Taibat. This is because there's a lot of stack allocations
17:20
that cannot filter, so we track a lot of stack allocations and the runtime impact is quite high. And this is reflected by those runtime and static instrumentation numbers. So first of all, the above table here shows you during compilation what we instrument.
17:40
So you can see that there are some heap-free operations that we find an instrument. There's some stack allocation of the globals that we instrument. Well, of course, those numbers do not represent the runtime numbers because heap and free operations sometimes are written in a way
18:01
that they are centralized in the program. That's why those numbers are not as high as you would expect. For stack allocations, we find 54 and out of those 54, we can filter, for LULESH, at least, 21%. And globals are much easier to follow
18:21
along the data flow in LLVM IR, so we can filter much more and much more effectively. Well, going to the runtime numbers, which means that those are basically the number of callbacks that happen during our benchmarking, we can already see that the high overhead which we observed in Taqion
18:45
is to be explained by the almost 80 million stack allocation callbacks, basically, that we have to track during runtime, which is a lot of context switching and so on, which is not good for the runtime.
19:04
All right, so this is already my conclusion. What we have done is basically with type art, must can now check all faces of the MPI communication with respect to type correctness. So the first face that must can already do is this one,
19:21
which is basically the message transfer this is checked against. However, there's also the face of message assembly, where you go kind of from the user process into the MPI process, and you have to check this. And of course, if you think about it, you would also have to check the message disassembly,
19:41
where you go from the received data to your user program again. So type art enables these kind of local checks to ensure type correctness. Thank you very much.
20:05
Any questions? Getting my exercise. Yeah, so I really liked your talk.
20:25
I thought it was really interesting. So one thing I wanted to ask was like, how does one get must? How do they install it? Is it available for distribution package managers, or is it more that you have to compile it yourself? Good question.
20:41
I think you have to compile it yourself on our HPC system. But it's not that tedious to compile, I think. Maybe I'm biased. But just go to the website, and there's a zip file. It includes every dependency that you need.
21:02
And I think the documentation is quite straightforward. You need, of course, maybe OpenMPI installed, but not much more, to be honest. And then you should be good to go. Yeah, I think it's CMake-based. I don't know if you have problems with that, but yeah, no, it should be straightforward to try it out.
21:22
Thank you. Okay, another question there on my way. So on the type analysis that you do,
21:43
I mean, if you look at malloc, and it has like a typecast, then you know what the type is. But if it doesn't have a typecast, if you malloc into a void pointer, and if the amount of bytes you're allocating comes from some constant or macro or some argument, how far do you follow? And if you can't see it, do you have a warning?
22:00
Do you crash? That's a good question. And that's basically a fundamental problem, right? We have to have some expectations of the program, right? So our expectation is that the malloc calls are typed. Otherwise, we would just track it as a chunk of bytes.
22:26
And I think our analysis is quite forgiving. So we would just look at, okay, this is a chunk of bytes. It fits the buffer, and this is fine.
22:40
And mine is beginning, it's inside. And also, if it's like aligned to the beginning of an element, right? Yes, you kinda lose that, right? If you just know it's a chunk of bytes, then you kinda lose the alignment checks, because you could, if you have, like say you malloc a struct,
23:01
and then you do some pointer magic for your MPI buffer, and you point between members in the padding area, only if TypeArt knows about the malloc struct, it can, of course, warn that you are doing some illegal memory operations. If we just see a void pointer due to the typeless malloc,
23:24
then we have lost, basically. Anyone else? Do you have any thoughts on using Rust, which is a way more memory safe language
23:41
than C and C++ is, have you looked at it? I'm not really, not really, not yet. For now, we have so much to do with the C and C++ words to support typing better, to get more robustness and so on, and not yet, to be honest, yeah.
24:02
Maybe all that work becomes irrelevant if Rust gets popular enough. I think, in general, maybe I'm completely, I'm not a Rust expert. A newbie when it comes to Rust, I think the MPI support itself is still in the works. No idea. I read some papers about generating bindings for MPI,
24:22
which are inherently type safe. Not sure how that goes, but yeah. I think everyone will be happy if Rust or some other type safe language becomes more used by people, and then this kind of work is irrelevant. But while people still use C++,
24:43
this is very relevant work, yes. That pays my bills, you know? Okay, thank you very much. Thank you.
Recommendations
Series of 5 media