Introduction to Scalasca: A Performance Analysis Toolset for Parallel Programs
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 199 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/32551 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
MathematicsQuicksortMereologySocial classXMLUML
00:16
Arithmetic meanWordCycle (graph theory)Scaling (geometry)Right angleSoftware developerMathematical analysisNumbering schemeLimit (category theory)MereologyForm (programming)Parameter (computer programming)Bit rateTraffic reportingMoment (mathematics)Computer programmingSound effectCommunications protocolMessage passingMultiplication signEntropiecodierungSoftware testingProcess (computing)Level (video gaming)Flow separationSet (mathematics)Power (physics)TelecommunicationVirtual machineSupercomputerCore dumpParallel computingGroup actionState of matterTrailWeightComputer animation
04:55
Process (computing)WordMeasurementFrequencyEntropiecodierungResultantBinary codeMathematical optimizationArithmetic meanQuicksortDifferent (Kate Ryan album)Traffic reportingData conversionOpen sourceGraphical user interfaceState of matterComputer animation
07:00
1 (number)Food energyFunktionalanalysisCASE <Informatik>Multiplication signGame controllerVirtual machineUniform resource locatorCausalityNichtlineares GleichungssystemProjective planeError messageDimensional analysisProfil (magazine)Operator (mathematics)Level (video gaming)RootWeightProcess (computing)Arithmetic progressionVapor barrierMessage passingForestSign (mathematics)Parallel computingMathematical analysisLiquidMeasurementTheoryWater vaporSoftwareForm (programming)WordContext awarenessRight anglePressureSystem callMassBitField (computer science)Physical systemRoundness (object)Software developerReading (process)AreaReduction of orderElectronic mailing listOpen sourceComputerEntropiecodierungDialectCartesian coordinate systemUniqueness quantificationTorusGroup actionPattern languageTracing (software)Event horizonWebsiteQueue (abstract data type)Traffic reportingDirection (geometry)Closed setBoss CorporationEmailNetwork topologyArrow of timeSource codeComputer animation
14:40
Event horizonSystem callFunktionalanalysis
15:14
Message passingTheoryTable (information)Order (biology)Group actionTheory of relativityForm (programming)BitMathematical analysisDifferent (Kate Ryan album)Monster groupLecture/Conference
Transcript: English(auto-generated)
00:04
Okay, welcome to the class. We're going to talk about performance by the year.
00:32
Good, but what we have here is that if we look behind, if we look to the past 10 or 15 years ago,
00:43
what we had was machines that were the number one in the world, and they are probably as fast as your notebook today. So, what does this mean for us? First, we know that single-core performance is pretty much done.
01:02
We have been increasing the number of cores on our machines, and we have different things. We have GPUs, we have accelerators, we have all those kinds of things. So, we should be taking the thing about parallel programming seriously right now. So, when we have this track of HPC here, HPC is only the people who find the problems first.
01:30
The same problems that HPC people find today in 10 years is what people doing normal programming will find. So, besides that, it's not exactly getting easier.
01:43
Parallel programming is hard first, and we have new things. We have now those accelerators, we have GPU programming. When we get to faster scales, we always find new problems. We have also, with the increase of performance, we have increase of data,
02:04
and no matter how fast we have machines, IO is still a problem. Besides that, it's hard to understand what we have when we measure performance on parallel programming. For example, this is an old code called sweep3d.
02:21
When we run it with 256,000 cores, what each core does is pretty much the same. The rest is just waiting for communication. Of course, performance analysis is not something new. There are several of them, but when you start to get to the scale that we are today of petaflops,
02:47
and exaflops eventually in the next five years or so, most performance tools stop working. For example, Kojak is an old performance analysis tool. It's very old, it's not kept anymore, but it's like 16 years old.
03:06
My group has a performance tool called Skalaska, which started seven years ago. And what's the idea of it? The idea is to have performance analysis that the scales itself meet.
03:22
That you have faster code, we have a tool that can actually go with the code. It's a work from the research center of Yudish, Germany, which has one of the biggest supercomputers in Europe. And so it has to run there.
03:41
What about it? It's a new BSD, it's portable, it runs on pretty much everything. It runs on PC, with Linux, and it runs on the biggest supercomputers in the world. It supports Fortran, C++, it supports MPI, OpenMP, both.
04:01
And what do you have that other performance tools done? Our analysis are scalable, they really go to high levels. We can search automatically for weight states in your codes. And we show reports of parallel performance.
04:24
And more important than that, it's insightful, you can actually understand what's going on. What do you mean by insightful? Let's see an example of a code. This is a very simple code with another tool called Vampir, developed in Bresn. Where one process sends a message to everybody, everybody answers, and this is easy to understand, right?
04:46
But then, if you have a real code, it looks more like this. Or this. Or this. So it's not actually that easy to understand.
05:01
So let me show you our workflow. What do we have? We have our source code. You instrument your source code, you have a binary, you run this binary. This binary will be measured, you have a summary report of it, and then it will have our GUI tool called Q that shows you what is the problem,
05:23
where in the code, and in which process in the execution. With that, you reiterate, you optimize your measurements, and then you run a full trace of this thing to know all the problems of your code.
05:40
As I mentioned before, and as I showed in the picture, we are not alone. There are several other tools. The tools are different, but they have the same needs. Meaning, we need to measure, we need to instrument, we need to do a lot of things that come up. And this created a problem in our community, which was that we had a lot of people doing the same thing, essentially.
06:05
And this is duplicated effort. So, duplicated effort is something that we don't like. No one likes. And another problem is that if you have all those measurements and all those different tools, you end up with a lot of measuring your code with different tools
06:23
just because this tool might give you one inside, and this other two might give you something else. This is, of course, complicated. This is complicated for the sysadmin, this is complicated for the user. So what happens? For example, this was the state two years ago of some performance tools.
06:44
You have ParaVer, you have LampGear, you have Scholastic, you have TAO. And then you have converters between those tools. You have instrumenters of those codes. So, this was very confusing for us. So what we did, we called the Romulans and created something called Scorpy.
07:08
Scorpy is essentially, actually we have the original developer of something here. Thanks for coming. It's a community common project where we took all this common infrastructure of all those two,
07:23
and what we need, which is common for all of them, and are developing this together. So, we share. We share the instrumentation system, we share the measurement system, we share the trace formats, the report formats, and so this, the idea is to unify effort,
07:42
and so every group can build their own unique tools on top of it. So who uses that right now? We do, in Munich, in Germany. People from Dresden use it. People from Munich use it. And the people from the United States, from Oregon, use it.
08:05
So, again, what happened? We had this mess, and we came out. What we have now is an instrumenter module, a single trace format, and it just did one converter tool.
08:21
We have profiles and GUI. So, what do we measure? As I mentioned in the beginning, we measure MPI, OpenMP, and your own control functions. We can create MPI profiles. This is very simple. You just link your code against our library, and this runs really fast. It doesn't disturb your execution that much.
08:45
This shows how many times a function was called, how much data was spent, and how much time each function... It's like a profiler, but parallel. We can also create a callback profile, which means you can actually see who calls who and when.
09:05
This needs recomulation, because we have to instrument the function calls. And this needs, it creates, of course, some overhead. And so, after running this, you might need to filter and just instrument some functions that you actually want to know about,
09:22
especially in C++, for example. And then we have trace. Trace is when you can actually record every single event on your code. And then this can show you an efficiency pattern. But of course, this is heavy. This is heavy on data, and this is heavy on time.
09:42
So, this is important to filter. Our tool, it's parallel in the data collection, reduction, the analysis parallel. It imitates your code. So, if your code uses a million cores, our tool will use a million cores.
10:03
We have everything parallel, well, except the visualization, but it's scalable. So, what can you measure? Those are classic examples of MPI patterns. The first one is the late sender, when process took too long to send a message,
10:23
and the other process is already waiting. So, this delay, this waiting time is measured. You have late receiver, which is the opposite. You have wrong order, and you have barrier operations. And all those red arrows are delayed. So, for example, this is how our tool shows a late sender.
10:47
It shows the problem, where in your code, and where in the machine which is running. In this case, a blue gene P. This is another way of viewing the thing, creating a topology of your application that matches your functions.
11:08
So, for example, this code is a code for measuring sea ice. As you can see, there is a lot of weight in the middle, close to the coast,
11:20
because, well, there is no ice in those places. What else? We measure direct wait time means, when you have someone sending a message, and this weight of the other process spends, this time this other process spends on waiting,
11:43
we can measure that. We can measure indirect weights. So, for example, what does it look like? Again, it's a matter of going to the GUI, seeing what we have as direct wait time. In this case of the example of sea ice, it's a code from another research in Yuli.
12:01
You see that direct waiting time, you have a lot around the Arctic and Arctic regions and on the coast, and you have little to no time on the sea. On the other hand, as you are not measuring any sea ice on the tropical part, you have a lot of indirect waiting time.
12:22
What else? This was a very interesting research a couple of years ago. Our tool is the only tool right now, as far as I am aware of, that shows you where a delay was created. So if you have a delay in your code, it might happen that this delay cascades and you have a further delay.
12:43
So our tool shows you where it started. So, for example, in process A at the beginning, you have a delay and this delay everything. This we have our tool shows. Besides that, we have, for example, one of the machines we use a lot is a blue gene queue.
13:04
Blue gene queue has a network of, it's a 60 torus network. It's hard to visualize six dimensions. So what we do is that we fold those dimensions and we can still have some insights out of it.
13:20
You can move them, you can select different ways of seeing them. And this is very interesting because, for example, the case of a computer in Japan has also a 60 network. What else? Well, so where are we heading now?
13:45
Energy, because my boss is getting tired of paying our electricity account, which is around 10,000 euros an hour. So we have to produce code that is aware of energy. Another thing is to bring performance analysis of parallel applications to the community.
14:06
So what does it mean? As I mentioned at the beginning, we have no, the open source world is starting to need that. You have to program parallel right away. So that's why we are here.
14:22
Seriously, we are here for that. And, well, essentially that's it. Any questions, we have a mailing list. There is a lot of very smart people wishing to answer your questions. And this is our website.
14:40
Actually, that's it. Any questions? Yes, there is some initial support for CUDA already. We can manage the CUDA events.
15:02
But it's something that is already happening. You can get CUDA events and function calls for dispatch. Also we have already some initial support for the Xeon 5, which is also an accelerator.
15:25
Actually, on that topic, there was support for Xeon 5, but it's not there yet. It's not there yet. We are working on this. What's the pipeline on this? It's ready when it's ready. I also have a question.
15:43
So you mentioned there are different performance tools out there, and now some of them are sharing a common base, but that's not everything yet. Do you still think it makes sense to have all these kinds of different performance tools? Are each of them a little bit better in maybe some aspects? Actually, yes, because there is no single answer.
16:03
You might... There are different... Because performance analysis in itself today is more an art than a science. And so you have different things that you measure. So for example, OR2 gives you a very straight answer to those things.
16:22
But if you want a more detailed answer, you might use a different tool. Or if you want different granularity, all those things are different. So yes, there are different goals. Does it basically go from very user-friendly and not from that much detail to very... Very specific, exactly. So if you try to do everything in one single tool, you have such a monster that...
16:44
Well, yes. Alright, thank you very much. Thank you.