HPC Node Performance and Power Simulation with Sniper
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 199 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/32550 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
CollaborationismVisualization (computer graphics)Different (Kate Ryan album)Multi-core processorStudent's t-testStack (abstract data type)Core dumpUniverse (mathematics)Single-precision floating-point formatPhysical system
00:37
Computer simulationState of matterResultantSoftware testingWell-formed formulaCasting (performing arts)1 (number)SoftwareMultiplication signProcess (computing)Traffic reportingMereologyPoint (geometry)Group actionCartesian coordinate systemSound effectCache (computing)Structural loadDistanceGoodness of fitComputer hardwareLetterpress printingType theoryPhysical systemTheory of relativityVideo gameMathematical optimizationNumberComputer architectureMathematical analysisLattice (order)State observerParallel computingMathematicsPRINCE2AverageFiber (mathematics)SimulationLevel (video gaming)Complex (psychology)AbstractionCoprocessorStudent's t-testDemosceneHybrid computerConfiguration spaceCASE <Informatik>CybersexComputer configurationVirtual machineWater vaporDifferent (Kate Ryan album)Cycle (graph theory)BitRevision controlApproximationNatural numberForm (programming)OrbitQuicksortMachine visionWorkloadShared memoryPersonal digital assistantVisualization (computer graphics)Set (mathematics)WordElectric generatorDebuggerStack (abstract data type)Thread (computing)Core dumpTwitterInteractive televisionStatisticsPersonal identification numberSpacetimeSemiconductor memoryBit ratePerfect groupMikroarchitekturMulti-core processorPower (physics)Analytic setNetwork socketXML
08:31
Branch (computer science)Green's functionCache (computing)Cartesian coordinate systemSingle-precision floating-point formatThread (computing)Component-based software engineeringWeb 2.0Visualization (computer graphics)AbstractionCore dumpCycle (graph theory)Network topologyElectric generatorServer (computing)Computer simulationWorkloadLevel (video gaming)Stack (abstract data type)View (database)MicroprocessorDifferent (Kate Ryan album)Type theoryMatching (graph theory)Insertion lossMikroarchitekturPhysical systemSynchronizationMultiplication signPower (physics)outputSinc functionMathematical analysisBitRevision controlPairwise comparisonStudent's t-testFunction (mathematics)CASE <Informatik>Network socketSemiconductor memoryFocus (optics)Limit (category theory)Vapor barrierSpacetimeProcess (computing)Rule of inferenceSocial classFraction (mathematics)1 (number)WebsiteGraph (mathematics)Graph (mathematics)Computer-generated imageryFood energyRight angleLibrary (computing)QuicksortSet (mathematics)Error messageStandard deviationOrder (biology)Water vaporScaling (geometry)WordReading (process)Line (geometry)VarianceShared memoryNumberSound effectMoment (mathematics)AnalogyData structureStructural loadComputer configurationGame theoryGraph coloringProgram flowchart
16:24
Scaling (geometry)Validity (statistics)Set (mathematics)Component-based software engineeringComputer simulationPlanningMathematical optimizationResultantNumberDifferent (Kate Ryan album)SpacetimeFunctional (mathematics)Dot productBand matrixMilitary baseLevel (video gaming)Maxima and minimaOpen sourceSemiconductor memoryLine (geometry)Point (geometry)Computing platformMathematical analysisCodeWorkloadLinearizationSlide ruleSource codeCartesian coordinate systemSoftwareMultiplication signHeat transferPolygon meshComputer clusterGraphics processing unitVariety (linguistics)Core dumpSupercomputerComputer architectureComputer configurationCoprocessorComputer hardwareTheoryProof theoryQuicksortEmailElectronic mailing listInformationLatent heatOrder (biology)TelecommunicationBlock (periodic table)WordSocial classMachine visionWage labourLaptopLinear mapSheaf (mathematics)Right angleCategory of beingMereologyVideo gameOutlierParallel computingIterationState of matterAnalytic continuationTheory of relativity
24:17
PlanningValidity (statistics)Configuration spaceComputer configurationShared memoryInterface (computing)DatabaseTheoryDegree (graph theory)Meeting/Interview
Transcript: English(auto-generated)
00:00
This is just a little peek ahead into what we're going to talk about today. So I'm a PhD student now at Gitt University. And this is some of the research work that we've been doing with collaboration with many other people, of course. And this is a visualization tool that is a little bit different from the talk we saw previously. Previously the talk, Skalaska, was looking at the MPI stack and looking at multiple nodes.
00:24
Here we're going to take a dive into a single node. And we're going to look at performance of cores and multi-core systems, because they're getting quite complicated these days. OK, great.
00:46
So what are the major goals of Sniper? Well, Sniper is our tool that we develop at Gitt University. And the tool is, the main goal for us is to figure out what the performance is going to look like for next generation machines.
01:01
So say there's this new Xeon Phi coming out. I've heard about it. I want to maybe try to change the settings on my computer and see how my HPC workload is going to perform on that new computer. So that's one type of thing you can do. Another thing you can do is do hardware software
01:21
co-design. So this is something that we were really lucky, because we were working closely with Intel. And the idea there was, can we change the software? Can we change the hardware at the same time? And now we can do better than changing only one or the other alone. And so we'll go into this in a little bit
01:41
in some of our research. But one of the other things that I guess one of the things that I'm targeting today is talked about most is, how is my application performing? And this goes back to the insight that we heard at the last talk. But the insight into the application is it can be very difficult to understand. Why is my application not performing
02:01
as well as I expected? So we're going to go into a little bit of detail on that. OK. So what we do at our research group at Get University is we try to design tomorrow's processors using today's hardware with the simulator. But today we're going to talk about optimizing tomorrow's software for tomorrow's processors.
02:23
And simulation, and that's what Stiper is, it's a simulator, is one promising solution. So now what we can do is we can have detailed analysis of our application, of our hardware, and we can see how they interact. We can do architectural exploration. And we can do early software optimization before the hardware is available.
02:43
OK, but come on, Stiper and simulation, this doesn't make any sense. Why can't I just use performance counters? Come on, they're really great, right? Why can't I just use cache grind and see how my cache misses look? This shouldn't be a problem, right? Well, it turns out that using these methods
03:01
can be really good for software optimization, but difficult to do a hardware software optimization when you're trying to look at performance. And the problem is that not all cache misses are alike. And so this is basically computer architecture 101 here where sometimes you have long latency loads, and sometimes the cache misses are not very important.
03:24
And so it doesn't really affect the performance all the time. Modern out-of-order cores, they can overlap these misses. So you don't really know which ones are important and which ones are not. And both the core performance and the cache performance matters. So just because we have a cache simulator
03:40
doesn't mean that we'll understand how the core performance looks. And actually they're really tightly intertwined. So that's why we developed Sniper. So node complexity is also increasing. So we have large changes happening in HPC nodes. We have large numbers of threads with Xeon Pi, for example. We have cache coherent NUMA.
04:01
And now what we wanna do is we wanna optimize for efficiency of our software. So one of the trends that we notice as computer architects is that we've seen these numbers of cores per node increasing. So back in the day, 2001 was the first dual core process with the power floor from IBM.
04:21
And the first x86 dual core was in 2005 from AMD. Then fast forward to 2011, we saw 10 core processors. And now we have 60 plus cores with Knight's Corner. And then Intel just recently announced the Knight's Landing processor. And my guess could be that maybe they'll believe
04:42
and see more cores than what we saw on the first version of the Xeon Pi. But we also have many different architecture options. So this is a typical processor that, a typical processor configuration that you would see in a multiprocessor node. This is a four socket node.
05:01
And each socket has four cores and the sharing in L3. So this is a typical configuration of the machine that you'll see a node. But what we also see are things like this, which are much different from the typical processor architectures. And this is very similar to the Xeon Pi type. Another thing that we've seen,
05:21
and today we've seen a lot about, we've heard a lot about this, is future systems and being diverse. So we have varying processor speeds, varying micro architectures. We also have heard a lot about failure rates. I won't talk much about that in this talk, but these are something that simulators allow you to do much better than being able to do this
05:42
on the hardware itself. Because of NUMA effects, I'm not sure, I won't talk about NUMA too much, but basically now you have memory that doesn't always look like it's the same distance. So basically you can have some memory accesses that appear to be much further, it's much longer to get to that data.
06:02
So now what we need are solutions, both the software space and the hardware space to solve these challenges. And what we do, our work originated from analytical models. So what this means is understanding your hardware, understanding your application, and coming up with formula that sort of represents the performance of this application.
06:23
But that doesn't allow us, at the current state of the art for analytical models, does not provide the level of detail that we need for today's complex applications and complex micro-architectures. So we propose fast and accurate simulation.
06:41
And I mentioned pre-Silicon software optimization, but also today I'd like to just focus on software optimization and software insight. Cycle-accurate simulation tends to be too slow for exploring this design space. So there are two types of, well, there are a few types of simulation. So simulation means emulation, basically.
07:02
We're pretending that we're a different type of machine that hasn't been invented yet. And you can do that at the cycle level, very precise, or you can take the abstraction up a little bit and you can do some approximations and you can get very close to the result of a cycle-level simulation. So what we did in Sniper was we sort of raised the level of abstraction to give us a very similar result,
07:21
very accurate result, but in a much faster way. Okay, so here's Sniper. It uses a hybrid simulation approach. We're a parallel simulator. We have a core model that's analytical, based on analytical models. What we are is we validate against hardware, but we're also pin-based. So I'm not sure if people here are familiar with pin.
07:43
Pin is a dynamic instrumentation tool. What you can do is you can take pin, you can sort of wrap it around your application, and then you can gather statistics. And so what Sniper does is it uses pin as one of the front ends to gather statistics about your application and model these new architectures.
08:03
And we scale with a number of cores and you can download it right now at snipersynch.org. Okay, we've got lots of fun features. We support MPI, we support 64 bits, we're parallel, and we're gonna go into some important things here like CPI stacks and interactive visualizations,
08:20
and that's what I showed earlier. There's some things here that are a little bit technical that I won't touch on today. But okay, I have to say, Sniper isn't perfect. It's not for everybody. We're user level. So this might not be the best match for workloads with significant OS involvement. That means databases.
08:41
That means web servers. This simulator would not be good if you're trying to simulate a web server application. I'll skip over this. We use a high abstraction core model. This means that if you want to understand the nitty gritty details of a processor, you probably don't wanna use our simulator.
09:02
But most people here wanna understand their application, so that's why our simulator will work for you. We're x86 only, but it turns out that all of these limitations are okay for the HPC space. And so that's why we developed Sniper, and that's what we're talking about. Okay, so this is a little bit of history of Sniper.
09:21
So we released our first version in 2011, and we've made many revisions since then, adding lots of fun features. And we've got around 700 plus downloads from researchers. Some people are searching for shoot-em-up games, but we discard those from the download account.
09:42
Okay, so now I wanna talk a little bit about the main feature, what I feel the main feature is for this community of Sniper, and that's a visualization of understanding the application and providing insight. So we worked closely with a master's student a year or two ago, and the question we were,
10:02
the first question he proposed to us was, well, come on, this is the kind of output that you get from Sniper, it's not very interesting. Who wants to read this data, all these numbers? Ah, I'm sorry, we gotta do something about this. So what we did was the very first type of visualization that we introduced with Sniper is called
10:21
the cycle stack, or CPI stack. Basically, the problem is modern out-of-order microprocessors are very difficult to understand. Where are we losing cycles? Why is it slow? I don't understand what's going on. So a cycle stack, or a CPI stack, is one way to understand where our loss cycles are going.
10:42
And so what we have is we have different components that represent the different reasons why we lose performance. So there's a base component, which really represents the fastest speed of the microprocessor, and then we have the branch predictor, and we have some instruction caches and other caches, and then there could be other components as well.
11:04
So this is a really good start. This is a single thread here, and it shows the components here. Here we can see there's quite a large component that is an L2 cache. That means that we hit in the L2 cache, and that hitting in the L2 cache causes us this much performance penalty,
11:22
and that's about 33% of the same. Okay, so the next step, well, why don't we look at it over multiple threads? So I have a lot of names here on the right, but basically we wanna focus on one component, which is these red bars right here in the middle here.
11:44
So what we see is these four threads on the left have a very small component on red and a very big component on this dark blue. So if we look over here, the red component is off-socket memory, and the blue is waving in synchronization because of a barrier.
12:01
So it turns out that the data is residing on the first socket. So the simulator can tell you with these cycle stacks, ah, I can access the data quite easily, but because of NUMA effects, this socket, this one over here, is unable to get that data as fast, and that's where the performance hit is occurring.
12:22
There are other interesting things that you can look at. So for example, you can compare different input sets, and you can compare scaling from eight cores, for example, to 16 cores to see how much time you're spending in synchronization versus actually doing things, so actually doing computation.
12:41
Okay, so what I want to do is I want to go to our current most advanced visualization feature of Steinberg, and what this is is a website that's automatically generated after you run your simulation, and it contains a few different things. And what I was surprised, when we started this research work,
13:01
I was surprised at how difficult it was to get a simplified view. We started off, and we had all these buttons, and graphs, and charts, and things like that, but actually taking things away was much more difficult than I expected. So we have a few different components here. We have the main view in the middle, and this represents time on the x-axis,
13:21
and the CPI, or the lost cycles. Where am I losing time? Why is my CPU slow? On the left here, we have some options. On the right, we have different components, and what these components are color-coded in the most recent version, such that the red components means this is lost cycles due to the core.
13:41
Yellow is branch predictor. Green is because of the memory subsystem, or cache hierarchy, and blue is synchronization with other threads, or other cores. Down here, we just have the performance of the system in instructions per cycle.
14:00
So this is a metric that computer architects like to use. But what's also cool about Sniper is we've got the performance. What about energy? And this was touched upon in the earlier presentation as well. What we're able to do is we can integrate Sniper with some other academic tools, in this case, MCAT.
14:22
And MCAT is a tool that allows you to do some high-level exploration and understanding of your application of the microarchitecture, and see where your power is going. So in this case now, we have a stack, very similar to a performance stack, but in this case, we're looking at power, or in this case, it's, yes, power, and where the power is going
14:42
during each of these, during overtime. And there was another interesting feature, which I thought was cool, but isn't as, doesn't have as much benefit as the previous two features, and that is a 3D view of the performance of your course.
15:03
So now you can get a quick and easy view and see which course here on the z-axis are doing better or worse, and in this case, we have a pretty homogenous application, so we don't see any dips. But it's possible that, for example, with the example we gave before, you'll see a dip in performance
15:21
because of off-chip accesses, and you'd see the dip here with the y-axis would dip down and you'd see much lower performance. So first we have performance analysis of the application, but we also have the view of this system.
15:41
What does the system that we're simulating look like? So this is an automatic topology generation. These are all the microarchitectural structures, the nitty-gritty details of your computer. So we have the different levels of cache, the L2 cache, the shared L3, and the DRAM. What happens is, if you mouse over one of these components,
16:01
it will show you a sparkline of the activity for that component over time. So now you can go in, you can look at the different components, and you can do comparisons and see how the application is using your microarchitecture. So then we have a little bit more experimental research
16:23
that we were working on, and the idea for this research is, how can we analyze the entire application in a more straightforward way so that we can understand what's going wrong in the application?
16:40
So what we have here on the x-axis is time, and on the y-axis is number of instructions. So normally you would expect some sort of linear behavior to occur, where if you have more instructions, then it'll just take a linear amount of time to execute those instructions. But what happens is, there are some outliers,
17:00
and that means that you're spending more time in these functions compared to other functions. So maybe these are the functions that you would wanna spend some time to investigate, do some research on. I just also want to touch on the roofline model. So this is a very interesting model that was developed in 2009.
17:23
David Patterson and some other folks. And what they came up with was very interesting, and this is for high-performance computing workloads. So what we have is, we have a roofline model consists of two components, and these are the maximum attainable performance
17:40
that you can achieve on a node. And we have one, it's in basically the intersection of two lines, the peak memory bandwidth, which is this line, and the peak floating point for performance of your core. And so with the intersection of these two lines, we have this space here, and we have a few functions. Each of these dots represents a function.
18:02
And the closer we get to the top of this performance or bandwidth number is how close we are to the theoretical maximum performance of this processor. So now we can have an understanding, okay, how well are we doing? Is there any room to grow here? And now we can use this methodology to determine.
18:24
We have a few more minutes. So what I want to just touch on is some of our research that we've been working on in hardware software optimization. The main idea is, if you have Xeon Phi style cores, for example, is there a way
18:42
to do hardware software optimization such that you can have a better performance result? And what I mean by that is, say you've got lots of options. You've got small cores and stacked RAM and big cores. Which architecture do you choose? And there's just a lot of options now. Xeon Phi, GPUs, although Sniper
19:03
doesn't have those GPUs yet. But basically what I'm saying is there's a large variety today, and it's getting even more complicated to understand. So what we do is we use Sniper to understand the complexity, and then to make it easy for us to understand what the right solution is. So for the HPC crowd, what we did was
19:22
we looked at a stencil computation. And what this stencil computation does is you're basically doing heat transfer between a 2D mesh. So basically each of these points can transfer heat to other components of the mesh. But maybe you wanna go ahead in time,
19:42
a few steps in time. And you want to compute, instead of just one iteration, maybe you wanna compute two or three or four. Okay, fine, you can do that. But what that means is you need extra data around the block that you're computing
20:00
in order to move in time without doing computation, without doing extra communication. So the point was before, every time step, you'd have to communicate with your neighbors. Now what we're saying is let's not communicate with the neighbors. Let's do extra computation. And before that, we're doing wasted computation because my neighbor is, these dark black dots,
20:23
my neighbor's actually also doing that work. So does it make sense for both of us to be doing that work? Well actually, it might. So here we have a roofline model again, and here we have the floating point performance of the bandwidth. And what we see here is as we increase the arithmetic intensity,
20:41
what that means is basically as we do more redundant computation, at some point we get diminishing returns, which means we're doing too much computation and we do much redundant computation, we're not making up the lost performance. So basically around two, so doing two time steps makes sense,
21:01
but any more, doesn't make any more sense. Okay, and basically to summarize, if you do co-optimization, you can do better than optimizing just software and just hardware alone. Okay, and I just wanna finish up by saying Sniper is available to download.
21:25
You can download it today. We have a pretty easy way to get you started, and we have a pretty active mailing list. We try to respond fairly quickly. And if there's any questions, I'd love to answer them. Thank you.
21:46
Any questions? Do you call this information with specific functions or source code lines, is it possible at all? So the proof line model that we showed a few slides ago,
22:01
that is function based. What we were looking into next is how can you go even deeper, and maybe you can go into source. Now this maybe goes back to the comment that was made in the previous talk, which was, well, do we really want one platform to do everything? We already have tools that do analysis of the code line.
22:28
So the problem with that, we could do it. I'm afraid that it might slow down the simulator too much and therefore all the benefits of being a fast simulator, a parallel simulator, a higher level simulator, we lose those benefits by moving in a touch.
22:42
So I think it's possible. Yes, if we thought about it, but right now we haven't. Do you know if people are actually doing that and want to start scurrying and using different tools and correlating these data? Correlating, I don't like it. Using different tools, one performs a source line and then the other is a slider and then correlating the result.
23:00
So they're correlating different applications or? Yeah, you run your application with each of your application with a sniper and you're on a normal scale. So we've done a validation against hardware, if that's what you mean. So we've taken the microarchitectural settings of a nail or a machine, for example.
23:20
We plug that into sniper and then you see the results are very similar. And I don't have the results here, but the accuracy numbers are quite good. I think he means if you use different tools, how can you align the results, right? I think that's very difficult. Yeah, I think that's a broader problem, right?
23:41
Because different tools have different expertise and they make different trade-offs. So therefore, their accuracy will be different for different components. So then how do you compare the accuracies? That's a, I think, maybe we'll take it offline. Someone else? Yeah, yeah. So you've got, you've verified the simulator
24:04
for hands and are there any plans to verify simulator? Yeah, that would be really, right now there aren't plans to validate NCI-5,
24:22
but that's totally possible. Is it? Yeah. So it's definitely a great, no, we haven't validated NCI-5 and have configuration options that allows you to configure through NCI-5, but yeah, that's a good thing. Validating for that would be a good thing.
24:45
You mentioned you have MPI support as well? Yeah, on-node MPI. Off-node, so not off-node? Not off-node. Because what we use is the shared memory interfaces of MPI to do MPI on-node MPI.