HPC Node Performance and Power Simulation with Sniper
Formal Metadata
Title 
HPC Node Performance and Power Simulation with Sniper

Title of Series  
Author 

License 
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2014

Language 
English

Content Metadata
Subject Area  
Abstract 
Sniper is a performance modeling simulator. The goal of Sniper is to provide software developers with an easy way to analyze their applications. We provide both performance and energy/power analysis, as well as advanced visualization support. This talk will cover the basics of how to download Sniper and get started quickly, but more importantly show the benefits that simulating your application can provide. With perfunction, detailed simulation analysis, CPI stacks over time and energy stacks, software developers that would like to optimize their applications can now do so quite easily and with more insight compared to using performance counter metrics typically available on machines today. * Downloading Sniper * Using Sniper * Visualization and Power Overview The intended audience is both HPC and scientific software developers, but is also applicable to software optimization in general.

00:00
Computer animation
Visualization (computer graphics)
Singleprecision floatingpoint format
Universe (mathematics)
Student's ttest
Physical system
00:45
Point (geometry)
State observer
Simulation
Group action
Electric generator
Multiplication sign
Mathematical analysis
Set (mathematics)
Computer simulation
Student's ttest
Mereology
Cartesian coordinate system
Orbit
Workload
Process (computing)
Computer animation
Software
Personal digital assistant
Average
Computer hardware
Video game
Form (programming)
02:41
Complex analysis
Multiplication sign
Structural load
Letterpress printing
1 (number)
Computer simulation
Sound effect
Cache (computing)
Casting (performing arts)
Goodness of fit
Mathematics
Computer animation
Software
Computer hardware
Software testing
Traffic reporting
Resultant
Mathematical optimization
Fiber (mathematics)
Computer architecture
04:05
Revision control
Process (computing)
Computer animation
Video game
Water vapor
Lattice (order)
Mereology
Number
04:47
Process (computing)
Computer animation
Computer configuration
Shared memory
Configuration space
Quicksort
Coprocessor
Demoscene
Computer architecture
05:18
Cybersex
Simulation
Theory of relativity
State of matter
Virtual machine
Computer simulation
Bit
Parallel computing
Distance
Cartesian coordinate system
Approximation
Type theory
Process (computing)
Computer animation
Wellformed formula
Personal digital assistant
Natural number
Energy level
Cycle (graph theory)
PRINCE2
Resultant
Mathematical optimization
Abstraction
Computer architecture
07:24
Simulation
Word
Computer animation
Visualization (computer graphics)
Hybrid computer
Multiplication sign
Computer simulation
Quicksort
Cartesian coordinate system
Machine vision
Number
Physical system
08:26
Server (computing)
Matching (graph theory)
Structural load
Limit (category theory)
Cartesian coordinate system
Web 2.0
Fraction (mathematics)
Revision control
Process (computing)
Computer animation
Energy level
Game theory
Social class
09:41
Computer animation
Visualization (computer graphics)
Different (Kate Ryan album)
Analogy
Student's ttest
Number
10:14
Type theory
Standard deviation
Componentbased software engineering
Computer animation
Visualization (computer graphics)
Website
Right angle
Water vapor
Error message
11:37
Componentbased software engineering
Computer animation
Synchronization
Sound effect
Quicksort
12:21
Scaling (geometry)
Electric generator
Computergenerated imagery
Multiplication sign
View (database)
Moment (mathematics)
Graph (mathematics)
Set (mathematics)
Variance
Insertion loss
Cartesian coordinate system
Graph coloring
Word
Componentbased software engineering
Computer animation
Visualization (computer graphics)
Synchronization
Computer configuration
Order (biology)
Website
Right angle
Reading (process)
Physical system
14:04
Graph (mathematics)
Computer animation
Personal digital assistant
1 (number)
Variance
Cartesian coordinate system
Rule of inference
Food energy
Library (computing)
Power (physics)
15:30
Point (geometry)
State of matter
Multiplication sign
Outlier
Shared memory
Sheaf (mathematics)
Computer simulation
Maxima and minima
Line (geometry)
Funktionalanalysis
Cartesian coordinate system
Mikroarchitektur
Supercomputer
Number
Band matrix
Cache (computing)
Componentbased software engineering
Computer animation
Semiconductor memory
Data structure
Quicksort
Linear map
Physical system
18:21
Computer animation
Computer simulation
Mathematical optimization
Resultant
18:47
Laptop
Point (geometry)
Wage labour
Variety (linguistics)
Block (periodic table)
Multiplication sign
Computer simulation
Heat transfer
Machine vision
Category of being
Word
Computer animation
Computer configuration
Different (Kate Ryan album)
Order (biology)
Video game
Iteration
Social class
21:04
Word
Email
Computer animation
Electronic mailing list
Quicksort
Resultant
21:38
Point (geometry)
Theory of relativity
Multiplication sign
Mathematical analysis
Computer simulation
Set (mathematics)
Planning
Database
Funktionalanalysis
Parallel computing
Mereology
Theory
Degree (graph theory)
Arithmetic mean
Computer animation
Right angle
Quicksort
Analytic continuation
Resultant
00:00
yes so so he had to what we about the and so on a PC student now the university and this is the research work and requires many of course is the visualization tool that is so low that it from the total previously visited faster because it was in the ice that can most of human annotated dive into a single global review performance of course and multicore systems it is a common and would the to to the
00:46
case so what are the major tools most celebrities is article that we developed at the University and smaller is the main goal for all the students figure out what the performance of my next generations say this museum by coming out of about life of the child was the settings on the high seas workload is going to it on and cost of the text in the on another thing you can do is to hardware software codesign so this is something that we're really lucky because you have to be aware of Intel and the idea there was and we change the software can change the part where the same time and now we can do better than she'd you only 1 with another along so will go the orbit of the but 1 of the things that think is 1 of the parties they come from this was my application form and this so that means that the the last part but the sense that it is very difficult the talks with 1 of the point is as well as I that this so we're going to go into the shopping so working with
02:10
our research group working it try to design a mild processes using the of the the thing to talk about optimizing to model of and simulation and that there is there is 1 of the the so now we can use the detailed analysis of the application of our hardware and we can see how it interacts with the you work region and the early so observation of the of the average the command
02:44
nonstandard simulations the things that I just used also centers on there there there is the great red white and I just use the best and see how the rest of this year right well it turns out that using these methods give you really good resource for optimization but difficult to a hardware software optimization print your when your time of performance and the problem is that not all cache misses like so this is basically computer architecture where 1 here where sometimes you have only loads and sometimes the status of a not very important and so the effect of all the time modern and reporters the overlap is the really know which ones which means not and both of them or performance and the cast of that's so just because you have test simulator doesn't he mean that will understand how the performance was actually you know it was so that's why the results so no
03:51
complexity is also so we have a large changes happening is a lot of research the fibers that with careful here amount of time and now we want to optimize the efficiency of
04:07
so what what trying we notice that we have here are actually you at the meeting these numbers of wars per node increase so that is a 2001 was the 1st move parts of or life here and the 1st x 86 to move forward with that 5 then I
04:27
guess for 2011 so 10 processes and now we have 60 + with nice water until just recently announced the landing processes and my guess would be that you don't believe more was that what was on the 1st version of the but we also many different
04:50
architecture options so this is a typical processing of the processor configuration in scene multiprocessor notice were sort node and each each stock has 4 wars and sharing of of this is a typical configuration here so but what we also see things
05:11
like this which are much different from the typical process this is that the price another thing that we see that there no
05:22
1 out for a lot of this huge assistance and being diverse so we have various processes at the very heart of architectures we also of your a lot about the only reason we talked about that In this talk we use something that simulators allow you better than thank you for 1 hour because of new facts on that so I will talk about human you much for basically you know you have memory that doesn't always just look like it's the same distance so basically you have summary axis is here with a much longer so now we need are in the solutions will fall within the harvest of and what we do our work originated from analytical models so what this means is understanding your harbor and we have a formula that represents the performance of the but that doesn't allow us to the current state of the art of models that provides a level of detail that we were too a complex applications to complex so we
06:38
propose a fast parallel simulation of the book and I mentioned Prince so far but also today and I just focus on solver optimization so it's of cycleaccurate simulations to solve for exploring this on the so there are 2 types of all the future of the simulation simulation means any relation to the pretending that were in different machine has been made in US cycle very precise on getting the extraction of a little bit you use some approximation to the very nature result of size of so what we did is they want to raise the level of abstraction to give very similar results very accurate results but not in that way OK so here's cyber uses
07:28
a hybrid simulation runs were carols and so we have all of its analytical based on models about what we are used to be that begins horrible also indicates the majority of the and you're really would in his the instrumentation tool what you can do is then you there sort of wrapped around your application in the galaxy system so what started as in at 1 time and again this is about the vision of the new 1 we scale the number of words and you can download it right now that's that's how you got lots
08:09
of fun features and support young I support system for this or parallel and we're going to have more things you might see yes that's in the visualization of earlier I do something here over the technical and of
08:28
OK I have to say center is perfect but not for everybody where user level so this might not be the best match work with significant OS involved that means database so that means rest this this is a good not good if you're trying to simulate a web server and of the over this we use a higher fraction of this means that if you want to understand the details of a process that probably don't want users would what most people here 1 of the applications that most of the work for you where 86 only what it turned out that all of these limitations are OK for these diseases so that's why the most I heard about it's so the
09:20
history of science we really start 1st version 2011 and leave the main reviews since then a lot of features and we got the process and the class that loads of researchers so people are searching for sure looking games so that good this is 1 of the
09:42
OK so that was a about the main feature of what what I mean is this community of sniper that's visualization understanding that providing so when
09:56
that we work closely with the master student years and the question we were the 1st question you because of this the analogy difference that is very interesting with these things number I'm sorry we're getting this is something about this so so what we did was
10:17
the very 1st type of visualization introduced the whole sites that was the last and this problem is modern out of water by a is are very difficult to understand the error the losing sight of why is it slow and standards going on so scientist accuracy gas that is 1 and understand where all CI and so we have we have a different components that represent different reasons why 1 so that based component which really represents the best is the of the of the of and then we have the right predictor and has structured caches of other caches and that of the so this is a really good start this is a single right here and it shows the components you to see there is quite a large component that is of the that the announced action and that the matter of this work that's not that so the next
11:27
step well why don't we look at it all rights so I had a lot of these here on the right but basically you will focus on 1 component which
11:39
is this is by bars right so we see these these for retinal at a very small component regret had very important historical so from over here the red component is also ignorant and another with the same thing in synchronization is that so turns
12:02
out that they had residing on the 1st side so this is really tell you that that in I can access the data quite easily but because of the effects of this sort it is sort of is unable to get the aspects and that's what you want the use
12:22
the other guy using the institutional so for example you can compare different but sets and their scaling over the course of this in order to see how much time getting synchronization versus actually doing things so actually you know
12:41
OK so I wanna do is I wanna go to our current most advanced visualization features but this is just a website that automatic generation at writers and it contains a few different things that what I was surprised when I started as we were I was surprised at how difficult it was to get simplified view we start off with a lot of the graphs and charts and things like that but actually taking these away of what would you to so we have a different components here that the the little and this represents time on the X axis and a CGI a loss of where where I'm on lifeless use low on the left you have some options on the right we have different components of working for the color of the most recent version such that the retina moments means this is also true for yellow is right through the reading is because of the variance of this word has higher and call this synchronization with other people and then you just have the performance of the this system the instructions of the site this is a metric that your was
14:05
also from the library is we got the performance what about energy and this was tested on review of the well always on you if you can integrate that is not in the other and rules of that and it has a tool that allows the use of high violence which is in the lower graph the model and see where your powers of so in this case now we have a sack very similar to the ones that but in this case looking at power far this case the power and where the power is going 32 these dual and it was have another
14:51
interesting feature which I won't rule 1 is excess of this has much better and that is but if we view of the performance of rules so that the the you will see which here on the axis are doing better or worse than the variance of this application of using this but it's possible that the example with same year before you didn't performance because of offchip accesses and you the year with the y axis you have so so 1st we
15:34
have all of the values of the application but we also have a you of this system what was the system that was doing so this is a matter of these are all microarchitecture structures that need details of you so we have a lot of cash the 2 cache the share read here but if you mouse over 1 of these components will show you pay us for line of the activities for them so that you know and you can look at the different components of and the all the applications the so then we have a little more
16:21
experimental research work and the idea of this research is how can we analyze the entire application in a more straight forward the of you I understand what is what's going wrong in so we will prove this will happen on the X axis is time and y axis of those normally you would expect some sort of thing you needed where I more instructions then you just take a linear map that's the 1st thing that happens is there's some outliers and that means that you spend more time in these functions compared all of the movement that we want to spend some time to that the yeah I also want to touch on the line on this very interesting model that was developed to that guy of the data and the support the books and what that is what they came up with a very interesting and this is for high performance computing yeah so we have if we have this of like consists of 2 components the maximum attainable performance that you can achieve on no and we have plotted the basically the section of 2 lines but the memory bandwidth which is this line and the key floating point for performance of your and so we need a 2nd line at its states here and now you functions to the start a bond and the closer we get to the top of the woman or bandwidth number is how close to you and that's all the so now you have an understanding of how well we do is there is room to grow here and now we can use this the yeah of the
18:28
so I want to touch on some of the research that we working on the horizontal optimization
18:34
yeah the main idea is if you have no bias of course for example is there a way to do model for optimization this that I have a better performance results of the various
18:49
say you got lots of options you've got small or that ran slower the words which rejected use that this is a lot that's young by you all those of User yet but basically what I'm saying is there's a large variety there and it's getting even more complicated to understand what we do is we use labor to understand the different classes and it's intermediating trusted and but the rights and
19:19
so so for this problem we did was we will have a central what is the computation does is heuristically the transfer between 2 means so the point of interest to other problems but they wanna go ahead of time a few steps and you want to compute just 1 iteration you wanted you to work through for all finding you that but what that means is the extra data around yeah the block at your computer In order to move time without doing properties of the next 50 years of life was 0 for every time step you have to communicate with their neighbors now we're saying let's not community we need to act computations and what were they doing with the computation because the leverage is dark laptops the researchers also believe that this is accessible was you the last it might so here we have the final in that you would want bed and what we see here is as we increase the over repair and what that means if we have the 2 more redundant computation at some point get the vision of the which means you to last updated and you might be going from his without me that that they have a lot of the performance so basically around to to the sometimes that's a sense that any more but not the they were of it and the
21:07
basically the sunrise became to quantization you can do better there a single optimizing just sort words just book OK and it is 1 of the chapter by
21:20
saying the the result you and that was they have a really easy so you started you can get projected mailing list produced by the and and of the
21:40
theft and the quality relation with the function of the sort of the so the you want show that you like to go that is what we were looking at next is happening so if we notice that there is a lack of continuity in this point was well we really want 1 we the want to do every year I have articles that through analysis hold on any kind of over the top so the problem that we could do it that way it might slow down the simulated wants all the benefits of being a simulator parallel circular i've problems in this so that it's kind of like the possible yes and the right now we have of the 1st of all the things you people in correlating the Ireland was 1 of the questions was that and so on and so with the release of the patients were at the time so you can find the right of the so we have validated we've taken in over our that settings part of the example because that decide and you see the result that there is no I have it here but there is of way I think means if you use different tools helping you align results right but that's very difficult to get this that's I think that's that's a broader problem right because different tools have different ideas and in different so therefore gave the accuracy we didn't do it on yeah so that's how you get accuracy so that the male offline 1 of the the the the that's the end of the year let's the but and
24:17
that be really want right our plans to validate that all of this is just a theory and so that the the people that we have evaluated the degrees of the last thing the is that you of the the the the on node in the knowledge that we use the database that you know what I wanted but what the good