Graph Processing with PUMA
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/46978 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Graph (mathematics)Process (computing)Graph theoryComputer architectureClient (computing)Read-only memoryElectric currentCache (computing)Pattern languageRegular graphBranch (computer science)Military operationVector spaceBefehlsprozessorDivergenceSimilarity (geometry)Read-only memoryGraphics processing unitSpeicherkapazitätOrder of magnitudeVertex (graph theory)AlgebraSparse matrixLinear mapGraph (mathematics)Core dumpState of matterProcess (computing)Graph (mathematics)Computer programChannel capacityComputer architectureDirected graphRead-only memoryOperator (mathematics)Cartesian coordinate systemBoss CorporationData transmissionSoftware developerDirection (geometry)ScatteringPresentation of a groupSystem callLocal ringPower (physics)Total S.A.Element (mathematics)Arithmetic meanStaff (military)PredictabilitySurface of revolutionGame controllerTelecommunicationUniformer RaumDivisorServer (computing)Level (video gaming)Graph (mathematics)Eccentricity (mathematics)Regular graphFunctional (mathematics)AlgebraAreaBranch (computer science)Electronic mailing listMusical ensembleGoodness of fitVector potentialMultiplicationComputer hardwareFitness functionParallel portBand matrixCache (computing)BefehlsprozessorPoint (geometry)Vertex (graph theory)Graphics processing unitGraph theoryLine (geometry)Bus (computing)Vector spaceAlgorithmMultiplication signCore dumpDataflowLinearizationSparse matrixComputer animationSource code
06:56
Cache (computing)Process (computing)Graph (mathematics)Core dumpAtomic numberInterior (topology)Inclusion mapVapor barrierThread (computing)Computer networkInterface (computing)Reduced instruction set computingSpacetimeAddress spaceLocal ringRange (statistics)Uniqueness quantificationOperations researchPartition (number theory)Read-only memoryGame controllerPhysical systemHierarchyOpticsVertex (graph theory)HypercubeTopologyDiameterComputer programmingParallel computingLatent heatCompilerPerformance appraisalCore dumpRead-only memorySoftwareMultiplicationOperator (mathematics)Game controllerComputer programComputer hardwareMultiplication signMereologyContext awarenessPhysical systemThread (computing)Atomic numberBand matrixPattern languageRemote procedure callNetwork topologyCache (computing)Structural loadSpacetimeAlgorithmAddress spaceProcess (computing)Shared memoryBitConnectivity (graph theory)Graph (mathematics)Front and back endsPresentation of a groupNormal operatorMathematical analysisTesselationCharacteristic polynomialPoint (geometry)Data storage deviceSubject indexingLevel (video gaming)Vertex (graph theory)Interface (computing)Parallel portComputer simulationCartesian coordinate systemConstructor (object-oriented programming)Binary codeTelecommunicationResultantSoftware testingQuicksortSet (mathematics)Open setFamilyArithmetic progressionMobile WebSingle-precision floating-point formatComputer architectureMortality rateLabour Party (Malta)Modal logicProgramming languageDirection (geometry)VotingSimulationTerm (mathematics)SummierbarkeitPrice indexCASE <Informatik>Computer animation
13:45
SimulationPerformance appraisalSimulationResultantBinary codeComputer simulationComputer hardwareMultiplication signCausalityNumberData structureOperator (mathematics)DampingComputer animation
14:13
SimulationPerformance appraisalSystem callCartesian coordinate systemRead-only memoryMereologySoftware developerProfil (magazine)Core dumpComputer animation
14:33
SimulationComputer simulationAnalytic setPerformance appraisalPairwise comparisonMultiplicationRandom numberGraph (mathematics)WaveParallel portEstimatorGraph (mathematics)Overhead (computing)Cartesian coordinate systemSimulationMultiplication signResultantRandomizationOrder of magnitudeSocket-SchnittstellePower (physics)Pairwise comparisonVertex (graph theory)Band matrixServer (computing)Energy conversion efficiencySingle-precision floating-point formatRead-only memoryTelecommunicationProjective planeProcess (computing)Core dumpAuthorizationMereologyDigital electronicsRange (statistics)ApproximationFeasibility studySurfaceJunction (traffic)Service (economics)Computer simulationKernel (computing)Analytic setMassComputer animation
16:50
Graph (mathematics)Computer programProcess (computing)Physical systemBand matrixMultiplicationRead-only memoryProgrammable read-only memoryOrder (biology)Order of magnitudeScale (map)Computer simulationSimulationPower (physics)Equaliser (mathematics)Order of magnitudeSoftware developerComputer animation
17:11
Process (computing)Computer programGraph (mathematics)Physical systemBand matrixMultiplicationRead-only memoryOrder (biology)Order of magnitudeScale (map)SoftwareComputer hardwareIntegrated development environmentProcess (computing)Vertex (graph theory)SupercomputerComputer configurationRead-only memoryQuicksortCore dumpShared memoryPhysical systemSinc functionWorkstation <Musikinstrument>Graph (mathematics)Point cloudPerspective (visual)Uniform resource locatorData centerServer (computing)MultiplicationGroup actionInsertion lossPlanningComputer animation
20:23
Point cloudFacebookOpen source
Transcript: English(auto-generated)
00:05
So, let's continue. I'm very happy to announce Steyn who's working with Intel and he's talking about programmable unified memory architecture, PUMA, and how you do the same with it. Okay, thank you. I'm afraid there's going to be another hardware talk, but I'll keep it simple and hope that
00:26
you appreciate the fact that the hardware is also evolving and that we're looking at more efficient processes for graph processing. So, as you all know, graph processing is getting bigger and bigger, and graphs are getting bigger and bigger, and we want to process more data.
00:46
So, we set ourselves a… So, we found that we need to increase the efficiency of graph processing versus existing architectures such as CPUs and GPUs. So, to that, we propose the programmable unified memory architecture, abbreviated as PUMA.
01:07
And in this talk, I want to explain to you why graph processing is actually challenging on the existing architectures, what makes PUMA fit for graph processing, some high-level details about PUMA, and how it performs.
01:23
And after this presentation, you probably get a question, can we buy PUMA? Unfortunately, not yet, because it's still under development. Okay, as you all know, Intel is the market leader of high-performance processors, and in these
01:40
processors, we've implemented a lot of things for regular applications that work very well, such as branch prediction. And regular applications are very predictable, so we use that to predict branches ahead such that we can flow the instructions more in a continuous way. We have caches that assumes that if you access some data, you will access it again in the near future, or its neighboring data.
02:08
We have vector operations that perform the same operation on neighboring data, and all that works good for regular applications. But graph applications, as the previous speaker also explained, are not that nice for these architectures.
02:24
For example, many of the graph applications have many branches that are data dependent, so the actual outcome of the branch depends on the data that is in the graph and which is, of course, not predictable, so branch predictors don't work well.
02:45
Data is also accessed in a scattered way, so that was also introduced by the previous speaker, so you access the neighbors, and the neighbors are not the next nodes in your list of nodes. They are scattered all over the place, so caches don't work well because you don't
03:01
use the neighboring data or don't use the same data, and the same for vector operations. So we see very low performance on regular CPUs for graph applications. Many people have proposed to use GPUs. They are actually better in general because
03:22
of higher bandwidth and more threats, but they actually suffer from the same problems. If branches diverge, so go to another direction, then you cannot, the parallelism cannot be fully exploited. Scattered memory accesses prevents to use memory coalescing, meaning that all of the efficient things you cannot use in
03:44
GPUs too, and we also saw before in previous presentations, there is a problem of memory capacity and scaling out. So what are potential solutions? There have been proposals of graph accelerators, so that specific chips made for graph applications that have some functionality, for example, that implement sparse linear algorithms or vertex-centric algorithms.
04:11
The problem there is that you're fixed with that functionality, so if your application or your algorithm works best for, within the sparse linear algebra, but your accelerator is a vertex-centric operation accelerator, then you need to transfer your application, and
04:29
potentially, and you also need a host, so a CPU that controls this accelerator, and there is the data transfer between both. Another solution is to have a general instruction set processor like the normal CPUs, but that's been optimized for
04:45
graph applications, which is then more flexible because you have an instruction set, you can implement other algorithms too, it's self-contained, you don't need a host per se, and that's actually what our approach was in PUMA.
05:01
Okay, I've talked about the challenges that graph applications pose, so how did we solve that in PUMA? Most of the graph applications are very memory-bound, so most of the time, if you have a very fast core, a Xeon core, for example, it's just stuck by the slow memory, it's waiting forever, so you cannot use all of its speed.
05:26
So instead, in PUMA, we have much lighter cores, much slower cores, but we have many of them, so they still wait for memory, but because we have many of them, the total throughput is higher.
05:40
I said it before, caching was a problem, so the problem is that if you load data, you only need that one element that you load, not the full cache line. So in a conventional architecture, you load the full cache line from memory through the memory bus into the cache and then to the core, but as you see, only the blue points are used.
06:02
It's very inefficient use of the cache capacity and of the memory bandwidth, so it means you waste a lot of things, a lot of potential. For PUMA, we have a lot of cores, but we also optimize the memory accesses such that you can access only a single element.
06:21
We bypass caches because they're not efficient to get more chip area for cores and other stuff and get a more fluent flow of memory operations. Another thing is the size of the graph, so if you have a very large graph,
06:40
you need to partition it and put it into multiple nodes, multiple compute nodes, multiple servers. But then when the graph algorithm wants to access data on another node, it needs to go to the whole communication stack, which takes a lot of time. Unfortunately, graph algorithms are not very predictable in their locality, so it often
07:03
occurs that we need to go off a node, giving a large performance penalty. So for PUMA, we have this hardware distributed shared memory, so there is a shared memory across the whole system. You don't need to think about communication. The network is high bandwidth, low
07:22
latency, just to reduce the latency or the performance impact of remote accesses. Another thing that we saw is that there are a lot of common patterns in graph applications, and we designed some offload engines that efficiently execute these patterns.
07:44
For example, atomics are very much used in graph algorithms, but for a core, that's a very intensive operation. So you have to log data, you have to load the data, update it, and write it back, and unlock the data. So the core is often stuck a long time performing these atomics.
08:03
In PUMA, we have this offload engine that performs the atomics, that relieves the core from performing these atomics. So just the core issues an atomic instruction, but leaves the execution over to the offload engine, and the core can continue executing other instructions in the background.
08:22
So the operation is done in the background, and the offload engine looks where the data is located and performs the update locally. Similarly, gather operations, for example, if you want to gather some characteristics of the neighbors of a vertex.
08:41
The normal operation is that you load the index of the first neighbor, then you load the data of the neighbor and store it somewhere. So that's a very intensive process for the core. So we have this DMA gather offload engine, so the core again just issues this DMA gather instruction and then continues executing.
09:02
And in the background, the offload engine performs the necessary memory accesses without actually needing to move all the data to the core itself. Other operations are memcpy, barriers, queues, and so forth. So going into a little bit more architectural detail of a single PUMA core.
09:25
So that's a schematic overview of a single PUMA core. It consists of multiple pipelines, so each of these are pipelines. Each pipeline supports multiple threads. Why? Like the previous presenter said. Most of the time, we are waiting for memory operations.
09:42
So if we are waiting for memory operations, we just switch to another thread. If this one is also waiting, we switch to yet another thread and so on, just to hide all of these memory latencies. We have limited caches. You have an instruction cache, a data cache, which are very small. It's for very local data, such as the stack of your threads.
10:06
The instructions executed by this course are a novel instruction set that we designed. It's based on a reduced instruction set, so simple instructions. And we added specific instructions that can be used by graph operations, such as a single instruction indirect load.
10:29
Because we have limited caching, we don't want to go to main memory all of the time for data that's been consumed and produced locally. So we have a manually accessible scratchpad, so the programmer can decide whether to cache its data, to use a scratchpad, or to go to main memory.
10:48
It's part of the global and S-Space, so other course can also access the scratchpad of this core. I talked about offload engines, so they perform these operations in the background.
11:00
We have a memory controller that's been optimized for these small accesses, 8-byte accesses. And we have a network interface to connect to other cores. Also, of course, using these 8-byte packets because most of the data is used in these small granular galaxies.
11:21
The full Puma system, as we envision it, is hierarchically built. You have multiple of these cores that form a tile, multiples of these tiles form a node, multiples of these nodes form a system. And that means that since you have already that many threads on a core, you have that many cores on a tile and so on,
11:44
that the full system can easily consist of millions of hardware context for threads. Again, the global address space and the memory is shared across the full system, so you can access any data from any point in the system.
12:01
Of course, this requires a very high bandwidth, low latency network, so we use HyperX network topology. This is the textbook presentation of this, so you have a hierarchical design where within each level, everything is fully connected. And then on the next level, there are connections between the levels.
12:24
Furthermore, we plan to have optical connections between the tiles and between nodes, again, to increase bandwidth and latency. More interesting for you, maybe, is how do we program this beast? That's the work in progress, so we're also working at a software infrastructure for this architecture.
12:45
Initially, where we did our first experiments with is a simple SPMD-based parallelism using C, and we use special intrinsics for the PUMA-specific instructions and LLVM for building a compiler.
13:05
In the meantime, there has been increasing support for C++ speed threads, OpenMP, and some tasking. And we're also looking at implementing common graph libraries, other programming languages, and a Python front-end.
13:24
Okay, now, importantly, does PUMA fulfill its premises, namely that it's more efficient to do graph analysis on it? Of course, the chip is not available yet, neither is so.
13:41
So we are also looking at a FPGA model, but that's not ready yet. Basically, our results that I will show are based on simulation, where you take a PUMA binary, you have a functional simulator that decodes the instructions of the binary that simulates the operations,
14:00
and then you have the timing simulation that models all of the hardware structures, such as the cores, the memory, scratchpad, network, and so on. We also have for the developers a profile of how the execution looks like, where cores
14:22
are idle, that's the pink part, or used, or waiting for memory, that's the light blue part, such that the developers can find the bottlenecks and optimize their application. We also have an analytical model, which I won't go into detail, because simulation is a very intensive process,
14:43
so we can simulate up to a few tiles, but after that, it becomes quickly infeasible, and then, therefore, we use this analytical model, which we then validate using simulation on the smaller cores. Then we took a whole bunch of kernels, graph kernels, and applications.
15:03
We run them on a high-end Intel Xeon server with four sockets. We optimize them if needed, and we also ported and optimized these applications for PUMA, and the initial estimates show that this high-end Xeon server with four sockets
15:22
will approximately consume the same power as a PUMA node will consume in the future. So comparing their performance is also an energy efficiency comparison. We also did some multi-node on Xeon, but that didn't work well, because you have this communication overhead.
15:47
The result was that we saw nodes speed up for most applications, or even slowdowns. We projected to 16 PUMA nodes, and as I will show shortly, this scales much better.
16:01
Here is an overview of some of the applications and performance results. It shows the speedup of a single PUMA node versus a high-end Xeon server. You see that there is always speedup, but there is a large range, so it goes from three times to more than 200 times faster than Xeon.
16:21
That's because some applications are more compute intensive, so they do more compute, and of course there the Xeon can use all of its resources to efficiently do that. On the other hand, if it's purely memory bandwidth bound, such as random walks, then we see a very huge performance increase.
16:42
And the 16 node profit projections also show that it scales much better than Xeon does. Okay, that was basically it, so I showed you that PUMA is a programmable instruction set processor for graph applications. It contains many features, which I discussed, and we show that through simulation and modeling that it's one
17:06
to two orders of magnitude faster than an equal power Xeon, and it scales well to multi-node. And it's still in the development, of course. Okay, that's it. Thank you very much, very interesting perspective on graph processing.
17:27
Questions? Do you see that this will be sitting, do you think that this will be affordable as a workstation, or would you say no, this is a cloud installation that is somewhere in the server center, or where do you see this?
17:42
So the question is whether this is a workstation PC or a desktop-based thing, or a, no, it would be more in a data center or a big supercomputer setup. Is it possible, kind of going onto what he asked, to have these cores as like co-processing cores,
18:09
like do you have like a few big normal cores and a few PUMA cores, would that ever be? So the question is if we plan to have some, let's say host nodes, or, yeah, we're still working on that, so there is
18:27
an option to have potentially x86 nodes that do more of the other stuff that this is not efficient for, but it's still under development.
18:55
Sorry, I...
19:00
It's not in the same way as on Xeon, but I'm afraid I cannot go into details of how it will work. Sort of the shared memory sort of architecture, and then there's different sort of processing units. Does it matter where in memory is
19:23
the location of the values needed for certain cores, so if they're close by they'll be a faster access, or it doesn't matter since it's an all-to-all connection, but what's the penalty of a data being further away in memory to the core that we need? Okay, so the question is, is data that's further in memory, will it be slower to access it? Yes,
19:45
of course, because it's a very large system, it's distributed across, the system will be distributed across multiple nodes, multiple racks, so there will be a larger penalty, but because of all of this memory latency hiding techniques and these offload engines that perform it very efficiently, you won't see as much as you see in conventional multi-node setup.
20:06
Maybe take more questions. Thanks again.