We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

High Performance Python on Intel Many-Core Architecture

00:00

Formal Metadata

Title
High Performance Python on Intel Many-Core Architecture
Title of Series
Part Number
100
Number of Parts
169
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Ralph de Wargny - High Performance Python on Intel Many-Core This talk will give an overview about the Intel® Distribution for Python which delivers high performance acceleration of Python code on Intel processors for scientific computing, data analytics, and machine learning. ----- In the first part of the talk, we'll look at the architecture of the latest Intel processors, including the brand new Intel Xeon Phi, also known as Knights Landing, a many-core processor, which was just released end of June 2016. In the second part, we will see which tools and libraries are available from Intel Software to enable high performance Python code on multi-core and many-core processors.
11
52
79
Multi-core processorComputer architectureIntelBitCoprocessorGoodness of fitBridging (networking)CompilerSoftware developerSoftware industryFormal languageSupercomputerMulti-core processorComputer hardwareCodeSinc functionRegular graphComputer architectureMathematicsRight angleInternetworkingLevel (video gaming)ExistenceComputational scienceData analysisLecture/Conference
CodeComputer animationLecture/Conference
Physical systemBit rateThread (computing)Multi-core processorINTEGRALRevision controlComputer animationLecture/Conference
IntelDigital signal processorBefehlsprozessorProcess (computing)Right angleRevision controlSocial classMereologyDigital signal processorMulti-core processorComputer animation
Multi-core processorLimit (category theory)Revision controlMoment (mathematics)BootingProcess capability indexWorkloadDigital signal processorPhysical systemRight angle3 (number)CoprocessorElectric generatorSet (mathematics)Process (computing)Lecture/Conference
Digital signal processorIntelBefehlsprozessorDisintegrationMomentumComputer architectureCodeNormal (geometry)Server (computing)Power (physics)CD-ROMProgrammer (hardware)Computer programmingSemiconductor memoryComputer animation
IntelProduct (business)FamilyComputing platformDigital signal processorCoprocessorBefehlsprozessorVideo GenieCodeComputer programmingModemComputer wormRevision controlPower (physics)Computer hardwareIndividualsoftwareBitFamilyProcess (computing)Data analysisDegree (graph theory)SupercomputerLevel (video gaming)Vector spaceExecution unitTwitterMachine learningInternetworkingDigital signal processorRight anglePoint cloudMulti-core processorWorkstation <Musikinstrument>Regular graphService (economics)Moment (mathematics)Ocean currentSoftware developerServer (computing)ScalabilityLecture/ConferenceComputer animation
Thread (computing)Extension (kinesiology)Operator (mathematics)MultiplicationSingle-precision floating-point formatVector spaceParallel portServer (computing)Revision controlOcean currentSupercomputerCartesian coordinate systemNumerical analysisBitRight angleComputer fileThread (computing)Lecture/ConferenceComputer animation
InfinityExtension (kinesiology)Multiplication signProduct (business)Normal (geometry)Regular graphLibrary (computing)Beta functionParallel computingSoftware developerDigital signal processorDistribution (mathematics)CodeData analysisSocial classNetwork topologyInstance (computer science)Computing platformProgrammer (hardware)Moment (mathematics)Different (Kate Ryan album)Sound effectSystem callLecture/Conference
CodeSpecial unitary groupRight angleProgrammer (hardware)Machine codeMathematical optimizationSupercomputerPower (physics)Axiom of choiceWell-formed formula
CAN bus2 (number)Multiplication signImplementationComputer configurationLecture/Conference
Compilation albumDigital signal processorTelecommunicationVector spaceLocal ringCompilation albumMultiplication signCodeLecture/Conference
Parallel computingMultiplicationMixed realityFormal languageCodeAxiom of choiceArc (geometry)Software frameworkScaling (geometry)BitLibrary (computing)Numeral (linguistics)Lecture/ConferenceUML
Axiom of choiceReading (process)Mixed realityCodeFormal languageComputing platformPower (physics)NumberExtension (kinesiology)Instance (computer science)Virtual machineTunisProfil (magazine)Software frameworkRevision controlCASE <Informatik>UML
Interface (computing)IntelDistribution (mathematics)Open sourceProcess (computing)NumberElectronic mailing listDistribution (mathematics)Interface (computing)Descriptive statisticsAnalytic setCausalityContinuum hypothesisLatent heatLecture/ConferenceComputer animation
Moment (mathematics)Revision controlOpen sourceLecture/Conference
Block (periodic table)BuildingMathematicsLibrary (computing)IntelKernel (computing)Term (mathematics)MappingRandom number generationComputer simulationVector spaceNumberElectric generatorInclusion mapComputer animation
Avatar (2009 film)IntelDefault (computer science)CASE <Informatik>Library (computing)MathematicsKernel (computing)ImplementationExecution unitPointer (computer programming)Convex hullThread (computing)Fast Fourier transformRandom number generationImplementationBeta functionRevision controlRight anglePairwise comparisonMultiplication signXMLComputer animation
Ring (mathematics)CASE <Informatik>IntelLibrary (computing)Kernel (computing)MathematicsVector spaceTransformation (genetics)Distribution (mathematics)Block (periodic table)ResultantElectric generatorRandom number generationMultiplication signData analysisConnectivity (graph theory)StatisticsMachine learningReal numberLibrary (computing)Lecture/ConferenceComputer animation
Network topologyRight angleSource codeRadical (chemistry)Level (video gaming)Process (computing)Product (business)Profil (magazine)Cartesian coordinate systemProgrammer (hardware)Multiplication signCodeLibrary (computing)Digital signal processorComputer programmingComputer animationLecture/Conference
Social classoutputInterior (topology)Source codeProcess (computing)TunisMaß <Mathematik>TouchscreenCoroutineSource codeMultiplication signWindowResultantCodeMeasurementDirection (geometry)Digital signal processorProgram flowchart
AverageOverhead (computing)Computing platformRevision controlIntelProfil (magazine)WindowRevision controlCartesian coordinate systemReplication (computing)Graphical user interfaceResultantCodeAreaLecture/ConferenceComputer animation
Sign (mathematics)Metropolitan area networkBeta functionDistribution (mathematics)IntelInternet forumRight angle2 (number)AreaCodeRevision controlSoftwareVirtual machineSupercomputerInternetworkingMoment (mathematics)Lecture/ConferenceComputer animation
Workstation <Musikinstrument>2 (number)Configuration spaceCompilerPay televisionSeries (mathematics)Source codeBefehlsprozessorWeb pageSoftware developerLibrary (computing)CodeElectronic mailing listDistribution (mathematics)Virtual machineMultiplication signCuboidComputer architecturePlanningRow (database)Multi-core processorCASE <Informatik>Expert systemMereologyProduct (business)Digital signal processorRewritingOpen sourceProgrammer (hardware)FlagConnectivity (graph theory)BuildingNumberState of matterLaptopComputer architectureProcess (computing)InformationFreewareMoment (mathematics)Computer programmingOrder (biology)Maxima and minimaInterface (computing)Lecture/Conference
Transcript: English(auto-generated)
Please join me in welcoming Ralf de Vernie, who's going to be talking about HPC on Intel hardware. Can you hear me? Or maybe I'll take this mic. That works? Okay, I can't hear me.
Alright, good afternoon. Welcome to this talk. My name is Ralf de Vernie from Intel. Actually more Intel software, because everybody knows Intel as a hardware manufacturer, semiconductor company, but it's also a very large software company. We are about 15,000 software developers at Intel,
who develop things from low-level BIOS to high-end big data solutions. And my talk today is going to be about high-performance Python on Intel architecture. Actually, I come from more from the high-performance computing, which is traditionally,
the traditional language is, of course, Fortran. Intel, we have a Fortran compiler. Who is using the Fortran compiler? Nobody. That's good. Thank you. Okay, normally when I go to high-performance computing specialized conferences,
a lot of people still do Fortran, right? Because the code didn't change since 1960. And so it's still running. It's running really fast, and it runs on the latest architecture. But more and more, of course, people do C and C++. But what we saw in the past couple of years, since I'm in high-performance computing,
is that more and more people are doing Python. So that's why I'm here, actually. So that's why Intel is in Python. It's no wonder. And for the last couple of years, we also love Python at Intel. And so who here in the room is from high-performance computing,
or does some kind of high-performance data analysis, data scientific computing, something like that? All right. Who is using Intel processors? Very good. Who is using GPUs? Don't worry, don't worry. I like GPUs also.
Okay, so today I'm going to talk first a little bit about Intel architecture, where are we at the moment, and after that I'll go into what are we doing for the Python community to make the bridge between software development and our high-performance processors.
I will mostly talk about high-performance processors like Intel Xeon, or Xeon Phi, or even your regular Core i7 in your Mac laptop, not for the embedded stuff. All right. So, wow. What's that? Who knows?
It's what? What is it? Xeon Phi. Wow. You can get a t-shirt later from me. Xeon Phi. Who knows Xeon Phi? Who heard about Xeon Phi?
All right, half of the room. Great. So this is the latest Xeon Phi. We just released it end of June. It was called KNL, Night's Landing. That's a code name. We love code names at Intel. So that's Night's Landing. It has 72 cores. You can count them. Every core has two threads, and they're all connected through OmniPad fabric.
So we hope it's going to be a really fast system. So how does it look like in reality? So here on the left you have the version with OmniPad fabric integrated. There is also a non-OmniPad fabric integrated version.
It's like a small cluster, right? You can imagine this as a cluster with a high-performance connect between all the cores. And the good thing and the new thing with this Xeon Phi is that it's not anymore a core processor. This is the third generation, and until now it was a core processor.
So you had to plug it in a PCI slot, and then you had to pass all your data through this small slot, right? And then it gave you a lot of limitations. So the new one will also have a core processor version, but the version we just launched is actually a bootable version.
So you don't need a host system anymore. It's the host and the core processor in one. So you can boot Linux on it at the moment. It can run a lot more workloads than the core processor, right? It has less limitation. It has MCD RAM on the chip, right?
16 gigabyte MCD RAM, which is really fast RAM. And it can run normal IA code. So Python, normally it's IA code. So Intel architecture code is really easy to program. That's why it says programmability. It's power efficient, all right?
It has a large memory, can use per server up to 400 gigabyte, 384 gigabyte. And it's really scalable. You can build a cluster out of it. So like I said, there is a regular version without the integrated fabric.
This is the integrated fabric one. And it goes onto a host. It's its own host processor, right? So you can use it as a regular workstation, right? Currently we have a lot of customers, software developers who have a regular PC. It looks like a PC. And instead of Xeon or Core i7, they have the Xeon 5 processor in there.
And at a later stage, it will also be available as a core processor, like a GPU current. Just to go back to a little bit more of hardware. So in the high end, we have two processor families.
We have Xeon. Xeon is the regular processor which boosts most of the servers at the moment. All the cloud, the internet, everything runs on Xeon. And Xeon 5 is our new processor targeted at high performance computing, but also big data analytics and machine learning.
And the good thing is it features 72 cores with 288 threads. And it can run, it has vector units for up to 512 bits. So it can run AVX 512 extensions, which give you a really great scalability
if you're doing mathematical operations with single instruction, multiple data vectorization. And so this is what we have now. And the future will be parallel. We will have still more cores, so at least in the next five years, we've got more cores, more threads, and more vectors.
So that's going to be really essential if you want to have performance with your application with numerical, mathematical, at least in high performance computing. And the current version of Xeon is Broadwell. Broadwell, this one here, it still doesn't have AVX 512.
It has AVX 2 with 256-bit wide SIMD. The next version will be Skylake with AVX 512 on the server. So what does it have to do with Python? So as a lot of Python developers are using these platforms,
they want to use the newest and latest, they want to have performance. Of course, you are paying for all these transistors. If you buy a Xeon Phi, you are buying five billion transistors. But if you run a regular Python code on it, you will not use five billion transistors. But you paid for these five billion transistors.
So we want to help you get more performance out of it and use all these transistors, especially for production, not for prototyping, but for production. But we are also seeing that for normal coders, it's really difficult to use these high performance extensions.
Even for us, it's sometimes really tricky. So it's really hard to combine Python and those high performance extensions. So that's why this year, in two months from now, we are releasing our own distribution of Python, which will be called Intel Distribution for Python.
At the moment, it's still in beta. And it's going to be released in the first week of September. Let me check the timing. All right. So our aim is to give you, as a Python programmer,
easy access to high performance in Python, of course. And so it's based on CPython. We recompiled it with our low-level libraries. For instance, the most important one is MKL. Who knows MKL already? Oh, great. So MKL is always at the forefront of performance.
It's always optimized for the latest processor technology. And we have been able to recompile it into our distribution. But we are not only using MKL. We are also using other libraries like DAL, which is a new library. It's going to be called PyDAL for data acceleration.
No, Data Analytics Acceleration Library is also in there. And also, we are including TBB for parallel programming. So what is required for making Python performance closer to native code? Of course, in HPC, it's always, you want to have native code.
That's why most of them, programmers are using C++ or C or Fortran. And here it gives you an example of what's required. So there's a very interesting book that came out two years ago from one of my colleagues about high performance computing on Xeon Phi.
It's called High Performance Parallelism Pearls. A lot of people in that book are writing an article about how they parallelize their code in high performance. So we took here a very simple example, which is the optimization of black choice pricing. It's really easy to parallelize that formula.
And if you run it on pure Python, right here it's the number of options, thousand options per seconds that can be calculated. If you use pure Python, you're maybe, for the same time frame, for one second, you can have like hundred thousand per seconds.
If you move this to NAFC implementation, you can have a 55 times performance implementation with static compilation. But if you really use the hardware, all these vector units, all these cores,
whatever is included in the Xeon Phi processor, vectorization, threading, and data locality optimization, you could get up to 350 times more performance. So what are we putting into Python to make that happen for you, without having you to code everything in C?
So first of all, we are accelerating the numerical packages of Python with our libraries. MKL, as I said, DAL, and some little bit of IPP, which is more a smaller scale library. We are implementing TBB for better parallelism,
you know, to get rid of oversubscription, for instance. And in some cases, also for MPI, if you're using the small cluster version. We are also having VTune, which is a profile. I'm gonna show you quickly what that means. We are also optimizing other extensions in Python,
like Cyton, NumPy, et cetera. And we are also working on the Big Data machine learning platform and frameworks like Spark, Teano, Caffe, et cetera. So what is in there? It's gonna be MKL, so I'm repeating myself, but we're also optimizing NumPy, CyPy,
Scikit, Pytable, et cetera, all that stuff. I can give you a description of a really long list about what we're optimizing, what packages. We are providing a specific interface for DAL, called PyDAL, it's gonna be available through this distribution, and also,
it's available from Anaconda, so from Continuum Analytics as a Conda package. And of course, we like the open source community, the Python community, it's amazing now, I really like this community, it's amazing what happens here. And of course, we want to bring all the good things
back to the community, and we, in the end, eventually, we are going to also optimize all the other packages of Python. So a quick overview of MKL, what is in MKL. So at the moment, for the first version, we are going to include BLAS, LAPACK,
we are going to include multi-dimensional FFTs, some vector mat, and RNGs, which are random number generators, very strong random number generators, which can be used in Monte Carlo simulation. Here an example of what can happen if you use our beta version of Python.
So this is a FFT implementation. You can see on the right side a comparison, if you use regular Python, if you change it with one thread or 32 thread, you can get up to 10 times acceleration.
Same if you use vanilla Python here, on one thread or 32 thread, can get up to five times acceleration. Okay, random number generators, I don't know if it's interesting to you who use random number generators, a few. So we really can get very nice results
on random number generators, up to more than 50 times more performance than regular Python. Okay, DAL, DAL is optimized for machine learning and statistics and big data analytics, so it has a lot of components,
and we are currently working on really making that available through PyData to a real Python library. So let's have a look at VTune. VTune is a low profile, it's a low level profiler. Who is using VTune or who knows VTune? VTune is an old product from Intel
which we make current every year. It's mostly used by C program or C++ Fortran to find hotspots in the code, right? Where is my hotspot? Where is my application running slow? Is there any performance gain I can implement? Am I using all my cores, all my threads?
Am I really using the processor at low level the right way? It's a very visual tool, it's slow, it uses not much of the performance, right? And it can give you visibility up to the code, to the source code.
So it pinpoints the source code where your hotspot is. That worked until now in C, C++ Fortran, but we made it also available on Python because up to now if you would use Python it would show you the C code of the library so that was not the intent, right?
So how it works, here you have some Python code, it's something very simple, we have two routines, one is slow and one is fast. That's easy, right? And we want to see if VTune is able to run at the same time as this small program
and find the code that is slow, but we know which is slow, right? So we start it, it runs at the same time as your code, it can even analyze the performance measurement units, the PMUs in the processor, and see what's happening in the processor directly.
And so it shows you here, this is the result, it's very visual, normally you have it on a much bigger screen, so this is a small screen, and it shows you where in your code the hotspot is, so here is slow encode, surprise, it's here, right? Fast encode is okay, it runs fast,
the slow encode runs slow. And if you click on this, you'll get directly to the Python source code. So that's the goal. So VTune is now available also for Python and recognizes Python code. And VTune is really a low-level profiling tool
which doesn't use a lot of overhead, like 1.1 to 1.6, works on Windows, Linux, and can use Python 3.4, 3.5, all the versions basically, it's really a rich graphical user interface, I would not say it's easy to use,
but we try to make it intuitive more and more. It supports different workflows, so you can start your application, start VTune at the same time, and wait for VTune to end and analyze the results, or you attach it to an application and you only profile a certain area of your code.
Alright, so that's the end of my talk, I still have 30 seconds. So you can download our version of Python at the moment from software.intel.com. It's still in beta, like I said, in the beginning of September it's gonna be released.
It's really supporting the full stack for high-performance computing and big data and machine learning, whatever. Alright, thank you very much. Thanks very much, Ross. Any questions?
In the latest processors, you are starting to embed FPEAs. Do you have something of that? Yes. Yeah. Please speak. Yeah, yep.
So the question was about, what about FPGAs? Intel bought Altera a few years ago, and since this year it's part of Intel, Altera, and our plan is of course to put some FPGA technology
in our Xeon processor going forward. At the moment, I have nothing specific to say about that. It's still in the works. You mentioned machines that have Xeon Fi as the main processor. Who's building these? Who's building these?
Where can I get one? You can get one. So you can go to your local OEM and order them so they can take orders. At the moment, on the market, we have the software development product. So the one I said, it's like a workstation.
You can buy it from Colfax or from a German small OEM. I don't recall the name now. So you simply go on the Xeon Fi web page and you have the list of, there are two small OEMs who build this workstation.
You can order them. It's like $6,000, one workstation. But other than that, if you want to have more information, you have to go to HP or there and they are building currently their offerings. The Intel distribution for Python
is only for the Xeon series of processors, also for others? It's for Intel architecture. So you can use it on Core i3 or in a laptop. i3, i5? It's all the same, basically. Core i3 is basically the same as a Xeon E5.
The configuration is different but the basic architecture is the same. That's why it's really good for programmability. More questions? We have a couple more minutes. So I'll have one.
So the basic idea of the special Intel Python distribution is that you've just taken the exact same C Python source code and you've just compiled it with different flags which are gonna magically optimize the way Python works or you then have to also rewrite things like NumPy slightly to... As far as I know, I'm not the full Python expert.
We recompiled C Python using our low-level libraries like MKL. So you should not... I'm not promising anything, right? Don't record me now. It's... Yeah, it should work out of the box. Thanks. It's compiled with ICC, right?
It's compiled with ICC, right? The Intel compiler. That's the plan, yeah. Okay. So we are... ICC is the Intel C compiler, C++ compiler. It's our C++ compiler. We optimize it to the maximum, to what is doable and we of course use it for ourselves in this case.
But we also work with GCC so we also optimize GCC. It's not that we're not... We love the open source community. The question is if I have a library that's compiled with GCC if I have any issues with interfacing to it
from the ICC compiler. So theoretically, the ICC and ICC are binary compatible so you should not have any problem. You can mix codes with both compilers. Any more questions? We probably have time for one more so it might be the best question yet. It will have to fit into 45 seconds including the answer.
42 seconds. So one thing, the distribution is free. Of course it's free. It's not open source, it's free. The MKL is not open source. It's our proprietary library. But the distribution will be free to download, free to use and with community support. If you want premium support from Intel
you will have to pay through the Intel Parallel Studio package. Great. Well then will you join me in thanking Ralph? Thank you very much.