High Performance Python on Intel Many-Core Architecture

Video thumbnail (Frame 0) Video thumbnail (Frame 4582) Video thumbnail (Frame 5320) Video thumbnail (Frame 5827) Video thumbnail (Frame 6376) Video thumbnail (Frame 7496) Video thumbnail (Frame 8254) Video thumbnail (Frame 10390) Video thumbnail (Frame 11923) Video thumbnail (Frame 15363) Video thumbnail (Frame 16487) Video thumbnail (Frame 17262) Video thumbnail (Frame 17953) Video thumbnail (Frame 18467) Video thumbnail (Frame 19298) Video thumbnail (Frame 20079) Video thumbnail (Frame 20925) Video thumbnail (Frame 21411) Video thumbnail (Frame 22397) Video thumbnail (Frame 23181) Video thumbnail (Frame 25047) Video thumbnail (Frame 25631) Video thumbnail (Frame 26923) Video thumbnail (Frame 27906) Video thumbnail (Frame 28829)
Video in TIB AV-Portal: High Performance Python on Intel Many-Core Architecture

Formal Metadata

High Performance Python on Intel Many-Core Architecture
Title of Series
Part Number
Number of Parts
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Ralph de Wargny - High Performance Python on Intel Many-Core This talk will give an overview about the Intel® Distribution for Python which delivers high performance acceleration of Python code on Intel processors for scientific computing, data analytics, and machine learning. ----- In the first part of the talk, we'll look at the architecture of the latest Intel processors, including the brand new Intel Xeon Phi, also known as Knights Landing, a many-core processor, which was just released end of June 2016. In the second part, we will see which tools and libraries are available from Intel Software to enable high performance Python code on multi-core and many-core processors.
Intel Existence Software developer Bit Data analysis Software industry Supercomputer Compiler Formal language Computer architecture Mathematics Multi-core processor Computer animation Internetworking Bridging (networking) Computer hardware Energy level Right angle Computational science Digital signal processor Sinc function Computer architecture
Computer animation Code
Thread (computing) Computer animation Bit rate Physical system
Intel Digital signal processor Electric generator Moment (mathematics) Cluster analysis Set (mathematics) 3 (number) Mereology Limit (category theory) Coprocessor Revision control Process (computing) Computer animation Befehlsprozessor Right angle Social class Physical system
Intel Digital signal processor Disintegration Cluster analysis Computer programming Power (physics) Revision control Programmer (hardware) Computer animation Semiconductor memory Befehlsprozessor Momentum CD-ROM Computer architecture
Intel Digital signal processor Service (economics) Execution unit Workstation <Musikinstrument> Data analysis Regular graph Twitter Supercomputer Machine learning Internetworking Befehlsprozessor Operator (mathematics) Single-precision floating-point format Computer hardware Energy level Computer worm Extension (kinesiology) Video Genie Family Multiplication Moment (mathematics) Code Bit Coprocessor Computer programming Individualsoftware Product (business) Degree (graph theory) Process (computing) Computer animation Vector space Computing platform Point cloud Right angle Digital signal processor Multi-core processor Family Modem
Server (computing) Thread (computing) Computer file Parallel computing Data analysis Infinity Regular graph Product (business) Revision control Programmer (hardware) Different (Kate Ryan album) Extension (kinesiology) Computing platform Social class Distribution (mathematics) Software developer Moment (mathematics) Sound effect Bit Instance (computer science) Cartesian coordinate system System call Thread (computing) Numerical analysis Computer animation Vector space Network topology Normal (geometry) Right angle Digital signal processor Library (computing)
Programmer (hardware) Computer animation Code Right angle Machine code Special unitary group Mathematical optimization Supercomputer Power (physics)
CAN bus Implementation Computer animation Vector space Computer configuration Multiplication sign Telecommunication Digital signal processor Local ring 2 (number)
Axiom of choice Parallel computing Multiplication sign Code Bit Formal language Numerical analysis Computer animation Mixed reality Compilation album Software framework Multiplication Arc (geometry) Library (computing)
Reading (process) Axiom of choice Cluster analysis Virtual machine Code Instance (computer science) Formal language Numerical analysis Power (physics) Revision control Computer animation Mixed reality Personal digital assistant Profil (magazine) Software framework Extension (kinesiology) Tunis Computing platform Library (computing)
Intel Distribution (mathematics) Distribution (mathematics) Interface (computing) Continuum hypothesis Open source Electronic mailing list Analytic set Numerical analysis Computer animation Causality Interface (computing) Process (computing) Descriptive statistics
Intel Random number generation Electric generator Open source Mapping Building Moment (mathematics) Computer simulation Numerical analysis Revision control Mathematics Computer animation Vector space Term (mathematics) Kernel (computing) Block (periodic table) Library (computing)
Pairwise comparison Intel Execution unit Implementation Random number generation Electric generator Multiplication sign Avatar (2009 film) Revision control Mathematics Pointer (computer programming) Computer animation Kernel (computing) Personal digital assistant Convex hull Right angle Implementation Library (computing) Resultant Default (computer science)
Intel Statistics Distribution (mathematics) Multiplication sign Connectivity (graph theory) Data analysis Transformation (genetics) Mathematics Machine learning Computer animation Kernel (computing) Ring (mathematics) Personal digital assistant Vector space Block (periodic table) Library (computing)
Code Multiplication sign Source code Cartesian coordinate system Product (business) Radical (chemistry) Programmer (hardware) Process (computing) Computer animation Profil (magazine) Network topology Energy level Right angle
Source code Touchscreen Code Multiplication sign Source code Interior (topology) Coroutine Process (computing) Computer animation Social class output Resultant Tunis Window Maß <Mathematik>
Overhead (computing) Intel Average Replication (computing) Cartesian coordinate system Revision control Graphical user interface Computer animation Profil (magazine) Revision control Computing platform Resultant Window
Area Metropolitan area network Beta function Intel Distribution (mathematics) Code Moment (mathematics) Virtual machine 2 (number) Supercomputer Revision control Sign (mathematics) Computer animation Software Internetworking Internet forum Right angle
Laptop Building Pay television Open source State of matter Connectivity (graph theory) Multiplication sign Workstation <Musikinstrument> Source code Virtual machine Computer programming 2 (number) Product (business) Programmer (hardware) Computer architecture Flag Cuboid Series (mathematics) Computer architecture Distribution (mathematics) Information Software developer Interface (computing) Moment (mathematics) Maxima and minima Numerical analysis Compiler Process (computing) Computer animation Order (biology) Configuration space Freeware Library (computing)
please join me in welcoming about is the value that we talk about age on and tell them hold where few this is the medium and long maybe think this might that what's going on here are a good afternoon and welcome to this talk my name is on the money from in certain actually more intense of because everybody knows intended as a hardware manufacturer semiconductor company but it's also very large software companies with about 15 thousand software developers and intent the we develop things from low level bias to all of who are high and big Data solutions and my talk today is going to be about high-performance Python on Intelligent that John actually I come from more from the high-performance computing which is traditionally the traditional language is of course for trying I think that we have a Fortran compiler is using the Fortran compiler nobody that's it thank you and OK no money when I go to high-performance computing specialized conferences and a lot of people still do Fortran right because the the caller didn't change since 1960 and so it's still running it's running really fast and it runs on the latest architecture but more and more of course people being C and C + + and but what we saw in the last couple of years since I'm in high-performance computing is that more and more people are doing items so that's 1 year so that's why it's no wonder and there's for the last couple of years we also know find on the internet and so who here in the room is from high-performance computing there's some kind of high performance data analysis data scientific computing something that alright who is using the Intel processes very good he was using GPU use don't worry don't worry we're note alleged use OK so today I'm going to talk 1st a little bit about the In July architecture where where are we at the moment and after that are going to know what are we doing for the Python community to make the bridge between software development and our high-performance speed and processes i will mostly talk about high-performance processors like intelligence of existence even your regular . 7 you might laptop notes for the embedded stuff so well what what's that who knows that's what what
is June 5 1 you can get a feel for the teachers think differently examined 5 not examined 5 worried about 0 on
behalf of the room right so this is the latest example would just really is the end of gene it said it was called nights landing that's a code will and in there that's nights lending it has
70 7 T to cause you can count them every course 2 threads and they connected through on the impact of rate so that's and we hope it's it's gonna be really fast system so how
does it look like in reality so here on the on the left you have
the version with only half every integrated there is also a normal on the part of an integrated version it's like a small cluster right you can imagine this as a as a class high-performance connector connector between all of the the course and the good thing is and the new thing with this so don't find is that it's not any or process so this is the
3rd set generation and until now it was cool process so so you have to plug it in a year BCI slots and then you have to to pass all your data show this to small not right and then it was the year the limitations of the new 1 we also have a coprocessor version but the divergent we just launched is actually a good stable version so you don't need a whole system anymore it's the host and practical in 1 thing can would lyrics on it at the moment it can run a lot more a lot closer than the coprocessor right because this let's say limitations it
has and CD-ROMs on the data on the chip right 16 gigabytes and CD-ROM which is really fast right and if it can run normally I told so I don't know money's I a cold so in that architecture qualities really easy to program that splits programmability it's all efficient provided has a large memory cannot can now use because of up to 400 per gigabyte 384 and it's it's really scalable you can you can build a cluster so I
guess that there is a there is a regular version without the power of the fabric integrated fabric this 1 and this is the
integrated fabric 1 and it's he goes on to host it's it's it's almost process so you can use it as a regular workstation right currently we have a lot of customers software developers who used have a regular PC it looks like a PC and instead of z 7 they have degrees of fibrosis and this and that and at the later stage it would also be available as a core process like if you you could just to go back to it a little bit more of the hardware so in the high and we have to process of families we have examined John is the regular process so which is most most of the service at the moment all the cloud internet everything went and so on exam is our new processor targeted at high-performance computing levels of big data analytics and machine learning and the good thing is it features the right features 72 cause but with 288 trends and you can run it has a vector units for up to 512 bits so you can run a Vieques
512 extensions would give you a really great scalability if you're doing a mathematical operations with a single instruction multiple data victimization and so this is what what we have now and the future is would be powered and we
would have still more cost so so at least in the file next 5 years we grow more cost more threads and more vectors so that's going to be really essentially if you want to have performance on your with your application in with numerical mathematical reasoning performance right in the current version of is broad broad where this 1 here it still doesn't have a big 520 that may be exposed to with 256 a bits wide simply the next version will be scattered all with uh the experiments were on the server so what does it have to do with items so as a lot of
item development by using these platforms want to use the new assimilated they want to have performance of course you're paying for all this for these transistors and if you buy and and you you are buying 5 billion transistors but if you run a regular item call on it you will not use that the prosody effect billion transistors that you paid for this for the differences so we want you once you rule to help you get more performance out of it and use of these transistors for especially for production but not for prototyping with for production and but we are also seeing that for norm of the borders it's really difficult to use these high-performance extension even for us it's sometimes really tricky and so it's really hard to combine item and those high-performance extensions so that's why this year in 2 months from now we are using only distribution of vitamin which is which you will be called 10 distribution for Python at the moment is still in better and that's going to be released in the 1st week of class September check all right so that our aim is to give you as a Python programmer easy access to high performance in the and trees based on c by we we combined it with our of libraries for instance the most important 1 is MKO who knows and already Margaret so and jail is always at the forefront of performance it's always optimized for the latest latest processor technology so and we have been able to recompile into Aura distribution but we are not only using an incredible also using other libraries like that use a new library it's going to be called tied down for data acceleration don't data analytics acceleration library it is also in there and also we are including TBD for parallel programming so what is required for making
a bite and performance closer to native code of course and HPC it's always you want to have native code that's where most of them programmers using C + + osteopontin and it gives you an example of what's required so there is a very interesting book that came out 2 years ago from 1 of my colleagues about high-performance computing find score high performance power is in worlds that of people are in that book are writing an article about how they paralyzed encoded in high performance I took with to hear a very simple example which is the optimization of Black shows pricing right it's really easy to paralyze that
that formula and if you run it on pure 5 years the number of
options that wasn't options seconds that can be calculated to use pure Python you know maybe in for the same timeframe Europe for 1 2nd you can have like 100 thousand the 2nd if you don't if you move this to native C implementation we can have a 55 times performance implementation
as static which that computation but if you if you really used are well on the use of Victorian it's all this course whatever is included in the end of young 5 processor vectorization trading in the locality of communication you
could get up to to 350 times more performance so what whether we're putting into Python to make that happen for you without having you to code everything
in C so 1st of all we are accelerating the
numerical packages of vitamin with our libraries like and Canada said that in some little bit of IPP which is more the smallest in
the library we are implementing TBD for
their power and you know to get rid of oversubscription for instance and in some cases also for MPI for using the the the more cluster version we also did not having the tune which is a profile I'm gonna show you could do what with that means we also optimizing other extensions in Python next item number it's a and we are also working on the Big Data machine learning platforms and frameworks like sparked a cafe so what
is what is in there it's going to be and then some repeating myself but we also optimizing number 5 side
by side kids but they will all that stuff you have it can give you a description of all the really long list about what we're optimizing with packages and we are providing a specific an interface for that cause by Don is going to be available to this distribution and also is available from anaconda so from continuum analytics
has gone and of course we like the of the open source community of the community it's amazing I realize this community it's amazing what happened here and we of course we want to then on the good things back to the community and will and eventually we're going to work also optimize all the other packages of 5 a quick overview of what and what isn't clear so at the moment for the 1st version
we're going to include the blasts and the fact that we are going to include term a two-dimensional fifties some vector maps and oranges which a random number generator very strong random then number generators which which can be used in Monte-Carlo simulations here's an example
of what can happen if you use an hour a better a version of so it's this is 50 implementation you can see on the right side in comparison to reuse their regular Python if you change we want to read trade you can get up to 10 times the acceleration same fusion ended up by heightened here on wanted that can get up to 5 times acceleration OK random number generator is known as it's interesting to you will use random number
generators a few so we can we really can get a very nice results on random number generators of 2 or more than
50 times more performance than regular the dialog that
is optimized for machine learning and statistics and the and the big data analytics so it has a lot of components and we are currently working on really making that's available to buy data to a real items that were so it's
never look at the tumor the Tunisian low-profile
itself there were never profile was always using the terminal nodes tree hedonism an old product from which with which we make current every year it's mostly used by C programmers he was less fortunate to find hot spots in the cold light where is my what's called where is my application running so is there any performance gain I can can implement using all might cause all my friends and I I really really using the process at low level in the right way it's very visual tool it's it's slower it it is not much of the performance and it can give you visibility of to to the cold right to the source code to a pinpoint the source code where your heart what that work until now in that she was fortunate that we made it also available on the on items because up to now if you would use by time it would show you the C code of the letters so that was not the intent so how it works here you
have some Python called so it's it's some something very simple we have to to work routines 1 is slow and 1 is fast that is and we want to see if the journey is able to run at the same time as this small program and find the code that is slow but we know which is so we started it runs at the
same time as your all as you call it can even and analyze the performance measurement units the p use in the process of and see what's happening in the process of directly and so it shows you here this is the result is the visual and money have on a much bigger screen so this is a small window and it shows you where where in Europe called the the the heart what it for years slowing called surprise here right fasting called it's OK to runs fast the slowing could run slow and if you click on this you get directly to the Python source code so that's that's the Soviet Union is now available also for Python and recognizes Python code the tune is really
a low-level profiling toward which doesn't use a lot of
overhead like 1 . 1 to 1 . 6 works on Windows Linux and can use by country for provided all the version basically it's really a rich graphical user interface I would not say it's you easy to use but it's so we try to make it into a more and more it supports different workflows so you can start the replication starts at the same time and wait for between 2 and and analyze the results or you attach it to an application and you only
have profile certain area of your code right so that's the end of my talk is still have 30 seconds so you can download it from our
version of Python at the moment from Internet software so that it on it's still in but I diversity in beginning of September it's going to be really is it's is really supporting the that for high-performance computing and data machine learning think you remote assumes
questions them in the latest processes that your is starting to emerge PAC uh do you have something of a yeah if you were on in the the and to this so the question was about whatever FPGA is intent and then go on to the idea of a few years ago and since this year is part of and then on the island of Grande is of course to tool to put some FPGA technology in our different processes going forward right at the moment I have no interesting and specific to say about that it's still in the in the works you mentioned machines that are Xeon Phi is the main processes are whose building these was building and I get 1 because we so you can go to your local away and ordered them so they can take orders at the moment on the market if you we have the the software development products so that when I said it's like a workstation you can buy from Colfax or from a German smaller Williams I think recall of the name no from so they're their use is simply going to be on the exam 5 page and you have just over there too small and with this workstation you can order them it's like 6 thousand bottles 1 9 1 9 workstation but other than that if you want to have more and more information you have to go to a HP there and they're building currently there offering In they intend distribution have like and it's only for the exam on series of processes also for Alice it's for agent architectures so if you can use it on the quality of in laptop I 3 yes I would say that it's all the same and basically what I 3 is basically the same as see on the 5 on the other side the configuration is defined by the end of the the basic architecture is the same that's why it's a really good for programmability no questions do we have a couple more minutes and not so that all have 1 some the basic idea of the special Intel Python distribution is that you just taking the exact same C Python source code and you just compiled it with different flags which again have magically optimize the way Python works for you then have to also we write things like non-PPI slightly to who we are as far as I know I'm not the the 4th don't expect we recompiled vital from using all lower-level library like in that state you should not there are a number of promising anything but the government it said yeah it should work out of the box and its component here it's it's it's compiled places the 2 components the plant OK so we are left with the place to see the intensity compiler the 1st component it's see but that's compiler we optimize its to the maximum to what is meant doable and we of course you for ourselves in the this in this case with but we also work with GCC so we also optimized GCC that we not we love the open source community the question is the 1st before have a library that's about with GCC if I have any issues with interface to from the ICC there'd the ICC and GCC are binary compatible so you should not have any problem you can mix quotes from both both components they will question answer program time for 1 more so in the best question it it will have to fit into 45 seconds including the answer 42 seconds the 1 thing that is the distribution is free of course it's really it's not open source it's free MKL is not open source this our proprietary library but the distribution will be free to download free to use and the community if you want premium support from Intel you will have to pay and through the powerless you like right well in regional and relevant