Merken

High Performance Python on Intel Many-Core Architecture

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
please join me in welcoming about is the value that we talk about age on and tell them hold where few this is the medium and long maybe think this might that what's going on here are a good afternoon and welcome to this talk my name is on the money from in certain actually more intense of because everybody knows intended as a hardware manufacturer semiconductor company but it's also very large software companies with about 15 thousand software developers and intent the we develop things from low level bias to all of who are high and big Data solutions and my talk today is going to be about high-performance Python on Intelligent that John actually I come from more from the high-performance computing which is traditionally the traditional language is of course for trying I think that we have a Fortran compiler is using the Fortran compiler nobody that's it thank you and OK no money when I go to high-performance computing specialized conferences and a lot of people still do Fortran right because the the caller didn't change since 1960 and so it's still running it's running really fast and it runs on the latest architecture but more and more of course people being C and C + + and but what we saw in the last couple of years since I'm in high-performance computing is that more and more people are doing items so that's 1 year so that's why it's no wonder and there's for the last couple of years we also know find on the internet and so who here in the room is from high-performance computing there's some kind of high performance data analysis data scientific computing something that alright who is using the Intel processes very good he was using GPU use don't worry don't worry we're note alleged use OK so today I'm going to talk 1st a little bit about the In July architecture where where are we at the moment and after that are going to know what are we doing for the Python community to make the bridge between software development and our high-performance speed and processes i will mostly talk about high-performance processors like intelligence of existence even your regular . 7 you might laptop notes for the embedded stuff so well what what's that who knows that's what what
is June 5 1 you can get a feel for the teachers think differently examined 5 not examined 5 worried about 0 on
behalf of the room right so this is the latest example would just really is the end of gene it said it was called nights landing that's a code will and in there that's nights lending it has
70 7 T to cause you can count them every course 2 threads and they connected through on the impact of rate so that's and we hope it's it's gonna be really fast system so how
does it look like in reality so here on the on the left you have
the version with only half every integrated there is also a normal on the part of an integrated version it's like a small cluster right you can imagine this as a as a class high-performance connector connector between all of the the course and the good thing is and the new thing with this so don't find is that it's not any or process so this is the
3rd set generation and until now it was cool process so so you have to plug it in a year BCI slots and then you have to to pass all your data show this to small not right and then it was the year the limitations of the new 1 we also have a coprocessor version but the divergent we just launched is actually a good stable version so you don't need a whole system anymore it's the host and practical in 1 thing can would lyrics on it at the moment it can run a lot more a lot closer than the coprocessor right because this let's say limitations it
has and CD-ROMs on the data on the chip right 16 gigabytes and CD-ROM which is really fast right and if it can run normally I told so I don't know money's I a cold so in that architecture qualities really easy to program that splits programmability it's all efficient provided has a large memory cannot can now use because of up to 400 per gigabyte 384 and it's it's really scalable you can you can build a cluster so I
guess that there is a there is a regular version without the power of the fabric integrated fabric this 1 and this is the
integrated fabric 1 and it's he goes on to host it's it's it's almost process so you can use it as a regular workstation right currently we have a lot of customers software developers who used have a regular PC it looks like a PC and instead of z 7 they have degrees of fibrosis and this and that and at the later stage it would also be available as a core process like if you you could just to go back to it a little bit more of the hardware so in the high and we have to process of families we have examined John is the regular process so which is most most of the service at the moment all the cloud internet everything went and so on exam is our new processor targeted at high-performance computing levels of big data analytics and machine learning and the good thing is it features the right features 72 cause but with 288 trends and you can run it has a vector units for up to 512 bits so you can run a Vieques
512 extensions would give you a really great scalability if you're doing a mathematical operations with a single instruction multiple data victimization and so this is what what we have now and the future is would be powered and we
would have still more cost so so at least in the file next 5 years we grow more cost more threads and more vectors so that's going to be really essentially if you want to have performance on your with your application in with numerical mathematical reasoning performance right in the current version of is broad broad where this 1 here it still doesn't have a big 520 that may be exposed to with 256 a bits wide simply the next version will be scattered all with uh the experiments were on the server so what does it have to do with items so as a lot of
item development by using these platforms want to use the new assimilated they want to have performance of course you're paying for all this for these transistors and if you buy and and you you are buying 5 billion transistors but if you run a regular item call on it you will not use that the prosody effect billion transistors that you paid for this for the differences so we want you once you rule to help you get more performance out of it and use of these transistors for especially for production but not for prototyping with for production and but we are also seeing that for norm of the borders it's really difficult to use these high-performance extension even for us it's sometimes really tricky and so it's really hard to combine item and those high-performance extensions so that's why this year in 2 months from now we are using only distribution of vitamin which is which you will be called 10 distribution for Python at the moment is still in better and that's going to be released in the 1st week of class September check all right so that our aim is to give you as a Python programmer easy access to high performance in the and trees based on c by we we combined it with our of libraries for instance the most important 1 is MKO who knows and already Margaret so and jail is always at the forefront of performance it's always optimized for the latest latest processor technology so and we have been able to recompile into Aura distribution but we are not only using an incredible also using other libraries like that use a new library it's going to be called tied down for data acceleration don't data analytics acceleration library it is also in there and also we are including TBD for parallel programming so what is required for making
a bite and performance closer to native code of course and HPC it's always you want to have native code that's where most of them programmers using C + + osteopontin and it gives you an example of what's required so there is a very interesting book that came out 2 years ago from 1 of my colleagues about high-performance computing find score high performance power is in worlds that of people are in that book are writing an article about how they paralyzed encoded in high performance I took with to hear a very simple example which is the optimization of Black shows pricing right it's really easy to paralyze that
that formula and if you run it on pure 5 years the number of
options that wasn't options seconds that can be calculated to use pure Python you know maybe in for the same timeframe Europe for 1 2nd you can have like 100 thousand the 2nd if you don't if you move this to native C implementation we can have a 55 times performance implementation
as static which that computation but if you if you really used are well on the use of Victorian it's all this course whatever is included in the end of young 5 processor vectorization trading in the locality of communication you
could get up to to 350 times more performance so what whether we're putting into Python to make that happen for you without having you to code everything
in C so 1st of all we are accelerating the
numerical packages of vitamin with our libraries like and Canada said that in some little bit of IPP which is more the smallest in
the library we are implementing TBD for
their power and you know to get rid of oversubscription for instance and in some cases also for MPI for using the the the more cluster version we also did not having the tune which is a profile I'm gonna show you could do what with that means we also optimizing other extensions in Python next item number it's a and we are also working on the Big Data machine learning platforms and frameworks like sparked a cafe so what
is what is in there it's going to be and then some repeating myself but we also optimizing number 5 side
by side kids but they will all that stuff you have it can give you a description of all the really long list about what we're optimizing with packages and we are providing a specific an interface for that cause by Don is going to be available to this distribution and also is available from anaconda so from continuum analytics
has gone and of course we like the of the open source community of the community it's amazing I realize this community it's amazing what happened here and we of course we want to then on the good things back to the community and will and eventually we're going to work also optimize all the other packages of 5 a quick overview of what and what isn't clear so at the moment for the 1st version
we're going to include the blasts and the fact that we are going to include term a two-dimensional fifties some vector maps and oranges which a random number generator very strong random then number generators which which can be used in Monte-Carlo simulations here's an example
of what can happen if you use an hour a better a version of so it's this is 50 implementation you can see on the right side in comparison to reuse their regular Python if you change we want to read trade you can get up to 10 times the acceleration same fusion ended up by heightened here on wanted that can get up to 5 times acceleration OK random number generator is known as it's interesting to you will use random number
generators a few so we can we really can get a very nice results on random number generators of 2 or more than
50 times more performance than regular the dialog that
is optimized for machine learning and statistics and the and the big data analytics so it has a lot of components and we are currently working on really making that's available to buy data to a real items that were so it's
never look at the tumor the Tunisian low-profile
itself there were never profile was always using the terminal nodes tree hedonism an old product from which with which we make current every year it's mostly used by C programmers he was less fortunate to find hot spots in the cold light where is my what's called where is my application running so is there any performance gain I can can implement using all might cause all my friends and I I really really using the process at low level in the right way it's very visual tool it's it's slower it it is not much of the performance and it can give you visibility of to to the cold right to the source code to a pinpoint the source code where your heart what that work until now in that she was fortunate that we made it also available on the on items because up to now if you would use by time it would show you the C code of the letters so that was not the intent so how it works here you
have some Python called so it's it's some something very simple we have to to work routines 1 is slow and 1 is fast that is and we want to see if the journey is able to run at the same time as this small program and find the code that is slow but we know which is so we started it runs at the
same time as your all as you call it can even and analyze the performance measurement units the p use in the process of and see what's happening in the process of directly and so it shows you here this is the result is the visual and money have on a much bigger screen so this is a small window and it shows you where where in Europe called the the the heart what it for years slowing called surprise here right fasting called it's OK to runs fast the slowing could run slow and if you click on this you get directly to the Python source code so that's that's the Soviet Union is now available also for Python and recognizes Python code the tune is really
a low-level profiling toward which doesn't use a lot of
overhead like 1 . 1 to 1 . 6 works on Windows Linux and can use by country for provided all the version basically it's really a rich graphical user interface I would not say it's you easy to use but it's so we try to make it into a more and more it supports different workflows so you can start the replication starts at the same time and wait for between 2 and and analyze the results or you attach it to an application and you only
have profile certain area of your code right so that's the end of my talk is still have 30 seconds so you can download it from our
version of Python at the moment from Internet software so that it on it's still in but I diversity in beginning of September it's going to be really is it's is really supporting the that for high-performance computing and data machine learning think you remote assumes
questions them in the latest processes that your is starting to emerge PAC uh do you have something of a yeah if you were on in the the and to this so the question was about whatever FPGA is intent and then go on to the idea of a few years ago and since this year is part of and then on the island of Grande is of course to tool to put some FPGA technology in our different processes going forward right at the moment I have no interesting and specific to say about that it's still in the in the works you mentioned machines that are Xeon Phi is the main processes are whose building these was building and I get 1 because we so you can go to your local away and ordered them so they can take orders at the moment on the market if you we have the the software development products so that when I said it's like a workstation you can buy from Colfax or from a German smaller Williams I think recall of the name no from so they're their use is simply going to be on the exam 5 page and you have just over there too small and with this workstation you can order them it's like 6 thousand bottles 1 9 1 9 workstation but other than that if you want to have more and more information you have to go to a HP there and they're building currently there offering In they intend distribution have like and it's only for the exam on series of processes also for Alice it's for agent architectures so if you can use it on the quality of in laptop I 3 yes I would say that it's all the same and basically what I 3 is basically the same as see on the 5 on the other side the configuration is defined by the end of the the basic architecture is the same that's why it's a really good for programmability no questions do we have a couple more minutes and not so that all have 1 some the basic idea of the special Intel Python distribution is that you just taking the exact same C Python source code and you just compiled it with different flags which again have magically optimize the way Python works for you then have to also we write things like non-PPI slightly to who we are as far as I know I'm not the the 4th don't expect we recompiled vital from using all lower-level library like in that state you should not there are a number of promising anything but the government it said yeah it should work out of the box and its component here it's it's it's compiled places the 2 components the plant OK so we are left with the place to see the intensity compiler the 1st component it's see but that's compiler we optimize its to the maximum to what is meant doable and we of course you for ourselves in the this in this case with but we also work with GCC so we also optimized GCC that we not we love the open source community the question is the 1st before have a library that's about with GCC if I have any issues with interface to from the ICC there'd the ICC and GCC are binary compatible so you should not have any problem you can mix quotes from both both components they will question answer program time for 1 more so in the best question it it will have to fit into 45 seconds including the answer 42 seconds the 1 thing that is the distribution is free of course it's really it's not open source it's free MKL is not open source this our proprietary library but the distribution will be free to download free to use and the community if you want premium support from Intel you will have to pay and through the powerless you like right well in regional and relevant
Bit
Hardware
Compiler
Datenanalyse
sinc-Funktion
Mathematisierung
Formale Sprache
Bridge <Kommunikationstechnik>
Computeranimation
Internetworking
Übergang
Intel
Supercomputer
Rechter Winkel
Existenzsatz
Wissenschaftliches Rechnen
Speicherabzug
Computerarchitektur
Coprozessor
Softwareentwickler
Softwareindustrie
Code
Computeranimation
Thread
Physikalisches System
Bitrate
Computeranimation
Prozess <Physik>
Momentenproblem
Klasse <Mathematik>
Versionsverwaltung
Physikalisches System
ROM <Informatik>
Computeranimation
Intel
Generator <Informatik>
Menge
Rechter Winkel
Prozessfähigkeit <Qualitätsmanagement>
Mereologie
Inverser Limes
Coprozessor
Drei
Intel
Programmiergerät
Prozessfähigkeit <Qualitätsmanagement>
Festspeicher
Versionsverwaltung
Leistung <Physik>
Computerarchitektur
Optimierung
Zentraleinheit
CD-ROM
Computeranimation
Leistung <Physik>
Bit
Prozess <Physik>
Momentenproblem
Datenanalyse
Familie <Mathematik>
Schriftzeichenerkennung
Computeranimation
Internetworking
Übergang
Intel
Metropolitan area network
Multiplikation
Einheit <Mathematik>
Regulärer Graph
Supercomputer
Code
Arbeitsplatzcomputer
Biprodukt
Coprozessor
Algorithmische Lerntheorie
Maßerweiterung
Individualsoftware
Nichtlinearer Operator
Hardware
Einfache Genauigkeit
Vektorraum
Dienst <Informatik>
Minimalgrad
Twitter <Softwareplattform>
Rechter Winkel
Speicherabzug
Streuungsdiagramm
Bit
Programmiergerät
Subtraktion
Momentenproblem
Distribution <Funktionalanalysis>
Datenanalyse
Klasse <Mathematik>
Versionsverwaltung
Kartesische Koordinaten
Systemplattform
Computeranimation
Netzwerktopologie
Metropolitan area network
Regulärer Graph
Programmbibliothek
Näherungsverfahren
Thread
Coprozessor
Maßerweiterung
Softwareentwickler
Soundverarbeitung
Systemaufruf
Vektorraum
Biprodukt
Elektronische Publikation
Rechter Winkel
Parallelrechner
Server
Normalvektor
Instantiierung
Metropolitan area network
Programmiergerät
Rechter Winkel
Supercomputer
Code
Minimierung
Maschinensprache
Computeranimation
Leistung <Physik>
Metropolitan area network
Telekommunikation
Code
Zwei
Stellenring
Implementierung
Coprozessor
Vektorraum
Computeranimation
Konfiguration <Informatik>
Metropolitan area network
Bit
Multiplikation
Krümmung
Desintegration <Mathematik>
Code
Programmbibliothek
Kardinalzahl
Computeranimation
Multiplikation
Versionsverwaltung
Profil <Aerodynamik>
Zahlenbereich
Systemplattform
Framework <Informatik>
Computeranimation
Virtuelle Maschine
TUNIS <Programm>
Programmbibliothek
Maßerweiterung
Leistung <Physik>
Instantiierung
Tabelle <Informatik>
Distributionstheorie
Schnittstelle
Physikalischer Effekt
Distribution <Funktionalanalysis>
Kontinuumshypothese
Zahlenbereich
Hausdorff-Raum
Mailing-Liste
Analytische Menge
Speicherbereichsnetzwerk
Computeranimation
Intel
Deskriptive Statistik
Metropolitan area network
Service provider
Personal Area Network
Gammafunktion
Schnittstelle
Kernel <Informatik>
Lineare Abbildung
Unterring
Momentenproblem
Open Source
Algebraisches Modell
Versionsverwaltung
Zahlenbereich
Vektorraum
Gleitendes Mittel
Term
Computeranimation
Zufallsgenerator
Intel
Mapping <Computergraphik>
Metropolitan area network
Generator <Informatik>
Diskrete Simulation
Fourier-Entwicklung
p-Block
Intel
Resultante
Generator <Informatik>
Rechter Winkel
Versionsverwaltung
Implementierung
Paarvergleich
Gleitendes Mittel
Computeranimation
Zufallsgenerator
Intel
Metropolitan area network
Unterring
Statistik
Transformation <Mathematik>
Betafunktion
Datenanalyse
Zusammenhängender Graph
p-Block
Gleitendes Mittel
Algorithmische Lerntheorie
Computeranimation
Netzwerktopologie
Programmiergerät
Prozess <Physik>
Rechter Winkel
Radikal <Mathematik>
Profil <Aerodynamik>
Kartesische Koordinaten
Quellcode
Biprodukt
Code
Computeranimation
Übergang
Resultante
Prozess <Physik>
Fächer <Mathematik>
Quellcode
Code
Computeranimation
Intel
Metropolitan area network
Einheit <Mathematik>
TUNIS <Programm>
Koroutine
Bildschirmfenster
Touchscreen
Intel
Resultante
Metropolitan area network
Datenreplikation
Bildschirmfenster
Versionsverwaltung
Profil <Aerodynamik>
Übergang
Benutzerführung
Kartesische Koordinaten
Gerade
Computeranimation
Distributionstheorie
Momentenproblem
Zwei
Versionsverwaltung
Vorzeichen <Mathematik>
Code
Computeranimation
Internetworking
Intel
Virtuelle Maschine
Flächeninhalt
Rechter Winkel
Supercomputer
Software
Betafunktion
Programmiergerät
Prozess <Physik>
Momentenproblem
Freeware
Quader
Extrempunkt
Compiler
Distribution <Funktionalanalysis>
Zahlenbereich
Computeranimation
Homepage
Virtuelle Maschine
Notebook-Computer
Fahne <Mathematik>
Arbeitsplatzcomputer
Programmbibliothek
Zusammenhängender Graph
Softwareentwickler
Optimierung
Konfigurationsraum
Schnittstelle
Pay-TV
Open Source
Gebäude <Mathematik>
Zwei
Reihe
Quellcode
Biprodukt
Computerarchitektur
Information
Ordnung <Mathematik>
Aggregatzustand

Metadaten

Formale Metadaten

Titel High Performance Python on Intel Many-Core Architecture
Serientitel EuroPython 2016
Teil 100
Anzahl der Teile 169
Autor Wargny, Ralph de
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
DOI 10.5446/21209
Herausgeber EuroPython
Erscheinungsjahr 2016
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Ralph de Wargny - High Performance Python on Intel Many-Core This talk will give an overview about the Intel® Distribution for Python which delivers high performance acceleration of Python code on Intel processors for scientific computing, data analytics, and machine learning. ----- In the first part of the talk, we'll look at the architecture of the latest Intel processors, including the brand new Intel Xeon Phi, also known as Knights Landing, a many-core processor, which was just released end of June 2016. In the second part, we will see which tools and libraries are available from Intel Software to enable high performance Python code on multi-core and many-core processors.

Ähnliche Filme

Loading...