AV-Portal 3.23.3 (4dfb8a34932102951b25870966c61d06d6b97156)

Addressing multithreading and multiprocessing in transparent and Pythonic methods

Video in TIB AV-Portal: Addressing multithreading and multiprocessing in transparent and Pythonic methods

Formal Metadata

Addressing multithreading and multiprocessing in transparent and Pythonic methods
Alternative Title
Addressing multithreading and multiprocessing in transparent and Pythonic ways
Title of Series
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
With the increase in computing power, harnessing and controlling one’s code out of the single-threaded realm becomes an ever-increasing problem, coupled with the desire to stay in the Python layer. With the recent tools and frameworks that have been published, escaping the GIL cleanly is much easier than before, allow one’s Python code to effectively utilize multi-core and many core architectures in the most Pythonic ways possible. In this talk, learn about how to utilize static multiprocessing for process pinning, and effectively balancing thread pools with a monkey-patched import of threading modules. Overview: Introduction to multithreading and multiprocessing in Python History of multithreading+multiprocessing in Python, classic frameworks Problems that can occur (oversubscription, nested parallelism issues, process hopping, pool resource on shared machines) Python accessing bigger hardware over the last few years (28+ cores, etc) When to stay in the GIL, and when to escape it The advantages and safety of the GIL Python-level exiting of the GIL; analysis of when to return to single-threaded, and when threading is a deceivingly bad idea Accountability of frameworks that natively exit the GIL The new multithreading and multiprocessing libraries and techniques static multiprocessing module (smp) (and monkey patching of multiprocessing) thread pool control with command line calls of Python ( python -m tbb -p 8) Putting it all together Examples of using static multiprocessing on a large machine to stop oversubscription Example of pseudo-daemon process on 4-core machine by processor pinning Thread pool control on a simple NumPy example Summary - Best practices for using above methods to control multithreading+multiprocessing What needs to be done in the space (frameworks and things that need to be exposed) Problems that still exist in the area Q&A
Ocean current Multiplication Game controller Concurrency (computer science) State of matter State of matter Parallel port Bit Mehrprozessorsystem Information technology consulting Process (computing) Software Electric current Address space
Formal language Expected value Heegaard splitting Mathematics Computer configuration Kernel (computing) Object (grammar) Core dump Software framework Extension (kinesiology) Predictability Area Concurrency (computer science) Software developer Parallel port Bit Instance (computer science) Type theory Process (computing) Befehlsprozessor Vector space Right angle Arithmetic progression Electric current Spacetime Laptop Dataflow Server (computing) Implementation Software developer Online help Twitter Number Time domain Operator (mathematics) Spacetime MiniDisc Task (computing) Module (mathematics) Addition Multiplication Parallel computing State of matter Counting Core dump Line (geometry) System call Kernel (computing) Interpreter (computing) Object (grammar) Multi-core processor Routing Abstraction Library (computing) Extension (kinesiology)
Random number Game controller Concurrency (computer science) Thread (computing) Ferry Corsten Code Venn diagram Similarity (geometry) Mereology Area Element (mathematics) Number Latent heat Mehrprozessorsystem Different (Kate Ryan album) Single-precision floating-point format Energy level Software framework Descriptive statistics Physical system Area Multiplication Focus (optics) Scaling (geometry) Mapping Parallel computing Expression Parallel port Morley's categoricity theorem Bit Cartesian coordinate system Software maintenance System call Type theory Process (computing) Numeral (linguistics) Right angle Spacetime Library (computing)
Intel Scheduling (computing) Dynamical system Code Multiplication sign Direction (geometry) 1 (number) Set (mathematics) Fluid statics Mathematics Machine learning Type theory Befehlsprozessor Software framework Aerodynamics Process (computing) Extension (kinesiology) Library (computing) Multiplication Intel Block (periodic table) Building Parallel port Mass Bit Variable (mathematics) Type theory Process (computing) Befehlsprozessor Fluid statics Addressing mode Auditory masking Chemical affinity Right angle Modul <Datentyp> Block (periodic table) Task (computing) Spacetime Laptop Overhead (computing) Online help Coordinate system Rule of inference Number Revision control Cache (computing) Software Address space Surjective function Form (programming) Task (computing) Dynamical system Overhead (computing) Multiplication Focus (optics) Distribution (mathematics) Scaling (geometry) Parallel computing Code Affine space Cache (computing) Integrated development environment Scheduling (computing) Address space
Module (mathematics) Area Random number Dialect Implementation Augmented reality 1 (number) Parallel port Set (mathematics) Open set Cartesian coordinate system Type theory Befehlsprozessor Process (computing) Software Personal digital assistant Chemical affinity Physical system
Type theory Server (computing) Demo (music) Network socket Core dump Repository (publishing) Software framework Bit
Laptop User interface Mapping Code Multiplication sign Bit Number 2 (number) Loop (music) Mehrprozessorsystem Electronic visual display Multi-core processor Physical system
Laptop Slide rule Dynamical system Server (computing) Code Multiplication sign Range (statistics) Virtual machine Set (mathematics) Online help 2 (number) Product (business) Mathematics Different (Kate Ryan album) Multiplication Descriptive statistics Scripting language Default (computer science) Multiplication Demo (music) Augmented reality Parallel port Cartesian coordinate system System call Loop (music) Right angle
Concurrency (computer science) Just-in-Time-Compiler Code Multiplication sign Direction (geometry) Source code 1 (number) Set (mathematics) Parameter (computer programming) Mereology Formal language Mathematics Software framework Library (computing) Scripting language Area Concurrency (computer science) Software developer Parallel port Bit Type theory Process (computing) Interface (computing) Right angle Quicksort Electric current Spacetime Ocean current Frame problem Implementation Divisor Patch (Unix) Similarity (geometry) Drop (liquid) Product (business) Intermediate language Operator (mathematics) Representation (politics) Energy level Directed set Module (mathematics) Pairwise comparison Standard deviation Focus (optics) Multiplication Demo (music) Parallel computing Interface (computing) State of matter Code System call Iteration Library (computing)
Axiom of choice Point (geometry) Frame problem Code Direction (geometry) Sheaf (mathematics) Similarity (geometry) Open set Drop (liquid) Formal language Number Time domain Computer configuration Object (grammar) Operator (mathematics) Software Energy level Software framework Multiplication Form (programming) Physical system Domain name Beat (acoustics) Beer stein Dependent and independent variables Multiplication Interface (computing) Expression Parallel port Mass Instance (computer science) Limit (category theory) Type theory Latent heat Right angle Object (grammar) Spacetime Library (computing)
Intel Dynamical system Building Run time (program lifecycle phase) Code Multiplication sign Set (mathematics) Open set Mereology Perspective (visual) Information technology consulting Computer programming Fluid statics Coefficient of determination Computer configuration Different (Kate Ryan album) Network socket Single-precision floating-point format Determinant Descriptive statistics Physical system Area Intel File format Block (periodic table) Concentric Parallel port Bit Microprocessor Type theory Process (computing) Befehlsprozessor Telecommunication Order (biology) Interrupt <Informatik> Right angle Directed graph Spacetime Game controller Open source Link (knot theory) Product (business) Number 2 (number) Goodness of fit Profil (magazine) Software Energy level Computing platform Address space Task (computing) Pairwise comparison Distribution (mathematics) Multiplication Weight Interactive television Mathematical analysis Cartesian coordinate system System call Personal digital assistant Multi-core processor Window Abstraction Library (computing)
I thank you so much good morning so my name is David Liu I'm Python technical consultant engineer for Intel and today I'll be talking about addressing multi-threading and multi processing and transparent and pythonic methods so just kind of a general
overview of this talk one of the things I'm gonna do is you know kind of state what the current state of concurrency and parallelism is in the industry I'm gonna talk a little bit about nested parallelism and oversubscription what those problems are we're gonna talk a little bit about composable methods and thread control and then you know how some of these packages work under the hood that addressed these these issues what it means to have a pythonic style and kind of the future of pythonic style for parallelism so one of the things I
kind of like to do is you know the Python language itself has had a lot of luck in attracting good talent and a lot of the the best people for addressing concurrency and multi-processing and so we can see that lesson here we're one of the few languages that encompass all of these frameworks and over the years you can see the progression of the frameworks has kind of included just general threatening multi processing tasks parallel type of workflows and you can see that the the large amount of packages that are now in this space help fill out the Python ecosystem such that we do have a lot of options when we choose to go for parallelism and we're one of the few languages to have that and we know from 2008 to 2017 you can see that just you know the large amount of packages that we have and a few of those packages have actually been talked about at this conference so again like I
said the options in the space are very good compared to other ecosystems and the majority of them do a very good job playing playing nicely with a global interpreter lock if you were expecting this talk to get rid of the global interpreter lock this is probably not to talk for you but you know we do a very good job by doing distributed or vectorization techniques or by working nicely with a Gil and that's one of the the biggest benefits of the Python ecosystem in the pike and the packages that are included in this ecosystem for the more domain-specific areas one can rely on high nclibraries to do that type of work for you to harness parallelism and threading so scifi numpy do a great job of this right so you know when you make a numpy call under the hood it's calling a sea library which is doing the majority of the data parallelism work that route that is required to get the job done quickly and with that being said you know one of the recent trends in the industry is increasing amount of cork counts and thread counts and that's becoming more commonplace in the server space and even in your laptop space that you have use you're seeing an increasing amount of cores and threads that are becoming available and because of that nested parallelism and oversubscription are now quite possible in the kernels that you're doing and so some of you may be asking well what exactly is that right and we'll go into that in a little bit but let's first talk a little bit about the Gil you know cuz this this can't this topic gets talked about a lot right the Gil has been complained about by many people in this space and many efforts have been made to remove the Gil you know there's a few talks in the last few years at PyCon that have been trying to do it there's been a lot of efforts to remove it and you know there's very valiant efforts to remove it too but as it stands you know what the Gil provides us is relatively important and it's kind of hard to ignore some of those right you know the readwrite safety a Python object predictable behavior right that language really wasn't written to be thread safe and you know the the guarantees that you get with types and everything come from the guarantee that the Gil provides you and in addition to that you know the when you're developing your own modules and extensions and and other things that type of expectation on the developer is a very hard expectation is to say well if I'm developing a framework I now have to expect that you know I have to you know it's not going to be single threaded other people may be accessing my objects that's extremely hard to test so you know passing that burden onto the developers is also not that great of an idea and again that's why the Gil provides something to allow you to be able to easily work and create extensions for Python and again because the Gil is it provides that safety and we have so many good frameworks it's kind of a non-issue today right you know you can there's there's many frameworks that have found a way to cleanly step around the Gil inside pine dump are great examples of this right you basically send a command for you know numpy dot dot or something similar it gets dispatched to you know your your Blas API and you know you can then use the Intel tells math kernal library or you're using open blas depending on your implementation that gets vectorized and parallelized inside the CPU and gets dispatched completely transparently to you right so numpy inside by do an amazing job of this and that's kind of one of the examples of cleanly stepping around the gil by understanding what that dataflow is and again there's a lot of other frameworks that utilize this type of vectorization right you have number nummis pression scythe on all do this type of vectorization work for you while allow you to stay within the Python layer multi processing frameworks that that now have now been included into the main library of Python eyes of Python 3 right have great ways of escaping via separate process is not necessarily just you know stepping from the vectorization line of but you can also have separate des within those and that's where you know some of the oversubscription problems can happen generally exiting the Gil in with the C library is the most pythonic ish way of doing things and this has been talked about by a lot of people in the numeric space is that you know if you understand the abstraction of your computational flow you can write a library that can do this type of work wrap it in Python and that essentially is the most pythonic way of operating and this composition of abstracted flows by splitting you know you can also do this by splitting off into multiple processes can also be a cleaner way of escaping the Gil and you know it's very rare to absolutely necessitate the language to be thread-safe I mean there's very few instances that we would ever need to really do that and I think that that issue of the advantages of Python going away would probably be the main detractor if we started doing that
so if we start breaking up the space now into three main areas so you know with this Venn diagram if we look at application level parallelism we look at single threaded concurrency and data parallelism focus we can split up the majority of frameworks in this space and kind of categorize them and see what areas overlap so you know you see the area that has been talked about a lot with you know trio and tornado or celery or anything really that lies within concurrent futures you know that's another big thing is you know when Python you know came down to say the current futures are the API that we want to really support that was a huge area because now thing a lot of these frameworks are now designing towards it you can see where that area is more of an unlike the single threaded concurrency and when most people think that they need parallelism they most likely just need concurrency and then when you get to application level parallelism you're seeing like multi processing or job Lib or similar frameworks being in that space desk also encompasses part of that framework when you get to the data parallelism focus you can see that the packages that we talked about in the numerical space numpy scifi number scythe on num expression all sit in that area because they understand that the data parallel is the area that they want to focus in and buy you know abstracting that call you can then exit the guild do the type of data parallelism type of work whilst while being able to return all of that back into the Python layer and then when you get into areas where you needed both single thread concurrency and data parallelism you know you can get like MPI for PI or some really weird types of concurrency and data parallelism focus that that will lie in the area and then the center area obviously maybe it's a unicorn maybe it's MPI for pi like you know obviously that's also a little harder to work with so that this hopefully will give you an understanding of what the different areas are encompassed and what I want to do with this though is try to focus down and talk about two specific areas today so if we're gonna take a look today at application level parallelism and data parallelism focus this is where a lot of kind of the final frontier has been has been sitting and so if we expand that now into three areas we have Python multi-processing Python multi-threading and then data parallelism focus so now you can see we're you know some of these frameworks now lie and like dusk is clear in the middle of it it's one of those actually interesting type of frameworks if we were in the u.s. and Matthew Rockland was here he'd be very happy that's he's one that main maintainer Zaid asked but that being said now now that we understand what space I want to talk about today this area that's kind of the intersection of them is where nested parallelism and oversubscription current occur when you start mixing these different libraries multi-processing with numpy or number with you know other elements that have been composed on top of or you start getting to multi-threading this is the area that that over description and and nested parallelism can occur so you may
ask what does that actually look like the answer to that is is it it can look like relatively benign code right so for many of you this may look like a very very simple type of thing that you would run into if you were just developing a numpy or you were just trying to you know scale just a little bit you know so here we have a numpy call with random we have multi processing pools with the thread pool and then we're gonna do a pulled up map on a numpy call well what exactly actually have you done here right the problem is is you've now done a composable type of nested parallelism without even knowing it and that's where it can get really really scary because now you can have threads being spawned in a nested parallel process and if you start putting this on a larger compute system it can go out of control so you know you go from P threads and then P Python threads and then you go to the threads that are in numpy well then what'll happen is that nested capability will then create like nearly double or quadratic the amount of threads if you're not careful depending on what's available in the system right and so you go from a relatively known
set of threads that you're like ok you know this one called an umpire I know what's going on right well now if I call that with multi processing on top of it I've just kind of created a mess in tangle of threads because now one can spawn a bunch of other ones but it's relatively uncap doesn't have any rules being passed down so you know what are
the problems that you have with that is you essentially will get oversubscription right and so you'll have a lot more threads than are actually mapped to the CPU and they're also then mapped to other ones so it's not it doesn't have any rules controlling it so with that many threads you'll have direct / OS overhead for switching out threads CPU cache becomes cold you're going to get a performance hit and you're gonna say well this worked actually faster on my laptop how did that happen right it's kind of an invisible impact if you're not used to it and other threads are kind of waiting for the for the other ones to return and it's just like it's it's just trying to you know have way too many threads to the actual logical cores now a lot of the popular frameworks that have this problem have solved it in a in a relatively simple issue a it's not the cleanest way and what they do is they lock the amount of threads to one for that specific process which is an ok-ish solution but it does doesn't scale well you know they'll say oh MP num threads with pea they'll set the block time to lower but it's not always you know the cleanest way and you know scikit-learn definitely has this if you use grid search you'll see it pi torch tends to flow they all exhibit this problem because the type of composable parallelism that they use to give you this above either the machine learning or other forms of work they this is one of the issues that they run into and you know I'll talk a little bit more about SMP when we get to it but the you know this is one of the SMP is one of the packages that addresses it so now let's talk about the composability modules that help address this space right one of them is TBB for pi and you know from our Intel distribution for Python TBB for pi is included with our distribution it's free and what it is is actually a Python C extension for managing the nested parallelism using a dynamic task scheduler so if you use our version of scikit-learn in our intel distribution for python we actually use TVB under the hood for some of those but when you start looking at what it's providing this is kind of the focus today is what exactly is it providing is the dynamic task scheduler for for that so if you have dynamically mapped tasks and you have ones that will occasionally end up a lot faster than then the completed ones is able to you know put those back into the thread pool and allow you to spawn new ones even if they're unbalanced so does it handles unbalanced work relatively well and its tan she eighths via monkey-patching of the pythons pulls enabling the TVB threating layer to be interchanged from you know with the mkl here and so no code changes are required on your part because of that monkey patching capability another one that we use in this space is static multi processing or SMP and it's a pure Python package that manages nested parallelism through coarse-grained static settings and so what that means is it's trying to augment your parallelism by saying I'm gonna take the the rules that have been defined by your your your parallelism and the types of environment variables and pass them down to the inherited processes to try to control oversubscription via that method so it handles ones that are a little more structured in that in that way it again instantiates for your monkey-patching and it uses affinity masks to and over openmp just statically defined and I'll keep those resources to avoid those excessive threads so now if
we return back and we look at this example you can see that you know these two packages can address the the issue that we have here so with you know with the nested parallelism how does that
actually work so TVB tries to accomplish this by saying okay you have your application you have your openmp threading and you know you have separate but uncoordinated areas of open M V parallel regions and so what happens is you know tries to map to many software threads and it tries to compete for all the logical processors and you know tries to map too many of them and what it's running under the TBB module essentially does is it says this is the pool that's defined and it can dynamically allocate or and and you know release new ones to be able to operate within that pool so it tries to keep them mapped to the to the logical processors while keeping that the on hold of the oversubscription because if one of these starts starts spawning like five or ten of them while the other one starts spanning one you can start seeing where the problem occurs whereas this if it starts wanting this Montand will still be pulling from the same pool and still be mapped to an actual logical processor now SMP does it in a completely different way it says with the same problem that we had before it's saying we want to take the thread pool implementation and propagate the masking settings towards each of the individual spawn processes to go down and so you're essentially augmenting your MK lr blas threading to be able to have the the Augmented settings passed down to each of the threads that are created from those processes and so one of the advantages here is you can actually mix the type of threading it can handle both both types of OpenMP threading in this case which is relatively powerful so one
of the things I'm going to do here is I'm going to show you a little a small demo of what this will look like when when you start you having oversubscription and I'll be running this on one of our just you know relatively large 2-u socket server to show you what that looks like and then showing you how these these these frameworks address that problem
so right here I'll show you kind of what what type of setup this is so this this has you know with hyper-threading it has about 88 core so it's a but to you socket and it's one of the xeon so one of the things I'll do here is this will take a bit of a while but let's see
let's hope the SSH tunnel works today so this will take a little bit of time so what this code is actually running and I'll and I'll show you here it's a
relatively benign piece of code and again you know we have we have a for loop we have a thread pull map that's from from multi-processing and we have a number called that's called inside of it and you know one of the things that that you can see from this is it's a relatively small amount of code that we might have written ourselves that could actually cause this problem and so if I were to run this on my local laptop it wouldn't be too bad but if I'm running it on a system with that many cores and that many that many threads it's gonna take a while so here this this example will repeat three times and you know it'll display the time that it took to actually get which accomplishes so the first one took 39 seconds so now I have to burn off another 32 two times 39 seconds while I'm talking here to let this complete so again you know while
we're while we're letting that run here you know essentially what that what it's done is we have our you know our data which is created by numpy random we have our thread pool created through the multi processing pool we have a three of a loop of three which is looping three times you know the time it's relatively simple call here and then QR and that for the amount of data in that range and so let's see if we've okay so it's we've we've hit the second one now right so we just have to burn off another 39 seconds here right so again this is because what we showed in that last in that last slide before that is you're essentially hitting over description because it's saying oh I have all these threads that I can address and I have you know the the multi processing then mapping to the threads it's like hey look all the threads that I can create is just gonna create as many as it can right and so that's where you know you can get in trouble is now you've written your application it works great on your laptop you scale it to you know to your server on your production machine and oh this is why is it slower why is it so much slower so one of the things that you can do now is you can actually run it like this right and what TBB is going to do is it's gonna say mice I'm gonna set my pull size I think there's some defaults and you can actually look at what those are by doing TVB and then you can ask - help and it'll show you what the default sizes are you can set that dynamic pull size so if I start running this I'm probably not gonna have enough time to actually finish my my discussion here before it decides to you know clean itself up but yeah so there you go when we talk about combatting oversubscription quantifying what that problem can have and that nested parallelism can have is very evident now right so you know something such as as simple as the demo that I just showed you you know TBB could just handle that you know then that's relatively Python you can actually call it your script will under TVB you made no code changes I made zero code changes to this thing
and it actually did that right SMP handles it a slightly different way right and so now if we call it under SMP and run it now then again it's taking those settings that the Augmented style of parallelism and now it's completing it relatively quickly so here you can also see you know it accomplishes the handling of nested parallelism and over description in a completely different way but it still addresses it and still is able to handle that in a relatively simple way by allowing you to run under SMP without making any code changes okay
so now that we've kind of seen this demo you know I think it's time to bring it back a little bit and talk about you know the the industry again right so in concurrency in pythons ecosystem of concurrency and parallelism much of the concurrency in async areas are very rich with packages right there's a lot of packages in that space we've done a lot of work with concurrent futures and it helps us solve the need of the majority of the Python users but now when we look at the areas of true parallelism and data parallelism it's a strong area but it's it's focus has been you know relatively small in comparison to the concurrency and async offerings so you know that's why you know when we when we look at the packages in this space it really hasn't been much shown in the area and we're trying to now make headway as kind of one of the final frontiers of parallelism in Python so most of the ways of achieving parallelism in this area rely on factorization frameworks or with multiple processing or distributed methods so I think that kind of pops the question of how do you do it in a semi pythonic way right so I'm gonna introduce this kind of silly idea of pythonic ish I'm not saying it's true pythonic because like you know that that's a whole different discussion but let's just talk about pythonic ish what makes it you know relatively pythonic ish is relatively few code changes right so you might have you know a small few bit of code changes maybe you have to modify its current behavior of one's framework to fit your needs so you know our you know to prevent a massive rewrite so that's one of the things that would be considered you know is it in directly in the Python standard library is it writable from the Python lair do I have to drop into a different lower level language like C to be able to utilize it is the interface easy to understand and does does it keep you in the Python lair and not drop to an intermediate representation right so I think that then poses a question how close can we get so if we look under the lens of TBB for pi you know it meets quite a bit of these but again two of them aren't met which is that it doesn't directly it's not directly in the Python standard library and it's not writable from the Python layer but on the other hand you don't have very many code changes you're not modifying a lot of the current behavior of that framework to make it work for you it's a relatively easy interface you saw that I just just called it while under the module of TVB ran the script made no code changes so it's relatively easy interface to understand you can set those with some command arguments if you need it and it keeps you in the Python layer and doesn't drop to near mediate representation looking under the lens of SMP you know it's relatively few code changes and it doesn't modify any current behavior of framework now one of the interesting parts is it is somewhat writable from the Python layer because it does have an API that you can use but you can also use it without that and you can run it just like I did where I'm just running it under the SMP module and letting it just pass down the settings it's relatively easy to understand and it keeps you in the Python layer I think the other thing to also add here is SP is completely in Python so you can look at it on our github it's it's it's a pure Python package so you know from the standpoint of being able to integrate that into a solution or into other people's frameworks it's relatively simple you know but it's still again not in the standard library but it's maybe a little closer to it but it's accomplishing in a different way so I think this then poses the question of you know these these final four questions was just how realistic is it to have a firm requirement for a pure Python implementation right so TBB is not a pure Python implementation but SMP is and these again now we're talking in the light of addressing you know nested parallels in an oversubscription the second question would be you know is the best way to modify your Python code is it monkey patching is it new you know new a different framework like how do we want to address that space when we want to modify our Python code to operate under those under that Augmented threading and at what level should the parallelism be controlled right should we be controlling it at the module call level should be controlling it when we're calling it from our own source code where should we be doing that and can an interface face be agreed upon to operate on that parallelism right so you know concurrent features did that relatively well can we do the same so let's answer the first two questions and you know now with with the demos being shown for TBB 4 PI and SMP how realistic is it to have a firm requirement for a pure Python implementation I would say it's not required but it's highly recommended you know we can see that with the the uptake of the packages that we've released people are more trending towards the pure Python variants of it there's also limited things that you can do from the pure Python layer but maybe that's something that vendors can work with you know the the actual Python in the PSF and the difficult developers to try to find something out now what's the best way to modify your Python code is it through monkey patching a new framework it's seeming like monkey patching is the the new normal from the space we're seeing a lot of examples where monkey patching is becoming you know the de facto standard when making packages that augment other packages behavior we see that a scikit-learn we see that in other places so you know that that seems to be the new normal and seems to be ok I think you also have that question of at what level should this parallelism be controlled right should JIT be controlled at the Python layer maybe so I think that question is is like well the Python layer it's sort of it can be controlled from the area the the challenge that you'll start finding is that it needs directives for how additional layers can compose it right and that in itself may be you know some type of composing directive would be useful in that space can an interface be agreed upon to operate on that parallelism right I think the jury's still out on that one because with every iteration that we make of attempting these packages we learn something new we learn something that works and that doesn't work and we're it seems like the Python community is still in that space and I urge you if you're in this space to continue pushing and seeing what what makes sense we do we're still very very young in this space to know what is pythonic what's the best way to operate on it what's the you know the best way of operating and augmenting your threading behavior and keeping that to be able to scale when you actually deploy this to you know your production cluster or something similar but you know with SMP we do get a slightly more clear picture as to what it could look like you know so now that
we look you know I've talked about all of this I think we you can see now that TVB for Pyatt SP a attempt to address you know the pythonic ish methods that I've set out and augment the way you do multi threading multi processing and it's hard to do it with a way that makes it such that you don't have to modify a lot of your code and you know I would still say it's best to leave the two forms of multi processing and multi-threading at their same levels and to not really change too much of how we interact with them at least from the Python and sea levels you know try to keep them at the respective levels in most multi multi threading is domain-specific so I've talked about like you know when you do something that's data parallel typically you know the domain that you're operating in and that's seeming to be the best choice and you have a lot of options for staying in Python or you can drop down to see if you need it right so you know numpy is decided that they want to be in c and these other frameworks allow you to stay within within the Python layer and still not have to actually build with some type of C based library right so number and M expression scythe on do a great example of this and you know one of the thoughts is well if you actually have some type of directive to say okay at this point I want to you know only have maybe 20% of the threads being able to be spawned during this this one section that might be better you could leave that in comments but doesn't that just sound like pragma OMP right so I think that then you know chooses the question of well what is pythonic at that pythonic ish at that point right if we're leaving things that literally look like sea is that really that useful are we complicating the language is maybe that the way that we're gonna achieve composable parallelism when we start combining them I think that that you know poses a great question augmenting the threading behavior seems to be the more useful based upon the experiments that we've run and you know putting the bulk of responsibility but that also means that putting the book of responsibility is on the users themselves right so if depending on how you want to do if your framework designer how do you choose to do how you choose to do your threading is really your choice you know that and I think that that is a relatively heavy responsibility not as heavy as expecting yours to always be you know caught Midway you know being completely thread safe but it's also a high requirement and you know threading in general for numerical has a lot of known frameworks and I think the thing is is if you're going to try to take away try to relate the room of a Gil or do anything similar you're going to be removing the the ability to use just a Python object and then you'll need stricter typing right so you know then that poses a question well why are you actually using Python in this instance so it's kind of summarized everything and kind of end it the Python ecosystem has a critical mass of you know good frameworks that you know we kind of walk through today that look to address multi threading multi processing so for those of you who are working on it you know keep on pushing and seeing what the limits are today's demonstration here we're showing what we're trying to do on our space and you know we encourage you to either contribute our or find other ways and propose other ways of doing so so thank you and with that I'm open for Q&A [Applause]
thanks very much I don't have any questions me great talk Thanks um you were asking about a good interface for you know for integrating this into into you know systems that require holism into Python you're probably aware of droplet yes so how does that and thread is that is your stuff from below droplet or do you directly integrate with it somehow so that's a great question so if we step
back a little bit and we look at where job libs sits within you know this this
part right so when you have job Lib and then you you're calling things from job Lib you can incur you can you know those are two different layers of parallelism and that's where I was talking about where because there's actually no real you know communication between the job what the job libs requirements are and then your the numpy call that you put into job Lib that's where that over description area can occur and so I think that job Lib does a very good good well good job no pun intended here of being able to separate those tasks out and being easily able to define a way to compose the jobs that you need to do in either you know tasks parallel format you know it's its biggest I think comparison would be will be desk and you know both of them do a very good job in that space but I think we still run into the problem of we have a composable parallelism problem you know we have something that's clearly a application level parallelism and something that's clearly data or or task parallelism and that that that link either you know from from the way that we've defined Python it either needs to be kept separate or we need to way of interlinking it without breaking you know I guess the API is that they were defined as I think I think we're losing the abstraction capability if we try to bring that layer down too much thanks thanks anyone else okay so you mentioned about cpu-bound parallelism that it's very clear when you can know about the oversubscription but also there's a part of a sink away a sink weight button and that's basically for IO bond processes right so at the end it's very hard to determine what's a better way to approach with that because I it's you don't know if the CPUs are starving while you are doing all that your way to processes or how you would approach that so I mean we one of the things that I do as a consultants that work with customers that have the style of problems and to determine whether we have a CPU bound / io bound problem and determine what's actually the issue we typically use a Python profiler so one of the products that we use is intel vtune amplifier and so we will take a look and see what's going what's going on from a code profiling perspective and then try to look at you know what's the behavior of the code in that in that space sometimes it takes it takes a little bit of static analysis to be able to determine that are looking at i/o saturation with the tools to be able to determine that but it's actually a very very hard thing to detect and you're right it is extremely hard to know if given no tools even with our open source profile like the open source profilers in the Python space yeah it's actually very very hard to know what's going on okay thank you hello good dog thanks two small questions is a simpie developed by Intel as well yes okay and which one was the first one to be developed T V B was the first to be developed the the threading building blocks has been around for quite some time I think it became open-source recently but it's the longer legacy one and then SP was developed I think about a year ago to kind of address the space because we you know one of the the the systems that we used had 63 plus cores per per socket and then you get that towards a lot of them we started seeing that this problem existed and you know when you start scaling out that that issue so we that's where we kind of why we developed it later in the in the game okay and what is to be fought i doing impotency it's operating with the TVB runtime actually so one of the things that we do is it's one of the libraries that we we ship and so it's actually operating directly with that light that dynamic library so that runtime basically when you if you do if you download any of our packages like numpy or scikit-learn that utilize it it'll download the runtime and interact at runtime with that library yeah thank you very much yeah thanks anybody else with a question hi there a great dog I'd like to ask just what if I would like to run a few programs under the TPP or what would happen will they understand that like their oversubscription could happen and like would happen or or should I like concentrate everything in one program - that is a great question so the way that both of these packages work and how most types of tools that that accomplish in this that accomplish that type of control in that space have to have is you have to start from a single Python process in order for that to work so if you think about how job Liv and ask are doing it's starting from something that's multi processing down to something that's that's threading based upon those processes but it started from the same one if you start them on different Python processes and they're not started from the same one then that's probably bad because they don't see each other so they're gonna have different pools whereas if you op if you start from the same one it's going to have the same pool and so it'll be better it'll be able to better handle oversubscription in that manner so it's better to focus that and start it from a single Python process if possible I don't know if I got the difference between the DVB and SMP what are the use cases one over the other so one of the the things is is with TVB it is a dynamic it handles dynamic types of threading better so say you have something that returns within ten seconds but it has the chance of returning in a second right if you have that in it with TVB it'll be able to say okay this one ending quickly we're gonna put this one back in the pool and really let it come back out SMP handles in a different way which is to say I'm gonna pass down the the settings for the amount of threads that can be spawned from that process so it's going to say so say I have like you know OMP num threads equal to like one or two and it's going to say okay for this process it's going to be two for this one it's gonna be one those passing it down by saying I can I I'm having these these settings passed down and that's how it controls it by not letting it go outside of it but it's better for structured work because if you think about it if if it's structured and it passes down those settings and they'll stay essentially semi pinned to the processors and not be you know jumping to different processors all the time and then you'll have like you have cache issues if you start doing that so symmetric work generally is better for SMP and then more dynamic types of parallelism that have the chance of returning a little bit earlier or unbalanced is better for TPB thank you Thanks any more hi can we use these packages without interruptions no by SK level and so on I mean you can download the packages themselves so you can download TBB for pi as a standalone and just run it with your again this is the kind of talk about Python Akash right so you can download both of these packages independently of our distribution on Conda on our condo channel which is you can just use the - C Intel channel if you're using Conda to actually look look up these packages hi can you say something about platform compatibility I guess it runs on Linux but what about BSD windows open Solaris and so on great question TVB runs on all platforms right now so we have four for Mac Windows Linux and majority of flavors of Linux as well SP right now is Linux only because of its of some of the items that we're using but we're looking to see what other options we have in that space but it's just currently only on Linux for the time being any other questions No all right well thank you David again thanks [Applause]