The Gilectomy

Video thumbnail (Frame 0) Video thumbnail (Frame 1743) Video thumbnail (Frame 2636) Video thumbnail (Frame 3734) Video thumbnail (Frame 4637) Video thumbnail (Frame 5766) Video thumbnail (Frame 6917) Video thumbnail (Frame 8050) Video thumbnail (Frame 10150) Video thumbnail (Frame 11032) Video thumbnail (Frame 12546) Video thumbnail (Frame 13547) Video thumbnail (Frame 14978) Video thumbnail (Frame 15945) Video thumbnail (Frame 16957) Video thumbnail (Frame 18245) Video thumbnail (Frame 19295) Video thumbnail (Frame 20189) Video thumbnail (Frame 21326) Video thumbnail (Frame 22672) Video thumbnail (Frame 23569) Video thumbnail (Frame 24529) Video thumbnail (Frame 25782) Video thumbnail (Frame 26784) Video thumbnail (Frame 27759) Video thumbnail (Frame 28836) Video thumbnail (Frame 29900) Video thumbnail (Frame 30811) Video thumbnail (Frame 31724) Video thumbnail (Frame 32612) Video thumbnail (Frame 33665) Video thumbnail (Frame 34882) Video thumbnail (Frame 35931) Video thumbnail (Frame 37049) Video thumbnail (Frame 38121) Video thumbnail (Frame 39701) Video thumbnail (Frame 40709) Video thumbnail (Frame 41714) Video thumbnail (Frame 42768) Video thumbnail (Frame 43831) Video thumbnail (Frame 45073) Video thumbnail (Frame 46233) Video thumbnail (Frame 47356) Video thumbnail (Frame 48364) Video thumbnail (Frame 49286) Video thumbnail (Frame 50455) Video thumbnail (Frame 51394) Video thumbnail (Frame 52283) Video thumbnail (Frame 53233) Video thumbnail (Frame 54352) Video thumbnail (Frame 55283) Video thumbnail (Frame 56450) Video thumbnail (Frame 57891) Video thumbnail (Frame 58849) Video thumbnail (Frame 59723) Video thumbnail (Frame 61599) Video thumbnail (Frame 64035) Video thumbnail (Frame 65049) Video thumbnail (Frame 67052) Video thumbnail (Frame 68706)
Video in TIB AV-Portal: The Gilectomy

Formal Metadata

The Gilectomy
Title of Series
Part Number
Number of Parts
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Larry Hastings - The Gilectomy CPython's GIL means your Python code can only run on one CPU core at a time. Can we remove it? Yes, we can... in fact we already have! But is it worth the cost? ----- CPython's "Global Interpreter Lock", or "GIL", was added in 1992. It was an excellent design decision. But 24 years is a long time--today it prevents Python from capitalizing on multiple CPUs. Many people want us to remove the GIL. It turns out, removing the GIL isn't actually that hard. In fact, I already removed it, in my experimental "gilectomy" branch. But the GIL is one reason CPython is so fast! The "gilectomy" makes CPython shockingly slow. This talk will discuss the history of the GIL, how the GIL helps make CPython fast, how the "gilectomy" removed the GIL, and some ways we might be able to make the "gilectomy" version fast enough to be useful.
Goodness of fit Demon Red Hat Computer animation Software developer Projective plane Pattern language Game theory Object (grammar) Multiplication
Ramification Addition Computer animation Software developer Sheaf (mathematics) Value-added network Condition number
Overhead (computing) Dependent and independent variables Overhead (computing) Thread (computing) Deadlock Code Deadlock God Neuroinformatik
Point (geometry) Laptop Server (computing) Workstation <Musikinstrument> Bound state Deadlock Computer programming Neuroinformatik Computer animation Causality Single-precision floating-point format Core dump Game theory Multi-core processor Spectrum (functional analysis)
Point (geometry) Thread (computing) Computer animation Line (geometry) Multiplication sign Patch (Unix) Source code Freeware Thread (computing) Neuroinformatik Formal language Exception handling
Freeware Patch (Unix) Multiplication sign Control flow Variable (mathematics) Computer programming Wechselseitiger Ausschluss Computer animation Interpreter (computing) Single-precision floating-point format Blog Data structure Extension (kinesiology) Window
Freeware Computer animation Patch (Unix) Interpreter (computing) Patch (Unix) Computer hardware Physical law Order (biology) Projective plane Planning Counting
Reading (process) Trail Functional (mathematics) Implementation Run time (program lifecycle phase) Thread (computing) Code Multiplication sign Counting Mereology Atomic number Operator (mathematics) Energy level Extension (kinesiology) Multiplication Information Parallel computing Software developer Counting Ext functor Variable (mathematics) Type theory Computer animation Object (grammar) Quicksort Tuple
Reading (process) Addition Thread (computing) State of matter Code Parallel computing Electronic mailing list Set (mathematics) Ext functor Counting Event horizon Computer animation Atomic number Operator (mathematics) Object (grammar) Error message
Single-precision floating-point format Computer animation Code Blog Statement (computer science) Control flow Quicksort Control flow Extension (kinesiology) Extension (kinesiology)
Single-precision floating-point format Computer animation Source code Order (biology) Quicksort Control flow System call Extension (kinesiology)
Goodness of fit Computer animation Integrated development environment Execution unit Counting Speicherbereinigung Extension (kinesiology) Condition number Connected space Exception handling Speicherbereinigung
Server (computing) Multiplication Multiplication sign Projective plane Expression Database transaction Database transaction Event horizon Type theory Computer animation Read-only memory Semiconductor memory Software Interpreter (computing) Speicherbereinigung
Latent heat Computer animation Read-only memory Software Counting Extension (kinesiology) Database transaction
Reading (process) State of matter Basis <Mathematik> Variable (mathematics) Computer programming Exterior algebra Befehlsprozessor Computer animation Shared memory Atomic number Right angle Object (grammar) Speicheradresse Condition number
Point (geometry) Multiplication Thread (computing) Gamma function Parallel computing Multiplication sign Range (statistics) Control flow Ext functor Computer programming Mathematics Word Computer animation Meeting/Interview Atomic number Operator (mathematics) Core dump Object (grammar) Extension (kinesiology)
Type theory Functional (mathematics) Computer animation Block (periodic table) Bit Water vapor Object (grammar) Parameter (computer programming) Macro (computer science) Computer programming
Type theory Functional (mathematics) Computer animation Hash function Object (grammar) Multiplication sign Hessian matrix Data storage device Right angle Object (grammar) Perspective (visual) Field (computer science)
Point (geometry) Thread (computing) State of matter Projective plane Function (mathematics) Field (computer science) Number Computer animation Hash function Personal digital assistant Semiconductor memory Object (grammar) Right angle Data conversion Object (grammar) Table (information)
Wechselseitiger Ausschluss Building Computer animation Synchronization Block (periodic table) Content (media) Sheaf (mathematics) Spacetime Object (grammar) Speicheradresse Window Spacetime
Computer animation Atomic number Constraint (mathematics) Operator (mathematics) Canonical ensemble Quicksort Extension (kinesiology) Computing platform Spacetime
Point (geometry) Reading (process) Default (computer science) Building Multiplication sign Compiler Revision control Computer animation Configuration space Quicksort Extension (kinesiology) Macro (computer science) Computing platform
Point (geometry) Revision control Module (mathematics) Reading (process) Computer animation Different (Kate Ryan album) Point (geometry) Authorization Configuration space Extension (kinesiology)
Point (geometry) Functional (mathematics) Computer animation Lecture/Conference Single-precision floating-point format Pauli exclusion principle Moment (mathematics) Ext functor Extension (kinesiology) Macro (computer science)
Type theory Functional (mathematics) Run time (program lifecycle phase) Computer animation Computer configuration Multiplication sign Operator (mathematics) Pauli exclusion principle Negative number Ext functor Pattern language
Computer animation Multiplication sign Pauli exclusion principle Source code Sound effect Ext functor Quicksort Extension (kinesiology) Spacetime
Point (geometry) Code Blog Order (biology) Bit Quicksort Computer-assisted translation Reading (process)
Spyware Software developer Decision theory Dependent and independent variables Bit Number Architecture Computer animation Vector space Internetworking Atomic number Maize FAQ
Point (geometry) Computer animation Atomic number Interpreter (computing) Extreme programming Object (grammar) Entire function
Metre Trail Freeware Multiplication sign Electronic mailing list Electronic mailing list Computer animation Operator (mathematics) Atomic number Interpreter (computing) Speicherbereinigung Right angle Object (grammar)
Freeware Thread (computing) Information State of matter State of matter Electronic mailing list Mereology Variable (mathematics) Thread (computing) Number Computer animation Atomic number Speicherbereinigung
Suite (music) Freeware Thread (computing) State of matter Multiplication sign Electronic mailing list Formal language Mathematics Lecture/Conference Software testing Macro (computer science) State of matter Data storage device Electronic mailing list Bit Variable (mathematics) Thread (computing) Entire function Word Computer animation Atomic number Software testing Object (grammar) Quicksort Reverse engineering
Computer virus Word Befehlsprozessor Thread (computing) Benchmark Computer animation Lecture/Conference Multiplication sign Software testing Formal language Computer programming
Multiplication Functional (mathematics) System call Module (mathematics) Fibonacci number Code Content (media) Code Bit Benchmark System call Mathematics Medical imaging Benchmark Computer animation Function (mathematics) Operator (mathematics) Core dump Pattern language Endliche Modelltheorie Integer Form (programming)
Point (geometry) Medical imaging Thread (computing) Computer animation Multiplication sign Counting Right angle Software testing Line (geometry) Number 2 (number)
Revision control Group action Befehlsprozessor Graph (mathematics) Matching (graph theory) Computer animation Multiplication sign Number
Point (geometry) Enterprise architecture Type theory Computer animation Code Multiplication sign Core dump Content (media)
Cache (computing) Latent heat Thread (computing) Cache (computing) Computer animation Synchronization Multiplication sign Core dump Synchronization Sheaf (mathematics) Computer programming Number
Point (geometry) Cache (computing) Computer animation Computer crime Multiplication sign View (database) Energy level Computer-assisted translation Computer programming
Revision control Cache (computing) Cache (computing) Computer animation Causality Semiconductor memory Weight Counting Benchmark Food energy Computer programming Number
Cache (computing) Computer animation Causality Semiconductor memory Multiplication sign Combinational logic Counting Object (grammar) Quicksort Number
Greatest element Mathematics Thread (computing) Computer animation Atomic number Buffer solution Combinational logic Cuboid Counting Object (grammar) Counting Number
Mathematics Thread (computing) Computer animation Buffer solution Surface Multiplication sign Cuboid Object (grammar) Counting Data buffer
Overhead (computing) Thread (computing) Physical law Content (media) Counting Bit Counting Login Mathematics Computer animation Synchronization Buffer solution Single-precision floating-point format Buffer solution Right angle
Multitier architecture Thread (computing) Length Logarithm Electronic mailing list Letterpress printing Counting Library catalog Counting Process (computing) Computer animation Buffer solution Iteration Object (grammar)
Multitier architecture Process (computing) Thread (computing) Computer animation Semiconductor memory Logarithm Buffer solution Operator (mathematics) Electronic mailing list Object (grammar) Counting
State observer Computer animation Vector space Personal digital assistant Buffer solution Operator (mathematics) Multiplication sign 1 (number) Counting Computer programming
Email Thread (computing) Divisor View (database) Water vapor Line (geometry) Counting Computer programming Cache (computing) Computer animation Vector space Buffer solution Object (grammar) Operator (mathematics) Buffer solution Active contour model
Point (geometry) Email Thread (computing) Content (media) Counting Canonical ensemble Mathematics Computer animation Lecture/Conference Different (Kate Ryan album) Term (mathematics) Semiconductor memory Object (grammar) Buffer solution Object (grammar)
Email Email Computer animation Object (grammar) Multiplication sign Buffer solution Counting Real-time operating system Object (grammar) Disk read-and-write head
NP-hard Thread (computing) Computer animation Semiconductor memory Object (grammar) Operator (mathematics) Statement (computer science) Speicherbereinigung Object (grammar) Thread (computing)
Trail Concurrency (computer science) Reflection (mathematics) Branch (computer science) Thread (computing) Subset Computer animation Object (grammar) Order (biology) Buffer solution Speicherbereinigung Pattern language Object (grammar) Speicherbereinigung
Trail Thread (computing) Computer animation Object (grammar) Multiplication sign Order (biology) Object (grammar) Extension (kinesiology) Thread (computing) Subset Speicherbereinigung Extension (kinesiology)
Computer animation Object (grammar) Thread (computing) Number Speicherbereinigung Extension (kinesiology)
Complex (psychology) Linear regression Multiplication sign Electronic mailing list Set (mathematics) Coma Berenices Computer programming Number Neuroinformatik Medical imaging Array data structure Computer animation Object (grammar) Quicksort
Functional (mathematics) Context awareness Multiplication sign Direction (geometry) Coroutine Stack (abstract data type) Protein System call Computer programming Formal language Element (mathematics) Mathematics Pointer (computer programming) Semiconductor memory Endliche Modelltheorie Exception handling
Purchasing Point (geometry) Multiplication Thread (computing) Overhead (computing) Scaling (geometry) Code Multiplication sign Sound effect Parallel computing Theory Computer programming Revision control Internet forum Meeting/Interview Operator (mathematics) Core dump Software framework Object (grammar) Table (information) Local ring
Mathematics Graph (mathematics) State of matter Code Software developer Multiplication sign Core dump Extension (kinesiology) Measurement
OK let's welcome Larry
Hastings AKA AKD is dead for the removal
of the game so few alright yes this is a talk about
the colectomy which is a project to remove the guild from C Python but let me preface my comments by saying about this topic can be exceedingly technical but I'm going to go right
into the heart of the matter and so it's kind of designed for people who were already core developers who were familiar With the internals of Python I'm I'm hoping you'll understand multithreading pretty well I'm hoping you'll know least of the understanding of policy but works internally the concept of like objects in the reference counts on objects also if you don't understand the stuff of a good thing to do would be to watch my talk last year I don't give it here but I I gave the other talk my think it here actually 1 is called a patterns and the skill it's on youtube and in good if you go back in time and watch it for you came in the door but anyway so it's about the
guilt the goal was added in 1992 and barring the addition of condition variable to enforce various it has remained essentially unchanged in 24 years since then but I wanna make it clear that undergoes a wonderful designed it really solve the problem it was a fabulous
designed for 1992 yourself look today in a lot of ways but there there's some ramifications from this so 1st of all the Gill is very simple so
it's really easy to get right now and and see Section developers don't have any trouble understanding how to use the Yale internally would never have any problem but only go
not only known response to were not supposed to and since is only 1 deal we can have deadlocks and there's only
1 like you can have a deadlock between them more than once and the only ever analyst thread and there is almost no overhead of switching between them and the girl we do with all things which threats and so you could go through fast and the gods were little overhead your now if you're single-threaded your code has been running really fast and it's a really great designed for single if you're and I
O-bound multithreaded this works great but if and this is actually the original design for what about readiness for that when all computers for a single-processor anywhere almost all of them and this problem the problem is when you want have
a CPU-bound program and you wanna unintelligible or simultaneously because you just can't and that is the main point of the game so again in
1982 all the computers around us all on a single core region the big servers but the world has changed since 1992 these days we have
these wonderful laptops spectrum because and even the phones are multi-core in
our wristwatches and eyeglasses evolve core i workstation Honda has a 28
causing it to but become hybrid Freddy this 56 cause we live in a deeply multi-core world and Python is kind of Philip race to take advantage of
that I wanna point out this comment is still in the Python source code today has only rudimentary thread support I suggest that maybe it's time to consider
adding more sophisticated trading support to survivor after all the but the goal of foreign language really should be to expose all of the various things that your computer can do you take
advantage of all the different resources you offers and I think news although except for the multiple cost you have some kind of a sore point there was an attempt
back in the nineties this thing called a
free threading patch this is added to this was an attempt impact 1 of 4 1099 to get rid of the guilt but it's been
require changing the API said and breaks extensions which is been designed and what it is moved most of the global variables inside the integrand a single structure and added and a single new text walk around in Brim that I believe there's a Windows variant of it at the time the use of interlocked increment interlock decrement was to introduce you API is equivalent
to atomic of that or but this single mutex lock was a little on the slow side and if your program
will run between 4 and 7 times slower in which but the clear what everyone wants they wanted everything guild because the 1 use
multiple course because they want programmable faster so I say 0 we remove the girl and go slower nobody's excited so this was not a very exciting passes the time I've learned read more about it there was a lovely blog post by David
Beasley couple years ago and got it run on modern hardware is called an inside look at the global patch of law and so I will do that too but
let's talk about what I'm doing now so they collect but
I have a plan to remove the guild and actually what I should say is that I have removed the
I removed the go back in April the problem that it's terrible slow so but in order to remove the go you can either have a plan in place
you there a bunch of considerations you must account for in order to remove the guild and how the of the project might be successful in immediate merged for use by people some so all is there for technical considerations you must address when you remove the Gill those are reference counting
up again every at every object in the C Python runtime has a reference count the tracks the object lifetime and and this is
traditionally kind of unfriendly terrible multivendor approaches but there are global static variables there are not nearly as many as I thought there were part of the couple but there's some pertinent information which I think all of in 1 place now and there's a bunch of shared similar objects like that all the small investors like negative 1 through 16 or something like that but non true false empty a tuple and these are all Python creates 1 of them and every time you use an empty tuple uses the same empty tuple level because it's new but you need to address the extensions C extensions currently running at this wonderful world where they don't have to worry about locking because they go protects them and they've never run in a world where they can be culture multiple threads at the same time this certainly never on in the world where multiple threads that run in the same function same time and so there's a lot of code that depends on only a single thread running the function like you know if a static thing is equal the null then and initialize static thing all that sort of code is going to break when we go to the and finally we need to worry about the atomicity of operations and the price of land in so that the developers of the other Python implementations type
II error and of more strongly IronPython diphone discovered a lot Python code it implicitly expects a lot of operations to be atomic in the pipeline if you and appended to a list or a new set of value in depth of another thread to be examining that object and it must not see that picture that list an incomplete state needs so I will see it before we can this happened or after the event is happening so we need to guarantee that atomicity of operation that you can never see an object in an incomplete state from another thing
but in addition to these 4 technical considerations I says there are 3 political considerations that we must address because of it's not simply a technical the
solution we need also make sure that the the C Python like this whole world people using the Python out there and and there are demands that are going to be made on of the moving the guilt that are not strictly technical demands and I say
that these are we need not hurt single-threaded performance this was actually something we established in a blog post which I'll talk about in a minute but we need to not make single code slower not make multithreaded I O-bound codes that's a very high bar to me we need not break C extensions this is sort of my statement of C Python 3 . 0 broke every extension of there and it's been however many years 5 years 6 years since us apart 300 came out and there are still plenty of extensions other that haven't upgraded to the new exactly we need to try and avoid breaking the extensions
as much as possible and finally don't make it too complicated of course this
is this is a judgment call but some C Python things this rule of about the Python is that pretty easy to work on internally it's not that
complicated it's conceptually very simple and it's because very clean and it would be a shame if we broke that feature of this from source code in order to get rid of the gulf so let's try and preserve that now there are a couple of approaches that people have talked about the ways to get rid of the so that I don't think we'll work and so just as sort of set the stage I wanna talk about those 3 4
minutes you have this what I call tracing garbage collection is also mark-and-sweep garbage collection and this would let us get rid of reference counting and again reference can is traditionally very difficult to do in different environment and so this would be but this to be very favorable
to multithreaded and tracing garbage collection of it's not clear whether would be faster or slower than reference coming will be traditional wisdom about this
connection with them says that the garbage collection and good reference coming conditions are about the same speed and and people like to argue that unit for you but where this falls down is that this is going to break every C extension of their cognitive very different world going to pure garbage collection as opposed to reference counting so the exceptions are just not going to work anymore so that the accuracy extension I say we kind can it can afford to do it politically and can also be very complicated it's a much more complicated API and reference and reference has a relatively simple API and still people mess it
up you know it can be it can be a little of obscure times it can be a little hard to figure out exactly the right things to do with reference counting garbage collection of things can be done much worse need even more so than tracing expressions of cults offer transactional memory of Armand Rico just showed up today and he's been working on a server transactional memory as a research project with the type interpreter for a couple of years now and then it sounds like a fantastic technology is going to be fast enough yes absolutely
suffered transactional memory works it's going to be really fast and can be really great it's going to take a wonderful event multiple will cause you have very little what involved but
it really falls down on the other 2 it's going to break every 6 hours from their horribly and it's going to be incredibly complicated internally also again this is research quality stuff right now it's not clear to anybody when it's going to be ready for production of
and I don't think that's the Python is able to wait so let's
move on let's talk about my proposal and the specifics of my proposal
so again I said there were those for technical considerations the 1st is referenced
what I say we keep reference that way we don't break the extension so it can be the
same API we have now applying the empire that the important thing is the compile-time CPI does not change now and like I said I got rid of the deal in April
and what I did is I switched to what's called atomic better and this is where the CPU itself and has provides an instruction it says I can add 1 or subtract 1 to this memory address and you do it in such a way that is not possible to have a race condition with another works great cost 30 % of speed right off the the top and so this is working great this is this means that our programs are correct but it's awfully slow and when look for alternative approaches here the bubbles back variables kind of
handle that on a case-by-case basis and again for the all of the stuff is already the pipe for me as I was done a couple years ago I noticed on so that's when go shared singletons and this remained shared all those shared objects like the small investors Enola not to true
false those just get shared between threads and that's all points of getting rid the goal and running for is that a Python programs don't change see session Palestinian ranges see
there's just nothing for it and they're
going to be running on a multi-core world and they're going to be called times multiple core about of multiple threads simultaneously
and they just need to get rid of get with the program so it's gonna break C extensions all over the place atomicity of operations
words gamma whole bunch of locks every object in C Python that is mutable will have a lock on it will have a lot more you perform new operation so this is going to add a
new blocking API to see Python undergoing macros
pylon compound what this is going
to college turned into the uptight type object is going to have spread to new members of blocking on 1 block which I'm guessing bits of exposed to Python
programs as Dundar Lokhande and under online from all these functions they only take 1 parameter which is the object water unlocked and return voyage because they always work
in for objects that are immutable my claim is that this obligation on the problem of those communal so you just support
locking don't if you don't like and you don't need the functions in so what objects we like it it's all
mutable objects and I say all mutable objects I mean sea mutable not just type on you for example consider this parameter in from the Python perspective
stores are immutable right but internally they have a couple of Leslie computed fields
like perhaps the hashes initially initialize negative 1 which by the way if you look at the hash function it says will never turn negative 1 negative 1 means uninitialized internal so that's why I need 1 values in the legal hash value by so it's initialized negative 1 in the 1st time somebody says to me the Hessian of this object a distorted because computed
stories it and returns that so that's a mutable state now in the case of the hatch that's harmless if we had 2 threads they books saw the negative 1 they will compute the hash table override within the overriding the same number so that's harmless but the 2 more fields UTF-8 and W. sister
both of these are also easily computed the user conversions to UTF-8 a or are you just 16 respectively and those allocated memory and there was a race they saw no although often allocate memory in Nabokov of right here but you're
memory that point so we can have a lot around us so this project is currently not safe and I haven't done with the itself right now with the k-means have of so every object is going be locked inside of the glycan which means we have to have as like a lock as possible and I will call this
a userspace under Linux we have this wonderful thing called a few texts on which is
literally a lot you can declare any 4 byte aligned memory address is a locking you can wake up and it's really more of like a building blocks for writing your new taxes and so and other synchronization objects really great windows for 20 years have what they call the critical section which is used only until there is contention and OS 10 has difficulty from the text of a couple people now have told me that people mutexes guaranteed the space only until there is contention so we
have used space what's that we need for all the major platforms on I I don't know about the other platforms philosophy previously and all that sort of thing somebody else going into do that work but all the major platforms Python runs on we're going to have the support for users lots of orbital canonical Python will see now as for the political considerations of for my approach with neglecting I would say that yes it's not going to
have been explored and yes it's not going to break the extensions now this may be crazy because I am
declaring that I just told you a couple minutes because in a precarious extension of the because of atomicity operations and I'm making
it very law or by adding ultimately the the preference counting so how can those 2 statements be true at the same time the answer is that I said that we have to
just have to build so we would have Python built with the guild and without developed with the goal and everything's the same so this today and that way all this extensions continued work that be the default build for everybody on every
platform and then if you're some sort of futuristic person who wants to live in the multi-core world you can build finding special noble version at which point the compiler-compiler lock stock work so these macros these pi begin all out there and all 3 of these pilot compiler like those of either be no-ops for active depending on which build you're in
if you had a guild and begin often and author and something pilot and how much we know what's if you don't have to go all the lock
and unlock are going to do something and begin will end up there they're probably not along the Hudson work in to this also means that the a c
extension will never at accidentally run the wrong build because we can have different entry points for each 1 and if you have a guild if you have a module just called module that you have an entry point called in that module we could say OK if you run it the without Gilgal version and would have a different viewpoint the noble in front or something just to make them to different entry points that way but no 1 ever
run in no global accidentally and and it's strictly off in notes extension will run in O global until they're ready until they declare themselves they have no go and you
might actually be able to build a single it's extension that work in both by the way
but we could add those other things are macros we could add accuracy functions for them and if you were very careful he might be all right see a single out so that's a portable moment I don't know if that's interesting or not it's just something that I mentioned and as long as we're effectively declaring and you see a because it's
true but this is at this point is kind and you see a guy looks very similar to to the existing CPI but it has this you
know reference counting works a little differently and the accuracy of operations in the of of walking all over the place and you can't guarantee the you're going only run a single-thread this time and this might
be a good time to start inflicting some best practices of people across Europe are optional on it's actually truancy Python that you can look very on type statically and you can create an object with
it you can pass it into the sea Python runtime and see pattern is never seen the subject of this type of war and it has to work and we could stop allowing that there's a function is supposed to call hi brandy that's optional negative OK knowledge acquired from the same token there's a new contact 49 this is this thing called multiphase C
extension initialization from really understand that has something to do with initializes the extension well that's very well so this might be a good time to say all these things that used to be optional now they're required here on a on a new node will build you have 2
copies of the capital 49 you probably have to use the limited CPI all those sorts of things there was this collecting idea falls down is the don't make it too complicated idea and it
is getting a little complicated for effect talking about 2 different builds and running the same time from the same source space so a C Python horticulture would have to read the code and say OK I want that the active in the noble all time I begin allow threading that's in
the active in the with oval and so again after sort of read every bit of code twice this reacting with Guillen without but we also have to be very careful about where you lock but ultimately this is the price you're gonna have to pay in order to get really all I don't see any simpler way of doing it now as I said and this is something I've been working on
for a couple months started in February on this last point is whole thing confuse a cat or just pick something from Monty Python but then then the name gluttony came up in like
all those was done there was the name now as I mentioned that in 2007 we or a blog post called it is easy to remove the
guilt where he talks about what would have happened or remove the guilt and I agree with everything in this paper that he wrote it's all a really insightful except in the title it turns out if you know where to start you can remove the going about a week here's how step 1 the step
0 really Orthophonic Internet reviews which piping current hijacker to use atomic here in green vector my only support the 64 bit Linux right now so
I just want to GCC and use the intrinsic absorbance 46 you right now number 1
you have to take what kind of locked in the news again unlike some using of feed expected based locks and there's a paper from already dropper called the taxes are tricky where walks you through
how to write a new text based on the Texas and basically using his son step 2 you need to write
out through the locks on the entire deck object you cannot run up to Python interpreter without having working so that they can continue to be safe so you just need to go through every extremal entry point any place where someone's calling into the
public from outside the need to make sure it is locked properly and online properly step 3 same thing with
the list again the C Python uses dates lists internally for a lot of operations and you just can't have working the interpretability that both of those working so for on there about 10 3 lists inside of supply where but when you allocate an object
and it looks to see if there's a free 1 waiting there is it just uses that if there is no it has to go off to the al-Qaeda the surrealists make things called faster but obviously not thread-safe yet so you need to add a lot around them you do that about 10 times so 5 meter disabled the
garbage collector and GC tracking and track like the garbage collectors completely broken and in the collective right now as can be quite a long before we get that working again so which actually by the way that makes it
makes my numbers a little better than they really should because there should be some garbage collection of rare that I don't have but and the garbage collector is just really totally broken and neglect me and is completely shut off step 16 actually
murder the gills so and this was a pleasure why I do that part of there's just structure you you just like going out there that variable anymore take all the things that a switching the Gilman just stop
mountain come the matter whatever they all go away but step 7 there
is when you switch threads with the C Python internally there is a thread state that stored in global variables and everyone just refers to that so whatever Tregaron they just look in the same spot and that's always the information about the current threat and after you can do that anymore running
words on a so instead of every time the people refer to that they're actually going from macro needed change the macro so polls that red state variable out of storage but that was actually pretty easy get working using the macros happens really good about it and finally
need to stick some tests and specifically the only couple test that really broken I did this mainly they were sensitive to exist testing exactly how big the deck object but list object were and now that I've added this lot to it and they have gotten a little bit bigger and I just need to fix those are actually the the entire Python of reversion test suites sort work apart from the stuff that was actually using threads and the couple so back to language so
much of I announced that it it was about 3 and a half times lower by
time in about 25 times CPU time what I mean by that is I was worried test this virus that has the same way every time I run 7 threads all running the same program
and I times and I didn't with normalcy Python and I did it with removing like music and but why did that with 7 words it was 3 3 have times you just watch the clock on the wall but you cannot how much CPU time was use was used in some quarters as opposed to normalcy by I'm just using 1 quarter and so you look platinum about 7 and about 25 so that 25 times slower to do the same amount of work which kind depressing so this
is the official benchmark the collecting is what everyone has been running it's a really bad Fibonacci generator and I'm showing you this just to impress on you how horrible the benchmarking is so far and how much how how little code I can run through the core of the pattern right now but this does work I can run on on multiple
or simultaneously runtime hasn't very much coincides with
that of C Python and it's looking up the fitted function over and over and over in the model that so there's a single deck it's just getting slammed with lookup requests and sense of lot that means reducing contention around walk of the form a function call which is always that some of its a pre heavyweight operations in Python running a little bit about code and we're talking about small images like 2
1 0 actually all the small images because away like you use all small integers a whole lot and again this morning shared between threads and they'll have reference counts which means that were changing reference of constantly from multiple threads which is causing of lot performance turns out and doing little that mathematically is all so this is what it looks like and I got
some flak for not later my axes so their label y axis the vertical is time in seconds horizontal of number of course that are being used and and this is the overseas colectomies so this having the gills the blue line it's way faster than have your right now with a guy like any of this shows you that's taking it seems to be curving off so at some point it might actually go be minor be making that much lower attic or to it that's going be a way way out there's also this the brand for I don't know why it's there I think it's just the way that the tests interleaved I would say it more it assume it's not there but I had a show that's my actually show
but more interesting again this was really a wall time I think CPU time is more interesting so the number of the amount of time that it took to compute these 7 the matching numbers so as to the 30th in C Python that's nothing you compare that with running with the gluttony news goes crazy so obviously it's incredibly slow or how much lower this is a graph of how many times lower it is I'm comparing group of normalcy Python to neglect version and again this is different for I would say more about what this is telling us is that and it's about
twice as slow with 1 core and then it shoots up to about 10 times slower to cost and it is it's gonna going up and up and up I thinking of a 7 core of 7 cause about
90 times lower here so why is it so slow after all the ability is changing that much code always not yet so the 1st thing I would say is that I
don't know for certain of its kind hard to measure and at the the lowest printed steep I thought at the type on the complement of by his early June that point but there were some enterprises hung out with me and there features and they can confirm suspicions here of the 2nd thing is actually lock contention
and that's what everyone's probably something was number 1 in Section number 2 number 1 in synchronization in cache misses and this is what really slamming half of the collecting something considers that's nothing inside of C Python is private so likable normally the core program you might write on you might design around the multi-core complicated here's this thing thread-local this 1 and that 1 there's almost nothing in Python the thread-local everything is
shared across all course all the time and all the course was talking simultaneously and that's kind of of the fundamental thing that's going performances that we really don't have any thread specific stuff so Sacramento about why things are slow and fast so let's hear it so
this is cash and you can use this point have 3 levels of cash between them and around the talking to you and it's 1 x to talk to level 1 cache level to catch in about 2 times a slow and military catches 10 times the slow and talking around itself is about 15 times lower so you won't be talking to capture comes computers abuses so fast that normal slow ramp can keep up with them anymore so we have all this Slavic caching going in between and we can keep the cash that we can keep the view that we can keep your program running at the point that we break the cat were going to start slowing down program a great deal and that's really what's going on
in the pot you know the guy means the cash never gets to warm up so let's just as an example of these erroneous lied to me this morning so
let's talk about we have a program got for cause 0 1 2 and 3 and we have the number 2 there were running colectomy version of the Python and we're running out of energy benchmark which is using the number to a whole lot so all
of them currently have the number 2 in cash so they want to look at the
number 2 they can just look at the neighboring that access that weight but then let's say that 1 of these is is going to actually do something with the number 2 so it's going to pi increment the number so it's going to change
the reference count score 1 is changing the reference count is incremented and that means that the number 2 has changed that memory has changed which means that it must now and invalidate the cache for all the other course for that cash
catch 64 bytes which is more than enough to cover the entire long object and so now none of the other
course have that honoring cash anymore and so the next time they want to talk to the number 2 they have to go load and armored tells me that they can actually talk to the other course maybe polar but still lot slower than simply having in cash ready to go so it's this is happening constantly anytime you examine an object in the Python you changes referenced the
time you change its reference count you are changing the memory anytime you change the memory you invalid in cash for all the other course which means that the more cause you add the slower you go and that's what I'm observing in my numbers so there is a solution for this or we sort of a combination of approaches for a solution there is a technique called suffered reference
counting I wouldn't use this in combination something else so this is what it how Werkstatt that conceptually these these blue box at the bottom this was quarters in this light blue boxes they're all that surprising in object so although we're talking to this directly so right now if you want to examine object you increments reference down near increments reference have
just going to it you reach the object you change the number but that means that we have to synchronize that across using this atomic Ingram data which is slow I would like to do something with faster
so why don't we change it so that we can use but what if we change it so that all changes to reference counts were done from a single thread then we would have to use atomic even acronym or we could just use
local unsynchronized before and after that the or we can do that all we do is we change it so that instead of writing a reference can directly right into of along the but the memory buffer that just gets reference countries changes so every time you watching surface common object you don't change directly instead you right and what you say add 1 to the reference use right that a lot of you don't worry about and all this other thread this portal box where recruitment that's the Committee thread that that I was going to make reference can change so he walks
the log C is 0 I should add 1 to the recover all and just goes and does it but he's the only threat making reference them changes so he can use unsynchronized and bring that that's
great the problem is all done is moved the contention now instead of having convention around the reference count we have intention and this law so we need to lock and unlock the log we really haven't solved the problems we can fix that so let's go to a single or protracted now when trade 0 wants to increment of reference count on the rights and this reference log and then the 2 but if it comes along and make that
change now we have a single lot thread we have a single thread making the changes there's no synchronization overhead hardly at all we need to have a little bit when we stop these buffers around that's great now we have an ordering from let's say that thread 1
is running along and let's say that our object is stored in a list and this is the only place where it's and all the reference count settle out and so there's a reference count of 1 1 0 right now and
that's the reference were held list L is holding reference that so 1
cabin comes along and it says on iterate over the list and just print everything and thread 0 comes along later it says on a clear list this
means that there were a few catalog for 1 is going to increment and decrement and the length of the reference count 0 is is conducted the problem is what the process look for a 0 before 1 what that meant that reference the article the reference problems was once it's going to drop a
0 on the on the object now the process it looks for 1 later and explode referencing an and initializer memory might have been freed might be another object some
crazy things can happen it's not a good idea we can solve that actually 0 by the way I will on making clear if you were saying well what if you just swap those leaders are in front of 1 and the same general
solution because you could have some of marriage thing across 2 threads you have to listen to objects each thread in increments over 1 of list and then with the other 1 but you can solve that by reordering the operations what you can do it as
consider that and these 2 operations of incurring debts but if you have 2 operations 1 of them is maker in the vector of the other ones in the vector can use what the answer is in almost every case you can if give to anchors you can swap them that time if you 2 doctors because swap them that's harmless if you've Decker followed by
an you can swap them that's time the only time you have a problem is if you have an EGA followed by adapted use swap those you might have an incorrect problem program now so With this observation we don't have to preserves very strict
ordering on the operation of increasing vectors so we can do this
buffer reference counting on cheaper by just having 2 different lots preached that 1 is in it for a lot 1 of the Decalogue and all the views be very careful to process all the Incas before we process all the factors in our programs can run correctly and we have almost no water so this solves the problem of having atomic occurring Decker around reference can we still have the problem
about cache lines so we can solve that too but there is a
technique that contour is actually that is working in neglect the thread but it's not ready yet I think and he was he was taking
a kind of a different approach you had this idea of having on a different reference count for every object for every thread and then there would be no contention I'm not optimistic that's actually working long term but this is going to help provide
for reference all we do is we take the object and we break into 2 pieces we have a reference counter separate from the object and then we push somewhere pardon memories so they're not next to each other now the reference count is going to different cashline then the answer we combine that with buffer reference canon now we have a single thread
that's committing the changes and making changes to memory that is way far removed from the object itself which means that were not involved in anybody's Kathlin's anymore at that point I'm pretty optimistic that we can get a lot of the this performance that so I of remote object
headers Thomas and networking so I'm optimistic that'll work when it comes time to work with it and I've been trying to do a buffer reference counting and fundamentally Python is allergic to turns out to not having reference counts being accurate real time so but it doesn't work right now and I'm enacted have my head down and like the budget through we just haven't had the weakest recently I want to get that to work on the
optimistic that the neglect is doing a lot faster story go after
that there's an idea to make objects in world there auspicious specifically referenced counter immortal if we had a and moral reference count then and we're not
changing the memory which means we're not involving catchlines but that make things faster and virtually that's an if statement basically every Ingrand after it's hard it's hard to tell about you experiment after a
private walking the idea here is that most objects never escape the thread in which they were created so if you created debt and you'll never use that data on the current thread then you don't need to be expensive walking operations around it and it's only when the data was ever used by different threads that you would have to
actually really locative marker and so what if we could prove lock objects in such a way that the lichens basically free when was the local we get a bunch of performance that I have an idea for how I think it work and I'm going to have to talk about garbage collection
Sunday in the gluttony branch of against can be quite a ways away but in order for to be coded people can depend on the patterns can have work on the election but I think that there are a bunch of techniques for garbage collection that supports a lot
less concurrent access and it's superevent stuff I really don't understand it on currency Python garbage collection is basically stopped the world of reflection that seems as if it's acceptable and I think I can get back work so I think the initial approaches can be stopped the world and then if we get this to work on the Python the Kansas gluttony branch that's actually a viable thing then the super are brainy psychologist can come along and fix my reflection 1 idea by the way for bomb-making garbage collection of the so expensive again there's always locking involved around it I think we could do the same thing with evidence coming across our buffer tracking and tracking of reference kind of objects unjust tracker
subject track this object around it and often have for that the Committee later finally 1 guy Eric I think suggested that as a way of mitigating the breakage involved around extensions we could have the ability to order locks acceptance he called where
whenever you called into a c extension there would be an implicit lock involved that would be the only 1 that would prevent more than 1 thread from 1 brings out the seats of time and that probably get want up and running free quickly but again it's this is this can be way far down the line before we're going to be
raised to look at things like that so my final thought for you is that the
journey of a thousand miles begins with a single step the is performance looks terrible right now but this is simply there is no way to get rid of the Gill without starting get about starting at the data looks like so I'm still optimistic even the number the terrible I'm optimistic that in the long run this can work thank
you so this is I think I have about 5 minutes left for questions thank you thank
you thank you very much what instead
be and this is the regression the trial so in other things so other than the 1 at the 2 some complex and computationally know nothing
complicated so again so as it stands right added walking around the deck object in the list of months of the dataset used let the list on number like images and floats are immutable so those are safe to use anything that's mutable and not unless just said isn't safe use inside the gladly right now so if you trying to computation with a set of objects you're it's is going lower so I I haven't done any other programs because I didn't think may be all that interesting about and again this is early days anyway really like my hope is that there's a lot of work to be done around it like but adding safety to these other mutable objects like sets and byte arrays and all these sorts of things and without all the time so we say can we can run any Python program and we could test that so I think that's really were absolutely times yeah but that
correctly from and write the status approach a couple years ago it was models and approach to remove the deal and can you can compare that but no stack never attempted to
remove the elements that was the original concept around status was and the original written like a long time ago status and run for a long time the original idea was that this was if you have a Python program but say that it's heavily recursive you run out that and then you get the stack exception items with the exception of if we because the way that by function calls work in the Python is that they are actually implemented using C function calls so every time you make a function call in Python it turns into about for function calls and see and that's building up the the C stack and then eventually policies that you a memory if we could separate those 2 so that whenever you made a function on Python all did was used keep memory then we could make a function call of long day we never run out of stack and then all the other context 1st of of function calls in the stack and noted various easily switch between function call stacks which means that we can have coroutines and so that was kind of the direction status was going was just separating the static and the functions that the Python and they have used that for a long time to actually do this crazy stuff where they actually take the C stack a copy it often memory and a copy of some over somewhere use of some language to copy the change that such a changes the stack pointer instruction pointer and jumped into another the 2nd is really more about proteins anyway it's never been about
so with the other purchase of for example a sectarian twisted and always asynchronous Watcom frameworks that tends to handle they only our they they don't use swords basically so with the approach of a chill let me based C Python that is so you would run like a single-layer reacts even always occur then loop like in each thread and then what so those of overhead would you you can't be looking at it's just in theory on the full those 3 that but that never ever our told between threads all
well the is that these of the completely divorced and adding more cores would make a forum scale linearly in practice and they were ever going to get there but but so the the answer to that question is the answer to all the other questions about performance which is that of the collective becomes interesting at the point at which you can add causal program get faster rather than slower but then again for the long time before we get there i in general and how those people like me effects twisted and is in Cairo and asynchronous programming things I can only think that it would be good for them just like every other program in particular like that sounds something a reasonably parallel program like if they should run parallel and that's the reason the reason we run them on the local wars the reason we don't normally course right now is because of the legal but they're already basically parallel operations anyway you're going to have to keep locking overhead costs but you're gonna be all to have multiple programs multiple threads running simultaneously on the same code base with the same local data storable objects that are in the Python so I think it's like a put it this way it does make a program faster then switch the singleton version of the the table
question Larry you the only legal developers panel listen the habitat to questions are guided by the Habitat to questions about their colectomy during the core developers panel of which starts at 3 45 today and I am chairing so I'm forced to attend the state for the whole time thank you very much for the wonderful talk have you considered keeping C extension compatibility with for example like the global interpre justice she extensions like with reader works well I
considered it but it doesn't work the problem is anytime you the the the problem is that but if you have a global a lot that you just use for 6 extensions you have code that is paying any attention to it it's can be changing this extension expected that change the state of change from out from underneath the because it's holding walk right now measure problems and graph so it's just not a it's a non-starter
OK thank you very much Larry let's give a big hand