Tales from the abyss: some of the most obscure CPython bugs
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 131 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/69506 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 202418 / 131
1
10
12
13
16
19
22
33
48
51
54
56
70
71
84
92
93
95
99
107
111
117
123
00:00
Point (geometry)Prisoner's dilemma2 (number)Error messageException handlingComa BerenicesCodeProgrammable read-only memoryDivision (mathematics)Data managementStreaming mediaStudent's t-testCore dumpException handlingLine (geometry)Loop (music)Multiplication signGraph coloringControl flowMessage passingIntegerNumberType theory2 (number)InformationCodeClosed setError messageAttribute grammarToken ringIterationKey (cryptography)Asynchronous Transfer ModeMereologyRun time (program lifecycle phase)Functional (mathematics)Dirac delta functionSoftware bugProjective planeString (computer science)1 (number)Revision controlLibrary (computing)Latent heatData compressionVirtual machineSystem callPoint (geometry)Object (grammar)Traffic reportingTerm (mathematics)Level (video gaming)Translation (relic)Goodness of fitStandard deviationSource codePlotterPiInstance (computer science)LaptopMathematicsSign (mathematics)CompilerRight angleArmFormal languageCASE <Informatik>Letterpress printingComputer architectureComputer animationLecture/ConferencePanel painting
08:45
Line (geometry)Fluid staticsDesign of experimentsOvalWebsiteStorage area networkValue-added networkComputer reservations systemComputer engineering2 (number)Customer relationship managementComputer-aided designPiCommon Information Model (computing)Systems engineeringPressureInstant MessagingComa BerenicesEigenvalues and eigenvectorsLocal ringOptical character recognitionPersonal digital assistantPresentation of a groupScalable Coherent InterfaceSet (mathematics)StrutObject (grammar)NumberException handlingFunctional (mathematics)Line (geometry)Forcing (mathematics)Pointer (computer programming)Right angleDifferent (Kate Ryan album)CodeNeuroinformatikHydraulic jumpType theoryAbsolute valueTable (information)Formal languageMultiplication signGoodness of fitImage resolutionLoop (music)System callCompilerElectronic mailing listSemiconductor memoryCategory of beingMathematical optimizationExpressionInstance (computer science)Ising-ModellIterationProcess (computing)Resource allocationComputer fileLinearizationAddress spaceMereologyBitFile formatExecution unitCompilation albumCASE <Informatik>Element (mathematics)Operator (mathematics)Sound effectSoftware testingPoint (geometry)Assembly languageMikroarchitekturError messageAlgorithmSingle-precision floating-point formatSoftwareSpacetimeSoftware bugBytecodeLambda calculusInheritance (object-oriented programming)Doubling the cubeComputer animationLecture/ConferencePanel painting
17:14
Ordinary differential equationEmailFehlererkennungAbstract syntax treeStorage area networkOvalBeta functionArc (geometry)Computer-generated imageryRun time (program lifecycle phase)Cellular automatonWell-formed formulaNumberRevision controlInformationRouter (computing)TetraederFrame problemAttribute grammarError messageSicComputer engineeringComa BerenicesAsynchronous Transfer ModeManufacturing execution systemCore dumpObject (grammar)Dynamic random-access memoryPrime idealDesign of experimentsCone penetration testThermal expansionLeast squaresConvex hullComputer fileRadio-frequency identificationOctahedronAlgebraRaster graphicsPrinciple of relativityCAN busData typeAbstract data typeOptical disc drive1 (number)Instance (computer science)Mathematical optimizationCompilerObject (grammar)Attribute grammarFormal languageCodeError messageSet (mathematics)Frame problemPoint (geometry)Ideal (ethics)CASE <Informatik>Web pageCuboidMemory managementException handlingBitFunctional (mathematics)Computer fileInterpreter (computing)Multiplication signLoop (music)DatabaseElement (mathematics)SpeicherbereinigungRow (database)Resource allocationCrash (computing)Semiconductor memoryPattern languageField (computer science)NP-hardLatent heatComputer programmingAsynchronous Transfer ModeTrailPointer (computer programming)MIDIParameter (computer programming)Semantics (computer science)Electric generatorType theoryReal numberSoftware bugQuicksortTupleComputer animationPanel paintingLecture/Conference
25:43
Local GroupObject (grammar)TupleFunction (mathematics)Data typeProteinConsistencyACIDBeat (acoustics)Conservation of energyMaizeInformation and communications technologyValue-added networkQuantum stateDirected graphUniform resource nameFrequencyNP-hardObject (grammar)CodeMathematical singularityPerformance appraisalFlagCASE <Informatik>Multiplication signSpeicherbereinigungSystem callLoop (music)Scheduling (computing)BitElectric generatorException handlingComputer animationSource code
27:15
Design of experimentsLogarithmCivil engineeringSet (mathematics)Software bugCodeError messageMultiplication signState observerException handlingComa BerenicesBitBus (computing)Computer animationLecture/Conference
Transcript: English(auto-generated)
00:04
I'm Pablo Alindo. I'm a CPython Core Dev, a student council, and release manager of 3.10 and 3.11. But today I am going to bring you a collection of weird stuff from CPython. You may think, oh, right, this is the kind of talk when you saw us weird stuff
00:21
on bugs that you suffer, and we will learn a lot. Yeah, it's true except the last part. Like it's just about me whining for like half an hour. But hopefully there is some, I tried to put some conclusions after every bug so we can learn something. But ideally this serves not only as an interesting collection of weird stuff that can happen
00:40
and how C is a really bad language, but also give you some insight into what is to contribute to CPython and why sometimes we take like three weeks to solve a small bug. Awesome. So let's just start with one of my favorite ones. This is called exclusive syntax errors. So this happened when we were implementing the new cool F strings in Python 3.12.
01:05
And the problem happened when you run this code, and then you say like, whoa, that sounds like okay, it's an easy code, you should like get this working quietly in the project. So we did actually, and this worked on my machine, and after the bug you will see that it works on my machine. It's less of an excuse, more than a miracle, but like you will see.
01:24
So it turns out that when we submit the PR and we run it in the CI, only on some machines we found this error. So it was a syntax error that only happened on certain architectures, quite weird. And it's complaining basically that a closing quote is not a curly braces.
01:42
And then you say, ah, but Pablo, I know what you're thinking, like this is in release mode, right? Like this is in the final version. CPython has a debug mode. So I should be able to just go, run this thing on debug mode, and get a much better error, right? Okay. This is the error that you get in debug mode.
02:01
Yes, yes. You can imagine how people feel here. If you find this already discouraging enough, the problem is you say, oh, okay, this is kind of weird, but I will try to reproduce this, these shenanigans. So this only reproduces on ARM64.
02:20
So you have an M1 laptop, you can more or less have a Docker container, but at the time the only thing that we have is this Raspberry Pi, and these errors only happen on Raspberry Pis. It's the syntax error that only happens in Raspberry Pis, quite bad. So let's explain what's going on. So in tokenizer, at some point, the part that basically grabs the source code and transforms
02:42
it into tokens so it can give it to the parser, there is this little code that basically saying get token, and this function is supposed to basically read your text and return the next token, and it places this on this char variable called token. And if you go basically to get token, it's very easy, it's like a 2,000-line function,
03:03
but at some point it says, oh, if there is an error, I will return minus one. What is the problem? Ah, we need to be lawyers and go to the secret C11 standard, and the C11 standard is fantastic because it says, well, you know, like the type of char sometimes is like unsigned in, sometimes it's in, sometimes I don't know, like it's up to the compiler.
03:22
Haha, good luck. And yeah, yeah, so what was happening basically is like in ARM 64, it turns out that chars are unsigned only on that. So when you were returning minus one as an error, it magically transformed that into 255.
03:40
So when this code was executed and you stored that basically on the token, then what was happening is that the token, an error token, basically transformed itself into 255, which is a valid token, which is call libraries close, and the tokenizer was saying like what is this call libraries close, like what is going on, and illegal. So syntax error, and that was the reason.
04:03
Obviously like if you run into debug mode, we have like a bunch of asserts around, so you get the other lovely error, and after like, you know, one week of like running this code quite a lot of times, the fix is very easy, you just change that to that, very easy. Now it's an integer. Now it works on all architectures.
04:20
Okay, so what are the conclusions of this small bug? There's only one, like avoid C. It's really bad. There are no more conclusions. No more conclusions. Good luck. Especially avoid C in Raspberry Pi, it's extremely dangerous. Okay, let me show you this other one. This one is very cool. So in Python 3.11, we have these like cool like, you know, error messages, and now like
04:45
it's fantastic. It tells you where the message is, like I wonder who did that, like fantastic. So we receive after the change, we receive this interesting error. So someone complained that if you run this code, so basically like, you know, like it doesn't matter what this is, but like if you raise this runtime error over there,
05:02
and then you measure how much time this code tends to run, basically you are catching it, so raising it and catching it. And then you compare this with another version of this in which you put like a lot of like lines. So like it might just a lot more lines. So if the syntax error is raised on line 10,000, compare if you raise the syntax error, sorry, the runtime error on line one.
05:22
So this code is much, much, much smaller. So someone actually made some experiments there, sorry, much slower. So someone made some experiments there and compared like how much time does it take to basically raise an exception depending on the line when you raise the exception, and they give us this lovely plot. The technical term is no bueno.
05:43
So yeah, that was kind of bad. So we have this lovely report and we have to fix it. So let me explain what is basically going on here. So this is the piano when we fix it. So how we fix it. Well, it turns out that, you know, if you basically like have a function and then you raise an exception, and you catch it, basically the exception has an attribute that is done
06:04
the traceback over there here. And that done the traceback has an attribute that is called tb-lino, like, you know, traceback line number. And it tells you three. So that's the way we can know when the exception is happening. And then the traceback machinery or your own code can basically like print that.
06:21
The thing is that the way this number is basically computed is that we know the instruction number where the problem happened. And then we need to translate that instruction number to a line number, right? So like we need to do this translation. Unfortunately like having like a map of every possible instruction number in a code
06:41
object to the line number is quite long. Like if you can imagine, most instructions will have like there is going to be a bunch of instructions that probably will happen on the same line. So you are going to have things like instruction one happens in line one, instruction two in line one, instruction three in line one. So storing that is quite expensive. Like we don't like that. So something that we did, especially now that we store like also like all this information
07:03
in code objects because we need to tell exactly where it happens. So what we do basically is to store this in a compressed fashion. So basically if you imagine this, you have the start instruction and the end instruction. So for instance, for instruction zero to instruction six, then that is line one.
07:20
So instead of like saying zero one, one, one, two, one, we just say from zero to six is line one. And then from six to 50, that is line two, and et cetera. So this is still already good because we don't need to repeat every single instruction and every single line. But what we do here is that we compress this even further, and basically what we do is that we store only the deltas.
07:42
So we store both the deltas in the line, in the instruction offsets and in the lines. So here it says from zero to six, you need to add plus one to the delta. So in this case, because the delta starts at zero is one. And then it says until like if you add 45, sorry, 44 to the instruction offset, then
08:00
you add one extra to the line delta. So the way basically you can reconstruct adding these numbers, what is the line number and the instruction. Here we have another problem because this is compressing specific line types in C, and that 200 is too big for the thing that we use, so we need to like break it apart in even more lines. So this is like even more complicated. There is like basically like a silver color.
08:21
I can show you just how to reconstruct this thing, but the most important thing for you to understand is not how to reconstruct it, it's that there is a loop there. So basically if you want to know, and this is the whole key of this problem, if you want to know the line that is associated with instruction 10,000, you need to compute all the previous ones, right? So you cannot just have it. You need to say, oh, what is like the line for instruction zero, and then you keep
08:44
later eating over the table, and then you will know at the end of the iteration what is the instruction from line 10,000. So this is the problem that was happening because we always compute this number and attach it to tracebacks. If you raise the exception in line 10,000, we need to go to this table and compute all the different lines until we have the one that we want, and then we stick it
09:04
into the traceback object, and doing this thing all the time is wasteful. Why is wasteful? Well, because if you think about it, you only need the line number when you actually want to show where the problem is. And you only show where the problem is when nobody is catching the exception. But here we're computing this thing all the time, even if people are catching the exception
09:22
and doing something else. So the fix that we did is that basically make this lazy, because I mean, we need to do this anyway. It's how people handle the bug expressions in other languages as well. But we did this lazy. So in Python code, basically the idea is to have a property or something on the traceback object. Obviously, this is in C, the illegal language, but this is just an idea.
09:41
So the idea is that instead of pre-computing this thing every single time we create a traceback object, we have a descriptor, a property that does this lazy only when you access it. So the idea is that only when code actually needs this line number, then you actually compute it. It looks a bit more ugly in C, so this is the way you will do this in C, but the
10:01
important part here is that when you want to calculate the descriptor, this is the way you declare a descriptor in C, and then you call this other function, and that actually calls the C code for calculating the address line, which is this algorithm. So funny enough, this is not the first time we have this problem. In Python 3.8, I think, or 3.7, we have a funny one as well.
10:24
So if you look at this, it's a similar one related to this. So if you look at this particular list comprehension, and then you look at the call objects, sorry, the bytecode here, you have a bunch of bytecodes here. You don't need to understand what it means, but because this is a loop basically with a conditional inside, there is a bunch of jumps, because the jumps will need to say,
10:42
oh, go to the next iteration of the loop if the element is false, because there is a conditional over here that is checking if the element is true or false. So you have two jumps, one if the element is false, and the other is if it's true, and then you need two jumps. So there is two jumps here, this pop jump is false, and then there is this jump absolute. And the good thing is that these two jump to the start of the iteration.
11:03
So for instance, if you go here, it tells you to line to line four, so you line to line four, and then you restart the loop of the list comprehension, and you arrive to this jump absolute. You also line to line four, so you start again. So both jumps are just one jump. But then if you just create a new line in the list comprehension, then oh, no, something
11:23
happened. So this one now jumps to line four, the same as before. But look at this one. If the thing is false, instead of jumping directly to the for iter in line four, it jumps to line 16, which is this jump absolute, and then it jumps to line four. So instead of one jump, it's two of them. You will say, why? I mean, this is not a big deal.
11:40
Well, I mean, you have a big list comprehension, maybe it is. But think about it. What piece of software really likes to put things in different lines? Black. Black really likes to do that. So you know, Black was the first code unoptimizer. So Black, they call unoptimizer, jizz-jizz.
12:02
Fantastic. Format, inconsiderate, harmful. So we fixed this, because Lucas was freaking out, because it was a PR problem. But you know, a PR node that I put was like, you know, public relationships. And we fixed it. So now Black is just slightly unoptimizing, but I won't tell you where. So conclusions.
12:20
Conclusions. Important. CPython will always buy you back. Like it doesn't matter. Like if you create these nice error messages, it will always have some cool surprises. Also calculating line numbers is linear. That's kind of important. It's not going away. It's just that now it's lazy. But if you, for instance, are calculating line numbers constantly, because I don't know, you have a tool that does this, you need to know that it's linear.
12:41
And you have a file that has, I don't know, 10,000 lines, so you need to calculate all the rest of it first. So I suppose the conclusion is don't have files that have 10,000 lines. It's better just to have little files. And then the other is like measuring side effects is kind of complicated, especially when you have some code base as big as CPython. Because if you think about it, in this case, what triggered all these problems is
13:03
that we added this code to have nice traceback formatting, and the rest of the code was the same. Like the code that was sticking the line numbers in the traceback objects wasn't unchanged. That was always there. It's just that now all of these things put together trigger this not lazy resolution, and now something that used to be like 01, like it used to be now, and now it's linear.
13:24
The funny thing here is if you have as many users as CPython will have, people will find all your bugs, which is both terrifying and cool, because at least you know all the ways you screw up, which I don't know, maybe it's not great for your mental health, but it's good for the community.
13:42
Awesome. So next bug. This has been interesting. So the next bug is that Python 3.6 segfaults only after 2020. Wow. So let's look at what happened. So this happened when I was working at my company. You know, why we execute Python 3.6?
14:01
It's like vintage Python, you know, like we have all Pythons, but people can do whatever they want. So someone was complaining that, you know, after 2020, if you execute like this setup.py, you know, nothing super complicated here, just Python segfaults, just segfaults. So we just investigated what was going on in the segfaults.
14:21
So we went and analyzed the whole thing. And then we find out that basically this was happening when CPython imports C types. And in C types, there is this function that is called just by importing it, which is creating this like C type function type lambda for some unknown reason. We don't judge, like we just, you know, we just recall, it's our job. And then like, you know, when it's calling this function, it's segfaulting.
14:44
And if you go down and then say, okay, we need to go into the SQL and see what's happening, it's basically segfaulting in this assembly instruction, and these mobups, which stands for move, align, pack, single, precision, floating, point, value, because like, why not? And by the way, I have an ongoing thing that is like in every single talk that I have been in for the past three years,
15:05
I always saw assembly call a Python, so check, I managed to do it again. So yes, it's basically segfaulting in this mobups instruction. So why is that? Well, it turns out that this code, basically, this is the code that generates that instruction. An important thing here is not what it's doing here, is that at the end, it's assigning two pointers.
15:23
So this self-callable and self-funk. And that mobups instruction is a vectorized instruction. So the compiler says, wow, I can assign these two pointers at once, technology. But that is kind of a problem, because it's assigning problems to a struct, right, like this self.
15:41
So that struct looks like this. So why is it important? Well, it's important because the two pointers that are assigned is two at the end, but the thing that is before that is this union, and this union has this long double. So these forces, I mean, you must believe me here, unless you know the legal language,
16:01
that this forces this thing to have an alignment on 16 bytes. Like, that is what the compiler will think. Basically, the compiler thinks that both pointers must be aligned to 16, because the union must be 16 aligned. But if you check that, it's not aligned to 16 bytes. Like, that is false. And if you go to, basically, the CPU architecture for the mobups instruction,
16:21
which is a long test that you don't need to read, it says that when you generate a mobups instruction, the operands need to be aligned to 16 bytes. And if you fail, well, that's the problem, and you get this nasty segfault. So what is the problem? Like, why is this not aligned to 16 bytes? Is this a compiler bug? Well, it's not a compiler bug. It's a C Python bug, because, like, obviously, it would be very cool to just blame the compiler.
16:41
So the problem is that in PyMalloc, we are allocating a space for, like, this struct over here. So this struct is allocated with a Python allocator. And the Python allocator, like, it just handles memory manually. And when it's handling memory manually, it decides to align things to different things. In particular, there is a constant in the code that tells, basically,
17:00
PyMalloc what to align the code to. So the fix, it was very easy. It's just changing that 8 to 16, and voila, it works. But then the question is, like, why was it only happening after 2020? Well, the reason is because after 2020, GCC, the compiler that was generated to create this code, upgraded itself, and now it's powerful enough to generate these vectorized instructions, except that, you know,
17:20
it starts facing these bugs in C Python, and then it crashes. So that's the surprise of why it's, like, after 2020. So conclusions, you know, AVX instructions will hurt you. Like, this is not good. Like, assigning two things at once is not worth it. The only interesting thing is that optimizations normally will invalidate your assumptions. And this is like, you know, it's a tension between compiler,
17:40
like, that tried to be faster, and your code that is probably incorrect. So, like, normally, it's very common, especially for compile languages, that code that is incorrect will be surfaced when the compiler starts to activate weird optimizations. So if you're dealing with programs done in C, you need to be ready to just go down, like, you know, like, look around into the code
18:00
that is being generated, and just check if that is happening correctly, especially if you're implementing memory allocators. And the other thing that is quite important is that you need to be prepared to read the specification. In this case, like, I just copied you the paragraph, but that paragraph was in basically, like, a book of, like, six million pages, and then you need to look for it, right? So yeah, yeah, it's not that simple. Okay, another one.
18:20
So this is interesting. So this is this, sometimes called objects become frozen sets. So we have this, like, lovely error report, basically, titled, Weirdest Failures Related to Tracebacks, because obviously, you know, that's it. And basically, someone was complaining that this is a traceback, that if they were receiving this error called frozen set object has no attribute F code. So this is, like, a code object that magically became a frozen set.
18:43
If you don't believe that this is something that people will be, like, you know, astonished, the brand, which is the person who was checking this back, basically, was checking, sorry, edit first, was really checking, like, how this happened. And not only that, like, it only happens for code objects. It also happens for, like, other stuff, so for frames.
19:01
So, mysteriously, like, a bunch of Python objects will become frozen sets. Quite alright. So this is a reproducer. I don't need you to, like, understand anything here. I just need you to be afraid of this. So you can just say, like, wow, Pablo, the real question is, like, why this is actually even working? But yes, you create, like, this, like, a DC loop,
19:21
and then you, like, basically create, like, open a file that you don't close, which creates a warning, and then you, like, make the DC collect this. It was basically crashing. The reason this is crashing is because before Python 3.11, we used to have one frame object, like, one. And it's a Python object, and every time we create these frame objects, sorry, new frames, we create one of these. But after Python 3.11, we optimized this to have two of these guys.
19:42
So this is the previous one, all of that in, like, gigantic struct. And right now, we have two of these, like, the Python object and one called PyinterpretFrame. If you think these are bad names, well, they are. But, like, you know... So the idea is that now we create the second one, which is more optimized and only in C, and only if we need to create, like, Python objects
20:00
because someone asks for it or a traceback happens or, you know, there is a generator, for instance, we create the one on the left. But there is a dependency between the two. Like, the one on the left, if it's created, will own the one on the right, and will only delete it once that is created. So before we have that big chunk, now we have one smaller chunk, it's still big, and a smaller chunk for the Python object.
20:21
So basically two different ones. And then the idea is that when you get a frame, there is, like, a bunch of checks, but you are calling this PyFrame getFrame object. And what happens here is that the fix for this is basically calling this thing called PyFrame is incomplete. So what is an incomplete frame? So the problem that was happening here is that
20:41
when you are creating these GC objects and whatnot, there was, like, another object being created in a generator, and then it's creating two of the Python object frames that are pointing to the same C frame, and all of them are fighting for ownership. And the problem is, like, one of them actually deletes the frame, while the other one still has a pointer to it, which frees the memory, and then the allocator says,
21:02
well, it would be a pity if I just grabbed this memory that now is free and I put a frozen set of all things. And somehow that goes through all C Python without crashing until it shows you this beautiful error. Wow, what a ride. Interestingly, if you check how we decide if a frame is incomplete or not,
21:22
it's actually all related to generators, but it's not super complicated. The idea is that there is a way to check it, and the semantics of this keep changing. So conclusions of this. Frames are now kind of hard. If you read the code in C Python, you will be scared. It's getting a bit better, but all this ownership is quite complicated.
21:40
The only interesting thing that happens quite a lot in C Python is that when you have memory errors, then all hell goes loose, right? Because the allocator will put objects in your memory that you don't expect, and then you now need to deal with, like, what the hell is going on? So you can see all sorts of weird shit happening, like, you know, call objects becoming frozen sets and whatnot. And the other thing is that using Python in the back mode is really good,
22:01
because instead of having these frozen sets in the middle, it will add some special bytes to the allocator. So instead of having weird code that is working, but it's like you don't understand what is going on, it just crashes, which, believe me, is better, at least for the backing. Okay, the last one that I'm going to show you is this cool one called, sometimes I see half-initialized tuples.
22:21
So that's kind of cool. So basically, the error was happening when someone was running some SQLite code. And this SQLite code, it was basically very easy. It creates like a tuple of a bunch of elements, like it's up for a record, so it calls this PyTuple new. And then there is a lot of code basically fetching the elements for the database,
22:41
and then it calls this PyTuple set item to set the items in the tuple. And then it was just hard crashing. So a reproducer for this that doesn't use SQLite, it's all this code. Again, you don't need to understand anything. It's just for looking smart and complicated. But the idea here is that you can make this code trigger by basically triggering the GC
23:02
in the middle of a tuple being basically created. So here you have a generator, so you are dealing objects, and then you call tuple over your generator. So what's happening here is that the tuple is receiving the items, and as more items are available, it resizes itself, becoming bigger and bigger. But if you call gc.getReference over the object
23:22
that you are dealing, you can see this tuple while it's being filled, like in the middle of it. It's a bit junky, and you can basically say, well, the surprising thing here is that why is this actually not crashing? But in 3.10, it was not crashing. So if you actually print that, you will see that Python will print the tuple, and it will show you a bunch of nulls.
23:40
Like this is actually without crashing. So it will show you that the tuple is incomplete. So it has, in this case, one object at the start, but all of the other points are null. This is kind of okay until the actual resizing happens, and then you will get this bad argument to internal function in the tuple object, the C, because when it sees all those nulls in the middle of it, it doesn't expect them,
24:01
and then it blows up. So that is the problem. So what is the problem here? The problem here is that this PyTuple new creates an empty tuple with a bunch of nulls and immediately activates the garbage collector over it. And the problem is that if you are trying to create a bunch of objects to put it in the tuple, well, what happens is that the garbage collector
24:20
can run at that point, and then it will be able to inspect the tuple mid-creation. And when the tuple is half-created, the garbage collector just blows up. So that is kind of a problem. And it can also happen in many other ways. So for instance, this is another, like if you check the PyTuple new, the code here, you will see that it basically allocates the tuple over here,
24:41
and then immediately calls PyObjectsGZTrack, even if the tuple is full of nulls and mid-initialization. This happens by many other ways. For instance, Juri here, he doesn't remember, apparently, but I checked with him. He fixed a bug here that was the same problem, except not with tuple, but with one of the types in the empty code, because what happens is that you were creating an object
25:01
without initializing a bunch of fields and tracking the GC immediately. Which means that when you try to initialize the fields, the GC has already seen the object mid-creation, and when it tries to visit it, it blows up. So I don't blame Juri for not remembering. It's quite traumatic. So here's what happened. Basically, you were allocating a new GC object, and then you forget to allocate a bunch of things,
25:22
and then you track the GC. So everything that you do after this is going to be able to see the object mid-initialization. So the fix basically is initialize those fields over there, and then when the garbage collector tries to access those fields, it's going to see that they are null, so it's not going to do anything with that. This is quite a problem. And the pattern basically is calling anything that allocates the GC
25:40
and then trying to do a bunch of things. And this is still a bug that is open. Like if you see this, this is Victor Stiner trying to fix it in six million ways, and it's not working. So this is still an open problem. Here is Victor Stiner trying to explain the whole problem and people crying around. So it's kind of hard. So the way we basically fix this is because in C Python,
26:01
the garbage collector is only triggering when you allocate an object. This used to be the case for Python 3.11 and before. So this is the code here for pyobject-gc-alloc. So this is a bunch of the bigger conditional in the planet. But basically what it's checking is like, oh, I will create an object, and it will create, I will check in the GC if I have enough objects to run,
26:20
and then I will call this collect generations, which basically runs the GC. So in Python 3.12, we change this to basically instead of running the GC immediately, we scale the GC over here. And instead of running it just in the creation of the object, we wait for something called the eval-breaker. So the eval-breaker basically is a code. So this is the schedule code. It basically says a bunch of flags in the serial code.
26:43
And then the eval-breaker basically is some code that runs in the middle of the evaluator loop, and it checks for a bunch of things. So for now, it checks also for garbage collector. But before, it was checking basically for request to drop the yield, like I've seen exceptions, pending calls, and singular handling. So basically, it's executing a bunch of instructions, and from time to time, it decides to say, I'm going to check for a bunch of things.
27:02
And this is a very safe place to actually run the GC, because the world is saying there is no half-initialized objects, and you can actually go and check. Okay, so I'm a bit over time, so I'm going to go to the conclusions here. There's a lot of weird bugs over here, but let me jump to the end.
27:22
Wow. Okay, conclusions. So users are quite hard, because they will find all your bugs. Every observable behavior, basically, will be relied upon by our users, because all of these bugs and things is basically being triggered by code that was working before, and you change the assumptions, it will basically break.
27:40
And all XKCD commits will eventually become a reality, even if you see there. Okay, so what have we learned? We have learned that CPython always buys you bugs, so you need to be prepared to suffer a bit. Errors can be quite wild, so most of the time you will see error reports, and you will say, there is no way, but yes, it will happen.
28:02
And then there is always an answer, so if you spend enough months fighting the issue, you will eventually find it. So yeah, that's everything. Thank you very much for your time. I hope you enjoyed the talk.