We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Mutation Testing in Python with Cosmic Ray

00:00

Formal Metadata

Title
Mutation Testing in Python with Cosmic Ray
Title of Series
Number of Parts
131
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Mutation testing is a technique for systematically mutating source code in order to validate test suites. It operates by making small changes to a program’s source code and then running a test suite; if the test suite ever succeeds on mutated code then a flag is raised. The goal is to check that a system’s test suite is sufficiently powerful to detect a large class of functionality-affecting changes, thereby helping ensure that the system functions as expected. While not in widespread use, mutation testing is a fascinating topic with great potential that has valuable lessons for the broader software development community. In this talk we’ll look at Cosmic Ray, an open-source mutation testing tool for Python. Mutation testing presents some difficult and fascinating challenges - both conceptually and from an implementation point of view - so we’ll look at how Cosmic Ray addresses (or plans to address) these complexities. While some of these details will necessarily be Python-specific, there are lessons in Cosmic Ray for the development of mutation in any language. Mutation testing is still a rather exotic testing technique, but it can produce genuinely useful and surprising results. To show this, we’ll look at a number of cases where Cosmic Ray has helped developers improve their test suites and tighten up their implementations.
Statistical hypothesis testingRouter (computing)TheoryStatistical hypothesis testingTheoryRight angleTouchscreenMultiplication signWave packetWebsiteUniform boundedness principleContext awarenessStatistical hypothesis testingInformation technology consultingDemo (music)Real numberBitSoftwareComputer animationLecture/Conference
CodeStatistical hypothesis testingStatistical hypothesis testingSoftware suiteCodeStatistical hypothesis testingNatural numberContext awarenessMathematicsStandard deviationMultiplication signJava appletRight angleSuite (music)Computer animation
Software suiteCodeSuite (music)Statistical hypothesis testingOperator (mathematics)Loop (music)AlgorithmWebsiteComputer programmingMobile appMereologyStatistical hypothesis testingWebsiteMathematicsSocial classCodeOperator (mathematics)Sign (mathematics)Line (geometry)Suite (music)Core dumpSoftware suiteComputer animation
Statistical hypothesis testingOrdinary differential equationSystem of linear equationsStatistical hypothesis testingMoving averageStatistical hypothesis testingCodeSocial classInfinityComputer programmingSoftware suiteStatistical hypothesis testingResolvent formalismImage resolutionTerm (mathematics)Form (programming)Suite (music)Right angleInheritance (object-oriented programming)PerimeterArc (geometry)Loop (music)Order (biology)Event horizonComputer animation
Operations support systemStatistical hypothesis testingFunction (mathematics)CodeMathematical analysisStatistical hypothesis testingException handlingInheritance (object-oriented programming)LogicOperator (mathematics)Relational databaseRouter (computing)Scalable Coherent InterfaceComputer multitaskingPrice indexInsertion lossConditional probabilityLogical constantVariable (mathematics)Scalar fieldFormal languageAerodynamicsZoom lensAreaObject (grammar)MathematicsSoftware suiteKolmogorov complexitySuite (music)Turing testEquivalence relationEntire functionElectronic mailing listProgrammschleifeResource allocationNatural numberBit1 (number)LengthImplementationFunctional (mathematics)Statistical hypothesis testingIterationPattern matchingCovering spaceStatistical hypothesis testingSoftware suiteRight angleWaveComputer programmingCodeGoodness of fitSatelliteINTEGRALInheritance (object-oriented programming)Correspondence (mathematics)InfinityControl flow2 (number)BuildingOffice suiteProper mapCASE <Informatik>MathematicsDrop (liquid)ResultantReal-time operating systemLatent heatLoop (music)Sound effectPattern languageStrategy gameOrder (biology)Operator (mathematics)AreaComplex (psychology)Line (geometry)Process (computing)Slide ruleCubeSemiconductor memoryFormal languageInformation overloadMultiplication signReading (process)WebsiteSuite (music)Spherical capVideo gameLibrary (computing)Physical systemHalting problemFunctional programmingTuring testComputer animation
Equivalence relationKolmogorov complexityFunction (mathematics)Entire functionComputer multitaskingStatistical hypothesis testingDenial-of-service attackRadio-frequency identificationJava appletOrder of magnitudeOrder (biology)Statistical hypothesis testingSoftware suiteImplementationOperator (mathematics)Operations researchCore dumpLatent heatSequenceDesign of experimentsPrice indexParameter (computer programming)Basis <Mathematik>AbstractionCalculationAbstract syntax treeCodeData conversionSinguläres IntegralSource codeModul <Datentyp>Universe (mathematics)Boundary value problemPhase transitionStatistical hypothesis testingCodeState of matterWebsiteEstimationObject (grammar)CASE <Informatik>Intercept theoremMultiplication signSingle-precision floating-point formatMultiplicationChainPresentation of a groupHigh-level programming languageSemiconductor memoryProgrammer (hardware)Functional (mathematics)Right angleLoop (music)Computer fileLibrary (computing)Operator (mathematics)BootingMiniDiscRevision controlSource codeSoftware suite1 (number)Virtual machineBitMessage passingProcess (computing)Inheritance (object-oriented programming)Core dumpMathematicsLimit (category theory)TowerNetwork topologySuite (music)Software frameworkStrategy gameString (computer science)Electronic mailing listModule (mathematics)Parameter (computer programming)Parallel portComplex (psychology)Physical systemPoint (geometry)Run time (program lifecycle phase)ImplementationMereologyLevel (video gaming)Statistical hypothesis testingConfiguration spaceData compressionSoftwareEquivalence relationMathematical optimizationRootGreatest elementOpen sourceBit rateNumerical taxonomyTwitterRepresentation (politics)Projective planeData structureVideo gameBlock (periodic table)Social classDatabaseFood energyMedical imagingStorage area networkTransformation (genetics)AbstractionAbstract syntax treeImage resolutionComputer animation
Modul <Datentyp>Statistical hypothesis testingUniverse (mathematics)Boundary value problemProcess (computing)Statistical hypothesis testingMiniDiscCodeStrategy gameStack (abstract data type)Open set2 (number)Operator (mathematics)Physical systemPlug-in (computing)Broadcast programmingHand fanTrailDatabaseInterrupt <Informatik>IDLInterface (computing)Digital filterConic sectionMetadataContext awarenessTupleStatistical hypothesis testingSuite (music)Configuration spaceBlock (periodic table)Module (mathematics)TrailSoftware developerBitDatabaseDomain nameProjective planeProcess (computing)Run time (program lifecycle phase)CodeLevel (video gaming)Cycle (graph theory)Operator (mathematics)Latent heatDemo (music)Multiplication signStatement (computer science)Statistical hypothesis testingINTEGRALQuicksortMatching (graph theory)Constructor (object-oriented programming)Line (geometry)Link (knot theory)Right angleState of matterSemiconductor memoryDifferent (Kate Ryan album)SoftwareResultantGoodness of fitCore dump2 (number)Set (mathematics)Real numberStrategy gameInfinityOrder (biology)Row (database)AreaUnit testingFormal languagePlug-in (computing)Bit rateCASE <Informatik>Phase transitionVirtual machineMathematicsLoop (music)MetadataSoftware suiteExecution unitSound effectData managementMiniDiscRemote procedure callComputer animation
Statistical hypothesis testingSet (mathematics)Mathematical analysisComputer fileFinite-state machineSoftwareField (computer science)Right anglePoint (geometry)MereologyMainframe computerComputer animation
Equals signString (computer science)Beat (acoustics)Power (physics)QuicksortMathematical optimizationGreatest elementBitFunctional (mathematics)MathematicsDemo (music)CodeStatistical hypothesis testingMathematical analysisResultantExpected valueMereologySoftware suiteComputer animation
Statistical hypothesis testingComputer iconRSA (algorithm)Successive over-relaxationImmersion (album)Moving averageOperations support systemSatelliteDesign of experimentsBarrelled spaceMultiplication signDefault (computer science)DatabaseFunctional (mathematics)Statistical hypothesis testingConfiguration spaceOperator (mathematics)Row (database)Roundness (object)LaptopOrder (biology)Functional (mathematics)Right angleDirectory serviceMereologyState of matterTheoryStrategy game1 (number)Data structureExpert systemPresentation of a groupDisk read-and-write headCodeSheaf (mathematics)Computer fileWordRun time (program lifecycle phase)Type theoryQuicksortBlind spot (vehicle)Statistical hypothesis testingSuite (music)Abstract syntax treeEvent horizonCASE <Informatik>Level (video gaming)Core dumpCausalityLine (geometry)Physical system2 (number)Process (computing)AreaTrailResultantSource codeIntegrated development environmentSoftware suiteLecture/ConferenceMeeting/InterviewComputer animation
Transcript: English(auto-generated)
So, mutation testing in Python with cosmic-ray, make sure you're in the right room here. We're going to talk about theory and practice, what is mutation testing, broadly ignoring the Python context, and then we'll look at a tool we've written called cosmic-ray, which does mutation testing for Python.
I'm Austin Bingham, I work for, I'm a co-founder, CTO, or whatever, technical director of 60 North, a small software consulting company, training company in Norway. If you like what I present today, we've got some books you can buy on Leanpub, they cover everything from very intro Python to deep, deep stuff that you'll never really
need to use, but it's very interesting. If you've seen any of our training on Pluralsight, it's the same material, so don't waste your time unless you like books. Right, so, I have to look at the screen here because I can't see what's going on So, we're going to talk about it, we're going to do an introduction to the theory of mutation
testing, we'll look at some of the practical difficulties, this is the reason a lot of you may have never heard of mutation testing or used it, because it's really hard to do in practice, not because it's complex, but because it takes a long time, that's the real problem. And we'll look at cosmic-ray a bit, we won't be able to do a demo, I'll describe the demo to you, and then we can take some questions.
So, mutation testing, what is it? This is from PI Test, PI Test is the gold standard for mutation testing tools that I'm aware of, it's for Java, it's really good, widely used, better than cosmic ray, but it probably has a lot of people writing code for it, not just me. So mutation testing is conceptually quite simple, we automatically seed faults, breakages
into your code, into your code under test, not your test suite, and we run your test suite. If your test suite passes with that mutation in place, it means your test suite isn't sufficiently powerful enough to detect a behavior-changing modification, and that's the nature of these changes, they need to be visible as a change in behavior in your code, so
you don't make changes that can't be detected at all. So if the test suite fails, that's good, it means your test suite is strong and has detected this change, and you make a bunch of changes, run the test suite a bunch of times, and this is to validate that your test suite is telling you what you think it's telling you, that your code actually works.
Okay, so we have a couple of parts to understand here, there's the code under test, your package, your program, your app, whatever that is, and the test suite, so we treat those as very separate things, which conceptually we usually do anyway. We introduce a single change to the code under test, we'll look at examples of those kinds of changes, we run your test suite, and then ideally, all the changes we ever
make will cause a failure in your test suite. That's the sign of a perfect test suite, right? Better than 100% coverage. Okay, the core algorithm, so to speak, is something like this, for every operator in our mutation operator, so every class of change I could make, and for every site in my code where one of those operators could make a change, we make that change
and run the tests. I should add a line here, which is undo the mutation after running the test, because of course you don't want to aggregate a bunch of mutations as you go. So there are three basic outcomes to any test run after you've made a mutation. One, you can, you said, this is the term of art, by the way, I didn't make this
up, but killed, we killed the mutant, right, the test suite failed, and this means that we detected the mutant, it didn't get past the perimeter, and this is successful from mutation testing standpoint. Another possibility is I've created what's called an incompetent mutant, a mutant that can't run for some reason.
An example in Python would be if I swapped two base classes, say I had a class that multiply inherited, and I swapped the base classes, that can make the program an invalid program, because now maybe you can't construct a method resolution order, right, C3 will fail, if you've ever looked into how Python resolves.
You can create a program that just can't run in any event, so you can't even test it, you just say I can't begin to test this, and so it's incompetent, maybe it goes into an infinite loop, that's also much, that happens a lot. And finally, it could survive, right, so your test suite could pass with a mutation in place, and this is what you're trying to find, this is bad, this means your test suite isn't high fidelity enough to tell you if a behavior has broken in your code.
So this means either one of two things, that you need to improve your test suite, or you can get rid of that code. This may be dead code, that you found it's not under test because you don't need it, it's not behavior you care about, and you didn't write a test for it. So this is, this is the guy who wrote dead code, is he in the room?
It's another form of dead code detection, a much more complicated and onerous form of dead code detection. So those are the situations, this is what we want to do, if you're old enough to know this meme, okay, we want to kill all the mutants, right, great. So what are the goals of mutation testing then?
I love this, these guys are so proud of themselves because they have 100% test coverage, right, who here has, or aims to have 100% test coverage? It's a laudable goal, right, there's value to it, but are you sure, just because I've executed a line of code, am I really sure that I have tested that that line
of code is doing what I think it's doing, right? You don't, I mean you might know, or it might just be getting executed, and you know, the code coverage tools are like, yeah, you executed that bit of byte code, but is it broken or not? You don't know. That's what mutation testing is helping you to say, if I have 100% coverage, which is a bit of a prerequisite to doing mutation testing, can I detect when my program is
actually broken? So not only do I see everything, but I'm actually testing functionality, behavior, right? We would like to know that our functionality is verified. They're so enthusiastic, there's something about, you know, they're just really proud of themselves. I guess that's okay.
What is this? What's this a picture of? I'm not a doctor, but I kind of recognize it. The thing at the end there, the veriform process at the very bottom, it's the appendix, right? Do we need the appendix? I mean, I was taught in school that we don't, but then I read later that, well, we kind of do. It's not critical to life, but it does something in the body.
So when we have a mutant that survives, we need to look really hard at the code involved in that area and decide, is this something that I don't need? Can I get rid of it? Can I cut the appendix out? Or is it actually critical that I need to update my test suite? So mutation testing can help us find unnecessary code, but a failing mutant, a surviving
mutant doesn't necessarily mean that that code is unnecessary. It may mean we have to write tests. So we have to put on our thinking caps. So I can't obviate the need for thinking, but I can help you think about where to think in your code. Does that make sense?
Yeah. Okay. So what are some examples of mutations, right? Here's a really simple one. Okay. First I want to start off in the background, there's a picture of a moth. Does anybody know why I chose that moth as the background picture? It's a famous moth. Yeah.
Exactly. So these moths were pre-coal in the UK, they were white, largely white, and they would land on these white buildings in Birmingham that were made of limestone and they'd blend in well. Coal comes along, all that white limestone becomes black. Coal is black. It turned it all black.
And the ones that were white got eaten by the birds, but the ones that mutated to be black, they survived better. Now we cleaned up the coal. A lot of that limestone is white again. They've turned back white. It's an example of real-time mutation within human knowledge span. So here's an example, a common example of a mutation operator. I have X greater than one in my code under test, and I'm going to change that greater
than to a less than. I ought to be able to write a test to detect that, right, unless that's completely dead code. Right. That's an example. Or maybe I exchange a break with a continue in my code. What's likely to happen if I make that second kind of change? What's a common outcome, you think?
Infinite loops. Exactly. This is one way of creating incompetent mutants, right? So there's research, and people have assigned names to a lot of these kinds of mutations, and you can read that list there, memorize that. There'll be a quiz at the end, so quick. We don't cover all those in cosmic ray yet, but it's an extensible system, so it's
pretty easy to add these things. And maybe not all of them apply to Python. I haven't actually checked. So here are some mutations that are common across many languages, constant replacement maybe changes 0 to a 4, or replace a variable with a scaler here, so I replace X with 42, or replace an arithmetic operator, plus the times, and other things like that.
You could imagine that being used in F sharp or Haskell or Elm or any language, really. I just picked a bunch of functional languages, didn't I? There are some mutations that really only kind of make sense in, say, OO languages, so maybe making something public, private, or changing base class order, and we kind
of talked about that a second ago, or removing overloads, so some mutations don't apply to all languages, so we have to pick and choose the ones that are appropriate to Python. And then, actually, when I wrote this slide, this didn't apply to Python, and now it kind of does, because we have pattern matching in Python, great, so we've evolved. So you can change the order of pattern matching, and that can have a big effect on a program's
behavior, of course, because you get one thing happening before the other. So these are examples of the kinds of things that a mutation operator might do before running your test suite. Is it all clear so far? Yeah, I'm moving a little bit quick, just because I realized we got a late start. This sounds great, why aren't we all doing this all the time, right?
There are some serious complexities with mutation testing. Okay, another quiz, what is that a picture of? Famous science experiment, right, it's called the pitch drop experiment, it's TAR, it's something that drips very slowly, like, I don't know, every 17 years or something, a drip comes out of that, so this guy's had it set up in his office for, I don't
know, 50 years, waiting to see the drop, and I think he's never seen it, so, anyhow, mutation testing can take a long, long time, you know, how long does your test suite take to run? Let's say 10 seconds, maybe, how many mutations could I make to your code? 100 million, so multiply those together, and that's how long it takes to do a full proper
mutation testing suite, in principle. So we can do things like try to parallelize, it is fortunately an embarrassingly parallelizable problem, it is like the example of embarrassingly parallel, you can also try to do things like baselining, where you can do a full run, and then try to have a correspondence
between your tests and your code, and as I change my code, only run the tests that are, I consider appropriate to the code that's been modified, and then only mutate those bits of code, that's another strategy you can try if you believe there's a correspondence that exists after changes and stuff, there's a lot of caveats to this approach, but it
can be used to make this a practical kind of operation, or you can just make your test suite faster, you ever think of that? We always have these various, we all say we're going to have super fast test suites, but nope, we always have some integration tests or something that take five minutes and have to talk to a satellite and launch a missile or something like that, right,
but it takes a long, long time, this is I think the reason it's not used all that often, or as often as maybe it should be. Another is incompetence detection, how do I determine when I've created an incompetent mutant, right, so why would Alan Turing say good luck with that, famously, the halting
problem, yeah, he proved that you cannot look at a program and tell if it's going to stop, so there's no magic wand that cosmic ray can wave and say oh, I've made this mutation and now it's going to loop forever, sometimes you can figure that out, but we don't attempt to, there's no, mathematically, provably, no way to do that, this is one of the things he's very well known for, in keynote, this is a great slide, because
first off it's a picture of Benedict Cumberbatch, and then it's a great joke, we see he becomes actually Alan Turing, anyhow, so, yeah, this is a really interesting Python specific example of what's called equivalent mutants, right, make a change that
is actually undetectable in practice, so this code, does anybody recognize this code, a lot of you've probably seen it, it's in the standard library, used to be at least, this is an implementation of a function called consume, and the idea is I want to pass in an iterator, and I want to just iterate through it, I don't want to do
anything with the stuff coming out of it, I just want to consume it, right, and hence called consume, so this is recommended code, I'm going to walk over here, right, so, they're saying that you just pump the iterator into a zero length double ended queue, and that'll work, it's not going to allocate any memory, a little bit of memory,
but it's not going to allocate, you know, like an array or something like that, it's just going to feed things in by the nature of the DQ initializer into that thing, and throw the results away, cool, cosmic ray is going to look at that code, and what's it going to say, I see a zero, I see a literal, I know what to do with a literal, I'm going to change that to a thousand, I'm going to change that to a million or a billion,
something like that, are your tests going to notice that, is there any way to test for that? In principle, if you really wanted to, you might be able to detect an extra allocation of memory, but who writes tests, you're not going to write that test, I'm not going to write that test, so, there are cases where a change can be made that is considered undetectable,
it's an equivalent mutant, and we have to somehow deal with those, mark them somehow, or be prepared to not deal with them, not make those mutations. This is another complexity, because these arise, it's another great example, these are almost all done, everybody's written code like that, right? So, if I mutate something inside that block, when is that going to be run?
Not in my test suite, right? This is not going to be true in my test suite, I'll never get there, but I've mutated the code anyway. Okay, so these are classes of things we need to be concerned about, with mutation testing. And we'll talk about some strategies for dealing with those, but you have to be creative sometimes.
Okay, so, doing good on time, alright. So, cosmic ray, this is my implementation, my 16th implementation of a tool for doing this kind of work in Python, right? The project is at this point about, it's probably like 7 or 8 years old, but it gets very little love these days.
I put a lot of energy into it first, it was a very exciting project, but I just don't have time anymore. One of the reasons I like to give this talk is to try to get other people interested. It's an open source project, you can find it there, if you have interest in writing new operators, you know, addressing some of the complexities, doing fancy things with the code to make it a more practical tool,
that's great, we always love to have people contributing. So, this is off of what used to be called Twitter, and I have to put up here, so it makes me feel proud, that they're talking about mutation testing tools, and at the bottom, the Python 1 cosmic ray has a decent design.
What else could I, I mean, this is, on Twitter, this is like, you know, it's like the Nobel Prize, right? A decent design, there you go. So there are, there's clearly implementation challenges, we have to write the code to do this stuff, what are the things we have to figure out how to do? One is to determine which mutations to make, given a body of code, what am I going to change?
How do I model that then in cosmic ray as something that people can, you know, extend and use? I then need to write a system, a machine, that makes mutations one at a time, runs a test, and then unmakes those mutations. So that's the kind of part two, what we have to do.
And then as I make those mutations, I have to run the test suite against each mutant. And all the time I need to keep in the back of my mind, dealing with the complexities we talked about, the long run times, the equivalent mutants, incompetence detection, things like that. Yeah, I'll be honest, I try not to worry too much about the long run time problem.
It's not something I can solve with magic. We know how to solve it, it's parallelization, and it's narrowing down of your test suites and things like that. I think trying to solve that problem in some ways is dealing, is working with premature optimization, you know, the root of all evil in software.
So I haven't focused too strongly on that, except to make it a parallelizable framework. Right, and the Tower of Babel over there, which is sometimes what it feels like rebuilding. Okay, so at the core of CosmicRay is something called an operator. An operator, this is an animated slide, so you're going to lose the animations, I apologize for that.
But an operator does kind of just two things. It identifies places in the code where it thinks it can make a change. So an operator might be the change plus to change to minus operator, right? And every time it sees a plus in the code, it's going to say, well, I can make a change there. And that gets noted in a little database.
And then later on, in a subsequent pass, we look through that database and say, operator, you said you could make a change there, go make that change there now. So that's the other thing that an operator needs to do is make the change at a place that it claimed it could make a change. It's not its job to decide when to do that. There's another machine that's driving the operators.
I think we look at the API a bit here. So the first function here, this abstract method mutate positions, that's the detection of where the operator is supposed to claim, yes, this AST node, I can make a change to that. And that gets noted. And then later on, when we're driving the operators, we come to this mutate function, the second abstract method there.
And it's saying, okay, on this AST node, make the nth change you could possibly make to it. It's possible in some cases for a single AST node to be mutated multiple times by a single operator. It just turns out that that's the case because of Python ASTs. And so we have to build that into it.
We have some support for providing arguments to operators. It's a nascent and poorly understood feature at this point. And this is kind of neat. We force operators to provide us a list of examples. And these examples are, I'm going to go from this code to this code. They're just strings of code.
And then our test suite for every operator, we just verify, does it actually make that change? It's a nice little feature. But the two ones at the top here, those are the important ones for our purposes here. Okay, this really suffers from not having animation. But in the back there is an AST, which is like it's one plus two times three.
And those green blobs are the AST. You see, we use a library called Parso. I think a very cool sci-fi kind of name for parsing the code. Has anybody else used Parso? You've almost certainly used a tool that uses Parso. It's incredible, right? It's amazing. If anybody here has worked on it, thank you. It's excellent.
But yeah, Parso, I just throw some code at it and it says here's the AST for that. We then crawl that AST with all the operators and they're making notes about what they can do. The superpower that Parso gives us is that I can then modify an AST node. I don't have to modify code. I can work on these in-memory data structures and say I'm going to change that plus operator,
which is represented as an object, and say now you're a minus operator. And then I can tell Parso write that AST back out to file and it will do that. And it's high fidelity. It remembers comments and indentation, all that stuff. It's excellent. So that's our basic strategy. We load it up into Parso, make a change, tell Parso to write it back out to file,
and then we can run our test suite against modified files on disk, the most natural way to run your tests. Initial versions of Cosmic Ray did much more exotic, ridiculous things. Like we had custom finders and loaders for the import library
and we would intercept code objects and try to do these changes in memory. And it was amazing. As a young programmer, I thought it was the coolest thing ever, but looking back in my wisdom now I realize it was idiotic. But it was super cool. It was fun code to write and I learned a lot about Python then. But this is much simpler, much more robust. This is the way we do things now.
If you have to deal with source code at all, look at Parso. It's very, very nice. But it's at the core of how we do the modification of code on disk. This is how brave I am. I used, what's that font? Comic Sans, yeah. I represent up here. I saw a talk by the guy who wrote Haskell.
And he's super famous, very, very smart guy. And he loved Comic Sans. And he spent like half of his talk just defending the use of Comic Sans in presentations. So I figured I should do it too. Okay, so there's kind of two phases ultimately to Cosmic Ray. There's the init phase where we ask Parso to generate an AST.
Then the operator calculates mutation sites. And then there's the mutation phase where step three, we actually mutate the AST, convert it back to code, and then we run a test. And then we just swap the code back to its original state after doing that. So that's, you know, this is the loop that Cosmic Ray runs in.
Excuse me. So to summarize all that, we use Parso to transform source code and abstract syntax trees. We use operators which detect sites and then perform mutations. And then we use Parso to again blast it back down to disk. Another quiz. What is that? I've used that a couple times, that image there.
Recognize that? It's apparently the tree of life. I'm not a biologist or a zoologist, but this is some kind of representation of the taxonomy of life on Earth. Seems small. Okay. Figuring out what to mutate. Interesting question.
Cosmic Ray operates essentially at the level of packages. When you set up a configuration for Cosmic Ray, you tell it there's this package. Maybe the top level in a big tree, but you tell it a single package. I should really say a single module that might be a package. And its job then is to operate on everything at that tree, in that tree.
We'll scan the entire package for all of its submodules. There are of course limitations, obvious limitations if you've been following to the kinds of things we can and cannot mutate. What's an example of a module we probably can't mutate? Well, no, no. Parso we probably could. We don't ever look at anything in site packages.
Let's say it's a package written in Rust or C++. We're just not in a position to do anything about that. That would be an interesting mental exercise though. There are certainly tools that can do C++ and I presume Rust mutation testing, but they're not integrated with Cosmic Ray at all.
It is also possible to exclude modules which you don't want to be mutated. Maybe you don't have perfect coverage of some modules, or maybe they are modules that you know lead to a lot of incompetence, for example. This is a way you can filter out, say, your main module. We looked earlier at that block of code that will never be executed in your test suite because Python won't let it be executed effectively.
So you can make notes in the configuration saying, please don't mutate this bit of code. That's one convenience we have to make this more practical tool. So we operate at the package level. It's pretty straightforward. Running tests.
The testing overview is, we've kind of talked about this a little bit, we figure out what to mutate. We scan the AST. We create a mutant. We write that mutant to disk and we run the test command. That's supposed to be animated, but we run that in a separate process. This is mildly contentious. I suppose other mutation testing tools in Python do things differently, for good reasons.
But I've decided that running the test in a separate process is the only way that I will sleep well at night. Because if I make a mutation to some code, right, well-behaved code, and I mutate it and then I run it, in my process, it could, who knows what it could do, right? It could wipe out my sys.modules.
It could do all sorts of crazy things. And I would have kind of no way of knowing. And so I wouldn't know the state of cosmic ray after running the test. So I feel, ultimately, this is the safest way to run tests. So, typically, it's a pytest kind of thing, or unit tests, you know, command that's being run in a separate process.
And then we, after that, we write back the original code. Yeah, the reason you might want to run it in process is for performance, possibly, because you can introspect it a little bit. There's good reasons, maybe, to run it in process, but I feel much better. I feel it's a simpler, more manageable tool by running it this way.
Dealing with incompetence. The most important incompetence to deal with are those that have gone into an infinite loop. It almost always happens on any sizable code base. The only real strategy that cosmic ray has right now is to set a timeout, right? You can, in your configuration, you can say, well, if the test suite takes longer than 30 seconds,
consider that as incompetent. It's gone into a loop that it's not going to recover from. So if your test suite normally runs in one second, and it's taking 30 seconds to run, you can heuristically conclude, I suppose, that it's never going to end. You can also, I think this feature still works.
We had a way of having cosmic ray calculate a baseline execution time for your test suite so it automatically figured out what the timeout was supposed to be. I think that feature is still in place, but I'm not positive. So the problem with working on a piece of software for so long is that you have memories of different states of the code, and so I honestly don't remember if that feature still works.
Right. But this is the basic strategy. I'm not sure what other ways you would do it. Maybe you could analyze the Python runtime and see if it looks like it's in a cycle or something like that. Maybe that's a possibility if anybody's feeling energetic.
And there's another set of interesting pieces of technology inside cosmic ray. It is heavily plugin-based, so all the operators are provided as plugins, so you can provide your own operators. This has been done on projects I'm not allowed to talk about,
private projects, but they provided what essentially were domain-specific operators. They knew that there were things in their code that they wanted to modify in special ways based on the behaviors in their domain. I'm waving my hands a bit here. So they just plugged in their own operator via,
we're using stevedore, I think we're using to load all that stuff up, and it just works. Cosmic ray will just scan all plugged-in operators and run them all. That's kind of how it behaves. Another thing that's pluggable in cosmic ray right now is what we call a distributor, and this is how we can distribute jobs remotely.
So we have, the one you'll normally start with is the local distributor. It just runs everything one at a time on your local machine, but you can also set up these little HTTP listeners wherever you want, and they can receive requests to do mutations, and so you could have, you know, say 10,000 of these running in AWS if you're rich,
or you could have a distributor that uses Celery or a distributor that uses whatever. So if you really need to parallelize, we have ways to support that. We use a little database to keep track of what's going on. This is actually, I think, a really nice feature. So as I mentioned, when we run the initialization phase of cosmic ray,
it scans all the code and figures out all the places that are going to get mutated, and it just makes a row in a database for each one of those. So I may have, you know, 10 million rows in the database, each one representing a single mutation, with a slot, an empty slot for the result, right? So then as I start running, executing tests, the results are coming in, I start filling in those slots,
and this has the nice feature that I can kill the execution. I can control-C it and walk away, but the state is remembered in this database. I can come back later, restart the execution, and then just pick up where it left off. And so, because sometimes you need to, you want to pause the test for some reason. So you get this little database, an SQLite database,
and we use SQLAlchemy on top of that to manage it, and it's good. It's a feature that I wasn't sold on initially, but it's turned out to be a really, really good idea. We use Qlik, which is a wonderful command line tool. If you've never used Qlik and you're writing command lines,
look at Qlik. Does anybody use Qlik? All right, people, right. It's wonderful, and it's turned out to be, we use it in all our stuff at 60 North, actually. It's everywhere. Really nice tool, so check it out if you've never heard of it. Spore. This is a talk, literally a talk in its own.
One of the early questions that came up in the development of CosmicRay was, how can I mark a bit of code that I don't want to get mutated, right? There's an obvious approach that's used by a lot of tools, which is to use a comment, you know, and say, comment no QA or something along those lines. That's always rubbed me the wrong way. It's just an aesthetic thing, I guess.
I feel like there ought to be a better way. So that led to the construction of a tool called Spore, which is Norwegian for trail or to track. I'm not Norwegian, but I live in Norway, so I just choose Norwegian names for a lot of my projects because they're never taken. I got five, wow, okay, I got five minutes. Basically, it lets you externalize information,
but attach it, anchor it to pieces in your code, and it has some nifty technology built in for updating those anchors as the code changes. That's its party trick. That's what it's there for. So you can check out Spore. Yeah, I think there's a link there. There's a Spore filter for CosmicRay, which allows you to kind of plug CosmicRay, sorry, Spore into CosmicRay,
but Spore is a standalone tool you could use to attach data to anything, and it seems to work, and I'd love to hear if it does or does not work for you. If you try it out, it's a neat tool. Right, so there's some remaining. There's actually a huge amount of remaining work, but this is, you know, improving the metadata anchoring. This is Spore. I'd like to see that pushed to a conclusion of some sort.
Support for more kinds of modules. We talked about, you know, can I handle C++ modules? Can I handle Cython? That kind of stuff. Maybe better integration with coverage testing. Just something smoother, more polished than it is now. More operators. We have a handful, more than a handful of operators in CosmicRay already, but the more we have, the better,
especially if they are, you know, they catch important cases that I haven't thought about, because as the language evolves, you know, we, for example, added the match statement, and nothing that handles the match statement in CosmicRay right now, so that would be an interesting area to explore. Higher order operators.
This is a theoretical, this comes from research, the idea of trying to speed things up essentially by making multiple mutations at a time, and then essentially bisecting down to the one that caused the problem. This is something we've started to add support for in the core of CosmicRay, but I don't really have a great depth of support
for it right now. In fact, you really can't do it. The groundwork is being laid, I think, which is to say, I think what I've done is a good groundwork for it. I'm sure that the work has been done. Right, practical results. God, I've got a few minutes, we're not going to have time, we won't be able to do the demo, but I can talk through this a bit. Okay, does anybody recognize that?
Nobody works in oil and gas. Right, okay, so I used to work in oil and gas. This is a piece of software for visualizing, doing geophysical analysis of the Earth. Something called SegPy, which we wrote at 60 North for loading up these files called SegY files, which are big in that field. Anyhow, part of SegY, because it's ancient,
is that its floating points values can be stored in IBM floating point from the old OS 360 days. Right, the good old days, as they said. We were asked the other day, have you ever hugged a mainframe? I mean, these guys probably have. Anyhow, we had to write a function,
so this SegPy has been run through CosmicRay, it has a high-power test suite, and we've had this function, IBM to IEEE, because of course we wanted an IEEE 754 so we can do actual math on these things. And so this function converts those. And there's this bit of code here at the bottom. If A equals B equals C equals E equals zero, that's zero, and there's no further analysis
that needs to be done. It's an optimization for this function. CosmicRay looked at this and came back and said, well, if I muck around with A, if I remove A, nothing changes. This mutant survives, and that's a problem. And we looked at it, and it turned out that indeed the A is unnecessary because if the A is zero, all the others,
well, actually, if all the others are zero, A has to be zero already. So it's something that's obvious in retrospect, but we never would have caught it. It's not a huge optimization, it's not a huge win, but it gives you some sense of the sort of introspective powers of something like CosmicRay that just mindlessly beats your code up until it screams, and you can pull out the parts
that don't need to be there anymore. So this is a neat packaged result. We just took the A out, and everything still works as expected. I would love to give you a demo. It's an amazing tool. You'd have cried, you'd have laughed, you'd have wept, but we can't do that, so I guess we have time for questions then.
Fantastic talk, Austin. Also, thanks for managing the time too. Oh, no problem. So question section, yeah, sure. Hi. I would like to ask how safe is all this, because we are changing the code.
We are replacing some parts with some other parts. Anything can happen. Let's say, just as an example, that I have a CLI, and I have an install function that just deletes directory with my configuration. And the CosmicRay might replace that configuration file with slash home or something like that.
So is it recommendable to run these in some kind of isolated environment like jail or container? I think you would need to use your head to decide. Yeah, if you fear that a mutation could put your code in a state where it's going to be destructive like that, yeah, you, I mean, it's always possible, I suppose, that we make a mutation that would cause damage
when you run the test suite. But if that's the case, then it's going to happen if you've made a mistake as well, so you probably need to be prepared for that case in any event. But yeah, it's something to consider for sure. Arbitrary things could happen. The caveat though is that the mutations are almost, they are always very small, right? Unless you write a mutation operator
that does great big things. So typically, I've never heard of something like that happening in practice, if that makes any sense. Okay, thank you. Clear the way. Thank you for the presentation. I wanted to ask, I don't know anything about mutation testing.
I just talked about it for an hour. Now I do know something. So just my question is, could there be a code that cannot be killed through mutations? So is there, or any type of code can be broken down to this like simplest operators that could be mutated? Well, so all code, all Python code
can be ingested by Parcel into an abstract syntax tree and that's what we work on. So anything that's in, if those words don't mean anything to you, we can talk about it later, but it's essentially a data structure that I can manipulate. So if it's in the AST, operators have the chance to work on it. So there's nothing, somebody tell me if I'm wrong,
but I can't imagine even how theoretically this could happen. You could have Python code that's not in an AST when it's slurped up into Parcel. So I don't think there's anything that isn't accessible through this kind of system. Okay, thank you. Yeah. But if I'm wrong, somebody tell me. I'd be very, very curious to know if I have a blind spot about this.
So any more questions? Yeah. There is one really hacky thing which is encodings in files which can manipulate the source before you have access to it, but that's really unlikely that you'll get that. Okay, so let me turn the crank on that because isn't Python 3 forcing you to use UTF-8 encoding?
It's not even Python code if it's not in UTF-8, right? That's the default, but you can pick any encoding. Okay, that's... Or any custom ones. Okay, that is worth noting at least. Thank you. Yeah. Yeah. Damn it. Thanks very much. Maybe you said it and I missed it, but so you mentioned that you remember the state of the test throughout one run.
Do you also keep track of it between runs? Because I'm thinking, say I notice some code that's like, I'm not testing well enough. I add another test, but clearly that all the mutants that die in the first run would still die in the second, right? Do you remember that to reduce the run time between improvements?
CosmicRay in its core doesn't do that, but we essentially publish what the database looks like. So if you want to go manipulate the database, you can. You can remove records and that would just force their running if you ran a second time. So say I had a result for some mutation. I just delete that result from the database and the next time I run CosmicRay,
it would just run that mutation again. You could also add new lines. You can null them out if you don't want them to run. And that's our strategy for letting users manipulate what gets run at a fine-grain level. Yeah. Any more questions? I'm sorry, that's all the time we have.
No, no, please. So you mentioned about high-order functions, and I was just trying to run through them in my head because I think if you, let's say you run four tests together and that combined one fails, you have no idea whether any one of those has failed inside it.
Right, and it's okay. I am far from an expert in the theory of higher-order mutations. It's an area that I need to educate myself on, but it's one that's established. I think the idea is if something survives after making multiple mutations at a time,
you then have to sort of individually apply those. I think I see what you're saying because this is what I don't entirely understand about higher-order mutation. Suppose they're offsetting mutations, and I don't detect those. I make two mutations, each of which ought to produce a survivor, but then they offset each other somehow.
That's always a possibility, but if there's a survivor in the, say I've made four mutations and there's a survivor, then I just have to try each of those, I think, individually to figure out which is the one that actually is the culprit. So does that answer your question? I'm sure. I'm happy to chat to you after this
and work through my thought process later. Okay, yeah. So there is not much time now. Is it possible maybe to have a discussion later? Okay, sorry. So anyway, it was like a fantastic talk, Austin, and please give again a round of applause. Thanks a lot for coming.
And thanks for the laptop.