Stop Writing Tests! - TIB AV-Portal

Stop Writing Tests!

00:00

5

Dodds, Zac Hatfield

Formal Metadata

Title

Stop Writing Tests!

Title of Series

EuroPython 2021

Number of Parts

115

Author

Dodds, Zac Hatfield

License

CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/58842 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

We often think of manual testing as slower and less effective than automated testing, but most test suites haven't automated that much! Computers can execute all our pre-defined tests very quickly - and this is definitely a good thing, especially for regression tests - but the tricky parts are still done by humans. We select test cases (inputs) and check that the corresponding outputs make sense; we write functions that "arrange, act, and assert" for our tests; and we decide - or script via CI systems - which tests to execute and when. So lets explore some next-generation tools that we could use to automate these remaining parts of a testing workflow! PROPERTY-BASED TESTING helps you to write more powerful tests by automating selection of test cases: instead of listing input-output pairs, you describe the kind of data you want and write a test that passes for all X.... We'll see a live demo, and learn something about the Python builtins in the process! CODE INTROSPECTION, and a handy templating tool, can help write tests for you. Do you need to know any more than which code to test, and what properties should hold? ADAPTIVE FUZZING tools take CI to its logical conclusion: instead of running a fixed set of tests on each push, they sit on a server and run tests full-time... fine-tuning themselves to find bugs in your project and pulling each new commit as it lands! By the end of this talk, you'll know what these three kinds of tools can do - and how to get started with automating the rest of your testing tomorrow.

Speech

Text

Image

00:00

MassProcess (computing)GoogolStatistical hypothesis testingMultiplication signProjective planeMoment (mathematics)Electronic mailing listInequality (mathematics)Mathematical optimizationNumberEqualiser (mathematics)Identity managementElement (mathematics)Pairwise comparisonObject (grammar)IntegerSemantics (computer science)Mixed realityUnicodeEndliche ModelltheorieMaxima and minimaTranslation (relic)Computer fileString (computer science)State of matterBranch (computer science)Point (geometry)MereologyCategory of beingRevision controlDirectory servicePlanningProof theoryBridging (networking)Default (computer science)Library (computing)Parameter (computer programming)Gastropod shellSoftwareMathematicsFunctional (mathematics)Different (Kate Ryan album)QuicksortOnline helpSuite (music)Equivalence relationTemplate (C++)Crash (computing)Shared memoryoutputResultantMultiplicationOrder (biology)BitScaling (geometry)Right angleValidity (statistics)Namespace2 (number)Source codeSpacetimeCASE <Informatik>Constraint (mathematics)Statistical hypothesis testingMachine codeSteady state (chemistry)Zoom lensExterior algebraLine (geometry)Code refactoringHypothesisFunction (mathematics)Semiconductor memorySoftware bugType theoryPhysical systemThread (computing)DatabaseDemo (music)Set (mathematics)Repository (publishing)Process (computing)Operator (mathematics)Web 2.0TelecommunicationAssociative propertyLengthCommutatorRoundness (object)Software repositoryAbelian categoryFeedbackCombinational logicData compressionRandomizationHeuristicFuzzy logicModule (mathematics)AlgorithmForm (programming)Power (physics)Linear regressionLocal ringMedical imagingPermutationCore dumpRadiusUniverse (mathematics)Classical physicsPattern languageComputer configuration10 (number)Level (video gaming)Bounded variationSocial classPresentation of a groupData dictionaryCausalityInfinityFlagError messageWater vaporWeightChecklistFlow separationTwitterSoftware developerPiGoodness of fitParametrische ErregungBoilerplate (text)Table (information)Binary codeMechanism designSoftware engineeringData structureKey (cryptography)Unit testingLogicSoftware design patternInteractive televisionCounterexampleWage labourLatent heatASCIIFile system

00:25

Water vapor10 (number)Medical imaging

00:59

Machine codeLinear regressionStatistical hypothesis testingoutputSample (statistics)Game theoryLinear regressionChecklistFlow separationTwitterModal logicElectronic mailing listMachine codeComputer animation

01:48

Goodness of fitString (computer science)Parametrische ErregungSoftwareFunction (mathematics)Boilerplate (text)Statistical hypothesis testingTable (information)Statistical hypothesis testingCore dumpType theoryoutputElectronic mailing list

02:45

Statistical hypothesis testingProcess (computing)Statistical hypothesis testingMechanism designScaling (geometry)Wage labour

03:13

Function (mathematics)ResultantNumberoutputElectronic mailing list2 (number)Mathematical optimizationCode refactoringRevision controlEquivalence relationSemiconductor memoryDatabaseQuicksortGoodness of fitStatistical hypothesis testingElement (mathematics)Exterior algebraSet (mathematics)PermutationSocial classOrder (biology)

05:42

Element (mathematics)Function (mathematics)MIDIMachine codeElement (mathematics)outputSoftware bugResultantFunction (mathematics)

06:19

Mathematical analysisoutputSoftware bugMachine codeInteractive televisionBitComputer animationLecture/Conference

06:36

Mathematical analysisCASE <Informatik>Statistical hypothesis testingProcess (computing)HypothesisMachine codeLibrary (computing)Computer animationLecture/Conference

06:58

Normed vector spaceStatistical hypothesis testingString (computer science)UnicodeTranslation (relic)Electronic mailing listPoint (geometry)outputMixed realityHypothesisInteger

08:00

Cellular automatonHypothesisCategory of beingRevision controlPlanningHypothesisSoftware bugSoftwareLibrary (computing)Computer animation

08:47

HypothesisDemo (music)IdempotentFunction (mathematics)Data typeLie groupBinary fileIntegerFuzzy logicBoolean algebraGame theoryMaizeOperator (mathematics)Cloud computingConnected spaceStructural loadGEDCOMASCIIKey (cryptography)Computer wormACIDHypothesisFunctional (mathematics)Statistical hypothesis testingEquivalence relationString (computer science)Statistical hypothesis testingSource codeUnit testingNamespaceSoftware design patternQuicksortTemplate (C++)Category of beingType theoryResultantData structureProjective planeRoundness (object)Identity managementAbelian categoryParameter (computer programming)Operator (mathematics)Validity (statistics)SoftwareMultiplication signStack (abstract data type)outputBinary codeForm (programming)Element (mathematics)Power (physics)NumberCounterexampleCASE <Informatik>Message passingDemo (music)Electronic mailing listCore dumpDatabaseComputer configurationObject (grammar)Codierung <Programmierung>InfinityData dictionaryDifferent (Kate Ryan album)FlagMathematical optimizationError messageEqualiser (mathematics)Pairwise comparisonSemantics (computer science)Crash (computing)Gastropod shellComputer animation

16:00

Repository (publishing)Human migrationStatistical hypothesis testingOnline helpBranch (computer science)Suite (music)QuicksortPoint (geometry)Shared memoryMultiplicationEndliche ModelltheorieValidity (statistics)Statistical hypothesis testingScaling (geometry)Functional (mathematics)Repository (publishing)String (computer science)Parameter (computer programming)Constraint (mathematics)Default (computer science)Web 2.0outputMathematicsComputer animation

19:08

Event horizonStatistical hypothesis testingDatabaseSuite (music)DisintegrationTotal S.A.Boolean algebraBounded variationBranch (computer science)Statistical hypothesis testingSoftware bugLevel (video gaming)FeedbackCombinational logicLibrary (computing)HeuristicClassical physicsBitPattern languageDifferent (Kate Ryan album)File systemFuzzy logicDatabaseLine (geometry)Local ringHypothesisMereologyProjective planeMultilaterationAlgorithmRandomizationSteady state (chemistry)Power (physics)

21:20

Statistical hypothesis testingStatistical hypothesis testingProcess (computing)Arithmetic meanMultiplication sign

Transcript: English(auto-generated)

00:07

So, this is stop writing tests, in which I will be somewhat provocative, but I'm also kind of serious about this, where I think that while many projects are under-tested,

00:21

we should nonetheless spend less time writing tests by hand than we do at the moment. But before we really jump into it, I want to start with an Australian tradition called an Acknowledgement of Country. This image is my hometown of Canberra. I live just off to the left, and work kind of in the central foreground of the Australian National University.

00:40

But before the town of Canberra was here, and before the Australian National University was here, this was the land of the Ngunnawal and the Ngambri peoples, for tens of thousands of years. And I want to pay my respects to their elders, past and present, and their leaders of merchant, and acknowledge that their land and waters were never ceded.

01:01

The main body of my talk, though, is about testing, and so it behooves me to quickly define testing. It's that thing where you write your code and check whether or not it did the right thing. Usually, we're either looking to find new bugs, or checking for regressions. These are often kind of separate activities, even if we use the same tools for each of them, the workflows are often quite different.

01:22

But fundamentally, the checklist goes like, choose inputs, run the thing we want to test, check that it did the right thing, or that it didn't do the wrong thing, and then repeat as necessary. So let's use an example. In deference to my friend David, who probably doesn't want to drink quite so much as this

01:40

tweet would indicate, instead of talking about sorting, reversing a list, sorry, we're going to look at sorting a list as our example. So here's some tests we might write for the sorted built in. If for some reason we'd lost all trust in the Python core developers, but still use Python. We can see here that if we sort the list one, two, three, we get the list one, two,

02:03

three. And if we sort the list three, two, one, we get one, two, three, even if they're floats and that preserves the output type. And we can see that we can sort strings as well as numbers. So that's kind of nice. And if we're thinking, like, don't repeat ourselves. Well, like all good software engineers, we might use a pytest parametrize.

02:23

So in this case, we've got semantically the same test, but we list out our input and output data. And then have a kind of data driven or table driven parametrize test. So this really helps reduce boilerplate. It, okay, to be honest, when you've only got three, it hasn't helped much. But it makes it much easier to add further input output pairs in the future.

02:45

The problem is that this isn't really automated testing. Like the mechanical Turk of contemporary Amazon or the 1800s magician's trick. What we have done is not so much automated a process as hidden the human labor involved in it.

03:01

And I'm going to claim that not only is this not particularly automated, it also doesn't scale particularly well. So what could we do that would make writing these kinds of tests easier? Well, one thing would be to go, is there a way that would let us get away from having to define the output so we only had to think up the inputs?

03:21

Because here, remember, we have to define by hand. What is the correct result for every possible input? So here, we only have to come up with the input. And by comparing it to a trusted equivalent sort function, we can automatically check that it behaved correctly. So we've already saved ourselves a chunk of work with this approach.

03:40

And you might be thinking, like, how often do I have a trusted alternative version? Well, every time you're refactoring, the version before the refactor and the version afterwards should do exactly the same thing. Or if you have multi-threaded code, if you run it with one thread or with many threads, it should do the same thing. Or even you might have, like, a mock version of your database that fits in memory instead of being a distributed system, you can check the equivalence of these kinds of things as well.

04:05

But even if you don't have that kind of thing, all is not lost. We can leverage particular properties, that's why it's called property-based testing, of our functions. And so for the sorted built-in, we know that no matter what the input should be, the output should always be in ascending order, right?

04:23

If you take the pairs of numbers or elements in the output, then the first one should always be less than or equal to the second of every possible pair. Do we think that this would be a sufficient test for the sorted built-in?

04:41

I'm going to go with no, because return the empty list is a great performance optimization which would pass this test. So we might want to say, well, if the output is in order, and we have the same number of elements, and we have the same set of elements as before, then we know that we've sorted the list correctly. Does this one seem right?

05:05

It turns out this one is also kind of subtly buggy, because if we had the list, for example, 1, 2, 1, we could replace it with the list 1, 2, 2, which would have the same length, the same set of elements, and be in sorted order, but would not be a correct sorting function.

05:21

So we could use the mathematical definition, that it's a permutation of the input such that it's in sorted order. The only problem with this one, though it is a full correct test for sorting, is that it's hideously slow. So we'd really want to use the collections.counter class. And I think this is a pretty good test for sorting.

05:43

In the process, we've kind of rediscovered the idea of property-based testing, that we can check whether our code is buggy without needing to know exactly how to reimplement it. In this case, sorting is fully specified by just these two simple properties, that the output should be in order and that the result has

06:01

the same elements as the input. I do want to note, though, partial specifications, even if you can only test one of these or you can test for some kinds of bugs but not others, still super useful and more tractable on more complicated kind of business logicy code.

06:21

The remaining problem with this is that no matter how hard you think, many bugs are actually caused by the interaction of our code with inputs or situations that we never expected or never thought of. And that means that the bit where we have to write the test cases by hand, you know, 123, 321, BCA, we're just going to be

06:50

kind of by definition, we've probably also written code that handles the things we think of. And that's where my library hypothesis comes in. The job of hypothesis is that if you describe

07:01

what kind of data should be possible, hypothesis will find particular examples that you wouldn't have thought of. And so, here, we've written exactly the same test body, but instead of a pytest parametrize, we're saying from hypothesis, use the given decorator to provide inputs and the argument that that provides should be either a list of mixed integers and floats

07:22

or a list of unicode strings. And even if this seems like a pretty direct translation of what we were doing before, this test will fail. And this test fails because there's the floating point value, not a number, which compares unequal to itself.

07:43

And so, it turns out that if you try to sort a list containing nens, Python will sort each of the sub lists but won't reorder anything across an n. Which is kind of wild, but it's unclear what else the behavior should be.

08:01

So, the short version is I want you to be able to and then to actually do it adopting property based testing. The foolproof plan for that, you just pip install hypothesis, you skim the documentation and then you find a lot of bugs. I hope you're into that kind of thing. To be more specific, like, hypothesis has minimal dependencies, just two pure Python

08:26

libraries that we use for some data structures. And it works on every supported version of Python for the Python software foundation from 3.6 to 3.10. It can be Conda installed if you like Conda.

08:42

So, let's have a look at how you would actually write some tests. And when I say how you would write some tests, the problem with this one is that you still have to write the whole thing by hand. And I'm not into that. So, let's look at the ghostwriter as a way to let hypothesis write your tests for you. Once you get around to pip installing hypothesis,

09:06

you can check out the hypothesis shell command. And if you ask hypothesis to write your tests for you, you'll see that there are a bunch of different kind of things you can do. So, you can write things based on type annotations. So, they're strictly optional.

09:21

A bunch of examples you can write in Pytest or unit test style. But let's just jump in. Let's write a test for our sorting function. Of course, that should be sort ed. And so, hypothesis spits out a test which reminds us that the test I wrote by hand earlier

09:40

neglected to consider the key function and the reverse flag. But in the body of the test, the ghostwriter doesn't have any particular knowledge of sorting. So, it just calls the function and hopes it doesn't crash. Which is a pretty good start. I would usually use this as a template to actually extend myself. But you could also tell hypothesis that the sorted function is item potion. That is, if you call it on its output,

10:06

the result should be the same as the first one. And so, there we are. That's a complete test for sorting. We could also test that our two functions are equivalent. Now, I don't have a trusted equivalent to the sorted function for the standard library,

10:23

but the eval built in should be the same as the AST.literal eval function for every string which represents a Python literal. So, let's see what that test looks like. Hypothesis spits out where we have our global and local namespaces, which can be none or if you want to upgrade the test, a dictionary or a namespace of things.

10:42

And then we have the node or string and the source arguments. And in this case, that's because the names of the arguments to eval and to literal eval actually don't overlap. So, you need to edit that down a little and probably use my hypothesis project to actually produce valid source code. But I feel this is a pretty good start.

11:02

The other cool kind of properties, and this is where the name property-based testing originated in Haskell about 20 years ago, was from properties of things like binary operators. We have associativity, commutativity, identity elements and so on. To be honest, I don't use this very often. But if your functions do have these kinds of properties,

11:23

it should be really easy to test. We could also have a look at round trips. And this is the last general category of property. And in particular, this is the one that you probably all think about going away and using. A round trip property is where you call a function

11:40

and then you call some other function that undoes that. So, in that case, hypothesis goes, well, if I compress and then I decompress, having found that matching name in the same module, then I should get back the original input. This is a really general and really powerful form of test or design pattern for tests. Because we do it all the time.

12:00

We encode and then we decode our data. We serialize it and we deserialize it. We save it to a database and retrieve it from a database. We send it across a network and receive it from a network. And in each case, these round trips tend to cross a lot of layers of our stack. They tend to operate on our core data structures and they are often absolutely crucial to get right. If saving your data to your database and bringing it back out gives you

12:25

different data, you have a critical problem on your hands. So, let's look at a more complicated case than just compression. How about JSON encoding? So, we'll dump it to a string and then load it back in. And in this case, we can see that hypothesis has found all of the optional keyword arguments

12:44

to JSON encoding and decoding, which I usually ignore. And it turns out there are many of them. So, I've put together just a short test file, which I edited down from that by hand. Where I said, well, the important bit here is I just cut out a bunch and I said that the object

13:01

is JSON. And that's defined recursively as the base case for JSON is none, true or false, a number or a string. Or JSON can be lists of any JSON or dictionaries of any string to any JSON. Sounds good? Anyone think this will pass or will it fail? It fails.

13:28

And here, hypothesis actually shows us two distinct failing examples with different causes, which I think is pretty cool. The first is that if we allow NaN, but then we pass NaN, oh, then our assertion fails because NaN is not equal to itself. Well, that's going to be easy

13:42

enough to fix. But the other one is if allow NaN is false and our JSON object is infinity, then we get this value error. And when you go and dig into this, it turns out that the JSON spec doesn't actually allow nonfinite numbers. And if you pass the confusingly named allow NaN flag to false, then Python will reject nonfinite numbers in

14:05

your JSON encoding. So, to fix that, we'll try using the hypothesis assume function. So, first of all, we'll say, like, okay, we just always allow NaN and infinity. And then we'll assume that the object is equal to itself. If this is false,

14:22

it's kind of like an assertion, but the error just tells hypothesis that that was a bad example. Try something else. And if we were on pytest on this one, what do you think we're going to see? Still fails. I was very surprised when I first found

14:42

this one, putting together a talk demo. It turns out that if you have a list containing not a number, it compares equal to itself. Because lists have a number of performance optimizations in equality where they try to short circuit things. So, if the list is equal to the other list by identity, then it will always compare equal

15:03

by equality as well as a performance optimization. And for each element, if the corresponding element is equal by identity, it won't bother comparing it by equality to save you on deeply nested comparisons. This is usually great, but when you have JSON and NaN, it gets pretty confusing. It turns out even if you call list on a list of NaN,

15:24

so you get a different object, the element is still the same by identity and they compare equal unless you're round tripping through JSON. It's kind of bizarre. So, the proper way to fix this is to tell hypothesis not to generate NaN. If we wanted to test the semantics of JSON round tripping with NaN aloud,

15:42

we might write a more complicated test. And if we run this one, hypothesis passes. And considerably more quickly, because we're not trying to find that minimal counterexample.

16:05

All right. So much for the ghost writer. You may, however, want to port some of your existing tests rather than just throwing out absolutely everything and starting over. So, I'm going to walk you through what it might look like to take tests for something like Git,

16:22

which I think of as kind of business logic-y. There's a lot of just weird arbitrary behaviors of Git, which match some kind of other model, but it deals with state, it deals with files, there's not like a clean algorithmic thing that you can really do for the user-facing part of Git. So, let's start with this test, which says that if you check out a new branch,

16:43

that makes it the active branch. We set up a directory, we initialize a repository, we check out a new branch, and that should be the active branch. Well, the first thing we could do to make this a little clearer is to pull out new branch as an argument to the test. We've set it as default value, so there's no semantic change,

17:01

but this does make it a little clearer to the reader that the specific value of the branch name shouldn't affect this test. And then if you want to start using a hypothesis, you could say, hey, hypothesis, generate a branch name which will only be new branch and run the same test body. Still no semantic change. And then you could pull that out into a function shared among your test suite.

17:23

And if you've only got a single test, okay, this doesn't help much. But if you've got many tests, then this means that you can share improvements or discoveries about what kind of data should be valid for particular sorts of inputs or models between all of your tests. So you get a kind of M plus N rather than a multiplicative scaling problem when you change

17:42

things. And now we come to the tricky bit, where we actually have to think about what are valid branch names. Because if you run this, for example, you'll discover that the empty string is not a valid branch name, that a whitespace only string is not a valid branch name, that Git branch names can't start or end with dashes and a whole bunch of other

18:05

complicated constraints which you can check in the Git manual. So for simplicity, we could say that a branch name should consist of only ASCII letters between one and 95 characters long. If you go over 95 characters, then certain web hosting

18:21

things start to reject your branch name as being too long. That was a fun one to discover. And then we could come back and look at this, where we're still saying that, well, do we really mean that this should only ever be true with a newly created repository? So the final test that I'll be trying to migrate towards is something like this,

18:43

where we say that given any valid branch name and any repository, if we assume that the branch name is not already a branch in the repo, then we check it out, then that branch name should be the active branch of the repository. Sounds pretty good to me. And I actually find this test a lot easier to read,

19:02

as well as a lot more rigorous compared to the starting point that we found. The final thing I want to talk through is coverage guided fuzzing. And this is where we get a little smarter. The hypothesis engine by default, as you would use it

19:20

when running your tests or in CI, has a combination of some feedback and some really good heuristics plus a lot of random search. Coverage guided fuzzing basically adds an evolutionary or genetic algorithm to that. And hypo fuzz is designed to kind of complement your CI-based workflow. So your CI workflow or your local tests can then be dedicated to

19:45

searching for regressions, and you can use this more powerful approach with extra feedback to search for new bugs. It's also got this nice feature where because it uses the same databases hypothesis to save all of the failing examples, to reproduce anything it finds can be

20:02

as simple as literally just running your tests locally. And it can pull that out of the local file system or Redis or whatever else you want to use. Let me pull up a live version of that that I have. So here's the live hypo fuzz dashboard where I've set that running on one of my

20:22

own projects. And if I just zoom in on the early part of this, you can see that there's this kind of classic pattern where we logarithmically approach whatever it is that our steady state seems to be. But if we really zoom in on one of these, we can see we're still discovering new behavior, new bits of branch coverage as we go. The big difference that coverage guided

20:43

fuzzing makes is that when we discover one of those very rare branches by chance, we can then try variations or whatever that thing was because we noticed that something was different. If you look on a log axis, you can kind of see that you get this more or less straight line plus the leveling off later. If we scroll down, this has been running for about

21:05

half an hour now, and it turns out that this was in fact sufficient to find a bug in one of my libraries. Yeah, I'll go fix that later. So that's my talk where I wanted to argue that

21:25

stop writing tests doesn't actually mean stop testing. But it means that we can hand over much of the job of testing to better tools, better libraries, and spend perhaps a little less time while still getting a great deal more rigor. So thanks very much, and I'll

21:40

see you in the chat for Q&A.

Recommendations