We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Property based testing with Hypothesis

00:00

Formal Metadata

Title
Property based testing with Hypothesis
Title of Series
Number of Parts
18
Author
Contributors
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The hypothesis.works of the Hypothesis project boldly asserts: - Normal “automated” software testing is surprisingly manual. Every scenario the computer runs, someone had to write by hand. Hypothesis can fix this. While it's debatable whether property-based testing should fully replace the manual parametrization of tests with different inputs and outputs, there's no doubt that Hypothesis is a powerful tool for uncovering bugs nobody would even have considered looking for. In fact, during its development, the authors of Hypothesis accidentally discovered countless bugs in CPython and libraries, thus coining the term *"The Curse of Hypothesis"*. The framework, although incredibly powerful, might seem overwhelming at first. In this talk, I will demonstrate how even simply throwing random strings at functions can reveal surprising bugs. From there, we'll progress towards generating more complex data, which will be less daunting than it initially appears. You'll also see how Hypothesis seamlessly integrates with various ecosystems and can be a valuable tool in any developer's toolkit. --------------------- About the speaker(s): Florian Bruhin ("The Compiler") is a long-time contributor and maintainer of both the pytest framework and various plugins. He discovered pytest in 2015 - since then, he has given talks and conducted workshops about pytest at various conferences and companies. His primary project, qutebrowser (a keyboard-focused web browser), has grown from a hobby to a donationfunded part-time job.
HypothesisCategory of beingManufacturing execution systemSoftwareCompilerCodeHypothesisCompilerInformation technology consultingMoment (mathematics)Projective planeSoftware maintenanceWave packetTwitterWeb 2.0Axiom of choicePower (physics)Web browserReal numberMultiplication signComputer programmingFingerprintPiStatistical hypothesis testingKeyboard shortcutProgramming languageAutomationJSONComputer animationLecture/Conference
Number theorySoftwareStatistical hypothesis testingRepository (publishing)SoftwareWave packetSlide ruleCodeWeb browserOpen sourceQR codeLecture/ConferenceComputer animation
Operations support systemCompilerHypothesisJoystickCalculusOperator (mathematics)Product (business)Stack (abstract data type)Data structurePositional notationBlogOperator (mathematics)Electronic mailing listTerm (mathematics)Matching (graph theory)CodeBitSlide ruleoutputSocial classError messageCASE <Informatik>CalculationData storage deviceSource codeRevision controlCuboidComputer fileDivision (mathematics)NumberContext awarenessElement (mathematics)Function (mathematics)HypothesisWebsiteFunctional (mathematics)ResultantLetterpress printingFile formatBlock (periodic table)Repository (publishing)Statistical hypothesis testingWave packetScheduling (computing)GradientString (computer science)DigitizingPiCondition numberStatistical hypothesis testingException handlingSubject indexingPoint (geometry)Reverse engineering1 (number)MathematicsMultiplication signProgramming paradigmRadical (chemistry)Positional notationDifferent (Kate Ryan album)System callSoftware bugCalculusDampingSoftwareComputer animation
Statistical hypothesis testingSoftwareHypothesisComputerWebsiteCategory of beingPraxisbudget Quick CheckInvariant (mathematics)outputStrategy gameModule (mathematics)Computer fileStack (abstract data type)String (computer science)CalculusGlass floatElement (mathematics)HypothesisRandom number generationIntegerCalculationoutputStatistical hypothesis testingNumberCuboidElectronic mailing listString (computer science)Data conversionMoment (mathematics)Web browserCodeRectangleCubeError messageElectric generatorCASE <Informatik>Strategy gameRandomizationCondition numberSubject indexingSoftware testingCategory of beingInvariant (mathematics)Object (grammar)Exception handlingBitStatistical hypothesis testingMultiplication signDampingSummierbarkeitSoftware bugRevision controlSlide ruleWave packetEntire functionPower (physics)Abelian categoryFraction (mathematics)Perspective (visual)Operator (mathematics)Point (geometry)SpacetimeParsingQuicksortComputer animation
HypothesisCategory of beingStatistical hypothesis testingMeta elementElectronic mailing listoutputLevel (video gaming)Revision controlTouchscreenStatistical hypothesis testingFunction (mathematics)String (computer science)Computer fileTrailData storage deviceString (computer science)Line (geometry)Real numberPixelCountingFiber bundleCodeSingle-precision floating-point formatFunction (mathematics)Web pageParameter (computer programming)TupleStatistical hypothesis testingoutputBitImplementationSlide ruleTouchscreenData compressionAlgorithmElement (mathematics)LengthVariable (mathematics)CalculationVirtual machineElectronic mailing listSoftware bugCASE <Informatik>ResultantSemiconductor memoryDampingWindowCategory of beingBootingInvariant (mathematics)Set (mathematics)Quicksort2 (number)Multiplication signCodierung <Programmierung>Computer animation
Statistical hypothesis testingHypothesisCategory of beingElectronic mailing listPersonal digital assistantInvariant (mathematics)SoftwareWebsiteGraphical user interfaceComputer programmingComa BerenicesStatistical hypothesis testingEvent horizonCompilerFeedbackRandomizationGroup actionCASE <Informatik>SpacetimeMereologyElectric generatorLengthMechanism designResultantOpen setEqualiser (mathematics)Category of beingString (computer science)Wave packetCodeFunction (mathematics)Strategy gameElectronic mailing listSummierbarkeitProjective planeCollaborationismDatabaseExecution unitFunctional (mathematics)InformationError messageFuzzy logicDefault (computer science)Instance (computer science)Data storage deviceHypothesisStatistical hypothesis testingLoop (music)Statistical hypothesis testingNormal (geometry)Social classSoftware bugEndliche ModelltheorieMultiplication signString theoryCodierung <Programmierung>Source codeParsingFormal languageCountingFormal grammarRoundness (object)2 (number)Unit testingGoodness of fitRight angleRepository (publishing)Parameter (computer programming)BitInvariant (mathematics)outputLibrary (computing)Computer animationLecture/Conference
State transition systemDemosceneUnit testingSoftware frameworkResultantStatistical hypothesis testingLogical constantType theoryPseudozufallszahlenRight angleHypothesisStatistical hypothesis testingoutputComplex (psychology)Lecture/Conference
outputStrategy gameNumberCodierung <Programmierung>Streaming mediaPoint (geometry)Core dumpIntegerStatistical hypothesis testingMultiplication signPresentation of a groupJava appletRandomizationPrototypeDemosceneHypothesisFunctional (mathematics)Formal languageSampling (statistics)CASE <Informatik>Control flowLecture/ConferenceJSON
Transcript: English(auto-generated)
Up next is Florian Ruhin, he's quite the regular at this conference, he's both speaking, helping to organize and now he's even sponsoring and he's kind of doing all of it at the same time for this edition of the conference. He's a long-time contributor to and maintainer of Pytest and I guess most of you know
about Pytest, when I write Python code I use it daily and he also offers consulting at workshops around and he will now present Hypothesis as a way to further automate things
when testing. So, yeah. That's what I already said, I'm Florian Ruhin or also known as The Compiler and I started using Python when I did an art project with a friend where we modified an electric
typewriter to automatically write tweets from the website formerly known as Twitter in 2011. That's where I wanted to learn real programming language, looked at Ruby for a weekend, looked at Python for a weekend, given that I'm standing here I'm sure you know what the choice was in the end.
About 11ish years ago I then started working on Qt Browser which is still a big project for me which is a web browser, a keyboard-driven web browser for power users written in Python. With that I discovered Pytest in 2015 and as things go in the same year did my
first training at EuroPython in Bilbao for Pytest, ended up as one of the maintainers for Pytest because I said yes in the wrong or right moment and did my first company training for Pytest. Then I started IT here and I founded my own company, Ruhin Software, where I do those
company trainings and do donation-based open source work on Qt Browser and on Pytest among others. Now if you go to this GitHub repository or scan this QR code you will find for one of those slides but also a lot of example code.
And I want to start with a little warning, there be dragons. The format of this talk is, or rather a 30-minute talk might or might not be the right format for those slides, I'm really not sure about that yet. This is actually a chapter from my company training, last-minute converted from a training
to a talk, because if you looked at this schedule a week ago or so there was actually a talk here, make testing great again, and unfortunately the speaker couldn't make it so I was asked to provide a replacement since it was about testing and apparently
people hear that I do things with Pytest. The downside of that is that it will require a little bit of context first and I'll show you a lot of code, so if you're ever ashamed of looking at your phone or getting your
laptop out during this talk it's really okay, feel free to follow along and look at the source code in that repository. And we'll start with a little calc function, just a small calculator that gets two operands and an operator as a string plus minus times divided by and returns the result.
No surprises here. And before I get to hypothesis, I will expand this little example a little bit and turn it into a terminal RPN calculator. Could you raise your hands real quick if you have heard before what an RPN, reverse
polish notation calculator is? That's more than all of you, which is surprising because often people in my trainings have no idea. I mean now I'm telling you, I should have asked before, it's a way of entering things into a calculator where you first enter the operands and then enter the operator.
And I'm using it as an example here because it's way easier to implement that compared to a calculator with operand, operator, operand. So it works in a way that to calculate 1 plus 2, you enter 1, you enter 2, you press plus
and you get the result. So what you need to do in code is basically you store all the operands on a stack or in case of Python just in a list. And when plus is pressed, you get the two last operands of the stack, add them to each other and store the result on the stack.
And since the result is stored on the stack, you can continue using it and do things like five times, for example, to get the term at the very top. So you first enter the thing inside the parenthesis and then you go on from there.
So now you can implement that in Python and more or less the code will still fit on the slide. We have a class RpenCalculator, stores the stack as a list. We have a run method which gets an input from the user, handles a couple of special
cases for quitting it or for printing the current data on the stack. And everything else is passed to an evaluate helper method. And this evaluate helper method will have a couple of bugs in it.
And you can try to spot them, but I'll also explain what they are on the next slide. So we check if a number was entered, and if so, we store it on the stack. Or if an operator was entered, we get the two last elements from the stack, call
our calc function from earlier, get the result, print that and store that on the stack again. And then all that's left is a dunder name, is dunder main, block where we create such a calculator and call run on it, nothing special there.
So now I said there were a couple of bugs in there. One issue for sure is that in this elif here, and that actually happened to me when I originally wrote this code. I was lazy and just used substring matching, but that means that if we enter plus minus,
we have a problem here because the condition will be true, but calc obviously won't support that. And then of course, calc can also raise a couple of other exceptions. It could raise the division by zero error. And if we don't have anything on the stack yet, we would get an index error from the
pop method. Those are kind of the more obvious ones. Then there is a problem if we enter something that's not valid, we don't get any output. And there is a couple more there.
But I won't tell all of them. We will see actually how hypothesis can find them for us. So that was version one of the calculator. Now I improved that, I fixed most of the bugs. I think one of them is still remaining for you or hypothesis to find.
And that's now version two, which you will also find in the same repository as a separate file. So we fixed another thing there, we fixed the is digit condition to also allow for
negative numbers and for floating point numbers rather than pure digits. We want to fix the plus minus thing. We want to give a bit more feedback when something invalid is entered. And we want to handle the zero division error and the index error. And that's exactly the changes from version one to version two.
We don't use this is digit check before, we just try to convert the input to a float. And if that doesn't work, we catch and ignore the value error. Those two paradigms are also often known as look before you leap versus it's easier to
ask for forgiveness, so LBYL versus EAFP. And Rodrigo actually has a nice blog post in his blog about the differences between those two.
Now with that fixed, we still have this issue with the substring matching. So we'd be a little bit less lazy there and use a list, so that's fixed as well. And then we want to have a little bit of error handling, so we add an error convenience
method to print something to the error output. And then just call that to display an error to the user. If there is an invalid input, if we don't have two operands, we could use for the calculation.
And if there isn't zero division error. And now with a third of my talk being over, you have all the context to actually look at this little bugger here, which is the logo of Hypothesis.
If you open the Hypothesis website, hypothesis.works, it starts with a rather bold claim, because it claims most testing is ineffective, since we call it automated software testing, but we still write all the test cases manually. So all those scenarios a human had to write by hand.
And they then claim Hypothesis can fix this. And the idea behind Hypothesis is a thing called property based testing, which is quite related to fuzzing, with the idea that you generate input data based on a strategy.
For example, generate random strings or generate integers between five and 100. Then you run your test case maybe one or 200 times with random generated input
data. And then of course, you can't just test for a return value, but you can check for certain properties or certain invariants that should hold true for all input data. And then if there is a failure, you minimize the input data to have a minimal example
that still fails your test. So Hypothesis won't give you a big random string or a total random number, it will try to find an example that is as minimal as possible that reproduces the same failure. So if we look at our buggy version one of our calculator again, if we just enter one
and plus in the command line in our calculator, we get this index error up from empty list. So let's write a test with Hypothesis.
We import a given decorator and in this given decorator, we can describe how the data we want to generate should look. So here we tell Hypothesis, just give me random strings, please. Then we create a calculator object and we call evaluate just with random strings and
we see if it explodes. That's the most simple test you can write with Hypothesis, yet it's an approach that has helped me to find countless bugs in kube-browser, especially if you have some sort of parsing code that parses some things from a string, for example, I have something where a user
can just provide a rectangle, so like the width, then an x, then the height, and then plus an offset, things like that. That kind of parsing code can trivially be tested by just throwing random data at it.
And indeed, Hypothesis tells us we got this index error up from empty list and the most minimal example I could find to reproduce this issue is an empty string. And that was actually surprising to me. But if we look at this condition again, even an empty string triggers this if, because
Hypothesis already would have found, would have helped us to find that bug and we already
fixed it in version 2. So let's run the test again over version 2 and indeed it now parses. So we will need to get a little more clever and write a more sophisticated test for our
code. The next step we could do is to actually try some calculation. So we tell Hypothesis, hey, give me two integers, please. And then we call evaluate on both.
We call evaluate with the plus operator and then, of course, the sum of those two integers should be on the stack. And that was when I wrote the code on these slides for the company training. This was like the second thing I tried and I was like, yeah, that should pass now because version 2 obviously has no bugs left.
But maybe if you paid attention in the talks about floats, you already know what's about to come. Hypothesis told me if you enter this number here and add zero to it, you end up with a different number.
And that for me was quite surprising. I mean, I knew about issues with accuracies about float numbers, but I only really thought about them as a thing for fractional numbers. I never considered, and in retro versa it effectively makes total sense, but I never
considered just being so used to arbitrary sized integers in Python that even for floating point numbers, sorry, even for integer floats, so floats but with an integer value, at
some point, of course, you don't have more space anymore in those eight bytes where you store that float. And indeed first thing I did was opening my calculator again, reproducing that, and turns out the most basic thing a calculator should do if the numbers get a bit bigger,
it got that wrong. And that's because we said we wanted to support floats, so why not just convert everything that's entered to a float? And this for me was really a moment when I saw the power of hypothesis, because this was
a bug that I would not have considered at all because I just wasn't aware of the entire category of bugs when writing this code. Then I took it one step further and said, okay, what about generating all the data
we could enter, all the valid inputs? And here you can see how you can combine those strategies with hypothesis.
So we can tell hypothesis, please give me a list, and then I can, as an argument, I can pause another strategy that decides how to generate elements in that list. So I say, please give me lists, and every element of that list should either be an
integer but mapped to a string, or a float mapped to a string, or one of plus, minus times and divided by. And then I just pause all those inputs to my calculator, and again for now just make
sure it doesn't explode. So hypothesis will try things like that, or things like that, and of course we will end up with funny error messages and such, because we're not like correctly using the
calculator, but we don't have any unhandled exception anymore. Now since I still have quite a bit of time, I want to show you another example
where hypothesis really shines. In case you haven't seen enough code yet, there is more. And what we will look at is an algorithm called run-length encoding, a very simple
compression algorithm, where instead of storing a string like this with repetitions in it, we just store the value and the count of it. So we say 5p, 3y, 7p, 4y.
And this exact, very simple compression algorithm has been used for the Windows boot screen in Windows 3.1, and this was also used by black and white Fux machines, because there you often have those runs of white pixels if you have a white page, so you can at least
store an entire line, or maybe multiple lines, instead of storing every single pixel. And again, I implemented that in Python, decoding it is of course very simple, I just go
through this list of tuples, and for every tuple in there, I just multiply the count by the character, which is something you can do in Python thankfully. And so I say 5 times p to get that back, 3 times y to get that back, and then I
just concatenate all those strings, and I end up with the original value again. Encoding will be a little bit more difficult, at least for a naive implementation, and I promise that it's the last slide that's full of code, hopefully.
Now go through this real quick step by step before we get to a test, and again, before we see what bugs are in there, or at least there is one bug in there.
So we set the count to 1, we keep track of the previous element, which is just named the string for now, and we create a list that's used as a return value at the very end. Then we iterate through every character in the string, and see if the character changed,
or if it's still the same as the previous one. If it changed, we store the previous count and character tuple in our return value,
we reset the count to 1, and we know that we are now looking at the new character. If it did say the same, we just increment the count by 1. And at the very end, we need to make sure that we also keep track of the last element in there.
So if we look at the string here, we start with the p, we see that the character has changed because initially this is set to an empty string. So we set the count to 1, set this previous variable to p.
We get the second p here, so we increment the count. Third p, count set to 3, fourth, count 4, fifth, count 5, now we get a y here. Character changed, and now we can see, okay, we have 5p saved already, or like in those variables.
So we create this 5p entry, add it to the list. Count up again for those y's, then we get a p again, character changed, we save 3y, add it to the list. Same for 7p, and then we are at the end of the string, make sure here we also save
4y, and then we have the correct result. Now, did someone spot a bug already?
Can you think of an argument here which is a string, but would actually not work as intended? If so, maybe just raise your hand real quick, because I'm curious. That's maybe 10 people or so. So what you would do, you would start writing a couple of tests.
You would hopefully use pytest, if you aren't yet, you should. And use its parameterize mark, which is amazing, if you aren't using it yet, you should. To parameterize that test with a couple of inputs and expected outputs. So say if I put pyy in, I want 1p2y back.
But as with my float example with the calculator, the big question is, do you even think of all possible corner cases? Maybe it's easier to think of certain invariants, of certain properties that always hold true for every in and output.
Now, if you think about those invariants here, we could say maybe, hey, it's a compression algorithm, so in some sort of mattering, maybe like the bytes occupy the memory, the output should always be smaller than the input.
But that actually doesn't hold true. For a trivial input like a single character, the output is bigger than the input. Same if you create a zip file with an empty text document, it will occupy more space than it did before. So that won't work out. We could come up with something clever like saying,
the length of the string should be equal to the sum of those lengths in the output. That would be a nice invariant, that would always be true. But we just saw a decode and an encode method or a function.
So the simplest thing to do is just to check if those round strip correctly. So whatever we put into our test or into our, yeah, whatever string we put into our test, whatever string hypothesis gives us, if we encode it and we decode the result of that again,
we should get back the original string. And then hypothesis finds exactly that bug. It tells us, here you try to access character,
but if I call this function with an empty string, that's an unbound local error. Because if the string is empty, this for loop never runs, so character is never assigned a value. And again, hypothesis found that bug for us.
I won't have time for a demo, but if you want to play around with it a bit, if you look at the code in the repository, you could fix this issue, you could rerun the tests, and then you could try introducing a more subtle issue. Maybe you break the resetting of the count
and make sure that hypothesis finds it. And by the way, you can implement this in a much simpler way. By using the iter tools module from the standard library,
you can say iter tools group by, which will group runs of equal values in an iterable. And then you just have a list comprehension that returns the length of that group and the character. So it turned all that complex code into a one-liner.
And of course, we want to test with hypothesis, do those two things really work in the same way? So we can write a test that just compares the result to each other for any random inputs, and that test passes as well.
Hypothesis can do a whole lot. It can generate data based on those strategies. You can filter, you can combine those strategies for more complex data, also to generate instances of classes, for example. And it integrates with a lot of other projects. They can just pass a data class to it or a pydantic model.
It integrates with Django models. It integrates with Lark, which is a parser for a language syntax grammar. So you can just tell it, please produce some source code matching this grammar, and all kinds of things,
even things like generate data matching a regex and such. Now, if you want to hear me talk more than 30 minutes, there is an open professional testing with Python training I do in collaboration with Python Academy next March,
or I also do custom company trainings. If you want more information about that, there is also little flyers near the coffee machines. That's all I have. Thank you very much for your attention, and have fun with hypothesis.
Thanks a lot. Are there questions in the audience? I see there, there, and back. So, yeah. Hand it there first. Thanks a lot for your talk.
Very interesting. I will ask how it's supposed to use. I mean, do you use this hypothesis instead of parameterizing? Or, for example, you run it once. It files for you some good test cases. Do you just move it in your parameterized test, how you use it?
So, the idea behind hypothesis is not like long-time fuzzing. The idea is with those like 200 examples or 100 by default is that you still have a test that completes in under one second, or maybe two seconds, and it will complain to you if your test is too slow for that.
So, the idea is that it's a normal test and part of your CI pipeline, for example, and it has certain mechanisms to deal with the randomness behind it. So, it can, for example, give you a random seed, or it stores failures in a database locally, and then it makes sure that you can reproduce that failure again.
Things like that. But as part of your normal CI unit tests, normally. All right. Question on the left there. It's a related question, actually. So, I mean, in a code base, I mean, it's amazing, right?
So, you would put it everywhere, or but then that begs the question, is it in the unit test, or is it in the code base directly? No, in the tests. It's in the tests. And I could argue it would go into the code base, right? As you write the function, you would say this is the spec. It's just an open question.
But then how would you run it? You don't want the hypothesis to interfere. I don't know. It's a question. I mean, it's kind of test runner agnostic, so you can use it with unit test. It somewhat integrates with pytest by just providing like custom command line arguments to pytest and such.
But the normal usage is with pytest, I would say. Or with another test runner, but why would you? Right. There's a question there. And raise your hand. Thank you. I think it was already answered. So my question was basically that she asked,
are you using the constant seed for pseudo-random inputs, or you are generating a random? And you mentioned that you are basically, correct me if I'm wrong, saving each run the seed from the each run so you can reproduce it. Does that mean that you have some kind of like framework just built for this type of like testing
where these results are always stored? And well, how do you manage this type of like complexity? Is there anything more to it? I mean, that's all handled behind the scenes by hypothesis. They did a lot of work on this. I think it was originally even some scientific paper kind of thing. The last thing I did is I think they wrote a kind of core functionality conjecture.
They wrote that in Rust, I believe, or C, I think it was Rust, with the idea of also expanding this concept to other languages. I think there is a hypothesis for Java prototype and such.
There is a lot going on there, but a lot of things behind the scenes is handled by hypothesis. It can even do things like ghost writing tests for you, where it looks like at functions named encode and decode, and it's like based on the names, this could be a nice pair to test and proposes a test with hypothesis.
So it does a lot of things. Time for one more question there in the back. Thank you for the presentation. I was just wondering if you do any kind of intelligent sampling or is it just a uniform sampling? So if you say you want to test for integers
or just randomly taking integers, or do you include edge cases like zero and very large and very small integers, or is there kind of intelligence in there? So what I know is that you can configure your strategies. You can say, for example, with floats, do you want those abnormal numbers or not?
I'm not sure. I know it does generate a random byte stream as in a binary, and the strategies then transform that byte stream to values. And that's also how the whole minimizing metric just works automatically even for custom strategies because it minimizes the byte stream input.
Based on that, I would say probably, sorry for that, probably it doesn't know how to then produce special values especially, but it could be that it takes care of that. As mentioned, someone was obsessed over this for years at this point, so it does a lot of things very nicely.
But I don't know, it's a short answer. All right, thanks a lot. So we'll have now another break. So we have time for questions. Go ahead directly. Thank you. Thanks.