We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

What's the fuzz all about? Randomized data generation for robust unit testing

00:00

Formal Metadata

Title
What's the fuzz all about? Randomized data generation for robust unit testing
Title of Series
Part Number
151
Number of Parts
173
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Production PlaceBilbao, Euskadi, Spain

Content Metadata

Subject Area
Genre
Abstract
Moritz Gronbach - What's the fuzz all about? Randomized data generation for robust unit testing In static unit testing, the output of a function is compared to a precomputed result. Even though such unit tests may apparently cover all the code in a function, they might cover only a small subset of behaviours of the function. This potentially allows bugs such as heartbleed to stay undetected. Dynamic unit tests using fuzzing, which allows you to specify a data generation template, can make your test suite more robust. In this talk, we demonstrate fuzzing using the hypothesis library. Hypothesis is a Python library to automatically generate test data based on a template. Data is generated using a strategy. A strategy specifies how data is generated, and how falsifying examples can be simplified. Hypothesis provides strategies for Python's built-in data types, and is easily customizable.Since test data is generated automatically, we can not compare against pre-computed results. Instead, tests are usually done on invariants of functions. We give an overview of such invariants. Finally, we demonstrate how we use fuzzing to test machine learning algorithms at Blue Yonder.
Keywords
Analytic setProcess (computing)BitLecture/Conference
Metropolitan area networkKeilförmige AnordnungLink (knot theory)Port scannerProper mapSoftware testingParameter (computer programming)Function (mathematics)Attribute grammarAverageoutputDensity of statesIndependence (probability theory)Uniformer RaumPraxisbudget Quick CheckVirtual machineArithmetic meanPower (physics)HypothesisSoftware testingCategory of beingCASE <Informatik>Endliche ModelltheorieFunctional (mathematics)Fluid staticsBitResultantField (computer science)Flow separationSoftware developerFuzzy logicIndependence (probability theory)outputSpacetimeStress (mechanics)Product (business)Vector potentialSocial classBranch (computer science)Parameter (computer programming)Type theoryAlgorithmPerfect groupContrast (vision)ExpressionCodeFibonacci numberDynamischer TestSource codePreprocessorMathematicsMessage passingMedical imagingUniverse (mathematics)Position operatorTopological algebraMultiplication signCrash (computing)Template (C++)Different (Kate Ryan album)DistanceAttribute grammarMachine learningProcess (computing)Entire functionComputer animation
Metropolitan area networkGamma functionValue-added networkSampling (music)IntegerElement (mathematics)Key (cryptography)ImplementationCodeMoving averageSoftware testingDynamischer TestMultiplication signIterationNumberRandomizationPresentation of a groupFibonacci numberString (computer science)Perspective (visual)Functional (mathematics)Set (mathematics)CASE <Informatik>Disk read-and-write headInverse elementCodeDistribution (mathematics)outputIntegerInformationCategory of beingSpacetimeMassStrategy gameHypothesisElement (mathematics)RootSquare numberMaxima and minimaWell-formed formulaCounterexampleFunction (mathematics)Point (geometry)Parameter (computer programming)Recurrence relationData storage deviceDatabaseLibrary (computing)Uniform resource locatorMathematicsArithmetic meanKey (cryptography)Fluid staticsXML
Metropolitan area networkVarianceSoftware testingExt functorParameter (computer programming)Port scannerString (computer science)CodeBit rateData typeString (computer science)Distribution (mathematics)Strategy gameRandom number generationParameter (computer programming)ASCIIFactory (trading post)Multiplication signAlphabet (computer science)Dynamischer TestFunctional (mathematics)BuildingCASE <Informatik>Electric generatorBitObject (grammar)Module (mathematics)Game controlleroutputCounterexampleSoftware bugSampling (statistics)HypothesisFluid staticsHoaxDynamical systemWordMassOrder (biology)Software testingAddress spaceEmailRandomizationNumberDot productMathematicsTopological algebraOnline helpUniform resource locatorComputer animation
Combinatorial optimizationUniversal product codeSoftware bugDistribution (mathematics)Loop (music)Communications protocolTemplate (C++)Multiplication signSource codeInfinityCASE <Informatik>outputDevolution (biology)Strategy gameProduct (business)Lecture/Conference
Transcript: English(auto-generated)
Okay, so really sorry for the delay, but let's get started. My name is Moritz Cronbach. I work at Blue Yonder in Germany, and I'm going to talk about pretty much the same thing as Tom just did. I hope those who watch both talks
still will find some interesting things. First, a little bit about me and why I want to talk about this topic. My job is doing predictive analytics. Basically means predicting the future. But actually it doesn't have anything to do with this.
It looks more like this. We have some machine learning pipeline with data pre-processing external data sources and then some super secret algorithms which produce a machine learning model.
We work with big data. If you do an image search for big data, you will find that big data is shiny, clean, and most importantly, blue. From my experience, big data is often not shiny
and not clean. Can be very complex and dirty and nasty to work with. So it's more like this. But it's definitely blue. So big data is really a breeding ground for weird edge cases and things that can go wrong.
And of course we have to find ways to protect ourselves against these things and discover potential problems before production. And randomized testing, or I call it dynamic testing,
is one such tool to help us find problems before going into production to reduce stress for us and for our customers. Okay, so what do I mean by dynamic testing? Basically I mean both property-based testing and fuzzing.
These things are usually seen as kind of separate things which sometimes make sense because they are often used for very different things. But it's sometimes also said because there are lots of cool stuffs in both fields and often it doesn't make sense to separate them.
So by dynamic testing, I mean any testing where test cases are generated automatically. For example, either by fuzzing, which means the user gives some example input which is then mutated automatically.
Or by parameter templates, which means the user gives some generative model from which test cases are generated. And then the function behavior is checked for properties like does not crash, does not timeout,
or other universal properties like mathematical expressions that the function result should fulfill. In contrast to dynamic testing is of course traditional static testing
where test cases are provided by the user and the function behavior is also precisely defined by the user. I want to compare the two types of testing a bit. For this, I want to introduce two attributes of test. The first is precision,
which means how closely the expected behavior of the function is defined. And the other is case coverage, which means it's sometimes called input space coverage and it means what proportion of the input space is covered by test cases.
I've had some discussion about if case coverage matters. So for example, does it matter if we check five out of two to the power of 64 cases or 5,000 out of two to the power of 64 cases? In any case, it's still a really, really small proportion of all possible test cases.
But I think it does matter and I give one example to maybe illustrate this. So let's say you have implemented an algorithm and there's a numerical instability in your algorithm that you don't know about and you don't know that there could be
a numerical instability. And only one in 1,000 inputs are affected by this. So if you use five test cases, you have a probability of about 0.5% to detect this instability. If you use 5,000 test cases,
you have a probability of nearly 100% to detect this. There are other reasons, like Tom said before, the independent code pass coverage can increase, which is much stronger than just branch coverage.
So in general, dynamic tests help you find classes of test cases that you didn't think about before. And usually it's like this, static tests have very high precision or even perfect precision, but low case coverage
and dynamic tests have sometimes low precisions because we have to define general properties, but a bit higher case coverage. And sadly, usually you can't have both high precision and high case coverage.
So the best way is often to have just one at a time. So it makes sense to use both static and dynamic testing if you want to increase, if you want to maximize robustness. Okay, now a bit more practical.
I'm going to show some examples for dynamic testing in Python. And I'm going to use Hypothesis, which is quick check style testing for Python. I really like to work with Hypothesis. It's stable, but still in steady development
and it has a lot of innovative features, some of which I am going to show. Okay, first example. So your colleague went on holiday and he left you this function, which calculates the Fibonacci numbers.
Does anyone, does everyone know what the Fibonacci numbers are? I think yes. Okay, so in any, but maybe if you look at this code, it's not really clear that this formula computes the Fibonacci numbers.
There's, the Fibonacci numbers are integers and here is some stuff with square roots and then an inverse here. So we are not really sure if this is correct, but our colleague also left us some tests. He's testing the base cases, one and two is equal to one
and then two more cases, which are easy to compute in your head. And even another case, FIP50, which he probably looked up on Wolfram Alpha. And we run these tests and everything is fine, but I'm still not really trusting this code.
So I'm going to write a dynamic test. For this, I will first set up hypothesis. The most important piece of hypothesis from the user's perspective is the given decorator, which provides a test function with automatically generated test cases.
The other important piece are strategies, which define how data is generated. So here we import a strategy for generating integers.
And then we use, we import some settings stuff to de-randomize hypothesis because for the presentation, I don't want any random behavior and we limit the number of iterations and set a timeout. Hypothesis has a built-in example database,
where it stores any examples that falsified an assertion before. And every time you run your test, you would again, these examples will be run too. Of course that makes it non-deterministic.
So I'm disabling this for this presentation. Okay, this is my dynamic test. Let's look at this function first. TestFibRecurrence takes one parameter n. And we check that the function fulfills the recurrence
that defines the Fibonacci sequence combined with the base cases. And this recurrence should hold for all values with at least three. Let's run this. And we find that there is a falsifying example,
namely n equals 71. So this assertion fails for n equals 71. Let's take a closer look first at what happened. So I've added the settings parameter to given
and increased the verbosity of the output. And we see that the first falsifying example is n equals 805. But hypothesis doesn't stop at this point. It tries to find the simplest falsifying example now.
And for integers, this means finding the smallest counterexample. So what it does, it starts with small values
and equals three, four, five. And then does pretty much a stochastic binary search to find the smallest counterexample. So it goes up to 67 here, which still works. And then 131, which doesn't work anymore.
So it goes back one step. So it tries to find the counterexample somewhere between 67 and 131 now. And after a few more iterations of this, it will find the example we have seen before, namely 71.
Okay, to summarize this, the first step when hypothesis runs is sampling. It generates test cases until it finds a falsifying example or until the maximum number of iterations is reached. And when it finds a falsifying example,
it tries to shrink this example, meaning it tries to find the simplest falsifying example. For integers, this is the smallest example. For strings, this is the shortest string, for example. And we've seen two key elements of hypothesis given,
which is the decorator that supplies test functions with data and strategies which describe how data is generated. And I think we have seen that dynamic testing is very easy to implement and often very useful
for math heavy code, where you have nice, beautiful properties that describe your function's behavior. But let's do some toy example with some less mathematically elegant code.
This one. So it's really just a toy example. We want to test the quote function from the URL lib package. The quote function prepares a string to be used in a URL. So you can see here, it encodes the space
with percentage 20. We can run this static test and of course it passes. To write a dynamic test for this, we have to think what do we expect from the quote function?
What are properties that the quote function should fulfill for any input? And one thing we might expect is that when quoting and then unquoting, no information is lost. So if you quote a string and then unquote it again, you would expect to get the original string back.
Let's try this. And we find a falsifying example. So have we find a bug in URL lib? Sadly not, or maybe not sadly. But anyway, we didn't, we just weren't careful
about what kind of input we generate. The text strategy generates strings consisting of all possible Unicode characters. And the counter example we found here is just a control character. So it doesn't make sense to encode this in URLs.
We can easily fix this one. We import a string module and pass the parameter alphabet to the text strategy and tell it that we only want printable characters. And now it seems to work fine.
Okay, but in general, handling impure objects like strings and datetimes can be a bit challenging. Not only in data generation, but also in your assertions
because often the underlying functions have a lot of special cases that you somehow need to handle. I just want to give you a quick overview of what kind of tools that are available to assist you with this. Hypothesis has some things built in to help you. For example, the alphabets for string generation.
For datetimes, there's the Hypothesis datetime package. Who knows what the fake factory package does? Okay, two persons. The fake factory package is really cool.
I think you can generate all kinds of everyday fake data. Like you can generate random URLs, random phone numbers, random email addresses. And Hypothesis fake factory exposes these random generators to Hypothesis.
So you can, for example, test functions that take a URL as a parameter. Hypothesis has also built in support for custom strategies so you can create your own strategies, for example, by mapping existing data types,
so basic data types to your own custom data types. Yeah, that was quick. Just to summarize, dynamic tests increase case coverage at the cost of precision. Dynamic and static testing complements each other.
None is a replacement of the other. And sometimes it makes sense to use both. Sometimes it's better, more pragmatic to just use one of them. For math heavy code, dynamic tests are really great, really great for if it gets less mathematical,
more real world. Sometimes it can be difficult, but there are tools available and steadily improving. And if you have a problem with anything, with any object and you think maybe there should be some tool available for this, just, you should just ask on the Hypothesis mailing list,
for example, because they want to know this. Okay, yeah, thanks for listening. And you get out in time for lunch.
Okay, thank you. Do we have any questions? Hey, very interesting talk. Maybe this is covered by the custom strategies, but are there provisions for providing templates
for, for example, protocol passing? No, I think that doesn't exist yet. Okay, thanks.
Thanks for the talk. Do any interesting bugs come to mind that you found in your production code base? Yes, we had one combinatorial optimization algorithm, which would only fail, as input,
it would take some distributions that we predicted, and it would only fail if there are two degenerate distributions and at least one non-degenerate distribution. In that case, it would enter an infinite loop. So it was data you weren't expecting, or could you have hard-coded that if you'd thought really hard about it as an example?
Sorry? Was that data you were expecting, or if you'd kind of, if you'd been hard-coding your examples, might you have come up with that? No, I think we would have never come up with that ourselves. Cool. Any more questions?
Okay, please join me in thanking our speaker once again.