Testing with two failure seeking missiles: fuzzing and property based testing
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 152 | |
Number of Parts | 173 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/20214 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Bilbao, Euskadi, Spain |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
EuroPython 2015152 / 173
5
6
7
9
21
27
30
32
36
37
41
43
44
45
47
51
52
54
55
58
63
66
67
68
69
72
74
75
77
79
82
89
92
93
96
97
98
99
101
104
108
111
112
119
121
122
123
131
134
137
138
139
150
160
165
167
173
00:00
Point (geometry)QuicksortGoodness of fitCodeRandomizationStatistical hypothesis testingoutputSoftware testingDomain nameDisk read-and-write headPower (physics)Lecture/Conference
00:52
NP-hardCodeSoftware testingTask (computing)CAN busoutputCovering spaceAverageIndependence (probability theory)Metropolitan area networkPoint (geometry)Uniform resource locatorKeilförmige AnordnungTestdatenProper mapPraxisbudget Quick CheckDrum memoryHaar measureSoftware developerReverse engineeringMereologyCodeNP-hardSoftware testingoutputHypothesisStatistical hypothesis testingCASE <Informatik>FeedbackIntegerGoodness of fitFunctional (mathematics)String (computer science)Exception handlingNumberStandard deviationFormal languageElectronic mailing listLibrary (computing)Maxima and minimaProcess (computing)SpacetimeEndliche ModelltheorieRevision controlComa BerenicesDisk read-and-write headMereologyGraph coloringMappingMedical imagingFunctional programmingValidity (statistics)Line (geometry)CounterexampleDemosceneOffice suiteMoment (mathematics)Object (grammar)Propositional formulaCategory of beingCovering spaceType theoryParticle systemPoint (geometry)AreaSoftware developerUniqueness quantificationRight angleBitCoefficient of determinationBoolean algebraBoom (sailing)Monad (category theory)Error messageComputer animation
05:41
Software testingReverse engineeringLogic synthesisHypothesisTheoremFormal grammarFluid staticsWell-formed formulaoutputHypothesisCategory of beingCounterexampleDistribution (mathematics)Function (mathematics)Plug-in (computing)Symbol tableQuicksortProof theoryMathematical analysisMathematicianGreatest elementPrisoner's dilemmaStandard deviationStatistical hypothesis testingComputer programmingComputer animation
06:40
TheoremFormal grammarSample (statistics)outputSoftware testingFunction (mathematics)Data acquisitionCodeProcess (computing)IntegerMassNeuroinformatikMathematicianElectronic mailing listHypothesisAreaCovering spaceComputer scienceRootComputer animation
07:15
Software testingFunction (mathematics)outputIntegerMaxima and minimaModulo (jargon)Military operationSineSign (mathematics)outputNumberMaxima and minimaCodeMessage passingIntegerElectronic mailing listAveragePropositional formulaStrategy gameMassPlanningBitCounterexampleSoftware testingDatabaseCategory of beingLocal ringSet (mathematics)Default (computer science)DeterminismRandomizationResultantStandard deviationTheoryView (database)Computer programmingInterior (topology)Interpreter (computing)Random number generationDivision (mathematics)Distribution (mathematics)Unit testingGeometryElement (mathematics)Series (mathematics)Level (video gaming)Right angleMultiplication signMedical imagingRange (statistics)Control flowResampling (statistics)Network topologyFitness functionMereologyPhase transitionVirtual machineComputer animation
11:05
Software testingMetropolitan area networkPort scannerSign (mathematics)Density of statesSample (statistics)CurvatureFuzzy logicCrash (computing)Drum memoryRing (mathematics)Read-only memoryFeedbackGoogolBlogoutputHypermediaEmulatorMemory managementType theoryDivision (mathematics)Drop (liquid)Real numberArithmetic meanSpeech synthesisBinary fileImage processingFloating pointSystem callTerm (mathematics)Group actionMereologyCategory of beingWave packetDatabaseSequenceWeb pagePhysical systemInformation securityGoodness of fitTable (information)Student's t-testParameter (computer programming)Strategy gameCuboidFilm editingRevision controlSocial classBlogSampling (statistics)FeedbackFunctional (mathematics)Exception handlingSoftware testingHypothesisoutputProcess (computing)Electronic mailing listOvalLevel (video gaming)VideoconferencingRange (statistics)Sign (mathematics)InfinityComputer programmingCodeDirectory serviceResultantForm (programming)IntegerWordStandard deviationFrequencyStatistical hypothesis testingLaptopRight angleBitNumberStatement (computer science)Computer fileAlgorithmLetterpress printingLibrary (computing)Local ringExploit (computer security)Network topologyGame controllerBasis <Mathematik>Nichtlineares GleichungssystemReading (process)State of matterFuzzy logicCounterexampleMemory managementService (economics)Arithmetic progressionDemosceneCircleDistribution (mathematics)Latent heatProbability distributionMoment (mathematics)Configuration spaceControl flowEncryptionSoftware developerSoftware bugFactory (trading post)Plug-in (computing)Medical imagingHookingFunction (mathematics)MiniDiscGraphical user interfaceSemiconductor memoryCore dumpPoint (geometry)Branch (computer science)Computer animation
20:25
Maxima and minimaData acquisitionFuzzy logicLevel (video gaming)Computer multitaskingAverageMathematical singularityTotal S.A.Pointer (computer programming)Statement (computer science)Crash (computing)Sampling (statistics)Operator (mathematics)IntegerElement (mathematics)MereologyStrategy gameTotal S.A.Data dictionaryCodeBitUniqueness quantificationSource code
21:03
Metropolitan area networkFuzzy logicUniformer RaumTotal S.A.Loop (music)Computer multitaskingMaxima and minimaReal numberRemote Access ServiceData acquisitionLevel (video gaming)AverageBitData recoveryRight angleRoundness (object)Computer animationSource codeJSON
21:40
Slide ruleSampling (statistics)outputDirectory serviceCodeCrash (computing)MereologyQueue (abstract data type)Multiplication signSource codeComputer animation
22:34
Queue (abstract data type)Military operationPosition operatorMetropolitan area networkMaxima and minimaGamma functionSoftware testingDifferent (Kate Ryan album)CodeSample (statistics)Grand Unified TheoryCloningQuantum stateSign (mathematics)Data acquisitionLevel (video gaming)Interpreter (computing)Library (computing)outputCodeStatistical hypothesis testingConnected spaceLaptopData managementRight angleData dictionaryElectronic mailing listCompilerInformation securityPropositional formulaSource codePoint (geometry)Type theoryInterpreter (computing)Crash (computing)Operating systemComputer fileString (computer science)Line (geometry)Multiplication signAlgorithmStandard deviationSampling (statistics)Computer programmingQueue (abstract data type)CASE <Informatik>Software testingOnline helpPhase transitionWordException handlingTable (information)SpacetimeServer (computing)Parameter (computer programming)Software bugFunctional (mathematics)Execution unit1 (number)Medical imagingProcess (computing)Power (physics)Asynchronous Transfer ModePointer (computer programming)RandomizationTerm (mathematics)Perturbation theoryLatent heatLengthDifferent (Kate Ryan album)BefehlsprozessorBitXMLComputer animation
27:48
Interpreter (computing)Fuzzy logicMountain passMetropolitan area networkChemical equationoutputComputer-generated imageryBinary codeSample (statistics)ComputerCodeService (economics)Software developerUniformer RaumStatistical hypothesis testingSoftware developerSpacetimeCodeIntegeroutputHookingCryptographyCounterexampleLogic gateDeterminantFunctional (mathematics)CuboidString (computer science)MaizeMilitary baseUnit testingHypothesisData structureBoolean algebraState of matterMereologySampling (statistics)Electronic signatureStandard deviationInterpreter (computing)Constructor (object-oriented programming)Branch (computer science)Point (geometry)Software testingLibrary (computing)Electric generatorLevel (video gaming)Multiplication signLocal ringNetwork topologyOrder of magnitudeLatent heatDeterminismAuthorizationDifferent (Kate Ryan album)RandomizationNeuroinformatikProfil (magazine)Exception handlingArithmetic meanImage resolutionAreaFinite-state machineStrategy gameService (economics)Execution unitSoftware bugDefault (computer science)Asynchronous Transfer ModeFuzzy logicType theoryProgram flowchart
33:03
Boom (sailing)Metropolitan area networkAsynchronous Transfer ModeProjective planeLibrary (computing)CodeSoftware testingoutputSoftware developerBitSoftware bugStatistical hypothesis testingDatabaseType theoryDifferent (Kate Ryan album)Fuzzy logicLocal ringUnit testingGoodness of fitMultiplication signAuthorizationArithmetic progressionAlgorithmRandomizationHypothesisPoint (geometry)DeterminismResultantMereologyStandard deviationServer (computing)Bookmark (World Wide Web)Execution unitOffice suiteQuicksortCASE <Informatik>Video gameLecture/Conference
Transcript: English(auto-generated)
00:02
OK, yeah, so today I'm going to tell you about property-based testing, fuzzing, and I'm calling this, tongue-in-cheek, failure-seeking missiles. So I'm here to challenge how you test your code. Let's see where the audience is as a starting point.
00:22
Hands up if you've written a test. OK. Hands up if you've only used hard-coded values as example data to pass into your code. Fair enough. Hands up if you've used some sort of random input. Good.
00:40
Hands up if you've used property-based testing. And hands up if you've done any fuzzing. OK, good. So yeah, some good things to learn about today then. So I'm here to ask you, are you testing your code hard enough? Are you stretching it? Are you asking it questions it wasn't expecting?
01:01
Are you an aggressive interviewer? Are you a softball interviewer who asks the easy questions? She wasn't expecting that question. And maybe you should ask your code some questions it's not expecting. So what are we trying to achieve with our test input data?
01:21
Possible goals. Well, to start off with, you want to cover the happy path. That's the one that's going to earn you some money and get the job done. But that's the absolute minimum. You probably want to cover all of the code base. You want to cover, say, exception handling and the validation when the user passes you some data that you
01:40
understand not to be valid. But what about unhandled exceptions? By definition, you're not expecting them. So how are you going to think up examples for that data? If you knew what they were, you probably would have caught them already. And then not just covering each line of code, but you really want to cover the independent paths
02:00
through your code base. It's going to start sounding like a lot of work, isn't it? So I'd like to just ask you to take a moment to think about the data that you pass into the test that you've written. This is to help you contemplate. So which type of points do you pick when you're writing your example data?
02:21
And where would an adversarial approach take you? I'm not an artist, but Google image search is quite handy. So this is an artist's impression of the central parts of the input space here, maybe the obvious examples, janeatexample.com.
02:40
But maybe you're under-testing some more difficult examples. I've certainly seen Unicode errors that would have been caught if example data had included a Unicode snowman or something like that. Also, just passing in empty lists, empty strings, it's good as a base standard for edge case testing.
03:04
So how do we create test data? We can write hard-coded values, like most of us have done. We can create purely random data, just with no feedback. So Model Mummy does this. Just gives you something conveniently random
03:21
that will get the job done. Or by firing a failure-seeking missile. Boom. Hardcore. Let's take a closer look. So QuickCheck is a Haskell library. Don't do it just yet. So it's been around for a while. Hands up if you've heard of it.
03:42
Cool. So this is property-based testing. You specify a property of your code that must hold. And QuickCheck will quickly check if it can prove you wrong. It does its best to find a counterexample. So this is a little bit of Haskell.
04:00
So the basic thing to take away here is that two lists of integers go into a property. And a Boolean comes out that is whether this property holds or not. So it's about reversing lists of integers. So basically, imagine four integers, a list of four integers, and a fist of four interfingers
04:21
here. And it's saying you're going to join them and reverse them, as opposed to reversing them and joining them. So it's a slightly dubious proposition. But we're going to see if it can prove it wrong. And at the bottom there, you can see 0 and 1 as the inputs after some shrinking.
04:40
And it has proved it wrong. But this isn't Euro Haskell monad con. So let's not accidentally learn too much Haskell. Instead, let's find out what functional language developers think of our world. This is a direct quote. In an imperative language, you have no guarantee that a simple function that should just crunch
05:00
some numbers won't burn your house down, kidnap your dog, and scratch your car with a potato while crunching those numbers. Fair enough, but we like Python anyway. So hypothesis is the Python version of QuickCheck. It's more of an update because it adds some new features. Let's delve into the kidnapped dog world of Python.
05:21
So just as a reminder, this was the Haskell version. In Python, there isn't a function that reverses lists. I've made one here. And in a similar way, you can see the at given decorator here specifies two lists of integers
05:41
which map to the two inputs to the test function. And then at the bottom, you can see the property is defined just with a standard assert. We're running it with a pytest runner. And there's a little tiny hypothesis pytest plugin that just helps you see the output a bit clearer.
06:05
So it proves it wrong and actually comes up with the same counterexample that QuickCheck did. And we didn't have to think up any example data ourselves. So that was an improvement. So what's going on here? How could this be working?
06:23
So maybe it's doing a formal proof in the background. Maybe it's doing some sort of static analysis. Maybe it just passes a symbol into the top of the program, looks at all the manipulations, ends up with a formula and then solved that formula. Even for mathematicians, they haven't quite got there yet.
06:41
I tweeted a really interesting article called Will Computers Redefine the Roots of Math? A wired article. But no, they're not there yet. Mathematicians still have a job. And same in computer science. They haven't really managed to, especially not in Python. So that leaves us trying a crap ton of examples,
07:00
also known as fuzzing. That's what's really happening. Let's have a look at the dirtiness under the covers. Okay, so this is the first list of integers that Hypothesis is sending in. If I go over here. They're pretty nasty.
07:20
And it turns out proposition is false, fine. And so basically it's proved it wrong in the first hit, but it doesn't want to show you that because that's kind of ugly and I don't think that would pass code review if you try to put that as a hard-coded example. So it has a go at making it simpler. As we scroll down here, you can see the first list is getting shorter. So then you've got three items now.
07:40
It's worked out that big numbers are more annoying than little numbers. So those numbers at the top there are getting shorter. Just two numbers, one number, keep on going. And the second list will get shorter as you get down. And it's getting it. It actually overshoots. You get something so simple that when you reverse it in each way, it's actually true.
08:01
So that's bad, doesn't want to show you that. Simple, simple, try some empty lists. And this is the simplest one it could come up with. And then that's the one it ultimately shows you. So that's the one you copy paste into your deterministic code boat, into your test set.
08:23
As an aside, if you run that again, it won't go through the whole business of those massive list of integers because it's got a local database of successful examples. So that's mainly for a speed enhancement. But also, if it's searched really hard and maybe there's a bit of luck
08:42
that it's found a counterexample, it wants to keep onto that because it might not find it next time, theoretically. So what's going on here? It's generating random-esque input data. So it's not purely random. It's random but with a view to breaking your code. It runs the test repeatedly, so it's really worth bearing in mind
09:01
this is not like a standard unit test that just gets run once. That at given decorator means that that test actually gets run by default 200 times or at least until a falsifying example has been found. If it finds a counterexample, it will then try and shrink that to the best of its abilities just to give you the cleanest, simplest counterexample
09:22
to prove your property wrong. So let's have a look at that random-esque data. Where did the integers come from? So this was the decorator. So this is strategies.list and strategies.integers.
09:41
And the integer strategy is made up of two parts. The random geometric int strategy is basically saying give me smallish numbers like maybe zero, maybe minus one, maybe 17 will break your code. And the other one, the wide range, says well, I'm gonna give you basically any random number from like anything
10:00
that your Python interpreter can handle. So massive integers and maybe that will upset it a bit. These are strategies to relentlessly, sorry, they are relentlessly devious plans to break your program. And the list strategy is to say you pass it the elements you want in your list
10:22
and it averages at 25 elements, but you can set it to maximum size, minimum size. It's very, it's got sensible defaults, but you can override them if you need to. So Raymond Hettinger tweeted this, calling it the, so you might not know what the percent does.
10:41
It's the Raymond upon division. And he has suggested two properties that should hold. The result should have the same sign as y and the absolute result should be less than the absolute of y. Okay, well let's check. This is how we would write that. So there's no list of integers,
11:01
there's just two integers here. And they relate to the x, y inputs to the test function. So we've got a new function here, assume. This is a way of giving feedback back to hypothesis and it says A, if this assumption proves false, in other words, if you give me a y at zero,
11:21
stop the test, it's not appropriate, and don't give me any more like that. Okay, and so in that way, it's guiding itself to be more helpful, give you inputs that are more likely to be relevant. So we calculate the result, and I mean, I had to create a same sign function, but apart from that, it pretty much reads as English
11:40
or copy and paste from the tweet. Let's see what the answer was. It passed, okay, I should know better than doubting Raymond Hettinger. But I can and will property test his tweets. How does it do it? So the data strategies are probability distributions
12:00
to pick elements from within that distribution. There's guided feedback via assume. Shrinking of counterexamples to be clearer to read and easy to understand why they're breaking your property. And a database of failing examples for speed,
12:22
especially when you're doing TDD, if it finds something wrong with your code, you've broken a property, you can have a go at fixing it and it will try it straight away again until you make it pass. The internals of hypothesis are really interesting. I won't explain them. They use a parameter system. It's worth having a read up. He's got a good page on the documentation about that.
12:43
Let's look at one more strategy. So we've seen integers. The floats are a bit more complicated. If you claim that your function accepts floating point numbers, it's gonna do all these math-y sounding things, Gaussian, bounded, exponential, maybe just some integers. Ah, you weren't expecting that, were you?
13:02
And then some nasty floats are taught as well, say, you know, zero, minus whatever, minus infinity, positive infinity, nan. You can assume these away if you don't want them because if you're doing maths, they will probably break your code.
13:21
There's some great advanced features of hypothesis. It makes it very easy to take the built-in strategies and make your own strategy. So say you've got a function that accepts a comma separated list of integers. You could map a list of integers, have them joined by comma,
13:42
and then pass that into your code because, you know, you want your test data to be relevant to your test. It can't just all fail at the first hurdle because it's too random. So, you know, you might wanna build your own strategy like that. There's plugins for Django, bit like Muddle Mummy or Factory Boy,
14:02
and NumPy as well, that's a prototype. There's also a bit experimental but a stateful testing where you give hypothesis the controls to your program and it tries to find a sequence of actions which cause a test failure. So, I don't know, this sounds very interesting to me.
14:22
Then moving on. Let's look at another failure-seeking missile that's getting a lot of attention recently. So American Fuzzy Lop. This is a fuzzer, I guess second version.
14:42
The first one was called Bunny the Fuzzer. So I think Michael Zalewski likes rabbits and they're certainly fuzzy. So it specializes in security and binary formats. Low-level libraries are essential to everything we do, whether it be accessing a database, image processing, encryption.
15:03
So we'll get onto the Python AFL in a minute, but just for a moment, let's think at the C level. So just to remind you, a fuzzer is something that fires data at your program, attempting to crash it. So we've kind of moved on from property-based testing. This is more about crashing your code
15:22
than specifying properties. So these are things that you wanna leave running for a good while, maybe on multiple cores. And speed is very important because the more ground you cover, the more likely you are to find some interesting inputs.
15:44
So fuzz testing has been around for a couple of decades or more. There's, I'd say, fuzz backwards Zuff that people have been using for a good while. And AFL is kind of a new style with some guiding going on. But traditional fuzzing is not dead.
16:01
This is very important. AFL might be the new cool thing that came out last year, but Google has been running fuzzing against FFmpeg for a couple of years and found 1,000 bugs. Literally 1,000 commits fixing those bugs. So not to be sniffed at.
16:22
So if you don't know, FFmpeg is a video processing library. It's in Chrome. It's probably in your local video player that you have on your Ubuntu desktop. So the strategy was take small video examples, small video files, mash them together with mutation algorithms,
16:40
maybe splice them together, you know, mutate some bits or bytes here and there. And then, admittedly, they had 500 and then eventually 2,000 cores over a period of two years, so maybe not just your laptop. But they found they made great progress.
17:03
A lot of memory management bugs were found. Actually, I was speaking to one of the FFmpeg developers last night, and I can confirm it's not just because they've written awful, awful code. It's just because this is quite a hard thing to do. The video specifications could be 600 pages long,
17:20
and they have to write very, very fast code. That's why they don't write it in Python. They have to look after all their memory management, and it's very easy to not do that perfectly. Now, there's a quote on that blog post where they thought 20 to 30% of the problems found
17:41
could easily be exploitable, so there's 100 to 200 zero-day exploits that they found with that fuzzing. So something tells me that local security services and or hackers, you know, this would be a good approach they might be doing. Let's hope the good guys find these bugs first.
18:00
So AFL's goals, it does need to be very fast because it's got a lot of ground to cover. It needs to be reliable because if it breaks overnight, it's not gonna get much done. I think in the past, fuzzers took a lot of configuration, but this one tries to be very simple, not require much setup.
18:21
I'll show you in a minute. And it does the things traditional fuzzers do in terms of taking some sample inputs, mutating them, but it also has a little secret, which is it adds compile-time instrumentation. So this means that when you compile your C code,
18:40
at each branch point, it adds a little hook that records the path taken through the code. So this, like we saw earlier about code coverage and taking independent paths through the code, it's able to get feedback on where its test data travels through the code base.
19:05
And you really just drop in replacement. So you might have been using GCC before, you just replace it with the AFL version. So here's a toy example. So we're literally just reading a hundred characters from standard in. And the bug we're simulating is if the input is foo,
19:23
then we're gonna blow up. So it's the toy example. Let's just compile that. So there's no configure here. We're just compiling it. And when I echo into the program there,
19:41
I've got some print statements. So it said one, two, three, four, and it did a walk. Okay, so it works. Let's try fuzzing it. So in the, I'm gonna size the input directory. So I've got one sample input, which is literally just one file with one dot in it, just to kind of say, here's something to get going on. I'm not gonna tell you what the answer is.
20:01
See if you can get to the answer. There's an output directory where the results will end up. And it's worth saying, if you're on a laptop like me, you probably want to use a RAM disk because it's gonna do millions of writes and you might have your SSD stop working quicker than you thought. And it's just gonna fire this test data
20:21
into the standard input of our program. So this is the dashboard you get with AFL. So draw your attention. I've got a laser here. So up here, we've got total paths and unique crashes. So it's found four paths through the code base, which is those if statements.
20:42
The strategy yields down here. This gives you a sample of some of the mutation operations it's doing. So bit flips, byte flips, arithmetic. It's gonna try some integers. We haven't given it a dictionary here, but I'll show you one of those in a minute. My notes have disappeared. That's awful.
21:10
Secret. Okay, recovered. Right. And the other thing to show you,
21:22
it's done almost a million runs in, what's that, two and a half minutes. I don't want my notes to disappear.
21:41
That's not helpful. Okay, next slide. So within the findings directory, you get a queue of interesting inputs that it has found take a different path through the code base. So it started with that dot I mentioned,
22:01
which was my sample input. It's found that after manipulating that, it found one that started with an F. So it's clearly trying thousands and thousands of examples, but when it happened to find one that started with an F, it took a new code path. So it recorded that for reuse. Similarly, FO, F of O.
22:21
So it's kind of using, it's kind of stepping up, making it further through the code base each time. And in the crash directory, it's found an example input that crashes the code. So this is exactly what we'd be looking for. So let's just have a look at that crash file.
22:41
So it has a kind of long file name, but it records, this is where it's recorded what's happened. So it told you signal six is a SIG abort. We did the abort, so that's expected. It tells you that it's based on the third item in the queue. It's done some eight bit arithmetic. In other words, it's replaced that Y with the umlaut with an exclamation mark.
23:02
So you can kind of see how it's working. You can see how it's manipulating previous inputs. So it's able to stand on the shoulders of what it's achieved so far and get one step further. So in the last year, it had a very, very impressive
23:22
trophy case here. This is about a third of the list, by the way. So there's security libraries, image libraries, SQL libraries, you name it, almost. But generally focusing on libraries that have, can be random input, but also you've got bash there.
23:43
You have to give it some more help when you're doing a kind of non-binary input because if it just fires random characters at SQL lite, it's not gonna get very far. So let's have a look at a specific example, SQL lite. It's worth saying SQL lite is a very, very
24:03
well respected library in terms of testing. It has already been fuzzed by a traditional fuzzer. So you might think there's not much low hanging fruit there. The approach taken was to start with a dictionary of SQL keywords. So you literally just put these kind of one per line
24:22
in a file. They crept out some handwritten test cases from the SQL lite test base. And they found 22 crashing test cases. This is one of the simpler ones. So this ends up arriving in a function
24:41
with an argument not initialized or something like that, or zero length argument where it was expecting a list of one or more things. So these were able to be fixed. So how does it do this? Let's just see an overview. It is a great traditional fuzzer,
25:00
and you can use it without the instrumentation. It will search for inputs that span different code paths. And it uses genetic algorithms to mash together the examples it's seen so far, as well as just mutating those examples one at a time. But you can imagine it searching the input space. And it's got some help, it's got some guiding by the instrumentation,
25:21
but it's always gonna be a slow process because the input space can be massive. It certainly can't just go A, B, C, D and do like an exhaustive list of all inputs. That'll take forever. So let's have a look at fuzzing CPython. It's worth saying that obviously
25:41
the innards of Python are written in C. So this is a different proposition to fuzzing Python code, which we'll see in a minute. So you can download Python source, Mercurial. You can compile it very similarly to, as the Python docs tell you, but using the AFL-C lang compiler.
26:04
And you can start fuzzing it. So I've got a sample input, a sample target program here. So I'm not a C types expert. So I can't explain that line, the magic line that you need there, but that connects things up.
26:22
And you're literally just passing standard in to JSON load. So it's treating it as a file. And we're not catching Python exceptions here, but we're looking for exceptions that happen in the C code. So I ran this overnight the other day for eight hours.
26:41
It didn't find any bugs yet. Maybe I shouldn't be so surprised it'd be a bit easy if it was that easy, but it did run 121,000 times. It is a lot slower than just running the toy example earlier because it's loading the whole Python interpreter, et cetera. But there's tools within AFL
27:00
to make this faster and easier. So they have a fork server, and they have various techniques you can use to make things faster. It also gives you some hints before you start about putting your operating system into performance mode instead of power saving mode. So you could say this is more ethical
27:20
than causing global warming by mining Bitcoin. But yeah, certainly it can cause laptops to get a bit hot and a bit CPU intensive. Okay, let's move on to Python AFL because this is not a Euro C null pointer exception conference.
27:40
So this uses siphon to connect the C layer and the Python layer. It connects the instrumentation that we mentioned to the Python interpreter via sys.settrace. So every scope that's entered will kind of log a little waypoint as traveling through the code base.
28:01
And your unhandled Python exceptions get converted to siguser1, which AFL will recognize as a signal. And you can see here, pi afl fuzz is just basically type pi before afl fuzz. So it's literally just as easy to use.
28:27
So here's an example Alex Gaynor did of using this to fuzz the cryptography library. So it's pretty simple. You have a little afl.start hook there that connects things up.
28:42
And he's literally just passing standard input to decode a signature. He said it was fruitful, but I don't know if he listed any particular bugs that he found. But this is the general approach. So what are some interesting questions
29:00
raised by these two libraries? So in default mode, hypothesis and AFL fuzz will give you new input data every time you run them. So this could be considered annoying to some people. They wanna know if a new commit fails their tests.
29:20
And if you've got different test data every time, well maybe it failed because it found a different test input rather than your commit causing it to fail. So consistent pass or fail some people insist on. On the other hand, you might find more bugs. So that would be handy as well. So I think the resolution between the two
29:42
is to do the non-deterministic testing, maybe not in your per commit testing, but to look for the counterexamples it pulls out and copy those into your deterministic test pack. Or just live with the non-determinism and find more bugs.
30:02
That's what the author of Hypothesis recommends. But you can put it in deterministic mode if you insist. So we've been thinking about random input. One way to think about this is if things are too random, they won't even get through the starting gate. They have to be relevant to your code.
30:22
On the other hand, you can't enumerate all the examples because there's too many. Your input space is often just too massive. And if you give a happy path sample input and you don't mutate it enough, then you're just gonna go straight through the maze and come out the other end and you're gonna think everything's fine.
30:42
So you wanna have that sweet spot where you're reaching your dead ends of your code base or your dead ends of the maze which represent all the paths through your code base. But not having them all fail at the first hurdle because they're too random. So which library should you use? Well, if you insist on just using a standard unit test,
31:02
you should at least try to be more adversarial. But I would say you can't expect the unexpected. So you should probably use Hypothesis if your input data are Python structures. And even if they're not built in data structures, you can build your own strategies. It's quite fun as well. And also it means you don't have to think up
31:22
the hard-coded examples. So people say that the code base becomes, the test code base becomes easier to read because they're not distracted by the specificity of bob at example.com. It just says this function takes all strings or all lists of integers.
31:41
And something like AFL or Python AFL, if your inputs are binary or as we saw, maybe you're pausing text input like an SQL library. So in conclusion, we've seen two styles of test data generation.
32:00
Humans are generally bad at picking random examples. Developers are bad at being adversarial to their own code bases, which they understandably love. Computers are fast, let them play with your code and find more bugs before your customer or the secret service does. Let me end by saying, don't interrogate your code base
32:21
like it's a fluffy bunny stuck up a tree. Fire a guided missile, blow the branches off the tree and clear up the mess. It's not just me saying it, celebrity endorsement, just a little reminder there. And also of interest, you don't even have to get up. The talk after this one, which will be more informative
32:44
and better presented by Morit Skromback, who I haven't talked to yet. I hope I didn't cover all your points, is in this room directly afterwards. And there we go, I've been Tom Viner, any questions?
33:05
Thank you, Tom. Thinking back to Guido's talks yesterday,
33:21
is Hypothesis Python 3 ready and could it be made to use the type annotations if they were available in the code? I think it is Python 3 compatible, but I don't know about the type hints, you'd have to ask the author of that library. Yeah, sounds interesting though.
33:46
Thanks for the interesting talk. How does Hypothesis handle if the code under test exhibits some randomness itself, either voluntarily, maybe a Monte Carlo algorithm
34:01
or involuntarily by mistake? Very good question. It will raise a flaky code warning. So it will tell you that its workflow is, you can suppress the warning, I think, but it basically tells you that if your code is non-deterministic,
34:22
then you're less likely to find helpful results. So you may wanna put your code into deterministic mode itself, or take another approach. I just wanted to point out, sorry,
34:43
if you want to help making libraries, especially C libraries, more secure, there is a project called the Fuzzing Project by Hanno Buck, which helps you with some documentation to get started with AFL and fuzzing your favorite library.
35:01
So that's a very good point to get started if you want to make things more secure. Could you just repeat the name of that? The Fuzzing Project, I believe.
35:24
Thanks for the talk. You mentioned that Hypothesis iterates over its tests many times to produce its results. Do you run it as part of your standard unit test workflow or do you run it somewhere else in your testing workflow? I think I personally would bite the bullet
35:40
and use it in a TDD workflow. There was a talk the other day about Testmon, which uses coverage.py to only run the tests that are related to the code you're changing. So you could get a speed improvement from Testmon and then balance that with a slowdown from Hypothesis
36:00
and just maybe end up where you were before but with finding more bugs. I have a quick question. So because Hypothesis is generating random input, is it a bit strange to use on a project where you have multiple developers? Because then doesn't each developer
36:22
have different input into the test? Yeah, so the inputs are non-deterministic to start with. So even before you had a local database of examples, everyone's getting different test inputs. But this idea of sharing the database of found examples
36:42
I think is still a work in progress. I think the developer of Hypothesis is still trying to think through whether it makes sense to, for example, have that on your CI server or whether that's a bit of a non-starter. You can add another decorator to tests to give them specific examples, to force it to give a specific example.
37:02
So if some developer found a certain input that was helpful, they could do a commit that hard-coded that is always present. All right, please join me in thanking Tom once again.