Adventures in not writing tests
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 131 | |
Author | ||
Contributors | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/69399 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 202451 / 131
1
10
12
13
16
19
22
33
48
51
54
56
70
71
84
92
93
95
99
107
111
117
123
00:00
Software testingStatistical hypothesis testingNumberStatistical hypothesis testingSoftware testingFloating pointProof theoryHypothesisGoodness of fitLibrary (computing)Service (economics)Multiplication signOnline helpWritingDescriptive statisticsCodePoint (geometry)Mathematical analysisPhysical systemVirtualizationStapeldateiRevision controlClient (computing)Sampling (statistics)1 (number)Suite (music)ExistenceCASE <Informatik>Adventure gameChannel capacityProduct (business)Structural loadDependent and independent variablesUnit testingLevel (video gaming)CausalityFigurate numberRight angleGateway (telecommunications)Data managementCategory of beingSinc functionComputer animation
07:57
HypothesisIntegerBoolean algebraKey (cryptography)Type theoryStatistical hypothesis testingFrame problemElectronic mailing listLibrary (computing)Physical systemFunctional (mathematics)Multiplication signNumberDomain nameClient (computing)MathematicianStapeldateiCodeIntegerSoftware bugCASE <Informatik>Boolean algebraIterationMaxima and minimaHypothesisSequenceTotal S.A.Software testingError messageKey (cryptography)Negative numberBitInfinityGroup actionDampingInterior (topology)Proper mapRevision controlBuffer overflowSummierbarkeitStandard deviationComplex (psychology)Point (geometry)QuicksortoutputComputer animation
15:54
Boolean algebraElectronic mailing listStrategy gameHypothesisObject (grammar)Parameter (computer programming)Functional (mathematics)Software testingMultiplication signSet (mathematics)DampingConsistencyCASE <Informatik>Diallyl disulfideExtension (kinesiology)Statistical hypothesis testingFrame problemInstance (computer science)Social classMathematicsExpressionoutputNegative numberAdditionAliasingSoftware developerError messageChannel capacityPhysical systemType theoryClient (computing)QuicksortMaxima and minimaData typeSoftware maintenanceLattice (order)Validity (statistics)CodeScaling (geometry)Cache (computing)RandomizationSoftware bugHookingEstimatorTest-driven developmentComputer animation
23:50
System of linear equationsPrinciple of relativityRSA (algorithm)IntegerHypothesisQuicksortHoaxCodeLibrary (computing)ConcentricStatistical hypothesis testingRange (statistics)Multiplication signIntegrated development environmentType theoryData typeProduct (business)Level (video gaming)Subject indexingFrame problemCASE <Informatik>Focus (optics)Default (computer science)String (computer science)Functional (mathematics)2 (number)Limit (category theory)Random number generationSoftware testingNumberSet (mathematics)Strategy gameLatent heat1 (number)Validity (statistics)Constraint (mathematics)CausalityParameter (computer programming)Computer virusFreewareElectric generatorCalculationLecture/ConferenceComputer animation
Transcript: English(auto-generated)
00:04
Good morning. We're going to be talking about adventures in not writing tests. I'm Andy Fundinger. I'm from Bloomberg. I find it's easier to tell people how long I've been writing Python by version number, so I learned Python.
00:21
The first book I got was Learning Python 2.2, newly revised for Python 2.3. And what I actually had was Python 2.4 to work on. So I've worked on this for a little while. I've worked with Plone. I've worked with Twisted. I've done some things with Virtual Worlds.
00:42
I've worked in the financial industry. I've been at Bloomberg since 2017. About 2019, I switched into our system reliability engineering teams. And in 2020, I moved to a department called Data Services
01:01
where I work on our data gateway. Essentially, that's if one system in Bloomberg needs market data from another system in Bloomberg, they ask our data gateway right in the middle to go figure out how to get that and send it back. So we're in the middle of a lot of Bloomberg workflows, but the code on either side, the code that actually knows the data
01:23
and the code that wants the data is other people. So it's a make it work, make it reliable, provide enough capacity, but you don't own the code that's really going to cause problems for billions of requests per day. And there's literally millions
01:40
of semi-independent code paths going off into really the ether. Which means a lot of what I do is pull in some telemetry data and start pulling it apart programmatically to try to understand what's going on on a very high level.
02:04
So we'll see some of that in this talk. Now, about this talk, can I get a show of hands, who read only the title and decided to come to this talk? Okay. You're going to have a slightly different experience from those who read the description.
02:23
Because this talk is about how to not write tests, as in the title, or at least how to reduce the time we take to do it. Despite the title, we will be creating tests, but we will get some help writing them.
02:41
And our motivating example mostly is going to be what I call analysis code. This is that code that starts with a question. Oh, what does our latency look like? What services do we rely on? We don't initially know the answer. We may only need to do it once.
03:02
We don't know even what the question means usually when it starts. And it may be quite reasonable not to write tests, since you don't know what you're doing or what you're going to find or how you're going to do it. I know some people do, no shade on that.
03:23
But you might just think you need the answer once. You run it, you share it with your team, you share it with your management, and they go, that's great. Can you run it again with today's data? Can you run it again every month, every week, every day, twice a day?
03:43
Now suddenly you have code that really ought to have tests, but you didn't write any tests. Another way to get into this situation is if you write a very good proof of concept, and it's pretty good, people start using it,
04:01
and now your proof of concept has gotten its way into production. None of this means we know what the future looks like. We never do. And especially when working on data, we don't know what the future data is going to look like.
04:20
So I'm going to be using an example somewhat made up. Imagine we have a system. It gets like, we've got like 10,000 requests of sample data for it. And we have one of these loosely defined questions. How would this system perform at various capacities? Oh yeah, and there's some odd features to the system.
04:42
Some of the clients are, I'm calling them timely. Some of them are batch. If it's a batch client, nobody cares when the response gets back. That's not true, but it is for this example. And oh, another little question. How long is the interval of 100% load
05:00
in whatever capacity you're running at? With that, let me introduce the star of the show, Hypothesis. Directly from their website, Hypothesis is a Python library for creating unit tests that are simpler to write and more powerful when run finding edge cases
05:21
in your code you wouldn't have thought to look for. It is stable, powerful, and easy to add to any existing test suite or any non-existing test suite that you've just gotten yourself into. It works by letting you write tests that assert something, should be true for every case,
05:40
not just the ones you happen to think of. So if we go ahead and think of a test as set up some data, do some stuff, assert some things about what should be true, then a Hypothesis test is almost the same.
06:00
You specify what some data should look like, do some stuff to it, assert some things about what should be true. This is called property-based testing. You may have also seen it in Haskell in the quick check library. So how does this help? Well, we're going towards not writing tests.
06:23
This relieves us of creating test data. So instead of creating test data and sourcing test data and deciding whether a test data is representative, we just have to describe the data that would possibly be valid and let Hypothesis work it out from there.
06:43
For example, let's say we work with a floating point number. There's a floating point number, 1.7977 times 10 to the 308th. It's not necessarily the first one I would think of, but it's absolutely a floating point number.
07:02
And if you said you were going to work with floating point numbers, that ought to work. A second one, 1.2 times 10 to the negative 7. Oh, and it's negative. Also not necessarily the first one I would think of, but again, it should work.
07:21
Let's go ahead for three. Oh yeah. Not a number is also a floating point number and your code probably breaks when it gets it. I can just tell you that based on experience. So unfortunately, we are still writing tests at this point.
07:41
I've just shown you how to create data. So let's enter the ghostwriter. When you install Hypothesis, it installs a command line tool. The command line tool has a command called write. It writes tests. So if we have it write a test for the sorted function,
08:02
in the standard library, it will go ahead and write this. Sorted takes an iterable. It is some iterable of integers, text, or float. The key function, it's not going to create a key function. Let's be reasonable here.
08:20
And it will take some booleans. And then it goes ahead and writes all of this code, including the test. Now note there are no assertions. But we'll see a few times that even without assertions, if your code is crashing before it gets out, you found a bug already.
08:44
So we'll try another one. The ceiling function, if you try to have Hypothesis write a test for the ceiling function, you will get this. ST nothing is telling you
09:01
that you have insufficient type hinting. It doesn't know what the type is there. It's not in the standard library. Maybe it is in the newest version or maybe that's something I can work on in sprints. But if you just tell it that this should be a floating point number,
09:21
then you will have a test. And we'll go ahead and add a simple assertion. If I take the ceiling of a number, I should get a number that's equal to or greater than the number I took. Pytest will pick this up. Pytest will run it. Pytest will find that is not true.
09:41
It turns out that if you give a not a number to ceiling, that cannot be converted into an int. That is a value error. If you give an infinity to ceiling, which is also a float, you will get an overflow error. There's no integer quite big enough to handle that.
10:02
So these are the defined proper functionality of ceiling. And what we really should do is separate that out. So we tell hypothesis in this test, we're not using not a number. We're not using infinity. And then we'll write some tests that the proper things happen
10:21
for not a number or infinity. Do one more. Group by would look like this. Again, it creates some iterables. It creates a key. That one is type-hinted. Let's go ahead and use quote-unquote R code. This is almost certainly not the best way
10:41
to do what it is doing, but it's good because it has bugs and I'm talking about testing. What it is doing is it's finding the largest sequence of falses or trues in the provided booleans. It's calling the largest sequence of falses the available, the maximum available time and the largest sequence
11:02
of trues as the maximum loaded time. And we can go ahead and write a test for this. Hypothesis took my type hints and it said, oh, you take iterables of booleans. Well, let me tell you about all the kinds of iterables of booleans.
11:22
Don't worry, I'll test those for you. And then I'll write a test and the test is no assertions. Okay, fine. If we run this, we get a key error because if you provide it
11:40
with an empty sequence, there are no falses, there are no trues, we were indexing by false and true and that's a key error. So we can fix the code. In our domain, in our problem, if there were no times when the system was fully loaded, there are zero times
12:01
when the system is fully loaded. It's not necessarily mathematically true but it is true for our problem and the same time for times when it wasn't fully loaded. So we can give the right answer to our system even in data that might be mathematically ambiguous.
12:21
But we can take up a more complex type. I'll do two. We'll do data frame first. What if we want to do a data frame? Well, it turns out it will do data frames. There you go. It will create a data frame. It will create any type with a from type method
12:42
and it will find an error with a data frame. It turns out that if you have a client and you have an empty data frame, the code behind this will have a problem with that. We'll see the code in a bit. It doesn't really matter. We're talking about the tests.
13:01
So obviously if we are indexing our data frame by batch clients, the batch clients has something in it. That something isn't in the columns of the data frame because it's empty. We're going to have a problem. That's wrong. But what are we really doing with this create a data frame method?
13:20
Because we didn't tell it very much. We'll go ahead and get that strategy and take a look. The first example is an empty data frame. The second example is an empty data frame. All the examples are an empty data frame. This will only find one kind of problem and you could write that test very easily.
13:42
Fine then. We can make our own data frame. With columns. And data. So in this case I've just changed the test. I've said okay, give me some client names.
14:00
It's a list. It's text. Give me some usage values. It's a list. It's floats. I'll arrange that later. Give me a list of booleans. I'll work off of that. And then I stitch it all together. It winds up being kind of a lot of code. But it will create a data frame full of lovely test data
14:23
that will indicate all kinds of things about what we might have done wrong. We're testing Unicode. We're testing congee. We've got some infinities, positive and negative. We've got some nada numbers. We've got some small numbers.
14:40
We've got some big numbers. We can write some assertions on this. So we're going to assert that the sum of everything is equal to the sum of everything. That's a safe assertion. We're going to assert that the biggest number in total is as big or bigger than the biggest number in the other columns.
15:00
That seems safe. It's not. In this case our friend nada number has popped up. If your usage values are infinity and negative infinity, when you total those up, you get nada number. Who knew?
15:21
Okay, mathematicians figured that one out. But you will get that. And then nada number is not equal to nada number because it's not. It's not a number. It can't be equal. It's also not greater than or equal to nada number,
15:41
which you can get, in this case the empty case isn't handled yet. So we can go ahead and change our input for usage values. We can say infinite usage is not really sensible.
16:02
And we can go ahead and change our code. This is handling the empty data frame case because it's saying if you listed extra clients in the input, wherever it came from, whatever happened, just don't do the math on them. They're not there. Again, this is specific to our problem.
16:21
Empty extra clients are not a problem. We'll just ignore them. There's nothing to add up. But we need to write the code for that. We do leave one error behind negative numbers. Also, if you have a negative number in one column, you add it up.
16:41
The other column has a zero, which is greater than the max. So we'll go ahead and take out the negative numbers. Again, we're claiming negative usage doesn't make sense. We should enforce that somewhere else in our code if we're going to depend on it. But we're really not going to keep doing this.
17:01
This is a lot of work and not much return. Hypothesis has a pandas extension that will let us create data frames. That looks fairly reasonable. You go into the extension, dot data frames, you give it some columns, you define the column with a name and a data type or with all the other definitions we were
17:23
looking at earlier, and now you have a test. I've switched methods here. This one takes a data frame and does some estimated capacity. And of course, it contains a bug. It's again, empty.
17:42
You can't get a percentile of an empty. However, we can tell you that the needed capacity if you have no usage is zero. So we'll go ahead and fix that as we go. But we can actually do one better. If we type alias, if we have a type,
18:05
then we can mark the type hints onto our function with that type alias, in this case, usage data frame. When we run it through the ghost writer, it will go ahead and write the code with that type alias.
18:24
And then we just have to provide a strategy for creating that. So if we have this data frame used a few places, or of course any other object, we can just say, for this type, whenever you're trying to create this type, use this strategy.
18:41
And this is the same strategy where you were using before. Now, if we're very lucky, we don't even need to do this. It turns out that if you are fully type hinted on your classes, you can just say, build an instance of this,
19:06
exactly like done here, the build strategy. And it will go ahead and work with all those types and build the whole thing out. Unfortunately, that is sometimes what you want
19:20
and sometimes not. Sometimes you do want to provide code because you want the data to make at least a certain amount of sense. And for that, we go to a composite strategy. That's a function that takes two arguments. Draw is a special object. And CLS is the type that it's creating a strategy for.
19:45
And then with draw, you can feed it any strategy you want or need, and it will produce a value of it. Of course, this will be all those pathological values that we've been dealing with right along. So that's perfect.
20:01
In this case, I draw a float for scale. I decide whether I'm going to have a burst. If I create the burst, then you can see the min value and the max value and other parameters make sense with the strategy that's being done. And then I just register that strategy in.
20:25
Another thing you might very well find is that some data does not make sense. After all the strategies have been done, you've created nonsense. The hypothesis assume function will let you assume just
20:40
like an assert, assume this is true, and then if it's not true, it will be skipped. So in this case, I'm again creating a data frame. If the capacity is, if the usage is more than 110% of the capacity, we've decided that's going to look
21:02
like any other overloaded case, so we won't go over to 2,000% with the assume. Okay. There's a few things that I have not demonstrated, but I do want to highlight.
21:21
Target an example. Target will let you say this would make a good test. Just feed in an expression that gets larger when the test is better and hypothesis will steer its way there. Example will say please always test this, which sort of defeats the purpose, but it will keep you
21:41
in a certain amount of consistency. There's also settings for timing and randomization. Timing I usually find, I get the error that it took too long. I go yes, I'll wait for it. I value this test. I go read the error. It says change this setting. It's a decorator. I put it on there and the test takes longer, but I'm happy.
22:03
You could also improve the data. Randomization, again, usually I leave it alone. It will randomly test things. It keeps a local cache. It tests different things every time. Unless something is broken, in which case it will not let you off the hook,
22:22
it will keep giving you that test case until you fix it. The downside is in a CI system, this means occasionally it will randomly test something different and fail. If you don't want that behavior, if you don't want to tolerate that behavior, if you can't convince your
22:40
teammates to tolerate that behavior, you can set a seed and have at least the CI test the same thing. So, conclusions, do write tests. Do use test-driven development where appropriate,
23:00
but use hypothesis rather than or in addition to selected examples, especially when you find yourself in a hole. If you don't know what data you might get, hypothesis lets you describe all the valid data and test with that. Sometimes it will even find new cases
23:22
of stunningly invalid data. That you can then describe as invalid or add to your data cleaning. And with that, we'll go to questions and answers. Yes, we're hiring. I have hypothesis stickers that Zach Hatfield-Dodds,
23:41
the maintainer of hypothesis, gave me to give out. So, if you want them, catch me after. Brilliant. Thank you, Andy. That was amazing. Hopefully the anxiety levels fell
24:01
down after we now know we have to write tests. And we have a question. Yeah, so I saw that you were defining the types and everything else for the tests, but like, does it often happen that sometimes you run the test, they generate a specific set of values for floats,
24:20
integers, whatever, and then the next time you run it, it generates a different set of numbers? How do you avoid flaky tests, you know? Like, I ran it now, it's working, it goes in the PR, in the validation pipeline, it fails. If you want to avoid flaky tests, you can set the randomization and set it to have a seed.
24:44
Now, I make the argument to my team that if there was something broken there, yes, it is unfortunate that you found it on your PR that didn't cause it, but it is very fortunate that we found it. Feel free to go either way.
25:01
I understand it's very debatable. Another good practice would be on your local environment, let it free range, but on CI, maybe pin it down if it's going to be disruptive. Okay, makes sense. Thanks. Thank you. Next one, please. Yes, how big is the range of values that's being tested
25:23
by hypothesis? So, hypothesis concentrates on as much as possible edge cases. So, as you saw when we were pulling floats, it tried a large one, it tried a negative, it tried a small one, it tried a negative one,
25:40
it tried a special value, it'll also do infinitesimals, it'll try zero, it'll try a few more. It does that in every data type it knows. So, if you tell it to generate strings, it'll go, okay, how about an empty string, how about a kanji, how about a bell character?
26:01
It tests, I believe it's a few hundred by default, and it will test some fairly normal values, but it especially goes to edge cases. So, it wouldn't lead to slow tests or? It will take time to do. I can't deny that.
26:21
It has a time limit, I think it's a seconders or a few seconds per test by default. As I said, my habit is to just bump that up because I'd rather find it, but yes, it will take time if you're doing intensive calculation. If you're doing something truly intensive,
26:42
I would test the smaller functions with hypothesis and the larger ones may be more worried about the orchestration. Yeah. Okay, thank you. Thank you. Next question, please. Just yesterday, this hypothesis library was recommended to me for my production by one guy here,
27:02
and today there was your wonderful talk. So, but I need the capabilities not for testing, but for production of my specific code. Is it a good idea to use a hypothesis library or specifically a ghost writer to generate random numbers
27:25
with many constraints for my code in production and not for tests? I would not generally recommend that because as I was explaining
27:40
to one of the other questions, hypothesis concentrates on strange values. So, unless you want to have your values tend to go out to edge cases, which is probably only testing, you might not actually want the values it's going to produce.
28:04
There are other faker libraries that I think you could just Google for that do concentrate on producing fake data for production use. Again, unless you want the sort of edge case focus,
28:23
it's not 100% focus, but I might consider those. Thank you. Thank you. Is it another question? Yes. Yeah, hello. Thank you for your talk. Just as a curiosity, the data, sorry,
28:42
the data frame generator, does it also support datetime indexed data frame? Yeah, you can set the indexes, you can set the types on it and in a worst case, you could use one of those composite strategies to come up with something truly unique.
29:04
But it does support most of the basic things. I honestly usually use the data frame type approach, and I've not had to actually apply a composite strategy on it in my cases.
29:22
I see. Thank you. Thank you. So we don't have any more questions, but Andy, you wanted to say something before that was those two minutes? Oh, no, no, no, no. I just wanted to make sure that I gave away Zach's stickers. Yeah, so give it up for Andy.