Streamlining Testing in a Large Python Codebase
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 131 | |
Author | ||
Contributors | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/69504 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 202420 / 131
1
10
12
13
16
19
22
33
48
51
54
56
70
71
84
92
93
95
99
107
111
117
123
00:00
Optical character recognitionStatistical hypothesis testingCodeContinuous integrationStatistical hypothesis testingStatistical hypothesis testingMathematical optimizationStrategy gameResultantComputer animation
00:32
PurchasingSoftwareComputer hardwareAreaCombinatory logicTotal S.A.Valuation (algebra)Service (economics)Systems engineeringOverhead (computing)System of linear equationsParallel portStatistical hypothesis testingSoftwareService (economics)PurchasingAreaStaff (military)Internet service providerSoftware as a serviceRun time (program lifecycle phase)Continuous integrationCodeSoftware frameworkCartesian coordinate systemScalabilityCASE <Informatik>Statistical hypothesis testingStrategy gameBefehlsprozessorFunctional (mathematics)Plug-in (computing)Parameter (computer programming)Core dumpoutputResultantProcess (computing)NumberSoftware developerMeasurementMathematicsCode refactoringConfidence intervalMetric systemFrequencyComputer configurationHeegaard splitting2 (number)Coefficient of variationGroup actionOverhead (computing)Computer fileImplementationDifferent (Kate Ryan album)HyperlinkDivisorScaling (geometry)Function (mathematics)Parallel portMultiplicationOrder (biology)Parity (mathematics)Source codeMereologyInformationCovering spaceLine (geometry)Multiplication signEvent horizonWebsiteBranch (computer science)Directory serviceLatent heatInterpreter (computing)Configuration spaceExtension (kinesiology)Data acquisitionOpen sourceIntegerDefault (computer science)Link (knot theory)Web 2.0Slide ruleComplete metric spaceThresholding (image processing)Revision controlInstallation artComputer animation
10:47
Conditional-access moduleStatistical hypothesis testingParallel portMereologyBefehlsprozessorSpacetimeSemiconductor memoryStatistical hypothesis testingResultantProcess (computing)Game controllerComputer fileDifferent (Kate Ryan album)MathematicsScaling (geometry)Roundness (object)PiComputer hardwareGroup actionCodePattern languageFunction (mathematics)Internet service providerElectric generatorCache (computing)Installation artData acquisitionMeasurementOrder (biology)Image resolutionMedical imagingStrategy gameComputer configurationDefault (computer science)AlgorithmPhysical systemProjective planeMessage passingHeegaard splittingLatent heatStructural loadOpen sourceModule (mathematics)Mathematical analysisData storage deviceHash functionKey (cryptography)Web browserSoftware frameworkComputer-assisted translationRevision controlCompilerElectronic mailing listConcurrency (computer science)Interpreter (computing)Parameter (computer programming)DatabaseGraph (mathematics)Matrix (mathematics)Overhead (computing)Front and back endsWindows RegistrySource codeNeuroinformatikMetric systemSource codeComputer animation
20:43
Group actionRun time (program lifecycle phase)Statistical hypothesis testingOrder (biology)Software developerMathematical optimizationStrategy gameNumberMultiplication signData storage deviceCodeDifferent (Kate Ryan album)Computer animation
22:02
Statistical hypothesis testingWebsiteNeuroinformatikStatistical hypothesis testingMultiplication signStrategy gameProcess (computing)BlogCodeParallel portComputer animation
23:08
Point (geometry)CodeComputer animationLecture/Conference
23:43
CodeMultiplication signNumberMathematicsCode refactoringMoment (mathematics)Statistical hypothesis testingComputer animationMeeting/Interview
24:34
CodeStatistical hypothesis testingStatistical hypothesis testingComputer fileLecture/Conference
25:15
Length of stayComputer wormOctahedronStatistical hypothesis testingEntire functionComputer fileCodeCASE <Informatik>TrailMultiplication signMathematicsTraffic reportingMatrix (mathematics)MeasurementBranch (computer science)Order (biology)Metric systemComputer animationMeeting/Interview
26:33
Quantum stateStatistical hypothesis testingComputer fileCodeLecture/Conference
27:13
Statistical hypothesis testingCodeSoftware developerResultantProcess (computing)INTEGRALMathematicsPhysical systemRevision controlMultiplication signDivisorError messageComputer animationMeeting/Interview
29:27
Operations support systemStatistical hypothesis testingObject (grammar)Parallel portRevision controlMeeting/Interview
29:52
Statistical hypothesis testingHeegaard splittingMathematicsCodePlug-in (computing)Order (biology)State of matterMultiplication signCausalityWorkloadNumberSoftware developerBlock (periodic table)Computer animationMeeting/Interview
31:00
Meeting/InterviewComputer animation
Transcript: English(auto-generated)
00:04
So in today's talk, I will cover the idea of Python testing and test coverage and continuous integration and the slow test challenge we faced in a large Python code base. Also, talk about the optimization strategies we used to solve the problem.
00:26
At the end, I will share some results, recap, and do Q&A. We received a procurement software-as-a-service provider startup to help our customers make purchase easier and more cost-effective.
00:43
Our customers included businesses in technology, banks, and many other areas. And we have a large Python code base with a lot of developers, and we are still hiring fast. We have more than 2.5 million sites of Python code, and the code base size doubled every year.
01:13
With this growth, the number of tests and tech debt are also increasing fast over time.
01:22
While we need tests, we found tests can help developers ensure the quality of their code changes. And tests can also empower developers to make the future refactoring more safer and with higher confidence.
01:45
The test case can also serve as documentation that are examples of how the code should be used. There are some common metrics that help us measure our tests.
02:02
First, test execution time measures the duration from the test start to the completion of tests. And reliability measures the frequency of test tests versus the failed tests.
02:22
The test coverage helps us measure the proportion of the code covered by tests, which can be useful to measure the quality of the tests. So let's see how we can write tests easy.
02:43
In the following slides, I will share some best practice using open source tools. A web link of each tool can be found at the top left. So PyTest is a popular test framework for Python that makes it very easy to write simple and scalable test cases for your code.
03:06
For example, you just implement a Python function isEven to check whether the input parameter is an even number or not, and return a pulling value as the result. Now we want to test it.
03:23
To use PyTest, we can simply just create a test file with the test underscore prefix and implement test cases as different Python functions, also with the test underscore prefix.
03:40
Then we can use the pytest command to run the test. After installing the PyTest, we can just run the command and the output shows the result. The dash vv option is useful for showing the details like which test cases are executed and what's their result.
04:07
In this output, we can see the overall test execution time 0.03 seconds and the number of test tests. To measure the test coverage, we can use the pytest-cov tool.
04:26
After installing the plugin, we can just provide a dash dash cov option to pytest and the output will show the test coverage information for each source file
04:40
with the number of covered line and the covered ratio of the file. And at the end, it also shows the overall test coverage, which is 91% in this case. To increase the test coverage further, in this example, we can add a new test case for the odd numbers.
05:12
To ensure the software quality with frequent development, we usually also implement the continuous integration best practice.
05:22
The idea is to continuous merge changes into the shared code base while ensuring the quality of the changes. In this practice, developers can submit a pull request for review when their code changes are ready.
05:42
And a pipeline will run the tests to verify the code changes. And we only merge the pull requests after all the tests pass and after it's approved in the code review. So, we can ensure the test reliability and test coverage meet the required search holds.
06:08
So, let's look into how to implement a continuous integration using GitHub Workflows. We can simply define a config file under the GitHub Workflows directory in your code base.
06:24
Let's say ci.yaml file. We specify the trigger events. Here, we use the pull request event to run the tests on all the pull request code changes. We also use the push event here to run the tests when the changes are merged into the main branch.
06:50
And then we can add a job like run pytest job to do the following steps, including check out the code base, set up Python interpreter with specific version,
07:04
install Python dependencies using pip install, and then run the pytest. So, this implements a simple continuous integration job. Now, let's talk about the challenges we have.
07:20
We faced the test execution time increases over time in our large code base because of the number of tests just increases. With more than 10,000 tests, the test execution time is very low. The second is the test, the code base size also doubled every year,
07:47
which also increases the test coverage measurement overhead. And the number of dependencies of our application also increases.
08:00
That cost the test will start slower. We have several strategies to help us solve the problem. The first strategy is parallel execution. We can use the pytest xdist plugin to run tests in parallel on multiple CPU cores.
08:26
After install the plugin, for example, we can use dash n with the number eight to indicate we want to use eight worker process. Then the pytest will distribute the test cases to different worker process to run in parallel.
08:46
We can also use dash n auto to automatically use all available CPU cores. Given n number of CPU, we can speed up the test execution time by a factor of n.
09:05
However, at the scale like 10,000 tests, a CPU core with number eight is still not fast enough. So we can also use another plugin, pytest split,
09:25
to run tests in parallel on multiple runners. After install the extension, we can use the dash dash split and given groups like number 10,
09:41
that means we want to split the tests into 10 different parts. And we want to run the first part on the current runner. So we specify dash dash group number one. With this approach, we can run different parts on different runners.
10:03
Let's say we have n number of runners, each has n number of CPU. We can increase the parallelism by m by n. By default, the plugin assumes all the tests have the same test execution time
10:25
when distributing the test cases to different runners. But in reality, the test execution time for different test cases are different and that will cost the unbalanced runner execution time.
10:41
That means we have to wait until the last runner finish in order to collect the full test results. To fix this issue, we can collect the test duration by using the dash dash store duration option.
11:01
Pytest will provide a dot test durations file at the end of the test run. Then when we use the pytest split, we can just provide a duration file using the dash dash durations path option.
11:22
So to implement this in a GitHub workflow, we can use the metrics strategy to provide a list of group. Here we provisioned 10 runners and each will receive a different metrics dot group value.
11:42
So we just provide the variable value to the group option to pytest. That way we run a different group of tests on different runners. So if each runner has eight cores, we will have 80 concurrent test worker processes.
12:07
The second strategy is to use cache. Before we can run the tests, we need to install the Python dependencies using pip. And pip can be stored when resolving the dependency versions,
12:22
download and install the dependencies. To speed it up, we can cache the dependencies because we don't have to always install it when the dependency is not updated. And to do that in a GitHub workflow,
12:41
we can use the cache action. We can provide the hash of the requirements file as the key of the cache. That way we will only rebuild the cache when some dependencies are updated. When the dependencies are not changed,
13:02
we just reuse the cached installed dependencies. And when the requirements file are updated, it will just run the pip install commands to rebuild the cache. So with this approach, we can start the test run faster.
13:27
In a large code base, this approach can save 5 to 10 minutes if you have a lot of Python dependencies. And to make the installation even faster,
13:42
we can use another tool, UV. UV uses fast dependency resolution algorithm to resolve the dependency and install the dependency much faster. By default, UV assumes using a vinv. So if no vinv is used,
14:01
simply provide the dash dash system parameter. We can also cache the non-Python dependencies. For example, we have to install the Python and not just interpreters or some database like Postgres
14:22
or some system packages like protobuf, compiler, graphviz, and more. Another example is browsers if you have end-to-end tests, like using playwrite framework.
14:41
To pre-install those, we can use a Docker image. We define the commands to install those dependencies in a Docker file. In this example, we use apt-get install command to install Postgres and protobuf compiler.
15:01
Then we can build the image and publish it to a registry like Docker Hub. And then in the GitHub workflow, we can specify the image source to be used in your workflow job. So you can get all the pre-installed non-Python dependencies.
15:28
And this approach could save 10 minutes or more if you have a lot of non-Python dependencies that needs to be installed. The next strategy is to skip unnecessary computing.
15:46
We can skip unnecessary tests and interruns if we check the specific code changes and only run the specific tests.
16:00
For example, your code base may have front-end and back-end code, and you may only run the back-end tests when there's back-end changes. So in the GitHub workflow,
16:20
we can use the changed file action and specify a pattern like this to detect all the Python file changes. We can also export the result of this action as a output value has pi changes that can be reused by all the other test jobs.
16:46
For example, in our run pi test job, we can specify we will need the result of changed file, and we only run pi tests when the has pi changes is true.
17:02
So with this, we can skip the unnecessary test runs or interruns. So we can further extend this idea. For example, for some linters, we can only run the updated files
17:22
if the code changes will not impact the non-updated files, like flaky or other linters. They can make those linters run even faster. And further extend the idea,
17:42
we can also try to modularize our code base if you have a monolase. That way, you can use a build system to run even fewer test changes when the code change is within a isolated module.
18:07
Next, we can skip the unnecessary code coverage analysis. As we mentioned, the coverage analysis overhead could be large in a large code base.
18:23
By default, the dash dash cough measures the coverage for all the files in the project. And it can be so. We can just provide the dash dash cough equal to updated path for each of the updated paths
18:42
on a pull request in order to only measure the updated files. That can reduce the measurement overhead a lot. By doing this, we saved more than one minute in our code base.
19:03
The last but not least strategy is to use modern runners. One interesting finding about GitHub workflows was it's slow and expensive when you have large scale tests.
19:25
So we found there are several third-party hosted runner providers. They offer new generation CPU memory runners to run your tests faster and cheaper.
19:41
Here are some examples. Namespace, Buildjet, Actuate, and more. For us, we would like to have the full control and customized runner, so we use self-hosted runner with auto-scaling.
20:00
We used the actions runner controller open source tool to deploy an auto-scaling runner on our Kubernetes cluster. We use AWS EC2 with custom hardware specifications based on our load of traffic. We choose the memory size and CPU generation carefully
20:27
to ensure the performance. And as a result, we achieved 5X cost saving and 2X faster test speed compared to GitHub hosted runners.
20:45
So with all the different strategies and optimizations, we were able to continue optimizing our CI pipeline to improve our test execution time and enhance the developer experience.
21:02
As you can see from this chart, the test execution time was initially over 30 minutes, and we were able to optimize it to become under 15 minutes. And you can see this chart is going up and down,
21:25
and it's because the test execution time will just become slower over time when developers add more tests. So we must continuously apply our strategies to optimize it in order to maintain a good experience.
21:45
And in the meantime, we also increased the test coverage number from 50% to 60% in the past year. That ensured better code quality in our code base.
22:03
So let's do some recap. So we shared our four strategies for scaling slow tests in a large code base, parallel execution, caching, skipping unnecessary computing, and modernizing runners.
22:24
And by thinking through those strategies, we were able to continuously find more opportunities over time to continue to optimize our CI pipeline.
22:43
Thank you for your attention. If you are interested in our other engineering work, you can find our engineering blog. We also have some job opportunities. If you are interested, you can follow those things.
23:02
Now we have some time for Q&A. Okay, thank you, Jimmy. Now I would like to invite anyone from the crowd who has a question to line up behind any of those two awesome microphones and ask away.
23:24
In the meantime, I do have a question. So you were talking about the coverage, and there's a percentage. What's the point, what's the percentage where you start to feel comfortable in your code base and safe?
23:43
Yeah, as you can see, we end up reaching 86% coverage at this moment, and we do that by requiring having at least 75% of code coverage on each full request code changes.
24:01
And that's the number we feel comfortable. Over time, we are also doing a lot of automatic refactoring to tackle the tech debt problems, and we found those tests are very useful to detect code change failures.
24:21
So I would suggest 75% or even better, 80% as a search hole if you want to enforce some check on your pull request. Brilliant, thank you, and we have a question from the crowd. Yeah, very nice talk, thank you.
24:40
I understood that for speeding up the testing, you just check the coverage for the modified files. In doing that, I guess this is kind of only checking the coverage of these files. I guess you are also doing an overall code testing,
25:03
where you do a full testing with a full coverage check. Is it like this? Yeah, so the tests are run when the pull request is submitted.
25:25
That means we have some updated files, and the tests we only measure the code coverage on the updated files to be able to provide a signal faster to our developers.
25:42
And after the code change is reviewed and passed the tests, when we merged it into the main branch, we will run the test coverage measurement on the entire code base files in order to keep track of the overall test coverage over time as quality metrics.
26:05
And also, you may also want to publish the full test coverage reports because that's useful for the co-owners to understand the coverage of the code
26:24
they owned to have some quality improvements planned. Okay, so in that case, when you do the full test, how long is the full code with 10,000 tests and so on?
26:46
You have to have an idea. Thank you very much. Thank you. Yeah, thank you. I do have one more question. So if you have an expensive test, for example, in your test code base, what is the best way to kind of exclude them from the usual run
27:04
if you want to run them only occasionally, like marking them or offloading them to a specific file? Sorry, let me try to understand your questions.
27:20
Are you talking about you have some tests that you only want to run occasionally? Yes, like they're heavy, expensive. Yeah, so ideally you want to run the tests when the corresponding code are updated.
27:42
Yes, and so I wouldn't suggest skip them just because they are too expensive. Unless the tests may change by some external factor.
28:01
For example, they are testing the integration to the third-party systems or some conflicts may change over time. That way you could potentially set up some periodic tests. I have also seen in some other code base,
28:22
they have the tests that are just too expensive, too slow, and their developers cannot afford to run them on each pull request change. In that code base, they did set up a recurring job to run the tests every few hours
28:48
and then use the result to fight the failures. But I think this approach is less efficient because you will only find the test failure after the code is merged into the main branch,
29:06
which the developer will require extra effort to fix the errors. So I think the most efficient way is still to run the updated code changes tests
29:22
whenever you submit a change. Okay, thank you very much. We have maybe one question left, but really quick one. Thank you for the talk. I wanted to ask if you have some tests which cannot run in parallel and how you tackle them, for example, those which are in the conflict
29:40
or can share some object which can get messed up if the two tests in parallel approach that same object. Yeah, so we used the high test Xdist and the split plugins to run tests in parallel
30:02
and with that approach, we do find some tests, they shared the states and the execution order could cause test failure. Those are flaky tests. And as the test number grows over time, there are more flaky tests.
30:24
So we actually have some required, sorry, retry steps in our test pipeline to verify the flakiness of the tests. And we have some workload to report the flaky tests to the co-owners
30:42
to get their attention, to get them fixed so they don't block the developers. We also automatically quarantine the flaky tests when it's blocking the code changes. Thank you very much. This all the time allows a big hand for Jimmy.
31:03
Thank you for your attention.