Regression tests without false positives
This package is buggy, examples are only partially tested, and the API is under active development.
Setup your package directory like this:
MyPackage
├── Project.toml
├── src
│ └── MyPackage.jl
├── test
│ └── runtests.jl
└── bench
└── runbenchmarks.jl
Put this in your MyPackage.jl file:
module MyPackage
function compute()
return sum(rand(100))
end
end
And commit your changes. This is our baseline.
Now, let's add regression tests. Put this in your test/runtests.jl
file:
import RegressionTests
RegressionTests.test()
And put this in your bench/runbenchmarks.jl
file:
using RegressionTests, Chairmarks, MyPackage
@track (@b MyPackage.compute() seconds=.01).time
The @b
macro, from Chairmarks
, will
benchmark the compute
function, and the [@track
] macro from RegressionTests
will
track the result of that benchmark.
Then run your package tests with ]test
. The tests should pass and report that no
regressions were found.
Now, let's introduce a 10% regression. Change the compute
function to this:
function compute()
return sum(rand(110))
end
And rerun ]test
. The tests should fail and display the result of the regression test.
Any time RegressionTests.jl is loaded, you can use ]bench
to run your benchmarks and
report the results which you can then revisit later by accessing RegressionTests.RESULTS
.
To make the most use of this feature, you can add using RegressionTests
to your startup.jl
file.
All the various ways of running benchmarks with this package funnel through a
runbenchmarks
function which performs a randomized controlled trial comparing two
versions of a package. Each datapoint is a result of restarting Julia, loading a randomly
chosen version of the target package, and recording the tracked values.
The results are then compared in a value independent manner that makes no assumptions about the actual values of the targets (other than that they are real numbers).
We make the following statistical claims for each tracked value t
- If the distributions of
t
is independent of the version being tested, then this will report a change with probability approximately1e-10
. - If the distributions of
t
on the two tested versions differ1 by at leastk ≥ .05
, then this will report a change with probability≥ 0.95
2. - All reported changes are tagged as either increases, decreases, or both.
- If all percentiles of
t
are on the primary version are greater than or equal to their corresponding values on the comparison version, thent
will be incorrectly reported as a decrease with probability≤ 1e-5
. (and vice versa) - If there is an increase with significance1
k ≥ .05
, then that increase will be reported with probability≥ 0.95
.
Note: the numbers in these statistical claims are based on empirical data. They likely accurate, but we're still looking for proofs and closed forms.
Julia version | Linux | MacOS | Windows | Other |
---|---|---|---|---|
≤0.7 | ❌ | ❌ | ❌ | ❌ |
1.0 | ||||
[1.1, 1.5] | ||||
1.6 | ||||
[1.7, 1.8] | ||||
1.9 | ✅+ | ✅+ | ? | |
[1.10, stable) | ✅ | ✅ | ? | |
stable | ✅+ | ✅+ | ? | |
nightly | ?+ | ?+ | ?+ | ? |
❌ Not supported
RegressionTests.test(skip_unsupported_platforms=true)
works
✅ Supported
? Unknown and subject to change at any time
+ Tested in CI
While this package claims both to report almost all significant changes and to have no
false positives and to never report anything that is unchanged, we make no claims about
insignificant but nonzero changes. If the distributions of possible outcomes differ by
some 0 < k < .05
, then we may report a change or nor report a change with no probability
guarantees. Consequently, if you run repeated tests and find some runs report changes and
others do not, you ma conclude with certainty both that there is a change and that it is not
a significant change from a statistical perspective.
Footnotes
-
Significance is measured by the integral from 0 to 1 of
(cdf(g)(invcdf(f)(x)) - x)^2
. This can be thought of as the squared area of deviation from x=y in the cdf/cdf plot. When referring to increases or degreases, we only count area on one side of the x=y line. The gist of this is that we report a positive result for anything that can be efficiently detected with low false positivity rates. ↩ ↩2 -
More generally, for any
k > .025
,recall
loss is, according to empirical estimation, at mostmax(1e-4, 20^(1-k/.025))
. So, for example, a regression withk = .1
, will escape detection at most 1 out of 8000 times. ↩