Fighting Fragility With Property-Based Testing

Previously published on DZone.
However long you work in software, you always feel late to the party. You encounter some seemingly cutting-edge new tool only to learn it has been around for decades, sometimes inspired by research papers from 1970. Still, you can’t keep up with everything and have a life. Property-based testing (PBT) is such an established technology and it deserves more attention. Java has the Jqwik library, Scala has ScalaCheckand Python has Hypothesis

Check the links at the end for some good tutorials. Here I want to skip the nitty-gritty and focus in detail on its killer feature, which is the ability to warn when some change to production code is no longer sufficiently covered by a test suite.

For the uninitiated: PBT validates so-called properties. I like the definition from ScalaCheck best: A property is a high-level specification of behavior that should hold for a range of data points. For example, a property might state that the size of a list returned from a method should always be greater than or equal to the size of the list passed to that method. This property should hold no matter what list is passed.

To test properties a PBT framework creates data that is arbitrary – not random! – and constrained within a specific range, for example, integers between 10 and 100. Jqwik has an extensive API to create such arbitrary data and clever logic to try out as many scenarios as needed to break a test. What’s the point of trying out so many possible inputs when usually only a few are relevant? Bear with me.

Finding the Relevant Edge Cases

For practical purposes, the range of valid inputs to a method under test is often infinite, especially with strings and large numeral types. PBT cannot try them all, and why should it? If results are predictable for a range of input values, we traditionally need only to validate the significant edge cases. If, say, a function that squares an integer returns 25 for 5 and 100 for 10, inputs 6 to 9 are deterministic. 10, -10, and 0 should be enough. Traditional unit testing relies on handpicked scenarios because it doesn’t make sense to test everything. Or so you think.

Imagine an embarrassingly simple function that calculates some monetary amount based on a person’s age. Here’s the specification:

  • A valid age must lie between zero and 125. Let’s be on the safe side.
  • Only people 18 years and older are eligible.
  • An eligible age returns 200 euros.
  • A non-eligible age returns zero.

It seems we have three significant numbers: 0, 18, and 125.

public class BenefitCalculator {
    int calculateBenefitForAge(int age) {
        if (age < 0 || age > 125)
            throw new IllegalArgumentException("Age is out of range [0-125]: " + age);
        return age < 18 ? 0 : 200;

If you rephrase the rules as general statements about a range of values, you get the properties:

  • any input less than zero is not acceptable
  • any input greater than 125 is not acceptable
  • any input between 0 and 17 returns zero
  • any input between 18 and 125 returns 200

In traditional scenario-based unit testing, we set up our parameters around the edges of significant values. For numbers we take the nearest neighbor that produces a different result than said value:

  • 0 does not throw, but -1 does.
  • 125 does not throw, but 126 does. 
  • 18 returns 200 while 17 returns 0.

Junit’s parameterized tests are an elegant mechanism to test the valid age range and the returned benefit amount without too much duplication.

// For each entry in the array the test is invoked. The comma-separated //values must match the test method parameters in size and type.
@CsvSource({"0,-1", "125,126"})
public void test_valid_age_range(int inRange, int outOfRange) {
    assertThatThrownBy(() -> calculator.calculateBenefitForAge(outOfRange)).hasMessageStartingWith("Age is out of range");
@CsvSource({"17,0", "18,200"})
public void validate_benefit_amount_for_age(int age, float benefit) {

Don’t Stop at the Edges

PBT on the other hand does not stop at the edges. It explores the entire range. Here are our four properties expressed in code: 

public void for_every_input_greater_than_125_the_function_throws(@ForAll @IntRange(min = 126) int age) {
 assertThatThrownBy(() -> calculator.getBenefitInEurosForAge(age));
public void for_every_input_less_than_zero_the_function_throws(@ForAll @Negative int age) {
   assertThatThrownBy(() -> calculator.getBenefitInEurosForAge(age));
public boolean any_input_between_0_and_17_returns_0(@ForAll @IntRange(max = 17) int age) {
   return calculator.getBenefitInEurosForAge(age) == 0;
public boolean any_input_between_18_and_125_returns_200(@ForAll @IntRange(min = 18, max = 125) int age) {
  return calculator.getBenefitInEurosForAge(age) == 200;

@Property marks the method as a jqwik test. Nothing is needed at the class level. @ForAll instructs the framework to try random values of the age parameter it annotates. @IntRange adds a necessary constraint. A property fails when it returns false or throws.

Maybe I did not win you over yet. The parameterized approach is less verbose (two versus four methods) than PBT and arguably more readable. It is certainly more performant: the default number of tries in jqwik is a thousand, or until all combinations have been exhausted. All this checking seems excessive. There’s no scenario between 18 and 125 after all where the code would suddenly behave differently.

But on a point of principle: traditional unit tests do not (in)validate properties, they only check handpicked examples. Suppose we augment the logic and squeeze in a special case for persons between 40 and 64 years old. We add the following property:

Any input between 40 and 64 returns 300

But we’re not done! The existing properties must be adjusted and augmented:

  • any input less than zero is not acceptable
  • any input greater than 125 is not acceptable
  • any input between 0 and 17 returns zero
  • any input between 18 and 125 returns 200 now becomes “any input between 18 and 39 returns 200”
  • any input between 40 and 64 returns 300 (new)
  • any input between 65 and 125 returns 200 (new)
public int calculateBenefitForAge(int age) {
    if (age < 0 || age > 125) {
        throw new IllegalArgumentException("Age is out of range [0-125]: " + age);
    } else if (age < 18) {
        return 0;
    } else if (age >= 40 && age < 65) {
        return 300;
    } else {
        return 200;

This code change has invalidated property 4, but the two original parameterized tests still succeed. That is because the test is strictly correct. The parameters 17 and 18 of the edge case still hold. But it tacitly suggests that every valid age over 18 yields the same output, and that is no longer true. 17 and 18 are no longer the whole truth. The test turns a blind eye to numbers 40 and 65, which constitute new edge-case scenarios. That doesn’t feel right. This is a simple code change, with possibly big repercussions. After any code change, you would expect at least one test to fail. 

We Practice Strict TDD, Except on Friday Afternoon

Now the above is never a problem because as TDD adepts we develop our tests and production code strictly in tandem, right? You even add the extra test cases before you touch the production code. Unless of course, it was time for Friday afternoon drinks, in which case you can add the test after your holidays, obviously.

So much for sarcasm. Proper PBT is a little more verbose, as you write a separate method for each property. But that is a much better safeguard to ensure that the suite is in sync with the production code. A code change is more likely to break a property than a parameterized test. The property that guaranteed return value 200 for inputs 18 to 125 now fails instantly. You tell me if adding these four extra edge cases is more user-friendly than the PBT approach.

@CsvSource({"17,0", "18,200", "39,200", "40,300", "64,300", "65,200"})

More Unknown Unknowns

PBT is also great at other ‘unknown unknowns’. Imagine some calculation where a non-obvious input leads to a division by zero further down the line. Or consider something far less intricate, for the more mathematically challenged like yours truly.

public int square(int input){
  return input * input;

This won’t work: any value greater than 46340 or less than -46340, and the result of the square is too large for a Java int. It’s easy to miss, but jqwik will tell you so:

public boolean all_int_ranges_are_valid(@ForAll int input){
    return NumberUtils.square(input) >= 0;
Property [SquarePropertyTest:all int ranges are valid] failed with sample {0=46341}

Jqwik doesn’t arrive at this edge-case by accident. After the first failure, it zooms in to find the failing value that is closest to a passing value in a process called shrinking. PBT forces you to think and refine your properties and assists in the process. You would probably use along for the return type and add an explicit range check at the top of the method.

Who Watches the Watchmen?

Although very different technologies, PBT shares a trait with mutation testing. They add a touch of controlled randomness to improve the robustness of your tests. Mutation testing does so by purposely messing up your byte code (introducing so-called mutants), the idea being that these changes should cause existing tests to fail, called “killing the mutants”. PBT tries your tests across a wide range of values that you assume should yield a predictable result. It puts those very assumptions to the test. The results will often catch you off guard and lead to more robust tests and production code.

Further Reading