Teaching and Assessing Property-Based Testing

Posted on 10 January 2021.

Property-Based Testing (PBT) sees increasing use in industry, but lags significantly behind in education. Many academics have never even heard of it. This isn’t surprising; computing education still hasn’t come to terms with even basic software testing, even when it can address pedagogic problems. So this lag is predictable.

The Problem of Examples

But even people who want to use it often struggle to find good examples of it. Reversing a list drives people to drink, and math examples are hard to relate to. This is a problem from several respects. Without compelling examples, nobody will want to teach it. Even if they do, unless the examples are compelling, students will not pay attention to it. And if the students don’t, they won’t recognize opportunities to use it later in their careers.

This loses much more than a testing technique. We consider PBT a gateway to formal specification. Like a formal spec, it’s an abstract statement about behaviors. Unlike a formal spec, it doesn’t require learning a new language or mathematical formalism, it’s executable, and it produces concrete counter-examples. We therefore use it, in Brown’s Logic for Systems course, as the starting point to more formal specifications. (If they learn nothing about formal specification but simply become better testers, we’d still consider that a win.)

Therefore, for the past 10 years, and with growing emphasis, we’ve been teaching PBT: starting in our accelerated introductory course, then in Logic for Systems, and gradually in other courses as well. But how do we motivate the concept?

Relational Problems

We motivate PBT through what we call relational problems. What are those?

Think about your typical unit test. You write an input-output pair: f(x) is y. Let’s say it fails:

Usually, the function f is wrong. Congrats, you’ve just caught a bug!
Sometimes, the test is wrong: f(x) is not, in fact, y. This can take some reflection, and possibly reveals a misunderstanding of the problem.

That’s usually where the unit-testing story ends. However, there is one more possibility:

Neither is “wrong”. f(x) has multiple legal results, w, y, and z; your test chose y, but this particular implementation happened to return z or w instead.

We call these “relational” because f is clearly more a relation than a function.

Some Examples

So far, so abstract. But many problems in computing actually have a relational flavor:

Consider computing shortest paths in a graph or network; there can be many shortest paths, not just one. If we write a test to check for one particular path, we could easily run into the problem above.
Many other graph algorithms are also relational. There are many legal answers, and the implementation happens to pick just one of them.
Non-deterministic data structures inspire relational behavior.
Various kinds of matching problems—e.g., the stable-marriage problem—are relational.
Combinatorial optimization problems are relational.
Even sorting, when done over non-atomic data, is relational.

In short, computing is full of relational problems. While they are not at all the only context in which PBT makes sense, they certainly provide a rich collection of problems that students already study that can be used to expose this idea in a non-trivial setting.

Assessing Student Performance

Okay, so we’ve been having students write PBT for several years now. But how well do they do? How do we go about measuring such a question? (Course grades are far too coarse, and even assignment grades may include various criteria—like code style—that are not strictly germane to this question.) Naturally, their core product is a binary classifier—it labels a purported implementation as valid or invalid—so we could compute precision and recall. However, these measures still fail to offer any semantic insight into how students did and what they missed.

We therefore created a new framework for assessing this. To wit, we took each problem’s abstract property statement (viewed as a formal specification), and sub-divided it into a set of sub-properties whose conjunction is the original property. Each sub-property was then turned into a test suite, which accepted those validators that enforced the property and rejected those that did not. This let us get a more fine-grained understanding of how students did, and what kinds of mistakes they made.

Want to Learn More?

If you’re interested in this, and in the outcomes, please see our paper.

What’s Next?

The results in this paper are interesting but preliminary. Our follow-up work describes limitations to the approach presented here, thereby improving the quality of evaluation, and also innovates in the generation of classifying tests. Check it out!