The Examplar Project: A Summary

Posted on 01 January 2024.

For the past several years, we have worked on a project called Examplar. This article summarizes the goals and methods of the project and provides pointers to more detailed articles describing it.

Context

When faced with an programming problem, computer science students all-too-often begin their implementations with an incomplete understanding of what the problem is asking, and may not realize until far into their development process (if at all) that they have solved the wrong problem. At best, a student realizes their mistake, suffers from some frustration, and is able to correct it before the final submission deadline. At worst, they might not realize their mistake until they receive feedback on their final submission, depriving them of the intended learning goal of the assignment.

How can we help them with this? A common practice—used across disciplines—is to tell students, “explain the problem in your own words”. This is a fine strategy, except it demands far too much of the student. Any educator who has done this knows that most students rightly stumble through this exercise, usually because they don’t have any better words than are already in the problem statement. So what was meant to be a comprehension exercise becomes a literary one; even if they can restate it very articulately, it may be because of verbal skills, not necessarily indicative of good understanding. And for complex problems, the whole exercise is somewhat futile. It’s all made even more difficult when students are not in their native language, etc.

So we have the kernel of a good idea—asking students to read back their understanding—but words are a poor medium for it.

Examples

Our idea is that writing examples—using the syntax of text cases—is a great way to express understanding:

  • Examples are concrete.

  • It’s hard to be vague.

  • Difficulty writing down examples is usually indicative of broader difficulties with the problem, and a great, concrete way to initiate a discussion with course staff.

  • It gets a head-start on writing tests, which too many computing curricula undervalue (if they tackle it at all).

Best of all, because the examples are executable, they can be run against implementations so that students get immediate feedback.

Types of Feedback

We want to give two kinds of feedback:

  • Correctness: Are they even correct? Do they match the problem specification?

  • Thoroughness: How much of the problem space do they cover? Do they dodge misconceptions?

Consider (for example!) the median function. Here are two examples, in the syntax of Pyret:

check:
  median([list: 1, 2, 3]) is 2
  median([list: 1, 3, 5]) is 3
end

These are both correct, as running them against a correct implementation of median will confirm, but are they thorough?

A student could, for instance, easily mistake median for mean. Note that the two examples above do not distinguish between the two functions! So giving students a thumbs-up at this point may still send them down the wrong path: they haven’t expressed enough of an understanding of the problem.

Evaluating Examples

For this reason, Examplar runs a program against multiple implementations. One is a correct implementation (which we call the wheat). (For technical reasons, it can be useful to have more than one correct implementation; see more below.) There are also several buggy implementations (called chaffs). Each example is first run against the wheat, to make sure it conforms to the problem specification. It is then run against each of the chaffs. Here’s what a recent version looks like:

Screenshot of Examplar's Successor

Every example is a classifier: its job is to classify a program as correct or incorrect, i.e., to separate the wheat from the chaff. Of course, a particular buggy implementation may not be buggy in a way that a particular example catches. But across the board, the collection of examples should do a fairly good job of catching the buggy examples.

Thus, for instance, one of our buggy implementations of median would be mean. Because the two examples above are consistent with mean as well, they would (incorrectly) pass mean instead of signaling an error. If we had no other examples in our suite, we would fail to catch mean as buggy. That reflects directly as “the student has not yet confirmed that they understand the difference between the two functions”. We would want students to add examples like

  median([list: 1, 3, 7]) is 3

that pass median but not mean to demonstrate that they have that understanding.

Answering Questions

Examplar is also useful as a “24 hour TA”. Consider this example:

  median([list: 1, 2, 3, 4]) is ???

What is the answer? There are three possible answers: the left-median (2), right-median (3), and mean-median (2.5). A student could post on a forum and wait for a course staffer to read and answer. Or they can simply formulate the question as a test: e.g.,

  median([list: 1, 2, 3, 4]) is 2.5

One of these three will pass the wheat. That tells the student the definition being used for this course, which may not have been fully specified in the problem statement. Similarly:

  median([list: ]) is ???  # the empty list

Indeed, we see students coming to course staff with questions like, “I see that Examplar said that …, and I wanted to know why this is the answer”, which is a fantastic kind of question to hear.

Whence Chaffs?

It’s easy to see where to get the wheat: it’s just a correct implementation of the problem. But how do we get chaffs?

An astute reader will have noticed that we are practicing a form of mutation testing. Therefore, it might be tempting to use mutation testing libraries to generate chaffs. This would be a mistake because it misunderstands the point of Examplar.

We want students to use Examplar before they start programming, and as a warm-up activity to get their minds into the right problem space. That is not the time to be developing a test suite so extensive that it can capture every strange kind of error that might arise. Rather, we think of Examplar as performing what we call conceptual mutation testing: we only want to make sure they have the right conception of the problem, and avoid misconceptions about it. Therefore, chaffs should correspond to high-level conceptual mistakes (like confusing median and mean), not low-level programming errors.

Whence Misconceptions?

There are many ways to find out what misconceptions students have. One is by studying the errors we ourselves make while formulating or solving the problem. Another is by seeing what kinds of questions they ask course staff and what corrections we need to make. But there’s one more interesting and subtle source: Examplar itself!

Remember how we said we want examples to first pass the wheat? The failing ones are obviously…well, they’re wrong, but they may be wrong for an interesting reason. For instance, suppose we’ve defined median to produce the mean-median. Now, if a student writes

  median([list: 1, 2, 3, 4]) is 3

they are essentially expressing their belief that they need to solve the right-median problem. Thus, by harvesting these “errors”, filtering, and then clustering them, we can determine what misconceptions students have because they told us—in their own words!

Readings

  • Why you might want more than one wheat: paper

  • Introducing Examplar: blog; paper

  • Student use without coercion (in other words, gamified interfaces help…sometimes a bit too much): blog; paper

  • What help do students need that Examplar can’t provide? blog; paper

  • Turning wheat failures into misconceptions: blog; paper

  • From misconceptions to chaffs: blog; paper

But this work has much earlier origins:

  • How to Design Programs has students write tests before programs. However, these tests are inert (i.e., there’s nothing to run them against), so students see little value in doing so. Examplar eliminates this inertness, and goes further.

  • Before Examplar, we had students provide peer-review on tests, and found it had several benefits: paper. We also built a (no longer maintained) programming environment to support peer review: paper.

  • One thing we learned is that these test suites can grow too large to generate useful feedback. This caused us to focus on the essential test cases, out of which the idea of examples (as opposed to tests) evolved: paper.