Articles by tag: Testing
Conceptual Mutation Testing
Generating Programs Trivially: Student Use of Large Language Models
Identifying Problem Misconceptions
Automated, Targeted Testing of Property-Based Testing Predicates
Student Help-Seeking for (Un)Specified Behaviors
Teaching and Assessing Property-Based Testing
Students Testing Without Coercion
Combating Misconceptions by Encouraging Example-Writing
The Hidden Perils of Automated Assessment
The Examplar Project: A Summary
Tags: Education, Misconceptions, Testing, Tools
Posted on 01 January 2024.For the past several years, we have worked on a project called Examplar. This article summarizes the goals and methods of the project and provides pointers to more detailed articles describing it.
Context
When faced with an programming problem, computer science students all-too-often begin their implementations with an incomplete understanding of what the problem is asking, and may not realize until far into their development process (if at all) that they have solved the wrong problem. At best, a student realizes their mistake, suffers from some frustration, and is able to correct it before the final submission deadline. At worst, they might not realize their mistake until they receive feedback on their final submission, depriving them of the intended learning goal of the assignment.
How can we help them with this? A common practice—used across disciplines—is to tell students, “explain the problem in your own words”. This is a fine strategy, except it demands far too much of the student. Any educator who has done this knows that most students rightly stumble through this exercise, usually because they don’t have any better words than are already in the problem statement. So what was meant to be a comprehension exercise becomes a literary one; even if they can restate it very articulately, it may be because of verbal skills, not necessarily indicative of good understanding. And for complex problems, the whole exercise is somewhat futile. It’s all made even more difficult when students are not in their native language, etc.
So we have the kernel of a good idea—asking students to read back their understanding—but words are a poor medium for it.
Examples
Our idea is that writing examples—using the syntax of text cases—is a great way to express understanding:
-
Examples are concrete.
-
It’s hard to be vague.
-
Difficulty writing down examples is usually indicative of broader difficulties with the problem, and a great, concrete way to initiate a discussion with course staff.
-
It gets a head-start on writing tests, which too many computing curricula undervalue (if they tackle it at all).
Best of all, because the examples are executable, they can be run against implementations so that students get immediate feedback.
Types of Feedback
We want to give two kinds of feedback:
-
Correctness: Are they even correct? Do they match the problem specification?
-
Thoroughness: How much of the problem space do they cover? Do they dodge misconceptions?
Consider (for example!) the median function. Here are two examples, in the syntax of Pyret:
check:
median([list: 1, 2, 3]) is 2
median([list: 1, 3, 5]) is 3
end
These are both correct, as running them against a correct implementation of median will confirm, but are they thorough?
A student could, for instance, easily mistake median for mean. Note that the two examples above do not distinguish between the two functions! So giving students a thumbs-up at this point may still send them down the wrong path: they haven’t expressed enough of an understanding of the problem.
Evaluating Examples
For this reason, Examplar runs a program against multiple implementations. One is a correct implementation (which we call the wheat). (For technical reasons, it can be useful to have more than one correct implementation; see more below.) There are also several buggy implementations (called chaffs). Each example is first run against the wheat, to make sure it conforms to the problem specification. It is then run against each of the chaffs. Here’s what a recent version looks like:
Every example is a classifier: its job is to classify a program as correct or incorrect, i.e., to separate the wheat from the chaff. Of course, a particular buggy implementation may not be buggy in a way that a particular example catches. But across the board, the collection of examples should do a fairly good job of catching the buggy examples.
Thus, for instance, one of our buggy implementations of median would be mean. Because the two examples above are consistent with mean as well, they would (incorrectly) pass mean instead of signaling an error. If we had no other examples in our suite, we would fail to catch mean as buggy. That reflects directly as “the student has not yet confirmed that they understand the difference between the two functions”. We would want students to add examples like
median([list: 1, 3, 7]) is 3
that pass median but not mean to demonstrate that they have that understanding.
Answering Questions
Examplar is also useful as a “24 hour TA”. Consider this example:
median([list: 1, 2, 3, 4]) is ???
What is the answer? There are three possible answers: the left-median (2), right-median (3), and mean-median (2.5). A student could post on a forum and wait for a course staffer to read and answer. Or they can simply formulate the question as a test: e.g.,
median([list: 1, 2, 3, 4]) is 2.5
One of these three will pass the wheat. That tells the student the definition being used for this course, which may not have been fully specified in the problem statement. Similarly:
median([list: ]) is ??? # the empty list
Indeed, we see students coming to course staff with questions like, “I see that Examplar said that …, and I wanted to know why this is the answer”, which is a fantastic kind of question to hear.
Whence Chaffs?
It’s easy to see where to get the wheat: it’s just a correct implementation of the problem. But how do we get chaffs?
An astute reader will have noticed that we are practicing a form of mutation testing. Therefore, it might be tempting to use mutation testing libraries to generate chaffs. This would be a mistake because it misunderstands the point of Examplar.
We want students to use Examplar before they start programming, and as a warm-up activity to get their minds into the right problem space. That is not the time to be developing a test suite so extensive that it can capture every strange kind of error that might arise. Rather, we think of Examplar as performing what we call conceptual mutation testing: we only want to make sure they have the right conception of the problem, and avoid misconceptions about it. Therefore, chaffs should correspond to high-level conceptual mistakes (like confusing median and mean), not low-level programming errors.
Whence Misconceptions?
There are many ways to find out what misconceptions students have. One is by studying the errors we ourselves make while formulating or solving the problem. Another is by seeing what kinds of questions they ask course staff and what corrections we need to make. But there’s one more interesting and subtle source: Examplar itself!
Remember how we said we want examples to first pass the wheat? The failing ones are obviously…well, they’re wrong, but they may be wrong for an interesting reason. For instance, suppose we’ve defined median to produce the mean-median. Now, if a student writes
median([list: 1, 2, 3, 4]) is 3
they are essentially expressing their belief that they need to solve the right-median problem. Thus, by harvesting these “errors”, filtering, and then clustering them, we can determine what misconceptions students have because they told us—in their own words!
Readings
-
Why you might want more than one wheat: paper
-
Student use without coercion (in other words, gamified interfaces help…sometimes a bit too much): blog; paper
-
What help do students need that Examplar can’t provide? blog; paper
But this work has much earlier origins:
-
How to Design Programs has students write tests before programs. However, these tests are inert (i.e., there’s nothing to run them against), so students see little value in doing so. Examplar eliminates this inertness, and goes further.
-
Before Examplar, we had students provide peer-review on tests, and found it had several benefits: paper. We also built a (no longer maintained) programming environment to support peer review: paper.
-
One thing we learned is that these test suites can grow too large to generate useful feedback. This caused us to focus on the essential test cases, out of which the idea of examples (as opposed to tests) evolved: paper.
Conceptual Mutation Testing
Tags: Education, Misconceptions, Testing, Tools, User Studies
Posted on 31 October 2023.Here’s a summary of the full arc, including later work, of the Examplar project.
The Examplar system is designed to solve the documented phenomenon that students often misunderstand a programming problem statement and hence “solve” the wrong problem. It does so by asking students to begin by writing input/output examples of program behavior, and evaluating them against correct and buggy implementations of the solution. Students refine and demonstrate their understanding of the problem by writing examples that correctly classify these implementations.
It is, however, very difficult to come up with good buggy candidates. These programs must correspond to the problem misconceptions students are most likely to have. Students can, otherwise, end up spending too much time catching them, and not enough time on the actual programming task. Additionally, a small number of very effective buggy implementations is far more useful than either a large number or ineffective ones (much less both).
Our previous research has shown that student-authored examples that fail the correct implementation often correspond to student misconceptions. Buggy implementations based on these misconceptions circumvent many of the pitfalls of expert-generated equivalents (most notably the ‘expert blind spot’). That work, however, leaves unanswered the crucial question of how to operationalize class-sourced misconceptions. Even a modestly sized class can generate thousands of failing examples per assignment. It takes a huge amount of manual effort, expertise, and time to extract misconceptions from this sea of failing examples.
The key is to cluster these failing examples. The obvious clustering method – syntactic – fails miserably: small syntactic differences can result in large semantic differences, and vice versa (as the paper shows). Instead, we need a clustering technique that is based on the semantics of the problem.
This paper instead presents a conceptual clustering technique based on key characteristics of each programming assignment. These clusters dramatically shrink the space that must be examined by course staff, and naturally suggest techniques for choosing buggy implementation suites. We demonstrate that these curated buggy implementations better reflect student misunderstandings than those generated purely by course staff. Finally, the paper suggests further avenues for operationalizing student misconceptions, including the generation of targeted hints.
You can learn more about the work from the paper.
Generating Programs Trivially: Student Use of Large Language Models
Tags: Large Language Models, Education, Formal Methods, Properties, Testing, User Studies
Posted on 19 September 2023.The advent of large language models like GPT-3 has led to growing concern from educators about how these models can be used and abused by students in order to help with their homework. In computer science, much of this concern centers on how LLMs automatically generate programs in response to textual prompts. Some institutions have gone as far as instituting wholesale bans on the use of the tool. Despite all the alarm, however, little is known about whether and how students actually use these tools.
In order to better understand the issue, we gave students in an upper-level formal methods course access to GPT-3 via a Visual Studio Code extension, and explicitly granted them permission to use the tool for course work. In order to mitigate any equity issues around access, we allocated $2500 in OpenAI credit for the course, enabling free access to the latest and greatest OpenAI models.
Can you guess the total dollar value of OpenAI credit used by students?
We then analyzed the outcomes of this intervention, how and why students actually did and did not use the LLM.
Which of these graphs do you think best represents student use of GPT models over the semester?
When surveyed, students overwhelmingly expressed concerns about using GPT to help with their homework. Dominant themes included:
- Fear that using LLMs would detract from learning.
- Unfamiliarity with LLMs and issues with output correctness.
- Fear of breaking course rules, despite being granted explicit permission to use GPT.
Much ink has been spilt on the effect of LLMs in education. While our experiment focuses only on a single course offering, we believe it can help re-balance the public narrative about such tools. Student use of LLMs may be influenced by two opposing forces. On one hand, competition for jobs may cause students to feel they must have “perfect” transcripts, which can be aided by leaning on an LLM. On the other, students may realize that getting an attractive job is hard, and decide they need to learn more in order to pass interviews and perform well to retain their positions.
You can learn more about the work from the paper.
Identifying Problem Misconceptions
Tags: Education, Misconceptions, Testing, Tools
Posted on 15 October 2022.Here’s a summary of the full arc, including later work, of the Examplar project.
Our recent work is built on the documented research that students often misunderstand a programming problem statement and hence “solve” the wrong problem. This not only creates frustration and wastes time, it also robs them of whatever learning objective motivated the task.
To address this, the Examplar system asks students to first write examples. These examples are evaluated against wheats (correct implementations) and chaffs (buggy implementations). Examples must pass the wheat, and correctly identify as wrong as many chaffs as possible. Prior work explores this and shows that it is quite effective.
However, there’s a problem with chaffs. Students can end up spending too much time catching them, and not enough time on the actual programming task. Therefore, you want chaffs that correspond to the problem misconceptions students are most likely to have. Having a small number of very effective chaffs is far more useful than either a large number or ineffective ones (much less both). But the open question has always been, how do we obtain chaffs?
Previously, chaffs were created by hand by experts. This was problematic because it forces experts to imagine the kinds of problems students might have; this is not only hard, it is bound to run into expert blind spots. What other method do we have?
This work is based on a very simple, clever observation: any time an example fails a wheat, it may correspond to a student misconception. Of course, not all wheat failures are misconceptions! It could be a typo, it could be a basic logical error, or it could even be an attempt to game the wheat-chaff system. Do we know in what ratio these occur, and can we use the ones that are misconceptions?
This paper makes two main contributions:
-
It shows that that many wheat failures really are misconceptions.
-
It uses these misconceptions to formulate new chaffs, and shows that they compare very favorably to expert-generated chaffs.
Furthermore, the work spans two kinds of courses: one is an accelerated introductory programming class, while the other is an upper-level formal methods course. We show that there is value in both settings.
This is just a first step in this direction; a lot of manual work went into this research, which needs to be automated; we also need to measure the direct impact on students. But it’s a very promising direction in a few ways:
-
It presents a novel method for finding misconceptions.
-
It naturally works around expert blind-spots.
-
With more automation, it can be made lightweight.
In particular, if we can make it lightweight, we can apply it to settings—even individual homework problems—that also manifest misconceptions that need fixing, but could never afford heavyweight concept-inventory-like methods for identifying them.
You can learn more about the work from the paper.
Automated, Targeted Testing of Property-Based Testing Predicates
Tags: Formal Methods, Properties, Testing, Verification
Posted on 24 November 2021.Property-Based Testing (PBT) is not only a valuable sofware quality improvement method in its own right, it’s a critical bridge between traditional software development practices (like unit testing) and formal specification. We discuss this in our previous work on assessing student performance on PBT. In particular, we introduce a mechanism to investigate how they do by decomposing the property into a collection of independent sub-properties. This gives us semantic insight into how students perform: rather than a binary scale, we can identify specific sub-properties that they may have difficulty with.
While this preliminary work was very useful, it suffered from several problems, some of which are not surprising while others became clear to us only in retrospect. In light of that, our new work makes several improvements.
-
The previous work expected each of the sub-properties to be independent. However, this is too strong a requirement. For one, it masks problems that can lurk in the conjunction of sub-properties. The other problem is more subtle: when you see a surprising or intriguing student error, you want to add a sub-property that would catch that error, so you can generate statistics on it. However, there’s no reason the new property will be independent; in fact, it almost certainly won’t be.
-
Our tests were being generated by hand, with one exception that was so subtle, we employed Alloy to find the test. Why only once? Why not use Alloy to generate tests in all situations? And while we’re at it, why not also use a generator from a PBT framework (specifically, Hypothesis)?
-
And if we’re going to use both value-based and SAT-based example generators, why not compare them?
This new paper does all of the above. It results in a much more flexible, useful tool for assessing student PBT performance. Second, it revisits our previous findings about student performance. Third, it lays out architectures for PBT evaluation using SAT and a PBT-generator (specifically Hypothesis). In the process it explains various engineering issues we needed to address. Fourth, it compares the two approaches; it also compares how the two approaches did relative to hand-curated test suites.
You can read about all this in our paper.
Student Help-Seeking for (Un)Specified Behaviors
Tags: Education, Misconceptions, Testing, Tools, User Studies
Posted on 02 October 2021.Here’s a summary of the full arc, including later work, of the Examplar project.
Over the years we have done a lot of work on Examplar, our system for helping students understand the problem before they start implementing it. Given that students will even use it voluntarily (perhaps even too much), it would seem to be a success story.
However, a full and fair scientific account of Examplar should also examine where it fails. To that end, we conducted an extensive investigation of all the posts students made on our course help forum for a whole semester, identified which posts had to do with problem specification (and under-specification!), and categorized how helpful or unhelpful Examplar was.
The good news is we saw several cases where Examplar had been directly helpful to students. These should indeed be considered a lower-bound, because the point of Examplar is to “answer” many questions directly, so they would never even make it onto the help forum.
But there is also bad news. To wit:
-
Students sometimes simply fail to use Examplar’s feedback; is this a shortcoming of the UI, of the training, or something inherent to how students interact with such systems?
-
Students tend to overly focus on inputs, which are only a part of the suite of examples.
-
Students do not transfer lessons from earlier assignments to later ones.
-
Students have various preconceptions about problem statements, such as imagining functionality not asked for or constraints not imposed.
-
Students enlarge the specification beyond what was written.
-
Students sometimes just don’t understand Examplar.
These serve to spur future research in this field, and may also point to the limits of automated assistance.
To learn more about this work, and in particular to get the points above fleshed out, see our paper!
Teaching and Assessing Property-Based Testing
Tags: Education, Formal Methods, Properties, Testing, User Studies
Posted on 10 January 2021.Property-Based Testing (PBT) sees increasing use in industry, but lags significantly behind in education. Many academics have never even heard of it. This isn’t surprising; computing education still hasn’t come to terms with even basic software testing, even when it can address pedagogic problems. So this lag is predictable.
The Problem of Examples
But even people who want to use it often struggle to find good examples of it. Reversing a list drives people to drink, and math examples are hard to relate to. This is a problem from several respects. Without compelling examples, nobody will want to teach it. Even if they do, unless the examples are compelling, students will not pay attention to it. And if the students don’t, they won’t recognize opportunities to use it later in their careers.
This loses much more than a testing technique. We consider PBT a gateway to formal specification. Like a formal spec, it’s an abstract statement about behaviors. Unlike a formal spec, it doesn’t require learning a new language or mathematical formalism, it’s executable, and it produces concrete counter-examples. We therefore use it, in Brown’s Logic for Systems course, as the starting point to more formal specifications. (If they learn nothing about formal specification but simply become better testers, we’d still consider that a win.)
Therefore, for the past 10 years, and with growing emphasis, we’ve been teaching PBT: starting in our accelerated introductory course, then in Logic for Systems, and gradually in other courses as well. But how do we motivate the concept?
Relational Problems
We motivate PBT through what we call relational problems. What are those?
Think about your typical unit test. You write an input-output
pair: f(x) is y. Let’s say it fails:
-
Usually, the function
fis wrong. Congrats, you’ve just caught a bug! -
Sometimes, the test is wrong:
f(x)is not, in fact,y. This can take some reflection, and possibly reveals a misunderstanding of the problem.
That’s usually where the unit-testing story ends. However, there is one more possibility:
- Neither is “wrong”.
f(x)has multiple legal results,w,y, andz; your test chosey, but this particular implementation happened to returnzorwinstead.
We call these “relational” because f is clearly more a relation than
a function.
Some Examples
So far, so abstract. But many problems in computing actually have a relational flavor:
-
Consider computing shortest paths in a graph or network; there can be many shortest paths, not just one. If we write a test to check for one particular path, we could easily run into the problem above.
-
Many other graph algorithms are also relational. There are many legal answers, and the implementation happens to pick just one of them.
-
Non-deterministic data structures inspire relational behavior.
-
Various kinds of matching problems—e.g., the stable-marriage problem—are relational.
-
Combinatorial optimization problems are relational.
-
Even sorting, when done over non-atomic data, is relational.
In short, computing is full of relational problems. While they are not at all the only context in which PBT makes sense, they certainly provide a rich collection of problems that students already study that can be used to expose this idea in a non-trivial setting.
Assessing Student Performance
Okay, so we’ve been having students write PBT for several years now. But how well do they do? How do we go about measuring such a question? (Course grades are far too coarse, and even assignment grades may include various criteria—like code style—that are not strictly germane to this question.) Naturally, their core product is a binary classifier—it labels a purported implementation as valid or invalid—so we could compute precision and recall. However, these measures still fail to offer any semantic insight into how students did and what they missed.
We therefore created a new framework for assessing this. To wit, we took each problem’s abstract property statement (viewed as a formal specification), and sub-divided it into a set of sub-properties whose conjunction is the original property. Each sub-property was then turned into a test suite, which accepted those validators that enforced the property and rejected those that did not. This let us get a more fine-grained understanding of how students did, and what kinds of mistakes they made.
Want to Learn More?
If you’re interested in this, and in the outcomes, please see our paper.
What’s Next?
The results in this paper are interesting but preliminary. Our follow-up work describes limitations to the approach presented here, thereby improving the quality of evaluation, and also innovates in the generation of classifying tests. Check it out!
Students Testing Without Coercion
Tags: Education, Testing, Tools, User Studies
Posted on 20 October 2020.Here’s a summary of the full arc, including later work, of the Examplar project.
It is practically a trope of computing education that students are over-eager to implement, yet woefully under-eager to confirm they understand the problem they are tasked with, or that their implementation matches their expectations. We’ve heard this stereotype couched in various degrees of cynicism, ranging from “students can’t test” to “students won’t test”. We aren’t convinced, and have, for several years now, experimented with nudging students towards early example writing and thorough testing.
We’ve blogged previously about our prototype IDE, Examplar — our experiment in encouraging students to write illustrative examples for their homework assignments before they actually dig into their implementation. Examplar started life as a separate, complementary tool from Pyret’s usual programming environment, providing a buffer just for the purposes of developing a test suite. Clicking Run in Examplar runs the student’s test suite against our implementations (not theirs) and then reports the degree to which the suite was valid and thorough (i.e., good at catching our buggy implementations). With this tool, students could catch their misconceptions before implementing them.
Although usage of this tool was voluntary for all but the first assignment, students relied on it extensively throughout the semester and the quality of final submissions improved drastically compared to prior offerings of the course. Our positive with this prototype encouraged us to fully-integrate Examplar’s feedback into students’ development environment. Examplar’s successor provides a unified environment for the development of both tests and implementation:
This new environment — which provides Examplar-like feedback on every Run — no longer requires that students have the self-awareness to periodically switch to a separate tool. The environment also requires students to first click a “Begin Implementation” button before showing the tab in which they write their implementation.
This unified environment enabled us to study, for the first time, whether students wrote examples early, relative to their implementation progress. We tracked the maximum test thoroughness students had achieved prior to each edit-and-run of their implementations file. Since the IDE notified students of their test thoroughness upon each run, and since students could only increase their thoroughness via edits to their tests file, the mean of these scores summarizes how thoroughly a student explored the problem with tests before fully implementing it.
We find that nearly every student on nearly every assignment achieves some level of thoroughness before their implementation work:
To read more about our design of this environment, its pedagogic context, and our evaluation of students’ development process, check out the full paper here.
Combating Misconceptions by Encouraging Example-Writing
Tags: Education, Misconceptions, Testing, Tools
Posted on 11 January 2020.Here’s a summary of the full arc, including later work, of the Examplar project.
When faced with an unfamiliar programming problem, undergraduate computer science students all-too-often begin their implementations with an incomplete understanding of what the problem is asking, and may not realize until far into their development process (if at all) that they have solved the wrong problem. At best, a student realizes their mistake, suffers from some frustration, and is able to correct it before the final submission deadline. At worst, they might not realize their mistake until they receive feedback on their final submission—depriving them of the intended learning goal of the assignment.
Educators must therefore provide students with some mechanism by which students can evaluate their own understanding of a problem—before they waste time implementing some misconceived variation of that problem. To this end, we provide students with Examplar: an IDE for writing input–output examples that provides on-demand feedback on whether the examples are:
- valid (consistent with the problem), and
- thorough (explore the conceptually interesting corners of the problem).
For a demonstration, watch this brief video!
With its gamification, we believed students would find Examplar compelling to use. Moreover, we believed its feedback would be helpful. Both of these hypotheses were confirmed. We found that students used Examplar extensively—even when they were not required to use it, and even for assignments for which they were not required to submit test cases. The quality of students’ final submissions generally improved over previous years, too. For more information, read the full paper here!
The Hidden Perils of Automated Assessment
Posted on 26 July 2018.
We routinely rely on automated assessment to evaluate our students’ work on programming assignments. In principle, these techniques improve the scalability and reproducibility of our assessments. In actuality, these techniques may make it incredibly easy to perform flawed assessments at scale, with virtually no feedback to warn the instructor. Not only does this affect students, it can also affect the reliability of research that uses it (e.g., that correlates against assessment scores).
To Test a Test Suite
The initial object of our study was simply to evaluate the quality of student test suites. However, as we began to perform our measurements, we wondered how stable they were, and started to use different methods to evaluate stability.
In this group, we take the perspective that test suites are classifiers of implementations. You give a test suite an implementation, and it either accepts or rejects it. Therefore, to measure the quality of a test suite, we can standard metrics for classifiers, true positive rate and true negative rate. However, to actually do this, we need a set of implementations that we know, a priori, to be correct or faulty.
| Ground Truth | |||
|---|---|---|---|
| Correct | Faulty | ||
| Test Suite |
Accept | True Negative | False Negative |
| Reject | False Positive | True Positive | |
A robust assessment of a classifier may require a larger collection of known-correct and known-faulty implementations than the instructor could craft themselves. Additionally, we can leverage all of the implementations that students are submitting—we just need to determine which are correct and which are faulty.
There are basically two ways of doing this in the literature; let’s see how they fare.
The Axiomatic Model
In the first method, the instructor writes a test suite and whatever that test suite’s judgments is used as the ground truth; e.g., if the instructor test suite accepts a given implementation, it is a false positive for a student’s test suite to reject it.
The Algorithmic Model
The second method does this by taking every test suite you have (i.e., both the instructor’s and the students’), running them all against a known-correct implementation, and gluing all the ones that pass it into one big mega test suite that is used to establish ground truth.
A Tale of Two Assessments
We applied each model in turn to classify 38 student implementations and a handful of specially crafted ones (both correct and faulty, in case the student submissions were skewed heavily towards faultiness or correctness), then computed the true-positive and true-negative rate for each student’s test suite.
The choice of underlying implementation classification model substantially impacted the apparent quality of student test suites. Visualized as kernel density estimation plots (akin to smoothed histograms):
The Axiomatic Model:
Judging by this plot, students did astoundingly well at catching buggy implementations. Their success at identifying correct implementations was more varied, but still pretty good.
The Algorithmic Model:
Judging by this plot, students performed astoundingly poorly at detecting buggy implementations, but quite well at identifying correct ones.
Towards Robust Assessments
So which is it? Do students miss half of all buggy implementations, or are they actually astoundingly good? In actuality: neither. These strikingly divergent analysis outcomes are produced by fundamental, theoretical flaws in how these models classify implementations.
We were alarmed to find that these theoretical flaws, to varying degrees, affected the assessments of every assignment we evaluated. Neither model provides any indication to warn instructors when these flaws are impacting their assessments. For more information about these perils, see our paper, in which we present a technique for instructors and researchers that detects and protects against them.
