Students Testing Without Coercion

Posted on 20 October 2020.

It is practically a trope of computing education that students are over-eager to implement, yet woefully under-eager to confirm they understand the problem they are tasked with, or that their implementation matches their expectations. We’ve heard this stereotype couched in various degrees of cynicism, ranging from “students can’t test” to “students won’t test”. We aren’t convinced, and have, for several years now, experimented with nudging students towards early example writing and thorough testing.

We’ve blogged previously about our prototype IDE, Examplar — our experiment in encouraging students to write illustrative examples for their homework assignments before they actually dig into their implementation. Examplar started life as a separate, complementary tool from Pyret’s usual programming environment, providing a buffer just for the purposes of developing a test suite. Clicking Run in Examplar runs the student’s test suite against our implementations (not theirs) and then reports the degree to which the suite was valid and thorough (i.e., good at catching our buggy implementations). With this tool, students could catch their misconceptions before implementing them.

Although usage of this tool was voluntary for all but the first assignment, students relied on it extensively throughout the semester and the quality of final submissions improved drastically compared to prior offerings of the course. Our positive with this prototype encouraged us to fully-integrate Examplar’s feedback into students’ development environment. Examplar’s successor provides a unified environment for the development of both tests and implementation:

Screenshot of Examplar's Successor

This new environment — which provides Examplar-like feedback on every Run — no longer requires that students have the self-awareness to periodically switch to a separate tool. The environment also requires students to first click a “Begin Implementation” button before showing the tab in which they write their implementation.

This unified environment enabled us to study, for the first time, whether students wrote examples early, relative to their implementation progress. We tracked the maximum test thoroughness students had achieved prior to each edit-and-run of their implementations file. Since the IDE notified students of their test thoroughness upon each run, and since students could only increase their thoroughness via edits to their tests file, the mean of these scores summarizes how thoroughly a student explored the problem with tests before fully implementing it.

We find that nearly every student on nearly every assignment achieves some level of thoroughness before their implementation work:

The Mean Implementation-Interval Thoroughness of students on various assignments.
The "Mean Implementation-Interval Thoroughness" of each student, on various assignments. (Click picture to open in a new window.)

To read more about our design of this environment, its pedagogic context, and our evaluation of students’ development process, check out the full paper here.

Using Design Alternatives to Learn About Data Organizations

Posted on 27 June 2020.

A large number of computer science education papers focus on data structures. By this, they mean the canon: lists, queues, stacks, heaps, and so on. These are certainly vital to the design of most programs.

However, there is another kind of “data structure” programmers routinely contend with: how to represent the world your program is about. Suppose, for instance, you’re trying to represent a family’s genealogy. You could:

  • Represent each person as a node and have references to their two biological parents, who in turn have references to their biological parents, and so on. The tree “bottoms out” when we get to people about whom we have no more information.

  • Represent each person as a node and have references to their children instead (a list, say, if we want to preserve their birth order). This tree bottoms out at people who have no children.

  • Represent each coupling as a node, and have references to their children (or issue, as genealogists like to say). Now you may have a kind of node for children and another for coupling.

And so on. There are numerous possibilities. Which one should we pick? It depends on (1) what information we even have, (2) what operations we want to perform, and (3) what complexity we need different operations to take.

Unfortunately, computing education research doesn’t talk about this problem very much at all; in fact, we don’t seem to even have terminology to talk about this issue. In a sense, this is also very much a matter of data structure, though of a different kind: whereas the purely abstract data structures of computer science we might call computational data structures, these — which center around directly representing real-world information — we might instead call representational data structures. That could get pretty confusing, though, so we’ve adopted the term data organization to refer to the latter.

Learning to think about data organization is an essential computing skill. But how early can we teach it? How well can students wrestle with it? What methods should we use? Do they need to be sophisticated programmers before they can engage in reasoning about representations?

Good news: we can begin this quite early, and students don’t need to be sophisticated computer scientists: they can just think about the world, and their experiences living in it, to reason about data organizations. Representational data structures probably do a far better job of drawing on their lived experience than computational ones do! (Unless they’ve previously lived as a computer.)

There are several ways we could introduce this topic. We chose to expose them to pairs of representations for the same domain, and have them compare the two. This is related to theories of perception. Read the paper to learn more!

Somewhat subtly, this also adds a dimension to “computational thinking” that is usually quite missing from standard discussions about it. Activities like those described in this paper generate new and engaging activities that many students can participate in. Indeed, computing background does not seem to matter much in our data, and a more diverse group of students is likely to make a much richer set of judgments—thereby enabling students in traditionally underrepresented groups to contribute based on their unique experiences, and also feel more valued.

What Help Do Students Seek in TA Office Hours?

Posted on 20 May 2020.

In computer science, a large number of students get help from teaching assistants (TAs). A great deal of their real education happens in these hours. While TA hours are an excellent resource, they are also rather opaque to the instructors, who do not really know what happens in them.

How do we construct a mechanism to study what happens in hours? It’s actually not obvious at all:

  • We could set up cameras to record all the interactions in hours. While this would provide a lot of information, it significantly changes the nature of hours. For many students, hours are private time with a TA, where they can freely speak about their discomfort and get help from a peer; they might ask personal questions; they might also complain about the instructor. One does not install cameras in confessionals.

  • We could ask TAs to write extensive notes (redacting private information) after the student has left. This also has various flaws:

    • Their memory may be faulty.

    • Their recollection may be biased by their own beliefs.

    • It would slow down processing students, who already confront overly-long lines and waits.

What do we instead want? A process that is non-intrusive, lightweight, and yet informative. We have to also give up on perfect knowledge, and focus on information that is actually useful to the instructor.

Part of the problem is that we as a community lack a systematic method to help students in the first place. If students have no structure to how they approach help-seeking, then it’s hard to find patterns and make sense of what they actually do.

However, this is exactly a problem that the How to Design Programs Design Recipe was addressed to solve. It provides a systematic way for students to structure their problem-solving and help-seeking. TAs are instructed to focus on the steps of the Design Recipe in order, not addressing later steps until students have successfully completed the earlier ones. This provides an “early warning” diagnostic, addressing root causes rather than their (far-removed) manifestations.

Therefore, we decided to use the Design Recipe steps as a lens for obtaining insight into TA hours. We argue that this provides a preliminary tool that addresses our needs: it is lightweight, non-intrusive, and yet useful to the instructor. Read the paper to learn more!

Combating Misconceptions by Encouraging Example-Writing

Posted on 11 January 2020.

When faced with an unfamiliar programming problem, undergraduate computer science students all-too-often begin their implementations with an incomplete understanding of what the problem is asking, and may not realize until far into their development process (if at all) that they have solved the wrong problem. At best, a student realizes their mistake, suffers from some frustration, and is able to correct it before the final submission deadline. At worst, they might not realize their mistake until they receive feedback on their final submission—depriving them of the intended learning goal of the assignment.

Educators must therefore provide students with some mechanism by which students can evaluate their own understanding of a problem—before they waste time implementing some misconceived variation of that problem. To this end, we provide students with Examplar: an IDE for writing input–output examples that provides on-demand feedback on whether the examples are:

  1. valid (consistent with the problem), and
  2. thorough (explore the conceptually interesting corners of the problem).

For a demonstration, watch this brief video!

With its gamification, we believed students would find Examplar compelling to use. Moreover, we believed its feedback would be helpful. Both of these hypotheses were confirmed. We found that students used Examplar extensively—even when they were not required to use it, and even for assignments for which they were not required to submit test cases. The quality of students’ final submissions generally improved over previous years, too. For more information, read the full paper here!

The Hidden Perils of Automated Assessment

Posted on 26 July 2018.

We routinely rely on automated assessment to evaluate our students’ work on programming assignments. In principle, these techniques improve the scalability and reproducibility of our assessments. In actuality, these techniques may make it incredibly easy to perform flawed assessments at scale, with virtually no feedback to warn the instructor. Not only does this affect students, it can also affect the reliability of research that uses it (e.g., that correlates against assessment scores).

To Test a Test Suite

The initial object of our study was simply to evaluate the quality of student test suites. However, as we began to perform our measurements, we wondered how stable they were, and started to use different methods to evaluate stability.

In this group, we take the perspective that test suites are classifiers of implementations. You give a test suite an implementation, and it either accepts or rejects it. Therefore, to measure the quality of a test suite, we can standard metrics for classifiers, true positive rate and true negative rate. However, to actually do this, we need a set of implementations that we know, a priori, to be correct or faulty.

Ground Truth
Correct Faulty
Test
Suite
Accept True Negative False Negative
Reject False Positive True Positive

A robust assessment of a classifier may require a larger collection of known-correct and known-faulty implementations than the instructor could craft themselves. Additionally, we can leverage all of the implementations that students are submitting—we just need to determine which are correct and which are faulty.

There are basically two ways of doing this in the literature; let’s see how they fare.

The Axiomatic Model

In the first method, the instructor writes a test suite and whatever that test suite’s judgments is used as the ground truth; e.g., if the instructor test suite accepts a given implementation, it is a false positive for a student’s test suite to reject it.

The Algorithmic Model

The second method does this by taking every test suite you have (i.e., both the instructor’s and the students’), running them all against a known-correct implementation, and gluing all the ones that pass it into one big mega test suite that is used to establish ground truth.

A Tale of Two Assessments

We applied each model in turn to classify 38 student implementations and a handful of specially crafted ones (both correct and faulty, in case the student submissions were skewed heavily towards faultiness or correctness), then computed the true-positive and true-negative rate for each student’s test suite.

The choice of underlying implementation classification model substantially impacted the apparent quality of student test suites. Visualized as kernel density estimation plots (akin to smoothed histograms):

The Axiomatic Model:

Plot of test suite performance atop axiomatic classification. Judging by this plot, students did astoundingly well at catching buggy implementations. Their success at identifying correct implementations was more varied, but still pretty good.

The Algorithmic Model:

Plot of test suite performance atop algorthmic classification.

Judging by this plot, students performed astoundingly poorly at detecting buggy implementations, but quite well at identifying correct ones.

Towards Robust Assessments

So which is it? Do students miss half of all buggy implementations, or are they actually astoundingly good? In actuality: neither. These strikingly divergent analysis outcomes are produced by fundamental, theoretical flaws in how these models classify implementations.

We were alarmed to find that these theoretical flaws, to varying degrees, affected the assessments of every assignment we evaluated. Neither model provides any indication to warn instructors when these flaws are impacting their assessments. For more information about these perils, see our paper, in which we present a technique for instructors and researchers that detects and protects against them.