The Hidden Perils of Automated Assessment

Posted on 26 July 2018.

We routinely rely on automated assessment to evaluate our students’ work on programming assignments. In principle, these techniques improve the scalability and reproducibility of our assessments. In actuality, these techniques may make it incredibly easy to perform flawed assessments at scale, with virtually no feedback to warn the instructor. Not only does this affect students, it can also affect the reliability of research that uses it (e.g., that correlates against assessment scores).

To Test a Test Suite

The initial object of our study was simply to evaluate the quality of student test suites. However, as we began to perform our measurements, we wondered how stable they were, and started to use different methods to evaluate stability.

In this group, we take the perspective that test suites are classifiers of implementations. You give a test suite an implementation, and it either accepts or rejects it. Therefore, to measure the quality of a test suite, we can standard metrics for classifiers, true positive rate and true negative rate. However, to actually do this, we need a set of implementations that we know, a priori, to be correct or faulty.

Ground Truth
Correct Faulty
Test
Suite
Accept True Negative False Negative
Reject False Positive True Positive

A robust assessment of a classifier may require a larger collection of known-correct and known-faulty implementations than the instructor could craft themselves. Additionally, we can leverage all of the implementations that students are submitting—we just need to determine which are correct and which are faulty.

There are basically two ways of doing this in the literature; let’s see how they fare.

The Axiomatic Model

In the first method, the instructor writes a test suite and whatever that test suite’s judgments is used as the ground truth; e.g., if the instructor test suite accepts a given implementation, it is a false positive for a student’s test suite to reject it.

The Algorithmic Model

The second method does this by taking every test suite you have (i.e., both the instructor’s and the students’), running them all against a known-correct implementation, and gluing all the ones that pass it into one big mega test suite that is used to establish ground truth.

A Tale of Two Assessments

We applied each model in turn to classify 38 student implementations and a handful of specially crafted ones (both correct and faulty, in case the student submissions were skewed heavily towards faultiness or correctness), then computed the true-positive and true-negative rate for each student’s test suite.

The choice of underlying implementation classification model substantially impacted the apparent quality of student test suites. Visualized as kernel density estimation plots (akin to smoothed histograms):

The Axiomatic Model:

Plot of test suite performance atop axiomatic classification. Judging by this plot, students did astoundingly well at catching buggy implementations. Their success at identifying correct implementations was more varied, but still pretty good.

The Algorithmic Model:

Plot of test suite performance atop algorthmic classification.

Judging by this plot, students performed astoundingly poorly at detecting buggy implementations, but quite well at identifying correct ones.

Towards Robust Assessments

So which is it? Do students miss half of all buggy implementations, or are they actually astoundingly good? In actuality: neither. These strikingly divergent analysis outcomes are produced by fundamental, theoretical flaws in how these models classify implementations.

We were alarmed to find that these theoretical flaws, to varying degrees, affected the assessments of every assignment we evaluated. Neither model provides any indication to warn instructors when these flaws are impacting their assessments. For more information about these perils, see our paper, in which we present a technique for instructors and researchers that detects and protects against them.

Mystery Languages

Tags: Programming Languages, Education

Posted on 05 July 2018.

How do you learn a new language? Do you simply read its reference manual, and then you’re good to go? Or do you also explore the language, trying things out by hand to see how they behave?

This skill—learning about something by poking at it—is frequently used but almost never taught. However, we have tried to teach it via what we dub mystery languages, and this post is about how we went about it.

Mystery Lanuages

A mystery language is exactly what it sounds like: a programming language whose behavior is a mystery. Each assignment comes with some vague documentation, and an editor in which you can write programs. However, when you run a program it will produce multiple answers! This is because there are actually multiple languages, with the same syntax but different semantics. The answers you get are the results of running your program in all of the different languages. The goal of the assignment is to find programs that tell the languages apart.

As an example, maybe the languages have the syntax a[i], and the documentation says that this “accesses an array a at index i”. That totally specifies the behavior of this syntax, right?

Not even close. For example, what happens if the index is out of bounds? Does it raise an error (like Java), or return undefined (like JavaScript), or produce nonsense (like C), or wrap the index to a valid one (like Python)? And what happens if the index is 2.5, or "2", or 2.0, or "two"?

(EDIT: Specifically, Python wraps negative indices that are smaller than the length of the array.)

Students engage with the mystery languages in three ways:

  1. The first part of each assignment is to find a set of programs that distinguishes the different languages (a.k.a. a classifier).
  2. The second part of the assignment is to describe a theory that explains the different behaviors of the languages. (For example, a theory about C’s behavior could include accessing heap addresses past the end of an array.)
  3. Finally, after an assignment is due, there’s an in-class discussion and explanation of the mystery languages. This is especially useful to provide terminology for behaviors that students encountered in the languages.

This example is somewhat superficial (the real mystery languages mostly use more significant features than array access), but you get the idea: every aspect of a programming language comes with many possible designs, and the mystery languages have students explore them first-hand.

Why Mystery Languages?

We hope to teach a number of skills through mystery languages:

  • Evaluating Design Decisions: The mystery languages are almost entirely based on designs chosen by real languages: either historical choices that have since been settled, or choices that are still up in the air and vary between modern languages. Students get to explore the consequences of these decisions themselves, so that they get a feel for them. And the next day in class they learn about the broader historical context.
  • Syntax vs. Semantics: We are frustrated by how frequently discussions about programming languages revolve around syntax. With luck, showing students first-hand that a single syntax can have a variety of semantics will bring their discussions to a somewhat higher level.
  • Adversarial Thinking: Notice that the array example was all about exploring edge cases. If you only ever wrote correct, safe, boring programs then you’d never notice the different ways that array access could work. This is very similar to the kind of thinking you need to do when reasoning about security, and we hope that students might pick up on this mindset.
  • Experimental Thinking: In each assignment, students are asked not only to find programs that behave differently across the mystery languages, but also to explain the behavior of the different languages. This is essentially a scientific task: brainstorm hypotheses about how the languages might work; experimentally test these hypotheses by running programs; discard any hypothesis that was falsified; and iterate.

Adopting Mystery Languages

If you want to use Mystery Languages in your course, email us and we’ll help you get started!

There are currently 13 mystery languages, and more in development. At Brown, we structured our programming languages course around these mystery languages: about half of the assignments and half of the lectures were about mystery languages. However, mystery languages are quite flexible, and could also be used as a smaller supplement to an existing course. One possibility is to begin with one or two simpler languages to allow students to learn how they work, and from there mix and match any of the more advanced languages. Alternatively, you could do just one or two mystery languages to meet specific course objectives.

Learn More

You can learn more about mystery languages in our SNAPL’17 paper, or dive in and try them yourself (see the assignments prefixed with “ML:”).

Resugaring Type Rules

Tags: Programming Languages, Types

Posted on 19 June 2018.

This is the final post in a series about resugaring. It focuses on resugaring type rules. See also our posts on resugaring evaluation steps and resugaring scope rules.


No one should have to see the insides of your macros. Yet type errors often reveal them. For example, here is a very simple and macro in Rust (of course you should just use && instead, but we’ll use this as a simple working example):

and the type error message you get if you misuse it:

You can see that it shows you the definition of this macro. In this case it’s not so bad, but other macros can get messy, and you might not want to see their guts. Plus in principle, a type error should only show the erronous code, not correct code that it happend to call. You wouldn’t be very happy with a type checker that sometimes threw an error deep in the body of a (type correct) library function that you called, just because you used it the wrong way. Why put up with one that does the same thing for macros?

The reason Rust does is that that it does not know the type of and. As a result, it can only type check after and has been desugared (a.k.a., expanded), and so the error occurs in the desugared code.

But what if Rust could automatically infer a type rule for checking and, using only the its definition? Then the error could be found in the original program that you wrote (rather than its expansion), and presented as such. This is exactly what we did—albeit for simpler type systems than Rust’s—in our recent PLDI’18 paper Inferring Type Rules for Syntactic Sugar.

We call this process resugaring type rules; akin to our previous work on resugaring evaluation steps and resugaring scope rules. Let’s walk through the resugaring of a type rule for and:

We want to automatically derive a type rule for and, and we want it to be correct. But what does it mean for it to be correct? Well, the meaning of and is defined by its desugaring: α and β is synonymous with if α then β else false. Thus they should have the same type (here the fancy D means “desugar”):

(Isurf ||- means “in the surface language type system”, Icore ||- means “in the core language type system”, and the fancy D means “desugar”.)

How can we achieve this? The most straightforward to do is to capture the iff with two type rules, one for the forward implication, and one for the reverse:

The first type rule is useful because it says how to type check and in terms of its desugaring. For example, here’s a derivation that true and false has type Bool:

However, while this t-and rule is accurate, it’s not the canonical type rule for and that you’d expect. And worse, it mentions if, so it’s leaking the implementation of the macro!

However, we can automatically construct a better type rule. The trick is to look at a more general derivation. Here’s a generic type derivation for any term α and β:

Notice Dα and Dβ: these are holes in the derivation, which ought to be filled in with sub-derivations proving that α has type Bool and β has type Bool. Thus, “α : Bool” and “β : Bool” are assumptions we must make for the type derivation to work. However, if these assumptions hold, then the conclusion of the derivation must hold. We can capture this fact in a type rule, whole premises are these assumptions, and whose conclusion is the conclusion of the whole derivation:

And so we’ve inferred the canonical type rule for and! Notice that (i) it doesn’t mention if at all, so it’s not leaking the inside of the macro, and (ii) it’s guaranteed to be correct, so it’s a good starting point for fixing up a type system to present errors at the right level of abstraction. This was a simple example for illustrative purposes, but we’ve tested the approach on a variety of sugars and type systems.

You can read more in our paper, or play with the implementation.

Picking Colors for Pyret Error Messages

Tags: Pyret

Posted on 11 June 2018.

Pyret has beautiful error messages, like this one:

Sample Pyret error message

Notice the colors. Aren’t they pretty? Whenever a section of code is mentioned in an error message, it gets highlighted with its own color. And we pick colors that are as different as possible, so they don’t get confused with each other. It is useful to keep all of the colors distinct because it provides a very intuitive one-to-one mapping between parts of the code you wrote and the code snippets mentioned in the error messages. If two error messages used the same color for a snippet, it might look at first glance that they were mentioning the same thing.

(We should say up front: while we believe that the approach described in this post should be fairly robust to most forms of color blindness, it’s difficult to reason about so we make no guarantees. However, if two colors are hard to distinguish by sight, you can always hover over one of them to see the matching section of code blink. EDIT: Actually, it’s not as robust as we had hoped. If you know a good approach to this, let us know.)

How did we make them? It should be easy, right? We could have a list of, say, six colors and use those. After all, no error message needs more than six colors.

Except that there might be multiple error messages. In fact, if you have failing test cases, then you’ll have one failure message per failing test case, each with its own highlight, so there is no upper bound on how many colors we need. (Pyret will only show one of these highlights at a time—whichever one you have selected—but even so it’s nice for them to all have different colors.) Thus we’ll need to be able to generate a set of colors on demand.

Ok, so for any given run of the program, we’ll first determine how many colors we need for that run, and then generate that many colors.

Except that it’s difficult to tell how many colors we need beforehand. In fact, Pyret has a REPL, where users can evaluate expressions, which might throw more errors. Thus it’s impossible to know how many colors we’ll need beforehand, because the user can always produce more errors in the REPL.

Therefore, however we pick colors, it must satisfy these two properties:

  • Distinctness: all of the colors in all of the highlights should be as visually different from each other as possible.
  • Streaming: we must always be able to pick new colors.

Also, the appearance of the highlights should be pretty uniform; none of them should stand out too much:

  • Uniformity: all of the colors should have the same saturation (i.e. colorfulness) and lightness as each other. This way none of them blend in with the background color (which is white) or the text color (which is black), or stand out too much.

The Phillotactic Color-Picking Algorithm

Now let’s talk about the algorithm we use!

(Note that this is “phillotactic”, not “phyllotactic”. It has nothing to do with plants.)

To keep uniformity, it makes sense to pick colors from a rainbow. This is a circle in color space, with constant saturation and lightness and varying hue. Which color space should we use? We should not use RGB, because that space doesn’t agree well with how colors actually appear. For example, if we used a rainbow in RGB space, then green would appear far too bright and blue would appear far too dark. Instead, we should use a color space that agrees with how people actually perceieve colors. The CIELAB color space is better. It was designed so that if you take the distance between two colors in it, that distance approximately agrees with how different the colors seem when you look at them. (It’s only approximate because—among other things—perceptual color space is non-Euclidean.)

Therefore we’ll pick colors from a circle in CIELAB space. This space has three coordinates: L, for lightness, A for green-red, and B for blue-yellow (hence the LAB). We determined by experimentation that a good lightness to use was 73 out of 100. Given this lightness, we picked the largest saturation possible, using A^2 + B^2 = 40^2.

Now how do we vary the hue? Every color picked needs a new hue, and they need to all be as different as possible. It would be bad, for instance, if we picked 13 colors, and then the 13th color looked just like the 2nd color.

Our solution was to have each color’s hue be the golden angle from the previous hue. From Wikipedia, the golden angle is “the angle subtended by the smaller arc when two arcs that make up a circle are in the golden ratio”. It is also 1/ϕ^2 of a circle, or about 137 degrees.

Thus the phillogenic algorithm keeps track of the number of colors generated so far, and assigns the n’th color a hue of n times the golden angle. So the first color will have a hue of 0 degrees. The second color will have a hue of 137 degrees. The third will have a hue of 137 * 2 = 274 degrees. The fourth will be 137 * 3 = 411 = 51 degrees. This is a little close to the first color. But even if we knew we’d have four colors total, they’d be at most 90 degrees apart, so 51 isn’t too bad. This trend continues: as we pick more and more colors, they never end up much closer to one another than is necessary.

There’s a reason that no two colors end up too similar. It follows from the fact that ϕ is the most difficult number to approximate as a fraction. Here’s a proof that colors aren’t similar:

Suppose that the m’th color and the (m+n)’th color end up being very similar. The difference between the m’th and (m+n)’th colors is the same as the difference between the 0’th and the n’th colors. Thus we are supposing that the 0’th color and the n’th color are very similar.

Let’s measure angles in turns, or fractions of 360 degrees. The n’th color’s hue is, by definition, n/ϕ^2 % 1 turns. The 0’th hue is 0. So if these colors are similar, then n/ϕ^2 % 1 ~= 0 (using ~= for “approximately equals”). We can then reason as follows, using in the third step the fact that ϕ^2 - ϕ - 1 = 0 so ϕ^2 = ϕ + 1:

n/ϕ^2 % 1 ~= 0
n/ϕ^2 ~= k    for some integer k
ϕ^2 ~= n/k
1 + ϕ ~= n/k
ϕ ~= (n-k)/k

Now, if n is small, then k is small (because n/k ~= ϕ^2), so (n-k)/k is a fraction with a small denominator. But ϕ is difficult to approximate with fractions, and the smaller the denominator the worse the approximation, so ϕ actually isn’t very close to (n-k)/k, so n/ϕ^2 % 1 actually isn’t very close to 0, so the n’th color actually isn’t very similar to the 0’th color.

And that’s why the phillotactic colors work.

Can We Crowdsource Language Design?

Tags: Programming Languages, User Studies, Semantics

Posted on 06 July 2017.

Programming languages are user interfaces. There are several ways of making decisions when designing user interfaces, including:

  • a small number of designers make all the decisions, or
  • user studies and feedback are used to make decisions.

Most programming languages have been designed by a Benevolent Dictator for Life or a committee, which corresponds to the first model. What happens if we try out the second?

We decided to explore this question. To get a large enough number of answers (and to engage in rapid experimentation), we decided to conduct surveys on Amazon Mechanical Turk, a forum known to have many technically literate people. We studied a wide range of common programming languages features, from numbers to scope to objects to overloading behavior.

We applied two concrete measures to the survey results:

  • Consistency: whether individuals answer similar questions the same way, and
  • Consensus: whether we find similar answers across individuals.

Observe that a high value of either one has clear implications for language design, and if both are high, that suggests we have zeroed in on a “most natural” language.

As Betteridge’s Law suggests, we found neither. Indeed,

  • A surprising percentage of workers expected some kind of dynamic scope (83.9%).

  • Some workers thought that record access would distribute over the field name expression.

  • Some workers ignored type annotations on functions.

  • Over the field and method questions we asked on objects, no worker expected Java’s semantics across all three.

These and other findings are explored in detail in our paper.