## Combating Misconceptions by Encouraging Example-Writing

*Posted on 11 January 2020.*

When faced with an unfamiliar programming problem, undergraduate computer science students all-too-often begin their implementations with an incomplete understanding of what the problem is asking, and may not realize until far into their development process (if at all) that they have solved the wrong problem. At best, a student realizes their mistake, suffers from some frustration, and is able to correct it before the final submission deadline. At worst, they might not realize their mistake until they receive feedback on their final submission—depriving them of the intended learning goal of the assignment.

Educators must therefore provide students with some mechanism by which students can evaluate their own understanding of a problem—*before* they waste time implementing some misconceived variation of that problem. To this end, we provide students with Examplar: an IDE for writing input–output examples that provides on-demand feedback on whether the examples are:

*valid*(consistent with the problem), and*thorough*(explore the conceptually interesting corners of the problem).

For a demonstration, watch this brief video!

With its gamification, we believed students would find Examplar compelling to use. Moreover, we believed its feedback would be helpful. Both of these hypotheses were confirmed. We found that students used Examplar extensively—even when they were not required to use it, and even for assignments for which they were not required to submit test cases. The quality of students’ final submissions generally improved over previous years, too. For more information, read the full paper here!

## The Hidden Perils of Automated Assessment

*Posted on 26 July 2018.*

We routinely rely on automated assessment to evaluate our students’ work on programming assignments. In principle, these techniques improve the scalability and reproducibility of our assessments. In actuality, these techniques may make it incredibly easy to perform *flawed* assessments at scale, with virtually no feedback to warn the instructor. Not only does this affect students, it can also affect the reliability of research that uses it (e.g., that correlates against assessment scores).

### To Test a Test Suite

The initial object of our study was simply to evaluate the quality of student test suites. However, as we began to perform our measurements, we wondered how stable they were, and started to use different methods to evaluate stability.

In this group, we take the perspective that test suites are *classifiers* of implementations. You give a test suite an implementation, and it either *accepts* or *rejects* it. Therefore, to measure the quality of a test suite, we can standard metrics for classifiers, *true positive rate* and *true negative rate*. However, to actually do this, we need a set of implementations that we know, *a priori*, to be correct or faulty.

Ground Truth |
|||
---|---|---|---|

Correct |
Faulty |
||

TestSuite |
Accept |
True Negative | False Negative |

Reject |
False Positive | True Positive |

A robust assessment of a classifier may require a larger collection of known-correct and known-faulty implementations than the instructor could craft themselves. Additionally, we can leverage all of the implementations that students are submitting—we just need to determine which are correct and which are faulty.

There are basically two ways of doing this in the literature; let’s see how they fare.

### The Axiomatic Model

In the first method, the instructor writes a test suite and whatever that test suite’s judgments is used as the ground truth; e.g., if the instructor test suite accepts a given implementation, it is a false positive for a student’s test suite to reject it.

### The Algorithmic Model

The second method does this by taking every test suite you have (i.e., both the instructor’s and the students’), running them all against a known-correct implementation, and gluing all the ones that pass it into one big mega test suite that is used to establish ground truth.

## A Tale of Two Assessments

We applied each model in turn to classify 38 student implementations and a handful of specially crafted ones (both correct and faulty, in case the student submissions were skewed heavily towards faultiness or correctness), then computed the true-positive and true-negative rate for each student’s test suite.

The choice of underlying implementation classification model *substantially* impacted the apparent quality of student test suites. Visualized as kernel density estimation plots (akin to smoothed histograms):

#### The Axiomatic Model:

Judging by this plot, students did astoundingly well at catching buggy implementations. Their success at identifying correct implementations was more varied, but still pretty good.

#### The Algorithmic Model:

Judging by this plot, students performed astoundingly *poorly* at detecting buggy implementations, but quite well at identifying correct ones.

## Towards Robust Assessments

So which is it? Do students miss half of all buggy implementations, or are they actually astoundingly good? In actuality: neither. These strikingly divergent analysis outcomes are produced by fundamental, theoretical flaws in how these models classify implementations.

We were alarmed to find that these theoretical flaws, to varying degrees, affected the assessments of *every* assignment we evaluated. Neither model provides any indication to warn instructors when these flaws are impacting their assessments. For more information about these perils, see our paper, in which we present a technique for instructors and researchers that detects and protects against them.

## Mystery Languages

Tags: Programming Languages, Education

*Posted on 05 July 2018.*

How do you learn a new language? Do you simply read its reference manual, and then you’re good to go? Or do you also explore the language, trying things out by hand to see how they behave?

This skill—learning about something by poking at it—is frequently used
but almost never taught. However, we have tried to teach it via what
we dub *mystery languages*, and this post is about how we went about it.

### Mystery Lanuages

A *mystery language* is exactly what it sounds like: a programming
language whose behavior is a mystery. Each assignment comes with some
vague documentation, and an editor in which you can write programs.
However, when you run a program it will produce multiple answers! This
is because there are actually *multiple languages*, with the same
syntax but different semantics. The answers you get are the results of
running your program in all of the different languages. The goal of
the assignment is to find programs that tell the languages apart.

As an example, maybe the languages have the syntax `a[i]`

, and the
documentation says that this “accesses an array `a`

at index `i`

”.
That totally specifies the behavior of this syntax, right?

Not even close. For example, what happens if the index is out of bounds?
Does it raise an error (like Java), or return `undefined`

(like
JavaScript), or produce nonsense (like C), or wrap the index to a
valid one (like Python)? And what happens if the index is `2.5`

, or
`"2"`

, or `2.0`

, or `"two"`

?

(EDIT: Specifically, Python wraps negative indices that are smaller than the length of the array.)

Students engage with the mystery languages in three ways:

- The first part of each assignment is to find a set of programs that
distinguishes the different languages (a.k.a. a
*classifier*). - The second part of the assignment is to describe a
*theory*that explains the different behaviors of the languages. (For example, a theory about C’s behavior could include accessing heap addresses past the end of an array.) - Finally, after an assignment is due, there’s an in-class discussion and explanation of the mystery languages. This is especially useful to provide terminology for behaviors that students encountered in the languages.

This example is somewhat superficial (the real mystery languages mostly use more significant features than array access), but you get the idea: every aspect of a programming language comes with many possible designs, and the mystery languages have students explore them first-hand.

### Why Mystery Languages?

We hope to teach a number of skills through mystery languages:

**Evaluating Design Decisions:**The mystery languages are almost entirely based on designs chosen by real languages: either historical choices that have since been settled, or choices that are still up in the air and vary between modern languages. Students get to explore the consequences of these decisions themselves, so that they get a feel for them. And the next day in class they learn about the broader historical context.**Syntax vs. Semantics:**We are frustrated by how frequently discussions about programming languages revolve around syntax. With luck, showing students first-hand that a single syntax can have a variety of semantics will bring their discussions to a somewhat higher level.**Adversarial Thinking:**Notice that the array example was all about exploring edge cases. If you only ever wrote correct, safe, boring programs then you’d never notice the different ways that array access could work. This is very similar to the kind of thinking you need to do when reasoning about security, and we hope that students might pick up on this mindset.**Experimental Thinking:**In each assignment, students are asked not only to find programs that behave differently across the mystery languages, but also to*explain*the behavior of the different languages. This is essentially a scientific task: brainstorm hypotheses about how the languages might work; experimentally test these hypotheses by running programs; discard any hypothesis that was falsified; and iterate.

### Adopting Mystery Languages

If you want to use Mystery Languages in your course, email us and we’ll help you get started!

There are currently 13 mystery languages, and more in development. At Brown, we structured our programming languages course around these mystery languages: about half of the assignments and half of the lectures were about mystery languages. However, mystery languages are quite flexible, and could also be used as a smaller supplement to an existing course. One possibility is to begin with one or two simpler languages to allow students to learn how they work, and from there mix and match any of the more advanced languages. Alternatively, you could do just one or two mystery languages to meet specific course objectives.

### Learn More

You can learn more about mystery languages in our SNAPL’17 paper, or dive in and try them yourself (see the assignments prefixed with “ML:”).

## Resugaring Type Rules

Tags: Programming Languages, Types

*Posted on 19 June 2018.*

*This is the final post in a series about resugaring.
It focuses on resugaring type rules.
See also our posts on
resugaring evaluation steps
and resugaring scope rules.*

No one should have to see the insides of your macros. Yet type errors
often reveal them. For example, here is a very simple `and`

macro in
Rust (of course you should just use `&&`

instead, but we’ll use this
as a simple working example):

and the type error message you get if you misuse it:

You can see that it shows you the definition of this macro. In this
case it’s not so bad, but other macros can get *messy*, and you might
not want to see their guts. Plus in principle, a type error should
only show the erronous code, not correct code that it happend to call.
You wouldn’t be very happy with a type checker that sometimes threw an
error deep in the body of a (type correct) library function that you
called, just because you used it the wrong way. Why put up with one
that does the same thing for macros?

The reason Rust does is that that it does not know the type of `and`

.
As a result, it can only type check *after* `and`

has been desugared
(a.k.a., expanded), and so the error occurs in the desugared code.

But what if Rust could automatically infer a type rule for checking
`and`

, using only the its definition? Then the error could be found in
the original program that you wrote (rather than its expansion), and
presented as such. This is exactly what we did—albeit for simpler type
systems than Rust’s—in our recent PLDI’18 paper
Inferring Type Rules for Syntactic Sugar.

We call this process *resugaring type rules*; akin to our previous work
on
resugaring evaluation steps
and
resugaring scope rules.
Let’s walk through the resugaring of a type rule for `and`

:

We want to automatically derive a type rule for `and`

, and we want it
to be *correct*. But what does it mean for it to be correct? Well, the
meaning of `and`

is defined by its desugaring: `α and β`

is synonymous
with `if α then β else false`

. Thus they should have the same type
(here the fancy D means “desugar”):

(`Isurf ||-`

means “in the surface language type system”, ```
Icore
||-
```

means “in the core language type system”, and the fancy D means
“desugar”.)

How can we achieve this? The most straightforward to do is to capture
the `iff`

with two type rules, one for the forward implication, and
one for the reverse:

The first type rule is useful because it says how to type check `and`

in terms of its desugaring. For example, here’s a derivation that
`true and false`

has type `Bool`

:

However, while this `t-and`

^{→} rule is accurate, it’s not the canonical
type rule for `and`

that you’d expect. And worse, it mentions `if`

, so
it’s leaking the implementation of the macro!

However, we can automatically construct a better type rule. The trick
is to look at a more general derivation. Here’s a generic type
derivation for *any* term `α and β`

:

Notice D_{α} and D_{β}: these are *holes* in the derivation, which ought to
be filled in with sub-derivations proving that α has type Bool and β
has type Bool. Thus, “α : Bool” and “β : Bool” are *assumptions* we
must make for the type derivation to work. However, *if* these
assumptions hold, *then* the conclusion of the derivation must hold.
We can capture this fact in a type rule, whole premises are these
assumptions, and whose conclusion is the conclusion of the whole
derivation:

And so we’ve inferred the canonical type rule for `and`

! Notice that
(i) it doesn’t mention `if`

at all, so it’s not leaking the inside of
the macro, and (ii) it’s guaranteed to be correct, so it’s a good
starting point for fixing up a type system to present errors at the
right level of abstraction. This was a simple example for illustrative
purposes, but we’ve tested the approach on a variety of sugars and
type systems.

You can read more in our paper, or play with the implementation.

## Picking Colors for Pyret Error Messages

Tags: Pyret

*Posted on 11 June 2018.*

Pyret has beautiful error messages, like this one:

Notice the colors. Aren’t they pretty? Whenever a section of code is mentioned in an error message, it gets highlighted with its own color. And we pick colors that are as different as possible, so they don’t get confused with each other. It is useful to keep all of the colors distinct because it provides a very intuitive one-to-one mapping between parts of the code you wrote and the code snippets mentioned in the error messages. If two error messages used the same color for a snippet, it might look at first glance that they were mentioning the same thing.

(We should say up front: while we believe that the approach described in this post should be fairly robust to most forms of color blindness, it’s difficult to reason about so we make no guarantees. However, if two colors are hard to distinguish by sight, you can always hover over one of them to see the matching section of code blink. EDIT: Actually, it’s not as robust as we had hoped. If you know a good approach to this, let us know.)

How did we make them? It should be easy, right? We could have a list of, say, six colors and use those. After all, no error message needs more than six colors.

Except that there might be *multiple* error messages. In fact, if you
have failing test cases, then you’ll have one failure message per
failing test case, each with its own highlight, so there is no upper
bound on how many colors we need. (Pyret will only show one of these
highlights at a time—whichever one you have selected—but even so
it’s nice for them to all have different colors.) Thus we’ll need to
be able to *generate* a set of colors on demand.

Ok, so for any given run of the program, we’ll first determine how many colors we need for that run, and then generate that many colors.

Except that it’s difficult to tell how many colors we need beforehand.
In fact, Pyret has a REPL, where users can evaluate expressions, which
might throw *more* errors. Thus it’s *impossible* to know how many
colors we’ll need beforehand, because the user can always produce
*more* errors in the REPL.

Therefore, however we pick colors, it must satisfy these two properties:

**Distinctness**: all of the colors in all of the highlights should be as visually different from each other as possible.**Streaming**: we must always be able to pick new colors.

Also, the appearance of the highlights should be pretty uniform; none of them should stand out too much:

**Uniformity**: all of the colors should have the same saturation (i.e. colorfulness) and lightness as each other. This way none of them blend in with the background color (which is white) or the text color (which is black), or stand out too much.

### The *Phillotactic* Color-Picking Algorithm

Now let’s talk about the algorithm we use!

(Note that this is “ph**i**llotactic”, not “ph**y**llotactic”. It has
nothing to do with plants.)

To keep uniformity, it makes sense to pick colors from a rainbow. This
is a circle in color space, with constant saturation and lightness and
varying hue. Which color space should we use? We should *not* use RGB,
because that space doesn’t agree well with how colors actually appear.
For example, if we used a rainbow in RGB space, then green would
appear far too bright and blue would appear far too dark. Instead, we
should use a color space that agrees with how people actually
perceieve colors. The
CIELAB color space
is better. It was designed so that if you take the distance between
two colors in it, that distance approximately agrees with how
different the colors seem when you look at them. (It’s only
approximate because—among other things—perceptual color space is
non-Euclidean.)

Therefore we’ll pick colors from a circle in CIELAB space. This space
has three coordinates: L, for lightness, A for green-red, and B for
blue-yellow (hence the LAB). We determined by experimentation that a
good lightness to use was 73 out of 100. Given this lightness, we
picked the largest saturation possible, using `A^2 + B^2 = 40^2`

.

Now how do we vary the hue? Every color picked needs a new hue, and they need to all be as different as possible. It would be bad, for instance, if we picked 13 colors, and then the 13th color looked just like the 2nd color.

Our solution was to have each color’s hue be the golden angle from the previous hue. From Wikipedia, the golden angle is “the angle subtended by the smaller arc when two arcs that make up a circle are in the golden ratio”. It is also 1/ϕ^2 of a circle, or about 137 degrees.

Thus the phillogenic algorithm keeps track of the number of colors
generated so far, and assigns the n’th color a hue of n times the
golden angle. So the first color will have a hue of 0
degrees. The second color will have a hue of 137 degrees. The third will have
a hue of 137 * 2 = 274 degrees. The fourth will be 137 * 3 = 411 = 51
degrees. This is a *little* close to the first color. But even if we
knew we’d have four colors total, they’d be at most 90 degrees
apart, so 51 isn’t too bad. This trend continues: as we pick more and
more colors, they never end up much closer to one another than is
necessary.

There’s a reason that no two colors end up too similar. It follows from the fact that ϕ is the most difficult number to approximate as a fraction. Here’s a proof that colors aren’t similar:

Suppose that the m’th color and the (m+n)’th color end up being very similar. The difference between the m’th and (m+n)’th colors is the same as the difference between the 0’th and the n’th colors. Thus we are supposing that the 0’th color and the n’th color are very similar.

Let’s measure angles in *turns*, or fractions of 360 degrees. The n’th
color’s hue is, by definition, `n/ϕ^2 % 1`

turns. The 0’th hue is 0.
So if these colors are similar, then `n/ϕ^2 % 1 ~= 0`

(using `~=`

for “approximately equals”). We can then reason as follows, using in
the third step the fact that `ϕ^2 - ϕ - 1 = 0`

so `ϕ^2 = ϕ + 1`

:

```
n/ϕ^2 % 1 ~= 0
n/ϕ^2 ~= k for some integer k
ϕ^2 ~= n/k
1 + ϕ ~= n/k
ϕ ~= (n-k)/k
```

Now, if n is small, then k is small (because `n/k ~= ϕ^2`

), so
`(n-k)/k`

is a fraction with a small denominator. But ϕ is difficult
to approximate with fractions, and the smaller the denominator the
worse the approximation, so `ϕ`

actually isn’t very close to
`(n-k)/k`

, so `n/ϕ^2 % 1`

actually isn’t very close to `0`

, so the
n’th color actually isn’t very similar to the 0’th color.

And that’s why the phillotactic colors work.