Mystery Languages

Tags: Programming Languages, Education

Posted on 05 July 2018.

How do you learn a new language? Do you simply read its reference manual, and then you’re good to go? Or do you also explore the language, trying things out by hand to see how they behave?

This skill—learning about something by poking at it—is frequently used but almost never taught. However, we have tried to teach it via what we dub mystery languages, and this post is about how we went about it.

Mystery Lanuages

A mystery language is exactly what it sounds like: a programming language whose behavior is a mystery. Each assignment comes with some vague documentation, and an editor in which you can write programs. However, when you run a program it will produce multiple answers! This is because there are actually multiple languages, with the same syntax but different semantics. The answers you get are the results of running your program in all of the different languages. The goal of the assignment is to find programs that tell the languages apart.

As an example, maybe the languages have the syntax a[i], and the documentation says that this “accesses an array a at index i”. That totally specifies the behavior of this syntax, right?

Not even close. For example, what happens if the index is out of bounds? Does it raise an error (like Java), or return undefined (like JavaScript), or produce nonsense (like C), or wrap the index to a valid one (like Python)? And what happens if the index is 2.5, or "2", or 2.0, or "two"?

(EDIT: Specifically, Python wraps negative indices that are smaller than the length of the array.)

Students engage with the mystery languages in three ways:

  1. The first part of each assignment is to find a set of programs that distinguishes the different languages (a.k.a. a classifier).
  2. The second part of the assignment is to describe a theory that explains the different behaviors of the languages. (For example, a theory about C’s behavior could include accessing heap addresses past the end of an array.)
  3. Finally, after an assignment is due, there’s an in-class discussion and explanation of the mystery languages. This is especially useful to provide terminology for behaviors that students encountered in the languages.

This example is somewhat superficial (the real mystery languages mostly use more significant features than array access), but you get the idea: every aspect of a programming language comes with many possible designs, and the mystery languages have students explore them first-hand.

Why Mystery Languages?

We hope to teach a number of skills through mystery languages:

  • Evaluating Design Decisions: The mystery languages are almost entirely based on designs chosen by real languages: either historical choices that have since been settled, or choices that are still up in the air and vary between modern languages. Students get to explore the consequences of these decisions themselves, so that they get a feel for them. And the next day in class they learn about the broader historical context.
  • Syntax vs. Semantics: We are frustrated by how frequently discussions about programming languages revolve around syntax. With luck, showing students first-hand that a single syntax can have a variety of semantics will bring their discussions to a somewhat higher level.
  • Adversarial Thinking: Notice that the array example was all about exploring edge cases. If you only ever wrote correct, safe, boring programs then you’d never notice the different ways that array access could work. This is very similar to the kind of thinking you need to do when reasoning about security, and we hope that students might pick up on this mindset.
  • Experimental Thinking: In each assignment, students are asked not only to find programs that behave differently across the mystery languages, but also to explain the behavior of the different languages. This is essentially a scientific task: brainstorm hypotheses about how the languages might work; experimentally test these hypotheses by running programs; discard any hypothesis that was falsified; and iterate.

Adopting Mystery Languages

If you want to use Mystery Languages in your course, email us and we’ll help you get started!

There are currently 13 mystery languages, and more in development. At Brown, we structured our programming languages course around these mystery languages: about half of the assignments and half of the lectures were about mystery languages. However, mystery languages are quite flexible, and could also be used as a smaller supplement to an existing course. One possibility is to begin with one or two simpler languages to allow students to learn how they work, and from there mix and match any of the more advanced languages. Alternatively, you could do just one or two mystery languages to meet specific course objectives.

Learn More

You can learn more about mystery languages in our SNAPL’17 paper, or dive in and try them yourself (see the assignments prefixed with “ML:”).

Resugaring Type Rules

Tags: Programming Languages, Types

Posted on 19 June 2018.

This is the final post in a series about resugaring. It focuses on resugaring type rules. See also our posts on resugaring evaluation steps and resugaring scope rules.


No one should have to see the insides of your macros. Yet type errors often reveal them. For example, here is a very simple and macro in Rust (of course you should just use && instead, but we’ll use this as a simple working example):

and the type error message you get if you misuse it:

You can see that it shows you the definition of this macro. In this case it’s not so bad, but other macros can get messy, and you might not want to see their guts. Plus in principle, a type error should only show the erronous code, not correct code that it happend to call. You wouldn’t be very happy with a type checker that sometimes threw an error deep in the body of a (type correct) library function that you called, just because you used it the wrong way. Why put up with one that does the same thing for macros?

The reason Rust does is that that it does not know the type of and. As a result, it can only type check after and has been desugared (a.k.a., expanded), and so the error occurs in the desugared code.

But what if Rust could automatically infer a type rule for checking and, using only the its definition? Then the error could be found in the original program that you wrote (rather than its expansion), and presented as such. This is exactly what we did—albeit for simpler type systems than Rust’s—in our recent PLDI’18 paper Inferring Type Rules for Syntactic Sugar.

We call this process resugaring type rules; akin to our previous work on resugaring evaluation steps and resugaring scope rules. Let’s walk through the resugaring of a type rule for and:

We want to automatically derive a type rule for and, and we want it to be correct. But what does it mean for it to be correct? Well, the meaning of and is defined by its desugaring: α and β is synonymous with if α then β else false. Thus they should have the same type (here the fancy D means “desugar”):

(Isurf ||- means “in the surface language type system”, Icore ||- means “in the core language type system”, and the fancy D means “desugar”.)

How can we achieve this? The most straightforward to do is to capture the iff with two type rules, one for the forward implication, and one for the reverse:

The first type rule is useful because it says how to type check and in terms of its desugaring. For example, here’s a derivation that true and false has type Bool:

However, while this t-and rule is accurate, it’s not the canonical type rule for and that you’d expect. And worse, it mentions if, so it’s leaking the implementation of the macro!

However, we can automatically construct a better type rule. The trick is to look at a more general derivation. Here’s a generic type derivation for any term α and β:

Notice Dα and Dβ: these are holes in the derivation, which ought to be filled in with sub-derivations proving that α has type Bool and β has type Bool. Thus, “α : Bool” and “β : Bool” are assumptions we must make for the type derivation to work. However, if these assumptions hold, then the conclusion of the derivation must hold. We can capture this fact in a type rule, whole premises are these assumptions, and whose conclusion is the conclusion of the whole derivation:

And so we’ve inferred the canonical type rule for and! Notice that (i) it doesn’t mention if at all, so it’s not leaking the inside of the macro, and (ii) it’s guaranteed to be correct, so it’s a good starting point for fixing up a type system to present errors at the right level of abstraction. This was a simple example for illustrative purposes, but we’ve tested the approach on a variety of sugars and type systems.

You can read more in our paper, or play with the implementation.

Picking Colors for Pyret Error Messages

Tags: Pyret

Posted on 11 June 2018.

Pyret has beautiful error messages, like this one:

Sample Pyret error message

Notice the colors. Aren’t they pretty? Whenever a section of code is mentioned in an error message, it gets highlighted with its own color. And we pick colors that are as different as possible, so they don’t get confused with each other. It is useful to keep all of the colors distinct because it provides a very intuitive one-to-one mapping between parts of the code you wrote and the code snippets mentioned in the error messages. If two error messages used the same color for a snippet, it might look at first glance that they were mentioning the same thing.

(We should say up front: while we believe that the approach described in this post should be fairly robust to most forms of color blindness, it’s difficult to reason about so we make no guarantees. However, if two colors are hard to distinguish by sight, you can always hover over one of them to see the matching section of code blink. EDIT: Actually, it’s not as robust as we had hoped. If you know a good approach to this, let us know.)

How did we make them? It should be easy, right? We could have a list of, say, six colors and use those. After all, no error message needs more than six colors.

Except that there might be multiple error messages. In fact, if you have failing test cases, then you’ll have one failure message per failing test case, each with its own highlight, so there is no upper bound on how many colors we need. (Pyret will only show one of these highlights at a time—whichever one you have selected—but even so it’s nice for them to all have different colors.) Thus we’ll need to be able to generate a set of colors on demand.

Ok, so for any given run of the program, we’ll first determine how many colors we need for that run, and then generate that many colors.

Except that it’s difficult to tell how many colors we need beforehand. In fact, Pyret has a REPL, where users can evaluate expressions, which might throw more errors. Thus it’s impossible to know how many colors we’ll need beforehand, because the user can always produce more errors in the REPL.

Therefore, however we pick colors, it must satisfy these two properties:

  • Distinctness: all of the colors in all of the highlights should be as visually different from each other as possible.
  • Streaming: we must always be able to pick new colors.

Also, the appearance of the highlights should be pretty uniform; none of them should stand out too much:

  • Uniformity: all of the colors should have the same saturation (i.e. colorfulness) and lightness as each other. This way none of them blend in with the background color (which is white) or the text color (which is black), or stand out too much.

The Phillotactic Color-Picking Algorithm

Now let’s talk about the algorithm we use!

(Note that this is “phillotactic”, not “phyllotactic”. It has nothing to do with plants.)

To keep uniformity, it makes sense to pick colors from a rainbow. This is a circle in color space, with constant saturation and lightness and varying hue. Which color space should we use? We should not use RGB, because that space doesn’t agree well with how colors actually appear. For example, if we used a rainbow in RGB space, then green would appear far too bright and blue would appear far too dark. Instead, we should use a color space that agrees with how people actually perceieve colors. The CIELAB color space is better. It was designed so that if you take the distance between two colors in it, that distance approximately agrees with how different the colors seem when you look at them. (It’s only approximate because—among other things—perceptual color space is non-Euclidean.)

Therefore we’ll pick colors from a circle in CIELAB space. This space has three coordinates: L, for lightness, A for green-red, and B for blue-yellow (hence the LAB). We determined by experimentation that a good lightness to use was 73 out of 100. Given this lightness, we picked the largest saturation possible, using A^2 + B^2 = 40^2.

Now how do we vary the hue? Every color picked needs a new hue, and they need to all be as different as possible. It would be bad, for instance, if we picked 13 colors, and then the 13th color looked just like the 2nd color.

Our solution was to have each color’s hue be the golden angle from the previous hue. From Wikipedia, the golden angle is “the angle subtended by the smaller arc when two arcs that make up a circle are in the golden ratio”. It is also 1/ϕ^2 of a circle, or about 137 degrees.

Thus the phillogenic algorithm keeps track of the number of colors generated so far, and assigns the n’th color a hue of n times the golden angle. So the first color will have a hue of 0 degrees. The second color will have a hue of 137 degrees. The third will have a hue of 137 * 2 = 274 degrees. The fourth will be 137 * 3 = 411 = 51 degrees. This is a little close to the first color. But even if we knew we’d have four colors total, they’d be at most 90 degrees apart, so 51 isn’t too bad. This trend continues: as we pick more and more colors, they never end up much closer to one another than is necessary.

There’s a reason that no two colors end up too similar. It follows from the fact that ϕ is the most difficult number to approximate as a fraction. Here’s a proof that colors aren’t similar:

Suppose that the m’th color and the (m+n)’th color end up being very similar. The difference between the m’th and (m+n)’th colors is the same as the difference between the 0’th and the n’th colors. Thus we are supposing that the 0’th color and the n’th color are very similar.

Let’s measure angles in turns, or fractions of 360 degrees. The n’th color’s hue is, by definition, n/ϕ^2 % 1 turns. The 0’th hue is 0. So if these colors are similar, then n/ϕ^2 % 1 ~= 0 (using ~= for “approximately equals”). We can then reason as follows, using in the third step the fact that ϕ^2 - ϕ - 1 = 0 so ϕ^2 = ϕ + 1:

n/ϕ^2 % 1 ~= 0
n/ϕ^2 ~= k    for some integer k
ϕ^2 ~= n/k
1 + ϕ ~= n/k
ϕ ~= (n-k)/k

Now, if n is small, then k is small (because n/k ~= ϕ^2), so (n-k)/k is a fraction with a small denominator. But ϕ is difficult to approximate with fractions, and the smaller the denominator the worse the approximation, so ϕ actually isn’t very close to (n-k)/k, so n/ϕ^2 % 1 actually isn’t very close to 0, so the n’th color actually isn’t very similar to the 0’th color.

And that’s why the phillotactic colors work.

Can We Crowdsource Language Design?

Tags: Programming Languages, User Studies, Semantics

Posted on 06 July 2017.

Programming languages are user interfaces. There are several ways of making decisions when designing user interfaces, including:

  • a small number of designers make all the decisions, or
  • user studies and feedback are used to make decisions.

Most programming languages have been designed by a Benevolent Dictator for Life or a committee, which corresponds to the first model. What happens if we try out the second?

We decided to explore this question. To get a large enough number of answers (and to engage in rapid experimentation), we decided to conduct surveys on Amazon Mechanical Turk, a forum known to have many technically literate people. We studied a wide range of common programming languages features, from numbers to scope to objects to overloading behavior.

We applied two concrete measures to the survey results:

  • Consistency: whether individuals answer similar questions the same way, and
  • Consensus: whether we find similar answers across individuals.

Observe that a high value of either one has clear implications for language design, and if both are high, that suggests we have zeroed in on a “most natural” language.

As Betteridge’s Law suggests, we found neither. Indeed,

  • A surprising percentage of workers expected some kind of dynamic scope (83.9%).

  • Some workers thought that record access would distribute over the field name expression.

  • Some workers ignored type annotations on functions.

  • Over the field and method questions we asked on objects, no worker expected Java’s semantics across all three.

These and other findings are explored in detail in our paper.

Crowdsourcing User Studies for Formal Methods

Tags: Verification, User Studies

Posted on 03 July 2017.

For decades, we have neglected performing serious user studies of formal-methods tools. This is now starting to change. A recent post introduces our new work in this area.

That study works with students in an upper-level class, who are a fairly good proxy for some developers (and are anyway an audience we have good access to). Unfortunately, student populations are problematic for several reasons:

  • There are only so many students in a class. There may not be enough to obtain statistical strength, especially on designs that require A/B testing and the like.

  • The class is offered only so often. It may take a whole year between studies. (This is a common problem in computing education research.)

  • As students progress through a class, it’s hard to “rewind” them and study their responses at an earlier stage in their learning.

And so on. It would be helpful if we could obtain large numbers of users quickly, relatively cheaply, and repeatedly.

This naturally suggests crowdsourcing. Unfortunately, the tasks we are examining involve using tools based on formal logic, not identifying birds or picking Web site colors (or solving CAPTCHAs…). That would seem to greatly limit the utility of crowd-workers on popular sites like Mechanical Turk.

In reality, this depends on how the problem is phrased. If we view it as “Can we find lots of Turkers with knowledge of Promela (or Alloy or …)?”, the answer is pretty negative. If, however, we can rework the problems somewhat so the question is “Can we get people to work on a puzzle?”, we can find many, many more workers. That is, sometimes the problem is one of vocabulary (and in particular, the use of specific formal methods languages) than of raw ability.

Concretely, we have taken the following steps:

  • Adapt problems from being questions about Alloy specifications to being phrased as logic “puzzles”.

  • Provide an initial training phase to make sure workers understand what we’re after.

  • Follow that with an evaluation phase to ensure that they “got the idea”. Only consider responses from those workers who score at a high enough threshold on evaluation.

  • Only then conduct the actual study.

Observe that even if we don’t want to trust the final results obtained from crowdsourcing, there are still uses for this process. Designing a good study requires several rounds of prototyping: even simple wording choices can have huge and unforeseen (negative) consequences. The more rounds we get to test a study, the better it will come out. Therefore, the crowd is useful at least to prototype and refine a study before unleashing it on a more qualified, harder-to-find audience — a group that, almost by definition, you do not want to waste on a first-round study prototype.

For more information, see our paper. We find fairly useful results using workers on Mechanical Turk. In many cases the findings there correspond with those we found with class students.