How do you learn a new language? Do you simply read its reference manual, and then you’re good to go? Or do you also explore the language, trying things out by hand to see how they behave?
This skill—learning about something by poking at it—is frequently used but almost never taught. However, we have tried to teach it via what we dub mystery languages, and this post is about how we went about it.
A mystery language is exactly what it sounds like: a programming language whose behavior is a mystery. Each assignment comes with some vague documentation, and an editor in which you can write programs. However, when you run a program it will produce multiple answers! This is because there are actually multiple languages, with the same syntax but different semantics. The answers you get are the results of running your program in all of the different languages. The goal of the assignment is to find programs that tell the languages apart.
As an example, maybe the languages have the syntax
a[i], and the
documentation says that this “accesses an array
a at index
That totally specifies the behavior of this syntax, right?
Not even close. For example, what happens if the index is out of bounds?
Does it raise an error (like Java), or return
valid one (like Python)? And what happens if the index is
(EDIT: Specifically, Python wraps negative indices that are smaller than the length of the array.)
Students engage with the mystery languages in three ways:
- The first part of each assignment is to find a set of programs that distinguishes the different languages (a.k.a. a classifier).
- The second part of the assignment is to describe a theory that explains the different behaviors of the languages. (For example, a theory about C’s behavior could include accessing heap addresses past the end of an array.)
- Finally, after an assignment is due, there’s an in-class discussion and explanation of the mystery languages. This is especially useful to provide terminology for behaviors that students encountered in the languages.
This example is somewhat superficial (the real mystery languages mostly use more significant features than array access), but you get the idea: every aspect of a programming language comes with many possible designs, and the mystery languages have students explore them first-hand.
Why Mystery Languages?
We hope to teach a number of skills through mystery languages:
- Evaluating Design Decisions: The mystery languages are almost entirely based on designs chosen by real languages: either historical choices that have since been settled, or choices that are still up in the air and vary between modern languages. Students get to explore the consequences of these decisions themselves, so that they get a feel for them. And the next day in class they learn about the broader historical context.
- Syntax vs. Semantics: We are frustrated by how frequently discussions about programming languages revolve around syntax. With luck, showing students first-hand that a single syntax can have a variety of semantics will bring their discussions to a somewhat higher level.
- Adversarial Thinking: Notice that the array example was all about exploring edge cases. If you only ever wrote correct, safe, boring programs then you’d never notice the different ways that array access could work. This is very similar to the kind of thinking you need to do when reasoning about security, and we hope that students might pick up on this mindset.
- Experimental Thinking: In each assignment, students are asked not only to find programs that behave differently across the mystery languages, but also to explain the behavior of the different languages. This is essentially a scientific task: brainstorm hypotheses about how the languages might work; experimentally test these hypotheses by running programs; discard any hypothesis that was falsified; and iterate.
Adopting Mystery Languages
If you want to use Mystery Languages in your course, email us and we’ll help you get started!
There are currently 13 mystery languages, and more in development. At Brown, we structured our programming languages course around these mystery languages: about half of the assignments and half of the lectures were about mystery languages. However, mystery languages are quite flexible, and could also be used as a smaller supplement to an existing course. One possibility is to begin with one or two simpler languages to allow students to learn how they work, and from there mix and match any of the more advanced languages. Alternatively, you could do just one or two mystery languages to meet specific course objectives.
No one should have to see the insides of your macros. Yet type errors
often reveal them. For example, here is a very simple
and macro in
Rust (of course you should just use
&& instead, but we’ll use this
as a simple working example):
and the type error message you get if you misuse it:
You can see that it shows you the definition of this macro. In this case it’s not so bad, but other macros can get messy, and you might not want to see their guts. Plus in principle, a type error should only show the erronous code, not correct code that it happend to call. You wouldn’t be very happy with a type checker that sometimes threw an error deep in the body of a (type correct) library function that you called, just because you used it the wrong way. Why put up with one that does the same thing for macros?
The reason Rust does is that that it does not know the type of
As a result, it can only type check after
and has been desugared
(a.k.a., expanded), and so the error occurs in the desugared code.
But what if Rust could automatically infer a type rule for checking
and, using only the its definition? Then the error could be found in
the original program that you wrote (rather than its expansion), and
presented as such. This is exactly what we did—albeit for simpler type
systems than Rust’s—in our recent PLDI’18 paper
Inferring Type Rules for Syntactic Sugar.
We want to automatically derive a type rule for
and, and we want it
to be correct. But what does it mean for it to be correct? Well, the
and is defined by its desugaring:
α and β is synonymous
if α then β else false. Thus they should have the same type
(here the fancy D means “desugar”):
Isurf ||- means “in the surface language type system”,
||- means “in the core language type system”, and the fancy D means
How can we achieve this? The most straightforward to do is to capture
iff with two type rules, one for the forward implication, and
one for the reverse:
The first type rule is useful because it says how to type check
in terms of its desugaring. For example, here’s a derivation that
true and false has type
However, while this
t-and→ rule is accurate, it’s not the canonical
type rule for
and that you’d expect. And worse, it mentions
it’s leaking the implementation of the macro!
However, we can automatically construct a better type rule. The trick
is to look at a more general derivation. Here’s a generic type
derivation for any term
α and β:
Notice Dα and Dβ: these are holes in the derivation, which ought to be filled in with sub-derivations proving that α has type Bool and β has type Bool. Thus, “α : Bool” and “β : Bool” are assumptions we must make for the type derivation to work. However, if these assumptions hold, then the conclusion of the derivation must hold. We can capture this fact in a type rule, whole premises are these assumptions, and whose conclusion is the conclusion of the whole derivation:
And so we’ve inferred the canonical type rule for
and! Notice that
(i) it doesn’t mention
if at all, so it’s not leaking the inside of
the macro, and (ii) it’s guaranteed to be correct, so it’s a good
starting point for fixing up a type system to present errors at the
right level of abstraction. This was a simple example for illustrative
purposes, but we’ve tested the approach on a variety of sugars and
Tags: PyretPosted on 11 June 2018.
Pyret has beautiful error messages, like this one:
Notice the colors. Aren’t they pretty? Whenever a section of code is mentioned in an error message, it gets highlighted with its own color. And we pick colors that are as different as possible, so they don’t get confused with each other. It is useful to keep all of the colors distinct because it provides a very intuitive one-to-one mapping between parts of the code you wrote and the code snippets mentioned in the error messages. If two error messages used the same color for a snippet, it might look at first glance that they were mentioning the same thing.
(We should say up front: while we believe that the approach described in this post should be fairly robust to most forms of color blindness, it’s difficult to reason about so we make no guarantees. However, if two colors are hard to distinguish by sight, you can always hover over one of them to see the matching section of code blink. EDIT: Actually, it’s not as robust as we had hoped. If you know a good approach to this, let us know.)
How did we make them? It should be easy, right? We could have a list of, say, six colors and use those. After all, no error message needs more than six colors.
Except that there might be multiple error messages. In fact, if you have failing test cases, then you’ll have one failure message per failing test case, each with its own highlight, so there is no upper bound on how many colors we need. (Pyret will only show one of these highlights at a time—whichever one you have selected—but even so it’s nice for them to all have different colors.) Thus we’ll need to be able to generate a set of colors on demand.
Ok, so for any given run of the program, we’ll first determine how many colors we need for that run, and then generate that many colors.
Except that it’s difficult to tell how many colors we need beforehand. In fact, Pyret has a REPL, where users can evaluate expressions, which might throw more errors. Thus it’s impossible to know how many colors we’ll need beforehand, because the user can always produce more errors in the REPL.
Therefore, however we pick colors, it must satisfy these two properties:
- Distinctness: all of the colors in all of the highlights should be as visually different from each other as possible.
- Streaming: we must always be able to pick new colors.
Also, the appearance of the highlights should be pretty uniform; none of them should stand out too much:
- Uniformity: all of the colors should have the same saturation (i.e. colorfulness) and lightness as each other. This way none of them blend in with the background color (which is white) or the text color (which is black), or stand out too much.
The Phillotactic Color-Picking Algorithm
Now let’s talk about the algorithm we use!
(Note that this is “phillotactic”, not “phyllotactic”. It has nothing to do with plants.)
To keep uniformity, it makes sense to pick colors from a rainbow. This is a circle in color space, with constant saturation and lightness and varying hue. Which color space should we use? We should not use RGB, because that space doesn’t agree well with how colors actually appear. For example, if we used a rainbow in RGB space, then green would appear far too bright and blue would appear far too dark. Instead, we should use a color space that agrees with how people actually perceieve colors. The CIELAB color space is better. It was designed so that if you take the distance between two colors in it, that distance approximately agrees with how different the colors seem when you look at them. (It’s only approximate because—among other things—perceptual color space is non-Euclidean.)
Therefore we’ll pick colors from a circle in CIELAB space. This space
has three coordinates: L, for lightness, A for green-red, and B for
blue-yellow (hence the LAB). We determined by experimentation that a
good lightness to use was 73 out of 100. Given this lightness, we
picked the largest saturation possible, using
A^2 + B^2 = 40^2.
Now how do we vary the hue? Every color picked needs a new hue, and they need to all be as different as possible. It would be bad, for instance, if we picked 13 colors, and then the 13th color looked just like the 2nd color.
Our solution was to have each color’s hue be the golden angle from the previous hue. From Wikipedia, the golden angle is “the angle subtended by the smaller arc when two arcs that make up a circle are in the golden ratio”. It is also 1/ϕ^2 of a circle, or about 137 degrees.
Thus the phillogenic algorithm keeps track of the number of colors generated so far, and assigns the n’th color a hue of n times the golden angle. So the first color will have a hue of 0 degrees. The second color will have a hue of 137 degrees. The third will have a hue of 137 * 2 = 274 degrees. The fourth will be 137 * 3 = 411 = 51 degrees. This is a little close to the first color. But even if we knew we’d have four colors total, they’d be at most 90 degrees apart, so 51 isn’t too bad. This trend continues: as we pick more and more colors, they never end up much closer to one another than is necessary.
There’s a reason that no two colors end up too similar. It follows from the fact that ϕ is the most difficult number to approximate as a fraction. Here’s a proof that colors aren’t similar:
Suppose that the m’th color and the (m+n)’th color end up being very similar. The difference between the m’th and (m+n)’th colors is the same as the difference between the 0’th and the n’th colors. Thus we are supposing that the 0’th color and the n’th color are very similar.
Let’s measure angles in turns, or fractions of 360 degrees. The n’th
color’s hue is, by definition,
n/ϕ^2 % 1 turns. The 0’th hue is 0.
So if these colors are similar, then
n/ϕ^2 % 1 ~= 0 (using
for “approximately equals”). We can then reason as follows, using in
the third step the fact that
ϕ^2 - ϕ - 1 = 0 so
ϕ^2 = ϕ + 1:
n/ϕ^2 % 1 ~= 0 n/ϕ^2 ~= k for some integer k ϕ^2 ~= n/k 1 + ϕ ~= n/k ϕ ~= (n-k)/k
Now, if n is small, then k is small (because
n/k ~= ϕ^2), so
(n-k)/k is a fraction with a small denominator. But ϕ is difficult
to approximate with fractions, and the smaller the denominator the
worse the approximation, so
ϕ actually isn’t very close to
n/ϕ^2 % 1 actually isn’t very close to
0, so the
n’th color actually isn’t very similar to the 0’th color.
And that’s why the phillotactic colors work.
Programming languages are user interfaces. There are several ways of making decisions when designing user interfaces, including:
- a small number of designers make all the decisions, or
- user studies and feedback are used to make decisions.
Most programming languages have been designed by a Benevolent Dictator for Life or a committee, which corresponds to the first model. What happens if we try out the second?
We decided to explore this question. To get a large enough number of answers (and to engage in rapid experimentation), we decided to conduct surveys on Amazon Mechanical Turk, a forum known to have many technically literate people. We studied a wide range of common programming languages features, from numbers to scope to objects to overloading behavior.
We applied two concrete measures to the survey results:
- Consistency: whether individuals answer similar questions the same way, and
- Consensus: whether we find similar answers across individuals.
Observe that a high value of either one has clear implications for language design, and if both are high, that suggests we have zeroed in on a “most natural” language.
As Betteridge’s Law suggests, we found neither. Indeed,
A surprising percentage of workers expected some kind of dynamic scope (83.9%).
Some workers thought that record access would distribute over the field name expression.
Some workers ignored type annotations on functions.
Over the field and method questions we asked on objects, no worker expected Java’s semantics across all three.
These and other findings are explored in detail in our paper.
For decades, we have neglected performing serious user studies of formal-methods tools. This is now starting to change. A recent post introduces our new work in this area.
That study works with students in an upper-level class, who are a fairly good proxy for some developers (and are anyway an audience we have good access to). Unfortunately, student populations are problematic for several reasons:
There are only so many students in a class. There may not be enough to obtain statistical strength, especially on designs that require A/B testing and the like.
The class is offered only so often. It may take a whole year between studies. (This is a common problem in computing education research.)
As students progress through a class, it’s hard to “rewind” them and study their responses at an earlier stage in their learning.
And so on. It would be helpful if we could obtain large numbers of users quickly, relatively cheaply, and repeatedly.
This naturally suggests crowdsourcing. Unfortunately, the tasks we are examining involve using tools based on formal logic, not identifying birds or picking Web site colors (or solving CAPTCHAs…). That would seem to greatly limit the utility of crowd-workers on popular sites like Mechanical Turk.
In reality, this depends on how the problem is phrased. If we view it as “Can we find lots of Turkers with knowledge of Promela (or Alloy or …)?”, the answer is pretty negative. If, however, we can rework the problems somewhat so the question is “Can we get people to work on a puzzle?”, we can find many, many more workers. That is, sometimes the problem is one of vocabulary (and in particular, the use of specific formal methods languages) than of raw ability.
Concretely, we have taken the following steps:
Adapt problems from being questions about Alloy specifications to being phrased as logic “puzzles”.
Provide an initial training phase to make sure workers understand what we’re after.
Follow that with an evaluation phase to ensure that they “got the idea”. Only consider responses from those workers who score at a high enough threshold on evaluation.
Only then conduct the actual study.
Observe that even if we don’t want to trust the final results obtained from crowdsourcing, there are still uses for this process. Designing a good study requires several rounds of prototyping: even simple wording choices can have huge and unforeseen (negative) consequences. The more rounds we get to test a study, the better it will come out. Therefore, the crowd is useful at least to prototype and refine a study before unleashing it on a more qualified, harder-to-find audience — a group that, almost by definition, you do not want to waste on a first-round study prototype.
For more information, see our paper. We find fairly useful results using workers on Mechanical Turk. In many cases the findings there correspond with those we found with class students.