Resugaring Type Rules
Tags: Programming Languages, Types
Posted on 19 June 2018.This is the final post in a series about resugaring. It focuses on resugaring type rules. See also our posts on resugaring evaluation steps and resugaring scope rules.
No one should have to see the insides of your macros. Yet type errors
often reveal them. For example, here is a very simple and
macro in
Rust (of course you should just use &&
instead, but we’ll use this
as a simple working example):
and the type error message you get if you misuse it:
You can see that it shows you the definition of this macro. In this case it’s not so bad, but other macros can get messy, and you might not want to see their guts. Plus in principle, a type error should only show the erronous code, not correct code that it happend to call. You wouldn’t be very happy with a type checker that sometimes threw an error deep in the body of a (type correct) library function that you called, just because you used it the wrong way. Why put up with one that does the same thing for macros?
The reason Rust does is that that it does not know the type of and
.
As a result, it can only type check after and
has been desugared
(a.k.a., expanded), and so the error occurs in the desugared code.
But what if Rust could automatically infer a type rule for checking
and
, using only the its definition? Then the error could be found in
the original program that you wrote (rather than its expansion), and
presented as such. This is exactly what we did—albeit for simpler type
systems than Rust’s—in our recent PLDI’18 paper
Inferring Type Rules for Syntactic Sugar.
We call this process resugaring type rules; akin to our previous work
on
resugaring evaluation steps
and
resugaring scope rules.
Let’s walk through the resugaring of a type rule for and
:
We want to automatically derive a type rule for and
, and we want it
to be correct. But what does it mean for it to be correct? Well, the
meaning of and
is defined by its desugaring: α and β
is synonymous
with if α then β else false
. Thus they should have the same type
(here the fancy D means “desugar”):
(Isurf 
means “in the surface language type system”, Icore

means “in the core language type system”, and the fancy D means
“desugar”.)
How can we achieve this? The most straightforward to do is to capture
the iff
with two type rules, one for the forward implication, and
one for the reverse:
The first type rule is useful because it says how to type check and
in terms of its desugaring. For example, here’s a derivation that
true and false
has type Bool
:
However, while this tand
^{→} rule is accurate, it’s not the canonical
type rule for and
that you’d expect. And worse, it mentions if
, so
it’s leaking the implementation of the macro!
However, we can automatically construct a better type rule. The trick
is to look at a more general derivation. Here’s a generic type
derivation for any term α and β
:
Notice D_{α} and D_{β}: these are holes in the derivation, which ought to be filled in with subderivations proving that α has type Bool and β has type Bool. Thus, “α : Bool” and “β : Bool” are assumptions we must make for the type derivation to work. However, if these assumptions hold, then the conclusion of the derivation must hold. We can capture this fact in a type rule, whole premises are these assumptions, and whose conclusion is the conclusion of the whole derivation:
And so we’ve inferred the canonical type rule for and
! Notice that
(i) it doesn’t mention if
at all, so it’s not leaking the inside of
the macro, and (ii) it’s guaranteed to be correct, so it’s a good
starting point for fixing up a type system to present errors at the
right level of abstraction. This was a simple example for illustrative
purposes, but we’ve tested the approach on a variety of sugars and
type systems.
You can read more in our paper, or play with the implementation.
Picking Colors for Pyret Error Messages
Tags: Pyret
Posted on 11 June 2018.Pyret has beautiful error messages, like this one:
Notice the colors. Aren’t they pretty? Whenever a section of code is mentioned in an error message, it gets highlighted with its own color. And we pick colors that are as different as possible, so they don’t get confused with each other. It is useful to keep all of the colors distinct because it provides a very intuitive onetoone mapping between parts of the code you wrote and the code snippets mentioned in the error messages. If two error messages used the same color for a snippet, it might look at first glance that they were mentioning the same thing.
(We should say up front: while we believe that the approach described in this post should be fairly robust to most forms of color blindness, it’s difficult to reason about so we make no guarantees. However, if two colors are hard to distinguish by sight, you can always hover over one of them to see the matching section of code blink. EDIT: Actually, it’s not as robust as we had hoped. If you know a good approach to this, let us know.)
How did we make them? It should be easy, right? We could have a list of, say, six colors and use those. After all, no error message needs more than six colors.
Except that there might be multiple error messages. In fact, if you have failing test cases, then you’ll have one failure message per failing test case, each with its own highlight, so there is no upper bound on how many colors we need. (Pyret will only show one of these highlights at a time—whichever one you have selected—but even so it’s nice for them to all have different colors.) Thus we’ll need to be able to generate a set of colors on demand.
Ok, so for any given run of the program, we’ll first determine how many colors we need for that run, and then generate that many colors.
Except that it’s difficult to tell how many colors we need beforehand. In fact, Pyret has a REPL, where users can evaluate expressions, which might throw more errors. Thus it’s impossible to know how many colors we’ll need beforehand, because the user can always produce more errors in the REPL.
Therefore, however we pick colors, it must satisfy these two properties:
 Distinctness: all of the colors in all of the highlights should be as visually different from each other as possible.
 Streaming: we must always be able to pick new colors.
Also, the appearance of the highlights should be pretty uniform; none of them should stand out too much:
 Uniformity: all of the colors should have the same saturation (i.e. colorfulness) and lightness as each other. This way none of them blend in with the background color (which is white) or the text color (which is black), or stand out too much.
The Phillotactic ColorPicking Algorithm
Now let’s talk about the algorithm we use!
(Note that this is “phillotactic”, not “phyllotactic”. It has nothing to do with plants.)
To keep uniformity, it makes sense to pick colors from a rainbow. This is a circle in color space, with constant saturation and lightness and varying hue. Which color space should we use? We should not use RGB, because that space doesn’t agree well with how colors actually appear. For example, if we used a rainbow in RGB space, then green would appear far too bright and blue would appear far too dark. Instead, we should use a color space that agrees with how people actually perceieve colors. The CIELAB color space is better. It was designed so that if you take the distance between two colors in it, that distance approximately agrees with how different the colors seem when you look at them. (It’s only approximate because—among other things—perceptual color space is nonEuclidean.)
Therefore we’ll pick colors from a circle in CIELAB space. This space
has three coordinates: L, for lightness, A for greenred, and B for
blueyellow (hence the LAB). We determined by experimentation that a
good lightness to use was 73 out of 100. Given this lightness, we
picked the largest saturation possible, using A^2 + B^2 = 40^2
.
Now how do we vary the hue? Every color picked needs a new hue, and they need to all be as different as possible. It would be bad, for instance, if we picked 13 colors, and then the 13th color looked just like the 2nd color.
Our solution was to have each color’s hue be the golden angle from the previous hue. From Wikipedia, the golden angle is “the angle subtended by the smaller arc when two arcs that make up a circle are in the golden ratio”. It is also 1/ϕ^2 of a circle, or about 137 degrees.
Thus the phillogenic algorithm keeps track of the number of colors generated so far, and assigns the n’th color a hue of n times the golden angle. So the first color will have a hue of 0 degrees. The second color will have a hue of 137 degrees. The third will have a hue of 137 * 2 = 274 degrees. The fourth will be 137 * 3 = 411 = 51 degrees. This is a little close to the first color. But even if we knew we’d have four colors total, they’d be at most 90 degrees apart, so 51 isn’t too bad. This trend continues: as we pick more and more colors, they never end up much closer to one another than is necessary.
There’s a reason that no two colors end up too similar. It follows from the fact that ϕ is the most difficult number to approximate as a fraction. Here’s a proof that colors aren’t similar:
Suppose that the m’th color and the (m+n)’th color end up being very similar. The difference between the m’th and (m+n)’th colors is the same as the difference between the 0’th and the n’th colors. Thus we are supposing that the 0’th color and the n’th color are very similar.
Let’s measure angles in turns, or fractions of 360 degrees. The n’th
color’s hue is, by definition, n/ϕ^2 % 1
turns. The 0’th hue is 0.
So if these colors are similar, then n/ϕ^2 % 1 ~= 0
(using ~=
for “approximately equals”). We can then reason as follows, using in
the third step the fact that ϕ^2  ϕ  1 = 0
so ϕ^2 = ϕ + 1
:
n/ϕ^2 % 1 ~= 0
n/ϕ^2 ~= k for some integer k
ϕ^2 ~= n/k
1 + ϕ ~= n/k
ϕ ~= (nk)/k
Now, if n is small, then k is small (because n/k ~= ϕ^2
), so
(nk)/k
is a fraction with a small denominator. But ϕ is difficult
to approximate with fractions, and the smaller the denominator the
worse the approximation, so ϕ
actually isn’t very close to
(nk)/k
, so n/ϕ^2 % 1
actually isn’t very close to 0
, so the
n’th color actually isn’t very similar to the 0’th color.
And that’s why the phillotactic colors work.
Can We Crowdsource Language Design?
Tags: Programming Languages, User Studies, Semantics
Posted on 06 July 2017.Programming languages are user interfaces. There are several ways of making decisions when designing user interfaces, including:
 a small number of designers make all the decisions, or
 user studies and feedback are used to make decisions.
Most programming languages have been designed by a Benevolent Dictator for Life or a committee, which corresponds to the first model. What happens if we try out the second?
We decided to explore this question. To get a large enough number of answers (and to engage in rapid experimentation), we decided to conduct surveys on Amazon Mechanical Turk, a forum known to have many technically literate people. We studied a wide range of common programming languages features, from numbers to scope to objects to overloading behavior.
We applied two concrete measures to the survey results:
 Consistency: whether individuals answer similar questions the same way, and
 Consensus: whether we find similar answers across individuals.
Observe that a high value of either one has clear implications for language design, and if both are high, that suggests we have zeroed in on a “most natural” language.
As Betteridge’s Law suggests, we found neither. Indeed,

A surprising percentage of workers expected some kind of dynamic scope (83.9%).

Some workers thought that record access would distribute over the field name expression.

Some workers ignored type annotations on functions.

Over the field and method questions we asked on objects, no worker expected Java’s semantics across all three.
These and other findings are explored in detail in our paper.
Crowdsourcing User Studies for Formal Methods
Tags: Verification, User Studies
Posted on 03 July 2017.For decades, we have neglected performing serious user studies of formalmethods tools. This is now starting to change. A recent post introduces our new work in this area.
That study works with students in an upperlevel class, who are a fairly good proxy for some developers (and are anyway an audience we have good access to). Unfortunately, student populations are problematic for several reasons:

There are only so many students in a class. There may not be enough to obtain statistical strength, especially on designs that require A/B testing and the like.

The class is offered only so often. It may take a whole year between studies. (This is a common problem in computing education research.)

As students progress through a class, it’s hard to “rewind” them and study their responses at an earlier stage in their learning.
And so on. It would be helpful if we could obtain large numbers of users quickly, relatively cheaply, and repeatedly.
This naturally suggests crowdsourcing. Unfortunately, the tasks we are examining involve using tools based on formal logic, not identifying birds or picking Web site colors (or solving CAPTCHAs…). That would seem to greatly limit the utility of crowdworkers on popular sites like Mechanical Turk.
In reality, this depends on how the problem is phrased. If we view it as “Can we find lots of Turkers with knowledge of Promela (or Alloy or …)?”, the answer is pretty negative. If, however, we can rework the problems somewhat so the question is “Can we get people to work on a puzzle?”, we can find many, many more workers. That is, sometimes the problem is one of vocabulary (and in particular, the use of specific formal methods languages) than of raw ability.
Concretely, we have taken the following steps:

Adapt problems from being questions about Alloy specifications to being phrased as logic “puzzles”.

Provide an initial training phase to make sure workers understand what we’re after.

Follow that with an evaluation phase to ensure that they “got the idea”. Only consider responses from those workers who score at a high enough threshold on evaluation.

Only then conduct the actual study.
Observe that even if we don’t want to trust the final results obtained from crowdsourcing, there are still uses for this process. Designing a good study requires several rounds of prototyping: even simple wording choices can have huge and unforeseen (negative) consequences. The more rounds we get to test a study, the better it will come out. Therefore, the crowd is useful at least to prototype and refine a study before unleashing it on a more qualified, hardertofind audience — a group that, almost by definition, you do not want to waste on a firstround study prototype.
For more information, see our paper. We find fairly useful results using workers on Mechanical Turk. In many cases the findings there correspond with those we found with class students.
User Studies of Principled Model Finder Output
Tags: Verification, User Studies
Posted on 01 July 2017.For decades, formalmethods tools have largely been evaluated on their correctness, completeness, and mathematical foundations while sidestepping or handwaving questions of usability. As a result, tools like model checkers, model finders, and proof assistants can require years of expertise to negotiate, leaving knowledgeable but uninitiated potential users at a loss. This state of affairs must change!
One class of formal tool, model finders, provides concrete instances of a specification, which can guide a user’s intuition or witness the failure of desired properties. But are the examples produced actually helpful? Which examples ought to be shown first? How should they be presented, and what supplementary information can aid comprehension? Indeed, could they even hinder understanding?
We’ve set out to answer these questions via disciplined userstudies. Where can we find participants for these studies? Ideally, we would survey experts. Unfortunately, it has been challenging to do so in the quantities needed for statistical power. As an alternative, we have begun to use formal methods students in Brown’s upperlevel Logic for Systems class. The course begins with Alloy, a popular modelfinding tool, so students are well suited to participate in basic studies. With this population, we have found some surprising results that call into question some intuitively appealing answers to (e.g.) the exampleselection question.
For more information, see our paper.
Okay, that’s student populations. But there are only so many students in a class, and they take the class only so often, and it’s hard to “rewind” them to an earlier point in a course. Are there audiences we can use that don’t have these problems? Stay tuned for our next post.