The Brown PLT Blog

LTL Tutor

2024-08-08T00:00:00+00:00

We have been engaged in a multi-year project to improve education in Linear Temporal Logic (LTL) [Blog Post 1, Blog Post 2]. In particular, we have arrived at a detailed understanding of typical misconceptions that learners and even experts have. Our useful outcome from our studies is a set of instruments (think “quizzes”) that instructors can deploy in their classes to understand how well their students understand the logic and what weaknesses they have.

However, as educators, we recognize that it isn’t always easy to add new materials to classes. Furthermore, your students make certain mistakes—now what? They need explanations of what went wrong, need additional drill problems, and need checks whether they got the additional ones right. It’s hard for an educator to make time for all that. And if one is an independent learner, they don’t even have access to such educators.

Recognizing these practical difficulties, we have distilled our group’s expertise in LTL into a free online tutor:

https://ltl-tutor.xyz

We have leveraged insights from our studies to create a tool designed to be used by learners with minimal prior instruction. All you need is a brief introduction to LTL (or even just propositional logic) to get started. As an instructor, you can deliver your usual lecture on the topic (using your preferred framing), and then have your students use the tool to grow and to reinforce their learning.

In contrast to traditional tutoring systems, which are often tied to a specific curriculum or course, our tutor adaptively generates multiple-choice question-sets in the context of common LTL misconceptions.

The tutor provides students who get a question wrong with feedback in terms of their answer’s relationship to the correct answer. Feedback can take the form of visual metaphors, counterexamples, or an interactive trace-stepper that shows the evaluation of an LTL formula across time.

In this example of LTL tutor feedback, a diagram is used to show students that their answer is logically more permissive than the correct answer to the question. The learner is also shown an example of an LTL trace that satisfies their answer but not the correct answer.

In this example of interactive stepper usage, the user examines the satisfaction of the formula (G (z <-> X(a))) in the third state of a trace. While the overall formula is not satisfied at this moment in time, the sub-formula (X a) is satisfied. This allows learners to explore where their understanding of a formula may have diverged from the correct answer.

If learners consistently demonstrate the same misconception, the tutor provides them further insight in the form of tailored text grounded in our previous research.

Here, a student who consistently assumes the presence of the `Globally` operator even when it is not present, is given further insight into the pertinent semantics of the operator.

Once it has a history of misconceptions exhibited by the student, the tutor generates novel, personalized question sets designed to drill students on their specific weaknesses. As students use the tutor, the system updates its understanding of their evolving needs, generating question sets to address newly uncovered or pertinent areas of difficulty.

We also designed the LTL Tutor with practical instructor needs in mind:

Curriculum Agnostic: The LTL Tutor is flexible and not tied to any specific curriculum. You can seamlessly integrate it into your existing course without making significant changes. It both generates exercises for students and allows you to import your own problem sets.
Detailed Reporting: To track your class’s progress effectively, you can create a unique course code for your students to enter, so you can get detailed insights into their performance.
Self-Hostable: If you prefer to have full control over your data, the LTL Tutor can easily be self-hosted.

Misconceptions In Finite-Trace and Infinite-Trace Linear Temporal Logic

2024-07-07T00:00:00+00:00

Over the past three years and with a multi-national group of collaborators, we have been digging deeper into misconceptions in LTL (Linear Temporal Logic) and studying misconceptions in LTLf, a promising variant of LTL restricted to finite traces. Why LTL and LTLf? Because beyond their traditional uses in verification and now robot synthesis, they support even more applications, from image processing to web-page testing to process-rule mining. Why study misconceptions? Because ultimately, human users need to fully understand what a formula says before they can safely apply synthesis tools and the like.

Our original post on LTL misconceptions gives more background and motivation. It also explains the main types of questions we use: translations between English specifications and formal specifications.

So what’s new this time around?

First, we provide two test instruments that have been field tested with several audiences:

One instrument [PDF] focuses on the delta between LTL and LTLf. If you know LTL but not LTLf, give it a try! You’ll come away with hands-on experience of the special constraints that finite traces bring.
The other instrument [PDF] is for LTL beginners — to see what preconceptions they bring to the table. It assumes basic awareness of G (“always”), F (“eventually”), and X (“next state”). It does not test the U (“until”) operator. Live survey here.

Second, we find evidence for several concrete misconceptions in the data. Some misconceptions were identified in prior work and are confirmed here. Others are new to this work.

For example, consider the LTLf formula: G(red => X(X(red))). What finite traces satisfy it?

In particular, can any finite traces that have red true at some point satisfy the formula?

Click to show answer:
No, because whenever red is true it must be true again two states later, but every finite trace will eventually run out of states.

Now consider the LTL formula F(X(X(red))). Is it true for an infinite trace where red is true exactly once?

Click to show answer:
Yes. But interestingly, some of our LTL beginners said no on the grounds that X(X(red)) ought to "spread out" and constrain three states in a row.

Third, we provide a code book of misconceptions and how to identify them in new data [PDF].

For more details, see the paper.

See also our LTL Tutor (traditional LTL only, not finite-trace).

Iterative Student Program Planning using Transformer-Driven Feedback

2024-07-05T00:00:00+00:00

We’ve had a few projects now that address this idea of teaching students to plan out solutions to programming problems. A thus-far missing but critical piece is feedback on this planning process. Ideally we want to give students feedback on their plans before they commit to any code details. Our early studies had students express their plans in a semi-formalized way which would’ve allowed us to automatically generate feedback based on formal structure. However, our most recent project highlighted a strong preference towards more freedom in notation, with plans expressed in far less structured language. This presents a challenge when designing automated feedback.

So how do we interpret plans written with little to no restrictions on notation or structure, in order to still give feedback? We throw it at an LLM, right?

It’s never that simple. We first tried direct LLM feedback, handing the student plan to an LLM with instructions of what kinds of feedback to give. Preliminary feedback results ranged from helpful to useless to incorrect. Even worse, we couldn’t prevent the LLM from directly including a correct answer in its response.

So we built a different kind of feedback system. Student plans, expressed mostly in English, are translated into code via an LLM. (We do not allow the LLM to access the problem statement— otherwise it would silently correct student misconceptions when translating into code.) The resulting code is run against an instructor test suite, and the test suite results are shown to the student as feedback.

When we deployed this system, we found that the results from running the LLM-generated code against our instructor test suite seemed to serve as a useful proxy for student plan correctness. However, many issues from the LLM still caused a great deal of student frustration, especially from the LLM not having access to details from the problem statement.

LLMs are good at presenting correct code solutions and correcting errors, and there is clear incentive for these behaviors to improve. But these behaviors are sometimes counterproductive to student feedback. Creating LLM-based feedback systems still requires careful thought in both its design and presentation to students.

For more detail on our design and results, read here.

Differential Analysis: A Summary

2024-06-27T00:00:00+00:00

For multiple decades we have worked on a the problem of differential analysis. This post explains where it comes from, what it means, and what its consequences are.

Context: Verification

For decades, numerous researchers have worked on the problem of verification. To a first approximation, we can describe this problem as follows:

P ⊧ ɸ

That is, checking whether some program P satisfies some property ɸ. There have been many productive discussions about exactly what methods we should use, but this remains the fundamental question.

Starting in around 2004, we started to build tools to verify a variety of interesting system descriptions (the Ps), starting with access-control policies. They could also be security policies, network configurations, and more. We especially recognized that many of these system descriptions are (sufficiently) sub-Turing-complete, which means we can apply rich methods to precisely answer questions about them. That is, there is a rich set of problems we can verify.

The Problem

Attractive as this idea is, it runs into a significant problem in practice. When you speak to practitioners, you find that they are not short of system descriptions (Ps), but they are severely lacking in properties (ɸs). The problem is not what some might imagine — that they can’t express their properties in some precise logic (which is a separate problem!) — but rather that they struggle to express non-trivial properties at all. A typical conversation might go like:

We have a tool for verification!
Nice, what does it do?
It consumes system descriptions of the kind you produce!
Oh, that’s great!
You just need to specify your properties.
My what?
Your properties! What you want your system to do!
I don’t have any properties!
But what do you want the system to do?
… Work correctly?

This is not to mock practitioners: not at all. Quite the contrary! Even formal methods experts would struggle to precisely describe the expected behavior of complex systems. In fact, talk to formal methods researchers long enough and they’ll admit knowing that this is a problem. It’s just not something we like to think about.

An Alternative

In fact, the “practitioner” answer shows us the path forward. What does it mean for a system to “work”? How does a system’s maintainer know that it “works”?

As a practical matter, many things help us confirm that a system is working well enough. We might have some test suites, we might have monitoring of its execution, we observe it run and use it; lots of people are using it every day; we might even have verified a few properties of a few parts! The net result is that we have confidence in a system.

And then, things happen! Typically, one of two things:

We find a bug, and need to fix it.
We modify an existing feature or add a new one (or — all too rarely — remove one!).

So the problem we run into is the following:

How do we transfer the confidence we had
in the old version to the new one?

Put differently, the core of formal methods is checking for compatibility between two artifacts. Traditionally, we have a system description (P) and a property (ɸ); these are meant to be expressed independent of one another, so that compatibility gives us confidence and incompatibility indicates an error. But now we have two different artifacts: an old system (call it P) and a new one (call it P’). These are obviously not going to be the same (except in rare cases; see below), but broadly, we want to know, how are they different?

P - P'

Of course, what we care about is not the syntactic difference, but the semantic change. Large syntactic differences may have small semantic ones and vice versa.

Defining the Difference

Computing the semantic difference is often not easy. There is a long line of work of computing the difference of programs. However, it is difficult for Turing-complete languages; it is also not always clear what the type of the difference should be. Computing it is a lot easier when the language is sub-Turing-complete (as our papers show). The question of exactly what a “difference” is is also interesting.

Many of the system description languages we have worked with tend to be of the form Request ⇒ Response. For instance, an access control policy might have the type:

Request ⇒ {Accept, Deny}

(In practice, policy languages can be much richer, but this suffices for illustration.) So what is the type of the difference? It maps every request to the cross-product of responses:

Request ⇒ {Accept↦Accept, Deny↦Deny, Accept↦Deny, Deny↦Accept}

That is: some requests that used to be accepted still are; some that were denied still are; but some that were accepted are now deined, and some that were denied are now accepted. (This assumes the domains are exactly the same; otherwise some requests that previously produced a decision no longer do, and vice versa. We’ll assume you can work out the details of domains with bottom values.) The difference is of course the requests whose outcomes change: in this case,

Request ⇒ {Accept↦Deny, Deny↦Accept}

Using the Difference

Depending on how we compute the difference, we can treat the difference essentially as a database. That is, it is a set of pairs of request and change-of-response. The database perspective is very productive, because we can do many things with this database:

Queries: What is the set of requests whose decisions go from Deny↦Accept? These are places where we might look for data leaks.
Views: What are all the requests whose decisions go from Accept↦Deny? These are all the requests that lost access. We might then want to perform queries over this smaller database: e.g., who are the entities whose requests fall in it?

And perhaps most surprisingly:

Verification: Confirm that as a result of this change, certain entities did not gain access.

In our experience, administrators who would not be able to describe properties of their system can define properties of their changes. Indeed, classical verification even has a name for some such properties: they’re called “frame conditions”, and mean, in effect, “and nothing else changed”. Here, we can actually check for these. It’s worth noting that these properties are not true of either the old or new systems! For instance, certain individuals may have had privileges before and will have them after the alteration; all we’re checking is that their privileges did not change.

Uses

Having a general “semantic difference engine”, and being able to query it (interactively), is very powerful. We can perform all the operations we have described above. We can use it to check the consequences of an intended edit. In some rare cases, we expect the difference to be empty: e.g., when we refactor the policy to clean it up syntactically, but expect that the refactoring had no semantic impact. Finally, a semantic differencing engine is also useful as an oracle when performing mutation testing, as Martin and Xie demonstrated.

A Cognitive Perspective

We think there are a few different, useful framings of differential analysis.

One might sound like a truism: we’re not very good at thinking about the things that we didn’t think about. That is, when we make a change to the system, there was some intent behind the change; but it can be very difficult to determine all the consequences of that change. Our focus on the intended change can easily blind us thinking through the consequences. Overcoming these blind spots is very difficult for humans. A semantic differential analysis engine lays them bare.

Another is that we lack good tools to figure out what the properties of a system should be. Model-exploration tools (such as Alloy, or our derivative of it, Forge) are useful at prompting people to think about how they expect systems to behave and not behave. Differential output can also be such a spur: in articulating why something should or should not happen with a change, we learn more about the system itself.

Finally, it’s worth distinguishing the different conditions that lead to system changes. When working on features, we can often do so with some degree of flexibility. But when fixing bugs, we’re often in a hurry: we need to make a change to immediately block, say, a data leakage. If we’re being principled, we might add some tests to check for the intended behavior (and perhaps also to avoid regression); but at that moment, we are likely to be in an especially poor shape to think through unintended consequences. Differential analysis serves as an aid in preventing fixing one problem introducing another.

Readings

Here are some of our papers describing differential analysis (which we have also called “change-impact analysis” in the past):

For access-control policies: paper
For obligations: paper
For firewalls: paper
For SDNs: blog; paper

Forge: A Tool to Teach Formal Methods

2024-04-21T00:00:00+00:00

For the past decade we have been studying how best to get students into formal methods (FM). Our focus is not on the 10% or so of students who will automatically gravitate towards it, but on the “other 90%” who don’t view it as a fundamental part of their existence (or of the universe). In particular, we decided to infuse FM thinking into the students who go off to build systems. Hence the course, Logic for Systems.

The bulk of the course focuses on solver-based formal methods. In particular, we began by using Alloy. Alloy comes with numerous benefits: it feels like a programming language, it can “Run” code like an IDE, it can be used for both verification and state-exploration, it comes with a nice visualizer, and it allows lightweight exploration with gradual refinement.

Unfortunately, over the years we have also run into various issues with Alloy, a full catalog of which is in the paper. In response, we have built a new FM tool called Forge. Forge is distinguished by the following three features:

Rather than plunging students into the full complexity of Alloy’s language, we instead layer it into a series of language levels.
We use the Sterling visualizer by default, which you can think of as a better version of Alloy’s visualizer. But there’s much more! Sterling allows you to craft custom visualizations. We use this to create domain-specific visualizations. As we show in the paper, the default visualization can produce unhelpful, confusing, or even outright misleading images. Custom visualization takes care of these.
In the past, we have explored property-based testing as a way to get students on the road from programming to FM. In turn, we are asking the question, “What does testing look like in this FM setting?” Forge provides preliminary answers, with more to come.

Just to whet your appetite, here is an example of what a default Sterling output looks like (Alloy’s visualizer would produce something similar, with fewer distinct colors, making it arguably even harder to see):

Here’s what custom visualization shows:

See the difference?

For more details, see the paper. And please try out Forge!

Acknowledgements

We are grateful for support from the U.S. National Science Foundation (award #2208731).

Finding and Fixing Standard Misconceptions About Program Behavior

2024-04-12T00:00:00+00:00

A large number of modern languages — from Java and C# to Python and JavaScript to Racket and OCaml — share a common semantic core:

variables are lexically scoped
scope can be nested
evaluation is eager
evaluation is sequential (per “thread”)
variables are mutable, but first-order
structures (e.g., vectors/arrays and objects) are mutable, and first-class
functions can be higher-order, and close over lexical bindings
memory is managed automatically (e.g., garbage collection)

We call this the Standard Model of Languages (SMoL).

SMoL potentially has huge pedagogic benefit:

If students master SMoL, they have a good handle on the core of several of these languages.
Students may find it easier to port their knowledge between languages: instead of being lost in a sea of different syntax, they can find familiar signposts in the common semantic features. This may also make it easier to learn new languages.
The differences between the languages are thrown into sharper contrast.
Students can see that, by going beyond syntax, there are several big semantic ideas that underlie all these languages, many of which we consider “best practices” in programming language design.

We have therefore spent the past four years working on the pedagogy of SMoL:

Finding errors in the understanding of SMoL program behavior.
Finding the (mis)conceptions behind these errors.
Collating these into clean instruments that are easy to deploy.
Building a tutor to help students correct these misconceptions.
Validating all of the above.

We are now ready to present a checkpoint of this effort. We have distilled the essence of this work into a tool:

The SMoL Tutor

It identifies and tries to fix student misconceptions. The Tutor assumes users have a baseline of programming knowledge typically found after 1–2 courses: variables, assignments, structured values (like vectors/arrays), functions, and higher-order functions (lambdas). Unlike most tutors, instead of teaching these concepts, it investigates how well the user actually understands them. Wherever the user makes a mistake, the tutor uses an educational device called a refutation text to help them understand where they went wrong and to correct their conception. The Tutor lets the user switch between multiple syntaxes, both so they can work with whichever they find most comfortable (so that syntactic unfamiliarity or discomfort does not itself become a source of errors), and so they can see the semantic unity beneath these languages.

Along the way, to better classify student responses, we invent a concept called the misinterpreter. A misinterpreter is an intentionally incorrect interpreter. Concretely, for each misconception, we create a corresponding misinterpreter: one that has the same semantics as SMoL except on that one feature, where it implements the misconception instead of the correct concept. By making misconceptions executable, we can mechanically check whether student responses correspond to a misconception.

There are many interesting lessons here:

Many of the problematic programs are likely to be startlingly simple to experts.
The combination of state, aliasing, and functions is complicated for students. (Yet most introductory courses plunge students into this maelstrom of topics without a second thought or care.)
Misinterpreters are an interesting concept in their own right, and are likely to have value independent of the above use.

In addition, we have not directly studied the following claims but believe they are well warranted based on observations from this work and from experience teaching and discussing programming languages:

In SMoL languages, local and top-level bindings behave the same as the binding induced by a function call. However, students often do not realize that these have a uniform semantics. In part this may be caused by our focus on the “call by” terminology, which focuses on calls (and makes them seem special). We believe it would be an improvement to replace these with “bind by”.
We also believe that the terms “call-by-value” and “call-by-reference” are so hopelessly muddled at this point (between students, instructors, blogs, the Web…) that finding better terminology overall would be helpful.
The way we informally talk about programming concepts (like “pass a variable”), and the syntactic choices our languages make (like return x), are almost certainly also sources of confusion. The former can naturally lead students to believe variables are being aliased, and the latter can lead them to believe the variable, rather than its value, is being returned.

For more details about the work, see the paper. The paper is based on an old version of the Tutor, where all programs were presented in parenthetical syntax. The Tutor now supports multiple syntaxes, so you don’t have to worry about being constrained by that. Indeed, it’s being used right now in a course that uses Scala 3.

Most of all, the SMoL Tutor is free to use! We welcome and encourage instructors of programming courses to consider using it — you may be surprised by the mistakes your students make on these seemingly very simple programs. But we also welcome learners of all stripes to give it a try!

Privacy-Respecting Type Error Telemetry at Scale

2024-02-02T00:00:00+00:00

Thesis: Programming languages would benefit hugely from telemetry. It would be extremely useful to know what people write, how they edit, what problems they encounter, etc. Problem: Observing programmers is very problematic. For students, it may cause anxiety and thereby hurt their learning (and grades). For professionals too it may cause anxiety, but it can also leak trade secrets.

One very partial solution is to perform these observations in controlled settings, such as a lab study. The downside is that it is hard to get diverse populations of users, it’s hard to retain them for very long, it’s hard to pay for these studies, etc. Furthermore, the activities they perform in a lab study may be very different from what they would do in real use: i.e., lab studies especially lack ecological validity when compared against many real-world programming settings.

We decided to instead study a large number of programmers doing their normal work—but in a privacy-respecting way. We collaborated with the Roblox Studio team on this project. Roblox is a widely-used platform for programming and deploying games, and it has lots of users of all kinds of ages and qualifications. They range from people writing their first programs to developers working for game studios building professional games.

In particular, we wanted to study a specific phenomenon: the uptake of types in Luau. Luau is an extension of Lua that powers Roblox. It supports classic Lua programs, but also lets programmers gradually add types to detect bugs at compile time. We specifically wanted to see what kind of type errors people make when using Luau, with the goal of improving their experience and thereby hopefully increasing uptake of the language.

Privacy-respecting telemetry sounds wonderful in theory but is very thorny in practice. We want our telemetry to have several properties:

It must not transmit any private information. This may be more subtle than it sounds. Error messages can, for instance, contain the names of functions. But these names may contain a trade secret.
It must be fast on the client-side so that the programmer experience is not disrupted.
It must transmit only small amount of data, so as not to overload the database servers.

(The latter two are not specific to privacy, but are necessary when working at scale.)

Our earlier, pioneering work on error message analysis was able to obtain a large amount of insight from logs. As a result of the above constraints, we cannot even pose many of the same questions in our setting.

Nevertheless, we were able to still learn several useful things about Luau. For more details, see the paper. But to us, this project is at least as interesting for the questions it inspires as for the particular solution or the insights gained from it. We hope to see many more languages incorporate privacy-respecting telemetry. (We’re pleased to see the Go team also thinking about these issues, as summarized in Russ Cox’s transparent telemetry notes. While there are some differences between our approaches, our overarching goals and constraints are very much in harmony.)

Profiling Programming Language Learning

2024-02-01T00:00:00+00:00

Programmers profile programs. They use profiling when they suspect a program is not being as effective (performant) as they would like. Profiling helps them track down what is working well and what needs more work, and how best to use their time to make the program more effective.

Programming language creators want their languages to be adopted. To that end, they create documentation, such as a book or similar written document. These are written with the best of intentions and following the best practices they know of. But are they effective? Are they engaging? How do they know? These are questions very much in the Socio-PLT mould proposed by Meyerovich and Rabkin.

In addition, their goal in writing such documentation is that people learn about their language. But many people do not enjoy a passive reading experience or, even if they do, they won’t learn much from it.

These two issues mesh together well. Books should be more interactive. Books should periodically stop readers and encourage them to think about what they’re reading:

If a listener nods his head when you’re explaining your program, wake him up.

—Alan Perlis

Books should give readers feedback on how well they have been reading so far. And authors should use this information to drive the content of the book.

We have been doing this in our Rust Book experiment. There, the focus was on a single topic: ownership. But in fact we have been doing this across the whole book, and in doing so we have learned a great deal.

We analyzed the trajectory of readers and showed that many drop out when faced with difficult language concepts like Rust’s ownership types. This suggests either revising how those concepts are presented, moving them later in the book, or splitting them into two parts, a gentle introduction that retains readers and a more detailed, more technical chapter later in the book once readers are more thoroughly invested.
We used both classical test theory and item response theory to analyze the characteristics of quiz questions. We found that better questions are more conceptual in nature, such as asking why a program does not compile versus whether a program compiles.
We performed 12 interventions into the book to help readers with difficult questions. We evaluated how well an intervention worked by comparing the performance of readers pre- and post-intervention on the question being targeted.

In other words, the profiler analogy holds: it helps us understand the behavior of “the program” (namely, users going through the book), suggests ways to improve it, and helps us analyze specific attempts at improvement and shows that they are indeed helpful.

However, we did all this with a book for which, over 13 months, 62,526 readers answered questions 1,140,202 times. This is of no help to new languages who might struggle to get more than dozens of users! Therefore, we sampled to simulate how well we would have fared on much smaller subsets of users. We show that for some of our analyses even 100 users would suffice, while others require around a 1000. These numbers—especially 100—are very much attainable for young languages.

Languages are designed for adoption, but mere design is usually insufficient to enable it, as the Socio-PLT paper demonstrated. We hope work along these lines can help language designers get their interesting work into many more hands and minds.

For more details, see the paper. Most of all, you can do this with your materials, too! The library for adding these quizzes is available here.

The Examplar Project: A Summary

2024-01-01T00:00:00+00:00

For the past several years, we have worked on a project called Examplar. This article summarizes the goals and methods of the project and provides pointers to more detailed articles describing it.

Context

When faced with an programming problem, computer science students all-too-often begin their implementations with an incomplete understanding of what the problem is asking, and may not realize until far into their development process (if at all) that they have solved the wrong problem. At best, a student realizes their mistake, suffers from some frustration, and is able to correct it before the final submission deadline. At worst, they might not realize their mistake until they receive feedback on their final submission, depriving them of the intended learning goal of the assignment.

How can we help them with this? A common practice—used across disciplines—is to tell students, “explain the problem in your own words”. This is a fine strategy, except it demands far too much of the student. Any educator who has done this knows that most students rightly stumble through this exercise, usually because they don’t have any better words than are already in the problem statement. So what was meant to be a comprehension exercise becomes a literary one; even if they can restate it very articulately, it may be because of verbal skills, not necessarily indicative of good understanding. And for complex problems, the whole exercise is somewhat futile. It’s all made even more difficult when students are not in their native language, etc.

So we have the kernel of a good idea—asking students to read back their understanding—but words are a poor medium for it.

Examples

Our idea is that writing examples—using the syntax of text cases—is a great way to express understanding:

Examples are concrete.
It’s hard to be vague.
Difficulty writing down examples is usually indicative of broader difficulties with the problem, and a great, concrete way to initiate a discussion with course staff.
It gets a head-start on writing tests, which too many computing curricula undervalue (if they tackle it at all).

Best of all, because the examples are executable, they can be run against implementations so that students get immediate feedback.

Types of Feedback

We want to give two kinds of feedback:

Correctness: Are they even correct? Do they match the problem specification?
Thoroughness: How much of the problem space do they cover? Do they dodge misconceptions?

Consider (for example!) the median function. Here are two examples, in the syntax of Pyret:

check:
  median([list: 1, 2, 3]) is 2
  median([list: 1, 3, 5]) is 3
end

These are both correct, as running them against a correct implementation of median will confirm, but are they thorough?

A student could, for instance, easily mistake median for mean. Note that the two examples above do not distinguish between the two functions! So giving students a thumbs-up at this point may still send them down the wrong path: they haven’t expressed enough of an understanding of the problem.

Evaluating Examples

For this reason, Examplar runs a program against multiple implementations. One is a correct implementation (which we call the wheat). (For technical reasons, it can be useful to have more than one correct implementation; see more below.) There are also several buggy implementations (called chaffs). Each example is first run against the wheat, to make sure it conforms to the problem specification. It is then run against each of the chaffs. Here’s what a recent version looks like:

Every example is a classifier: its job is to classify a program as correct or incorrect, i.e., to separate the wheat from the chaff. Of course, a particular buggy implementation may not be buggy in a way that a particular example catches. But across the board, the collection of examples should do a fairly good job of catching the buggy examples.

Thus, for instance, one of our buggy implementations of median would be mean. Because the two examples above are consistent with mean as well, they would (incorrectly) pass mean instead of signaling an error. If we had no other examples in our suite, we would fail to catch mean as buggy. That reflects directly as “the student has not yet confirmed that they understand the difference between the two functions”. We would want students to add examples like

  median([list: 1, 3, 7]) is 3

that pass median but not mean to demonstrate that they have that understanding.

Answering Questions

Examplar is also useful as a “24 hour TA”. Consider this example:

  median([list: 1, 2, 3, 4]) is ???

What is the answer? There are three possible answers: the left-median (2), right-median (3), and mean-median (2.5). A student could post on a forum and wait for a course staffer to read and answer. Or they can simply formulate the question as a test: e.g.,

  median([list: 1, 2, 3, 4]) is 2.5

One of these three will pass the wheat. That tells the student the definition being used for this course, which may not have been fully specified in the problem statement. Similarly:

  median([list: ]) is ???  # the empty list

Indeed, we see students coming to course staff with questions like, “I see that Examplar said that …, and I wanted to know why this is the answer”, which is a fantastic kind of question to hear.

Whence Chaffs?

It’s easy to see where to get the wheat: it’s just a correct implementation of the problem. But how do we get chaffs?

An astute reader will have noticed that we are practicing a form of mutation testing. Therefore, it might be tempting to use mutation testing libraries to generate chaffs. This would be a mistake because it misunderstands the point of Examplar.

We want students to use Examplar before they start programming, and as a warm-up activity to get their minds into the right problem space. That is not the time to be developing a test suite so extensive that it can capture every strange kind of error that might arise. Rather, we think of Examplar as performing what we call conceptual mutation testing: we only want to make sure they have the right conception of the problem, and avoid misconceptions about it. Therefore, chaffs should correspond to high-level conceptual mistakes (like confusing median and mean), not low-level programming errors.

Whence Misconceptions?

There are many ways to find out what misconceptions students have. One is by studying the errors we ourselves make while formulating or solving the problem. Another is by seeing what kinds of questions they ask course staff and what corrections we need to make. But there’s one more interesting and subtle source: Examplar itself!

Remember how we said we want examples to first pass the wheat? The failing ones are obviously…well, they’re wrong, but they may be wrong for an interesting reason. For instance, suppose we’ve defined median to produce the mean-median. Now, if a student writes

  median([list: 1, 2, 3, 4]) is 3

they are essentially expressing their belief that they need to solve the right-median problem. Thus, by harvesting these “errors”, filtering, and then clustering them, we can determine what misconceptions students have because they told us—in their own words!

Readings

Why you might want more than one wheat: paper
Introducing Examplar: blog; paper
Student use without coercion (in other words, gamified interfaces help…sometimes a bit too much): blog; paper
What help do students need that Examplar can’t provide? blog; paper
Turning wheat failures into misconceptions: blog; paper
From misconceptions to chaffs: blog; paper

But this work has much earlier origins:

How to Design Programs has students write tests before programs. However, these tests are inert (i.e., there’s nothing to run them against), so students see little value in doing so. Examplar eliminates this inertness, and goes further.
Before Examplar, we had students provide peer-review on tests, and found it had several benefits: paper. We also built a (no longer maintained) programming environment to support peer review: paper.
One thing we learned is that these test suites can grow too large to generate useful feedback. This caused us to focus on the essential test cases, out of which the idea of examples (as opposed to tests) evolved: paper.

A Core Calculus for Documents

2023-12-28T00:00:00+00:00

Document languages like Markdown, LaTeX, PHP, and Liquid are widely used to generate digital documents like PDFs and web pages. Document languages often come with programming features like variables and macros to help authors write complex documents. However, these document programming features are often designed or implemented in problematic ways, especially by the standards of a modern programming language. For example:

LaTeX’s macro system is not hygienic, which can cause macro arguments to conflict with a macro body.
PHP’s templating system relies on mutating a global output buffer, which means document fragments are not first-class values.
Liquid’s variable system provides a small and fixed set of data types, which limits the expressiveness of computations.

These problems could all be addressed in more carefully designed document languages (we are personally fans of Scribble and Typst). But those languages, too, have their own failure modes. Therein lies a deeper problem: there are no theoretical tools for reasoning about the design of a document language. Programming language theorists can use the lambda calculus to reason about the design of general-purpose programming languages. No such formal model exists for document languages.

Our work addresses this issue by providing a document calculus, or a formal model of the programmatic aspects of document languages. We designed the document calculus in a series of levels that each correspond to a family of existing document languages. Each level consists of a domain, or the thing produced by the language, and a constructor, or the feature used to construct elements of the domain. The levels are summarized in this table:

Domain	Ctor	Example Languages	Example Syntax
String	Literal	Text files, quoted strings	`"Hello World"`
	Program	PLs with string APIs, such as Javascript	`"Hello" + " World"`
	Template Literal	C `printf`, Python f-strings, Javascript template literals, Perl interpolated strings	let world = "World"; `Hello ${world}`
	Template Program	C preprocessor, PHP, LaTeX, Jinja (Python), Liquid (Ruby), Handlebars (JS)	`{% set world = "World" %} Hello {{ world }}`
Article	Literal	CommonMark Markdown, Pandoc Markdown, HTML, XML	`- Hello World`
	Program	PLs with document APIs, such as Javascript	`var ul = document.createElement("ul"); // ...`
	Template Literal	JSX Javascript, Scala 2, VB.NET, Scribble Racket, MDX Markdown, Lisp quasiquotes	`@(define world "World") @itemlist{@item{ Hello @bold{@world}}}`
	Template Program	Typst, Razor C#, Svelte Javascript, Markdoc Markdown	`#let world = [World] - Hello #world`

The document calculus can inform the design of document languages in a few ways:

The levels form a taxonomy which can help designers identify different points in the document language design space.
The calculus provides a reference semantics for a clean way to desugar key features like variables and loops.
The type system shows how to type-check a document prior to desugaring, including a proof of syntactic type safety.

For more details, see the paper.

Observations on the Design of Program Planning Notations for Students

2023-12-27T00:00:00+00:00

In two recent projects we’ve tried to make progress on the long-dormant topic of teaching students how to plan programs. Concretely, we chose higher-order functions as our driving metaphor to address the problem of “What language shall we use to express plans?” We showed that this was a good medium for students. We also built some quite nice tool support atop Snap!. Finally, we were making progress on this long open issue!

Not so fast.

We tried to replicate our previous finding with a new population of students and somewhat (but not entirely) different problems. It didn’t work well at all. Students made extensive complaints about the tooling and, when given a choice, voted with their feet by not using it.

We then tried again, allowing them freedom in what notation they used, but suggesting two: one was diagrammatic (essentially representing dataflow), and the other was linear prose akin to a todo-list or recipe. Students largely chose the latter, and also did a better job with planning.

Overall, this is a sobering result. It diminishes some of our earlier success. At the same time, it sheds more light on the notations students prefer. In particular, it returns to our earlier problem: planning needs a vocabulary, and we are still far from establishing one that students find comfortable and can use successfully. But it also highlights deeper issues, such as the need to better support students with composition. Critically, composition serves as a bridge between more plan-oriented students and bricoleurs, making it especially worthy of more study, no matter your position on how students should or do design programs.

For more details, see the paper.

Conceptual Mutation Testing

2023-10-31T00:00:00+00:00

Here’s a summary of the full arc, including later work, of the Examplar project.

The Examplar system is designed to solve the documented phenomenon that students often misunderstand a programming problem statement and hence “solve” the wrong problem. It does so by asking students to begin by writing input/output examples of program behavior, and evaluating them against correct and buggy implementations of the solution. Students refine and demonstrate their understanding of the problem by writing examples that correctly classify these implementations.

Student-authored input-output examples have to both be consistent with the assignment, and also catch (not be consistent with) buggy implementations.

It is, however, very difficult to come up with good buggy candidates. These programs must correspond to the problem misconceptions students are most likely to have. Students can, otherwise, end up spending too much time catching them, and not enough time on the actual programming task. Additionally, a small number of very effective buggy implementations is far more useful than either a large number or ineffective ones (much less both).

Our previous research has shown that student-authored examples that fail the correct implementation often correspond to student misconceptions. Buggy implementations based on these misconceptions circumvent many of the pitfalls of expert-generated equivalents (most notably the ‘expert blind spot’). That work, however, leaves unanswered the crucial question of how to operationalize class-sourced misconceptions. Even a modestly sized class can generate thousands of failing examples per assignment. It takes a huge amount of manual effort, expertise, and time to extract misconceptions from this sea of failing examples.

The key is to cluster these failing examples. The obvious clustering method – syntactic – fails miserably: small syntactic differences can result in large semantic differences, and vice versa (as the paper shows). Instead, we need a clustering technique that is based on the semantics of the problem.

This paper instead presents a conceptual clustering technique based on key characteristics of each programming assignment. These clusters dramatically shrink the space that must be examined by course staff, and naturally suggest techniques for choosing buggy implementation suites. We demonstrate that these curated buggy implementations better reflect student misunderstandings than those generated purely by course staff. Finally, the paper suggests further avenues for operationalizing student misconceptions, including the generation of targeted hints.

You can learn more about the work from the paper.

Generating Programs Trivially: Student Use of Large Language Models

2023-09-19T00:00:00+00:00

The advent of large language models like GPT-3 has led to growing concern from educators about how these models can be used and abused by students in order to help with their homework. In computer science, much of this concern centers on how LLMs automatically generate programs in response to textual prompts. Some institutions have gone as far as instituting wholesale bans on the use of the tool. Despite all the alarm, however, little is known about whether and how students actually use these tools.

In order to better understand the issue, we gave students in an upper-level formal methods course access to GPT-3 via a Visual Studio Code extension, and explicitly granted them permission to use the tool for course work. In order to mitigate any equity issues around access, we allocated $2500 in OpenAI credit for the course, enabling free access to the latest and greatest OpenAI models.

Can you guess the total dollar value of OpenAI credit used by students?

We then analyzed the outcomes of this intervention, how and why students actually did and did not use the LLM.

Which of these graphs do you think best represents student use of GPT models over the semester?

When surveyed, students overwhelmingly expressed concerns about using GPT to help with their homework. Dominant themes included:

Fear that using LLMs would detract from learning.
Unfamiliarity with LLMs and issues with output correctness.
Fear of breaking course rules, despite being granted explicit permission to use GPT.

Much ink has been spilt on the effect of LLMs in education. While our experiment focuses only on a single course offering, we believe it can help re-balance the public narrative about such tools. Student use of LLMs may be influenced by two opposing forces. On one hand, competition for jobs may cause students to feel they must have “perfect” transcripts, which can be aided by leaning on an LLM. On the other, students may realize that getting an attractive job is hard, and decide they need to learn more in order to pass interviews and perform well to retain their positions.

You can learn more about the work from the paper.

A Grounded Conceptual Model for Ownership Types in Rust

2023-09-17T00:00:00+00:00

Rust is establishing itself as the safe alternative to C and C++, making it an essential component for building a future software univers that is correct, reliable, and secure. Rust achieves this in part through the use of a sophisticated type system based on the concept of ownership. Unfortunately, ownership is unfamiliar to most conventionally-trained programmers. Surveys suggest that this central concept is also one of Rust’s most difficult, making it a chokepoint in software progress.

We have spent over a year understanding how ownership is currently taught, in what ways this proves insufficient for programmers, and looked for ways to improve their understanding. When confronted with a program containing an ownership violation, we found that Rust learners could generally predict the surface reason given by the compiler for rejecting the program. However, learners could often could not relate the surface reason to the underlying issues of memory safety and undefined behavior. This lack of understanding caused learners to struggle to idiomatically fix ownership errors.

To address this, we created a new conceptual model for Rust ownership, grounded in these studies. We then translated this model into two new visualizations: one to explain how the type-system works, the other to illustrate the impact on run-time behavior. Crucially, we configure the compiler to ignore borrow-checker errors. Through this, we are able to essentially run counterfactuals, and thereby illustrate the ensuing undefined behavior.

Here is an example of the type-system visualization:

And here is an example of the run-time visualization:

We incorporated these diagrams into an experimental version of The Rust Programming Language by Klabnik and Nichols. The authors graciously permitted us to create this fork and publicize it, and also provided a link to it from the official edition. As a result, we were able to test our tools on readers, and demonstrate that they do actually improve Rust learning.

The full details are in the paper. Our view is that the new tools are preliminary, and other researchers may come up with much better, more creative, and more effective versions of them. Rather, the main contribution is an understanding of how programmers do and don’t understand ownership, and in particular its relationship to undefined behavior. It is therefore possible that new pedagogies that make that connection clear may obviate the need for some of these tools entirely.

What Happens When Students Switch (Functional) Languages

2023-07-16T00:00:00+00:00

What happens when students learn a second programming language after having gotten comfortable with one? This was a question of some interest in the 1980s and 1990s, but interest in it diminished. Recent work by Ethel Tshukudu and her collaborators have revived interest in this question.

Unfortunately, none of this work has really considered the role of functional programming. This is especially worth considering in the framework that Tshukudu’s work lays out, which is to separate syntax and semantics. That is the issue we tackle.

Specifically, we try to study two conditions:

different syntax, similar semantics
similar syntax, different semantics

For the same semantics, any two sufficiently syntactically different functional languages would do. The parenthetical syntax of the Lisp family gives us a syntax that is clearly different from the infix syntaxes of most other languages. In our particular case, we use Racket and Pyret.

The second case is trickier. For a controlled lab study, one could do this with very controlled artificial languages. However, we are interested in student experiences, which require curricula and materials that made-up languages usually cannot provide.

Instead, we find a compromise. The Pyret syntax was inspired by that of Python, though it does have some differences. It comes with all the curricular support we need. Therefore, we can compare it versus an imperative curriculum in Python.

You can read the details in the paper. The work is less interesting for its answers than for its setup. As a community we know very little about this topic. We hope the paper will inspire other educators both through the questions we have asked and the materials we have designed.

Typed-Untyped Interactions: A Comparative Analysis

2023-02-06T00:00:00+00:00

Dozens of languages today support gradual typing in some form or another. What lessons can the designer of a new language learn from their experiences?

As a starting point, let’s say that we want to maximize interoperability. For most types in the language, untyped code should be able to make a value that works for that type: type Number should accept untyped numbers, type List(Number) should accept untyped lists, and so on.

(This may seem like an obvious requirement for a type system that gets added on top of an untyped language. But there are good reasons to break it — see our earlier post on Static Python.)

The question now becomes: what sort of validation strategy should typed code use when an untyped value tries to cross a type boundary? For example, when a Number -> Number type receives an untyped function there are at least three viable options:

Wrap the function in a proxy to make sure its behavior matches the type.
Leave the function unwrapped, but put checks at its call sites.
Do nothing! Trust the function.

Both the research literature and the realm of production-ready languages are full of ideas on this front. What was unclear (until now!) is how the validation strategies implicit across the landscape relate to one another.

With this paper, we introduce a toolbox of formal properties to pin down the guarantees that gradual types can provide:

type soundness (generalized) for local type guarantees,
complete monitoring for compositional type guarantees,
blame soundness to judge the accuracy of validation errors, and
blame completeness to judge the precision of validation errors.

We also use an error preorder to rank strategies by how early they can detect a type validation mismatch.

The upshot of all this is a positive characterization of the landscape. There is no clear winner because there are other factors at play, such as performance costs and the usefulness of types for debugging. What we do gain is a solid theoretical foundation to inform language designs.

It’s all in the paper.

Little Tricky Logics

2022-11-05T00:00:00+00:00

We also have followup work that continues to explore LTL and now also studies finite-trace LTL.

LTL (Linear Temporal Logic) has long been central in computer-aided verification and synthesis. Lately, it’s also been making significant inroads into areas like planning for robots. LTL is powerful, beautiful, and concise. What’s not to love?

However, any logic used in these settings must also satisfy the central goal of being understandable by its users. Especially in a field like synthesis, there is no second line of defense: a synthesizer does exactly what the specification says. If the specification is wrong, the output will be wrong in the same way.

Therefore, we need to understand how people comprehend these logics. Unfortunately, the human factors of logics has seen almost no attention in the research community. Indeed, if anything, the literature is rife with claims about what is “easy” or “intuitive” without any rigorous justification for such claims.

With this paper, we hope to change that conversation. We bring to bear on this problem several techniques from diverse areas—but primarily from education and other social sciences (with tooling provided by computer science)—to understand the misconceptions people have with logics. Misconceptions are not merely mistakes; they are validated understanding difficulties (i.e., having the wrong concept), and hence demand much greater attention. We are especially inspired by work in physics education on the creation of concept inventories, which are validated instruments for rapidly identifying misconceptions in a population, and take steps towards the creation of one.

Concretely, we focus on LTL (given its widespread use) and study the problem of LTL understanding from three different perspectives:

LTL to English: Given an LTL formula, can a reader accurately translate it into English? This is similar to what a person does when reading a specification, e.g., when code-reviewing work or studying a paper.
English to LTL: Given an English statement, can a reader accurately express it in LTL? This skill is essential for specification and verification.

Furthermore, “understanding LTL” needs to be divided into two parts: syntax and semantics. Therefore, we study a third issue:

Trace satisfaction: Given an LTL formula and a trace (sequence of states), can a reader accurately label the trace as satisfying or violating? Such questions directly test knowledge of LTL semantics.

Our studies were conducted over multiple years, with multiple audiences, and using multiple methods, with both formative and confirmatory phases. The net result is that we find numerous misconceptions in the understanding of LTL in all three categories. Notably, our studies are based on small formulas and traces, so we expect the set of issues will only grow as the instruments contain larger artifacts.

Ultimately, in addition to

finding concrete misconceptions,

we also:

create a codebook of misconceptions that LTL users have, and
provide instruments for finding these misconceptions.

We believe all three will be of immediate use to different communities, such as students, educators, tool-builders, and designers of new logic-based languages.

For more details, see the paper.

Identifying Problem Misconceptions

2022-10-15T00:00:00+00:00

Here’s a summary of the full arc, including later work, of the Examplar project.

Our recent work is built on the documented research that students often misunderstand a programming problem statement and hence “solve” the wrong problem. This not only creates frustration and wastes time, it also robs them of whatever learning objective motivated the task.

To address this, the Examplar system asks students to first write examples. These examples are evaluated against wheats (correct implementations) and chaffs (buggy implementations). Examples must pass the wheat, and correctly identify as wrong as many chaffs as possible. Prior work explores this and shows that it is quite effective.

However, there’s a problem with chaffs. Students can end up spending too much time catching them, and not enough time on the actual programming task. Therefore, you want chaffs that correspond to the problem misconceptions students are most likely to have. Having a small number of very effective chaffs is far more useful than either a large number or ineffective ones (much less both). But the open question has always been, how do we obtain chaffs?

Previously, chaffs were created by hand by experts. This was problematic because it forces experts to imagine the kinds of problems students might have; this is not only hard, it is bound to run into expert blind spots. What other method do we have?

This work is based on a very simple, clever observation: any time an example fails a wheat, it may correspond to a student misconception. Of course, not all wheat failures are misconceptions! It could be a typo, it could be a basic logical error, or it could even be an attempt to game the wheat-chaff system. Do we know in what ratio these occur, and can we use the ones that are misconceptions?

This paper makes two main contributions:

It shows that that many wheat failures really are misconceptions.
It uses these misconceptions to formulate new chaffs, and shows that they compare very favorably to expert-generated chaffs.

Furthermore, the work spans two kinds of courses: one is an accelerated introductory programming class, while the other is an upper-level formal methods course. We show that there is value in both settings.

This is just a first step in this direction; a lot of manual work went into this research, which needs to be automated; we also need to measure the direct impact on students. But it’s a very promising direction in a few ways:

It presents a novel method for finding misconceptions.
It naturally works around expert blind-spots.
With more automation, it can be made lightweight.

In particular, if we can make it lightweight, we can apply it to settings—even individual homework problems—that also manifest misconceptions that need fixing, but could never afford heavyweight concept-inventory-like methods for identifying them.

You can learn more about the work from the paper.

Performance Preconceptions

2022-10-10T00:00:00+00:00

What do computer science students entering post-secondary (collegiate) education think “performance” means?

Who or what shapes these views?

How accurate are these views?

And how correctable are their mistakes?

These questions are not merely an idle curiosity. How students perceive performance impacts how they think about program design (e.g., they may think a particular design is better but still not use it because they think it’s less performant). It also affects their receptiveness to new programming languages and styles (“paradigms”) of programming. Anecdotally, we have seen exactly these phenomena at play in our courses.

We are especially interested in students who have had prior computer science (in secondary school), such as students taking the AP Computer Science exam in the US. These students often have significant prior computing, but we have studied relatively little about the downstream consequences of these courses. Indeed, performance considerations are manifest in material as early as the age 4–8 curriculum from Code.org!

This paper takes a first step in examining these issues. We find that students have high confidence in incorrect answers on material they should have little confidence about. To address these problems, we try multiple known techniques from the psychology and education literature — the Illusion of Explanatory Depth, and Refutation Texts — that have been found to work in several other domains. We see that they have little impact here.

This work has numerous potential confounds based on the study design and location of performance. Therefore, we don’t view this as a definitive result, but rather as a spur to start an urgently-needed conversation about factors that affect post-secondary computer science education. Concretely, as we discuss in the discussion sections, we also believe there is very little we know about how students conceive of “performance”, and question whether our classical methods for approaching it are effective.

The paper is split into a short paper, that summarizes the results, an an extensive appendix, which provides all the details and justifies the summary. Both are available online.

Structural Versus Pipeline Composition of Higher-Order Functions

2022-08-16T00:00:00+00:00

Building on our prior work on behavioral conceptions of higher-order functions (HOFs), we have been looking now at their composition. In designing a study, we kept running into tricky problems in designing HOF composition problems. Eventually, we set out to study that question directly.

We’re going to give you a quiz. Imagine you have a set of standard HOFs (map, filter, sort, andmap, ormap, take-while). You are given two ways to think about composing them (where “funarg” is short for the parameter that is a function):

Type A: HOF_A(<some funarg>, HOF_B(<some funarg>, L))

Type B: HOF_C((lambda (inner) HOF_D(<some funarg>, inner)), L)

Which of these would you consider “easier” for students to understand and use?
How would you rate their relative expressive power in terms of problems they can solve?

Don’t go on until you’ve committed to answers to these questions.

Rather than A and B, here are better names: we’ll refer to A as pipeline, because it corresponds to a traditional data-processing/Unix pipeline composition (HOF_B L | HOF_A), and we’ll refer to B as structural. If you’re like many other people we’ve asked, you likely think that pipeline is easier, but you’re less certain about how to answer the second question.

Alright, now you’re ready to read the paper! We think the structural/pipeline distinction is similar to the structural/generative recursion distinction in HtDP, and similarly has consequences for how we order HOF composition in our pedagogy. We discuss all this, and more, in the document.

Plan Composition Using Higher-Order Functions

2022-07-09T00:00:00+00:00

There is a long history of wanting to examine planning in computing education research, but relatively little work on it. One problem you run into when trying to do this seriously is: “What language shall we use to express plans?” A lot hinges on this language.

The programming language itself is too low-level: there are too many administrative details that get in the way and might distract the student; failures may then reflect these distractions, not an inability to plan.
Plain English may be too high-level. It’s both difficult to give any useful (automated) feedback about, it may also require too much interpretation. In particular, an expert may interpret student utterances in ways the student didn’t mean, thereby giving the student an OK signal when in fact the student is not on the right path.

Separately, in prior work, we looked at whether students are able to understand higher-order functions (HOFs) from a behavioral perspective: i.e., as atomic units of behavior without reference to their underlying implementation. For our population, we found that they generally did quite well.

You may now see how these dovetail. Once students have a behavioral understanding of individual HOFs, you can use them as a starting vocabulary for planning. Or to think in more mechanical terms, we want to study how well students understand the composition of HOFs. That is the subject of this work.

Concretely, we start by confirming our previous result—that they understand the building blocks—and can also articulate many of the features that we previously handed out to them. This latter step is important because any failures at composition may lie in their insufficiently rich understanding of the functions. Fortunately, we see that this is again not a problem with our population.

We then focus on the main question: can they compose these HOFs. We do this in two ways:

We give them input-output examples and ask them to identify which compositions of functions would have produced those results. This is akin to having a dataset you need to transform and knowing what you would like the result to look like, and figuring out what steps will take it there.
We give them programming problems to solve, and ask them to first provide high-level plans of their solutions.

What we find is that students don’t do superbly on (1), but do extremely well on (2). Indeed, our goal had been to study what changes between the planning and programming phase (e.g., if they planned incorrectly but programmed correctly; or vice versa), but our students unfortunately did too well on both to give us any useful data!

Of particular interest is how we got them to state plans. While HOFs are the “semantics”, we still need a “syntax” for writing them. Conventional textual programming has various bad affordances. Instead, we created a custom palette of operations in Snap!. In keeping with the point of this paper, the operations were HOFs. There are numerous advantages to this use of Snap!:

Drag-and-drop construction avoids getting bogged down in the vagaries of textual syntax.
Changing plans is much easier, because you can drag whole blocks and (again) not get caught up in messy textual details. This means students are hopefully more willing to change around their plans.
The planning operations focus on the operations we care about, and students can ignore irrelevant details.
Most subtly: the blanks can be filled in with text. That is, you get “operations on the outside, text on the inside”: at the point where things get too detailed, students can focus on presenting their ideas rather than on low-level details. This is, in other words, a hybrid of the two methods we suggested at the beginning.

Critically, these aren’t programs! Because of the text, they can’t be executed. But that’s okay! They’re only meant to help students think through their plans before starting to write the program. In particular, given students’ reluctance to change their programs much once they start coding, it seems especially important to give them a fluid medium—where switching costs are low—in which to plan things before they start to write a line of code. So one of the best things about this paper, beyond the main result, is actually our discovery of Snap!’s likely utility in this setting.

For more details, see the paper!

Towards a Notional Machine for Runtime Stacks and Scope

2022-07-07T00:00:00+00:00

Stacks are central to our understanding of program behavior; so is scope. These concepts become ever more important as ever more programming languages embrace concepts like closures and advanced control (like generators). Furthermore, stacks and scope interact in an interesting way, and these features really exercise their intersection.

Over the years we’ve seen students exhibit several problematic conceptions about stacks (and scope). For instance, consider a program like this:

def f(x):
  return g(x + 1)

def g(y):
  return y + x

f(3)

What is its value? You want an error: that x is not bound. But think about your canonical stack diagram for this program. You have a frame for g atop that for x, and you have been told that you “look down the stack”. (Or vice versa, depending on how your stacks grow.) So it’s very reasonable to conclude that this program produces 7, the result produced by dynamic scope.

We see students thinking exactly this.

Consider this program:

def f(x):
  return lambda y: x + y

p = f(3)

p(4)

This one, conversely, should produce 7. But students who have been taught a conventional notion of call-and-return assume that f’s stack frame has been removed after the call completed (correct!), so p(4) must result in an error that x is not bound.

We see students thinking exactly this, too.

The paper sets out to do several things.

First, we try to understand the conceptions of stacks that students have coming into an upper-level programming languages course. (It’s not great, y’all.)

Second, we create some tooling to help students learn more about stacks. More on that below. The tooling seems to work well for students who get some practice using it.

Third, we find that even after several rounds of direct instruction and practice, some misconceptions remain. In particular, students do not properly understand how environments chain to get scope right.

Fourth, in a class that had various interventions including interpreters, students did much better than in a class where students learned from interpreters alone. Though we love interpreters and think they have various valuable uses in programming languages education, our results make us question some of the community’s beliefs about the benefits of using interpreters. In particular, some notions of transfer that we would have liked to see do not occur. We therefore believe that the use of interpreters needs much more investigation.

As for the tooling: One of the things we learned from our initial study is that students simply do not have a standardized way of presenting stacks. What goes in them, and how, were all over the map. We conjecture there are many reasons: students mostly see stacks and are rarely asked to draw them; and when they do, they have no standard tools for doing so. So they invent various ad hoc notations, which in turn don’t necessarily reinforce all the aspects that a stack should represent.

We therefore created a small tool for drawing stacks. What we did was repurpose Snap! to create a palette of stack, environment, and heap blocks. It’s important to understand these aren’t runnable programs: these are just static representations of program states. But Snap! is fine with that. This gave us a consistent notation that we could use everywhere: in class, in the textbook, and in homeworks. The ability to make stacks very quickly with drag-and-drop was clearly convenient to students who gained experience with the tool, because many used it voluntarily; it was also a huge benefit for in-class instruction over a more conventional drawing tool. An unexpected success for block syntaxes!

For more details, see the paper.

Gradual Soundness: Lessons from Static Python

2022-06-28T00:00:00+00:00

It is now virtually a truism that every dynamic language adopts a gradual static type system. However, the space of gradual typing is vast, with numerous tradeoffs between expressiveness, efficiency, interoperability, migration effort, and more.

This work focuses on the Static Python language built by the Instagram team at Meta. Static Python is an interesting language in this space for several reasons:

It is designed to be sound.
It is meant to run fast.
It is reasonably expressive.

The static type system is a combination of what the literature calls concrete and transient types. Concrete types provide full soundness and low performance overhead, but impose nonlocal constraints. Transient types are sound in a shallow sense and easier to use; they help to bridge the gap between untyped code and typed concrete code. The net result is a language that is in active use, with high performance, inside Meta.

Our work here is to both formalize and assess the language. We investigate the language’s soundness claims and report on its performance on both micro-benchmarks and in production systems. In particular, we find that the design holds up the intent of soundness well, but the act of modeling it uncovered several bugs (including one that produced a segmentation fault), all of which have now been fixed.

We believe Static Python is a particularly interesting language for the gradual typing community to study. It is not based on virtuoso theoretical advances; rather, it takes solid ideas (e.g., from Muehlboeck and Tate’s Nom language), combines them well, and pays attention to the various affordances needed by practicing programmers. It therefore offers useful advice to the designers of other gradually-typed languages, in particular those who confront large code-bases and would like incremental approaches to transition from dynamic safety to static soundness. You can read much more about this in the paper.

As an aside, this paper is a collaboration that was born entirely thanks to Twitter and most probably would never have occurred withtout it. An aggrieved-sounding post by Guido van Rossum led to an exchange between Carl Meyer at Meta and Shriram, which continued some time later here. Eventually they moved to Direct Messaging. Shriram pointed out that Ben Greenman could investigate this further, and from that, this collaboration was born.

Applying Cognitive Principles to Model-Finding Output

2022-04-26T00:00:00+00:00

Model-finders produce output to help users understand the specifications they have written. They therefore effectively make assumptions about how these will be processed cognitively, but are usually unaware that they are doing so. What if we apply known principles from cognitive science to try to improve the output of model-finders?

Model Finding and Specification Exploration

Model-finding is everywhere. SAT and SMT solvers are the canonical model-finders: given a logical specification, they generate a satisfying instance (a “model”) or report that it’s impossible. Their speed and generality have embedded them in numerous back-ends. They are also used directly for analysis and verification, e.g., through systems like Alloy.

One powerful modality enabled by tools like Alloy is the exploration of specifications. Usually, model-finders are used for verification: you have a specification and some properties about it, and a verifier tells you whether the properties are satisfied or not. However, we often don’t have properties; we just want to understand the consequences of a design. While a conventional verifier is useless in this setting, model-finders have no problem with it: they will generate models of the specification that show different possible ways in which it can be realized.

Presenting Exploration

The models generated by exploration (or even by verification, where they are typically counterexamples) can be presented in several ways. For many users, the most convenient output is visual. Here, for instance, is a typical image generated using the Sterling visualizer:

As of this writing, Alloy will let you sequentially view one model at a time.

Exploration for Understanding

The purpose of showing these models is to gain understanding. It is therefore reasonable to ask what forms of presentation would be most useful to enable the most understanding. In earlier work we studied details of how each model is shown. That work is orthogonal to what we do here.

Here, we are interested in how many models, and of what kind, should be displayed. We draw on a rich body of literature in perceptual psychology going back to seminal work by Gibson and Gibson in 1955. A long line of work since then has explored several dimensions of this, resulting in a modern understanding of contrasting cases. In this work, you don’t show a single result; rather, you show a set of similar examples, to better help people build models of what they are seeing. Since our goal is to help people understand a specification through visual output, it was natural to ask whether any of this literature could help in our setting.

Our Study

We concretely studied numerous experimental conditions involving different kinds of contrasting cases, where we show multiple models on screen at once. Critically, we looked at the use of both positive and negative models. Positive models are what you expect: models of the specification. In contrast, “negative” models are ones that don’t model the specification.

There can, of course, be an infinite number of negative models, most of which are of no use whatsoever: if I write a specification of a leader-election protocol, a whale or a sandwich are legitimate negative models. What we are interested in is “near miss” models, i.e., ones that could almost have been models but for a small difference. Our theory was that showing these models would help a user better understand the “space” of their model. (In this, we were inspired by prior work by Montaghmi and Rayside.)

Our Findings

We study these questions through both crowdsourced and talkaloud studies, and using both quantitative and qualitative methods. We find that in this setting, the use of multiple models does not seem to have been a big win. (Had it been, we would still have to confront the problem of how to fit all that information onto a screen in the general case.) The use of negative instances does seem to be helpful. We also constructed novel modes of output such as where a user can flip between positive and negative instances, and these seem especially promising.

Of course, our findings come with numerous caveats. Rather than think of our results as in any way definitive, we view this as formative work for a much longer line of research at the intersection of formal methods and human-computer interaction. We especially believe there is enormous potential to apply cognitive science principles in this space, and our paper provides some very rough, preliminary ideas of how one might do so.

For More Details

You can read about all this in our paper. Be warned, the paper is a bit of heavy going! There are a lot of conditions and lots of experiments and data. But hopefully you can get the gist of it without too much trouble.

Automated, Targeted Testing of Property-Based Testing Predicates

2021-11-24T00:00:00+00:00

Property-Based Testing (PBT) is not only a valuable sofware quality improvement method in its own right, it’s a critical bridge between traditional software development practices (like unit testing) and formal specification. We discuss this in our previous work on assessing student performance on PBT. In particular, we introduce a mechanism to investigate how they do by decomposing the property into a collection of independent sub-properties. This gives us semantic insight into how students perform: rather than a binary scale, we can identify specific sub-properties that they may have difficulty with.

While this preliminary work was very useful, it suffered from several problems, some of which are not surprising while others became clear to us only in retrospect. In light of that, our new work makes several improvements.

The previous work expected each of the sub-properties to be independent. However, this is too strong a requirement. For one, it masks problems that can lurk in the conjunction of sub-properties. The other problem is more subtle: when you see a surprising or intriguing student error, you want to add a sub-property that would catch that error, so you can generate statistics on it. However, there’s no reason the new property will be independent; in fact, it almost certainly won’t be.
Our tests were being generated by hand, with one exception that was so subtle, we employed Alloy to find the test. Why only once? Why not use Alloy to generate tests in all situations? And while we’re at it, why not also use a generator from a PBT framework (specifically, Hypothesis)?
And if we’re going to use both value-based and SAT-based example generators, why not compare them?

This new paper does all of the above. It results in a much more flexible, useful tool for assessing student PBT performance. Second, it revisits our previous findings about student performance. Third, it lays out architectures for PBT evaluation using SAT and a PBT-generator (specifically Hypothesis). In the process it explains various engineering issues we needed to address. Fourth, it compares the two approaches; it also compares how the two approaches did relative to hand-curated test suites.

You can read about all this in our paper.

A Benchmark for Tabular Types

2021-11-21T00:00:00+00:00

Tables are Everywhere

Tables are ubiquitous in the world. Newspapers print tables. Reports include tables. Even children as young as middle-school work comfortably with tables. Tables are, of course, also ubiquitous in programming. Because they provide an easy-to-understand, ubiquitous, already-parsed format, they are also valuable in programming education (e.g., DCIC works extensively with tables before moving on to other compound datatypes).

(Typed) Programming with Tables

When it comes to programming with tables, we have excellent tools like relational databases. However, using external databases creates impedance mismatches, so many programmers like to access tabular data from directly in the language, rather than construct external calls. The popularity of language-embedded query has not diminished with time.

Programming with tables, however, requires attention to types. Tables are inherently heterogeneous: each column is welcome to use whatever type makes most sense. This is all the more so if tables are a part of the language itself: while external data tend to be limited to “wire-format” types like numbers and strings, inside the language they can contain images, functions, other tables, and more. (For instance, we use all of these in Pyret.)

What is the type of a table? To make the output of tabular operations useful, it can’t be something flat like just Table. Because tables are heterogenous, they can’t have just a single type parameter (like Table<T>). It may conceptually make sense to have a type parameter for each column (e.g., Table<String, Number>), but real-world tables can have 17 or 37 columns! Programmers also like to access table columns by name, not only position. And so on.

Making Results Comparable

In Spring 2021, we ran a seminar to understand the state of knowledge of type systems for tables. While we read several excellent papers, we also came away very frustrated: authors simply did not seem to agree on what a “table” was or what operations to support. The result was an enormous degree of incommensurability.

Therefore, rather than invent Yet Another Tabular Type System, we decided to take a step back and address the incommensurability problem. What we need as a community is a shared, baseline understanding of several aspects of tables. That is what this work does: create a tables benchmark. This is not a performance benchmark, however; rather, it’s an expressivity and design benchmark. We call it B2T2: The Brown Benchmark for Tabular Types.

The benchmark doesn’t spring out of thin air. Rather, we extensively studied tabular support in widely-used industrial languages/libraries: R, Python/Pandas, and Julia. To cover educational needs, we also studied the Pyret-based Bootstrap:Data Science curriculum. You will notice that all are based on dynamic languages. (Though Pyret has an optional static type system, it currently does not support tables in any meaningful manner, so tabular programming is essentially dynamic.) This is intentional! If you start with a typed language, you end up reflecting the (potentially artificial and overly-restrictive) constraints of that type system. Rather, it’s healthy to study what programmers (seem to) want to say and do, filter these for reasonability, and reconcile that with the needs of static types (like decidability).

Do Now!

What do you expect to find in a tabular programming benchmark?

Make a list before you read on!

Benchmark Components

B2T2 has the following parts:

A definition of a table. There is actually a large space of possibilities here. We’ve chosen a definition that is both broad and interesting without being onerous.
Examples of tables. Why did we bother to provide these? We do so because many type systems may have all sorts of internal encodings. They are welcome to do so, but they cannot expect the outside world to conform to their representation. Therefore, these examples represent the canonical versions of these tables. Explaining how these will be converted to the internal format is the responsibility of the type system designers.
An API of table operations. This is of course the heart of the benchmark. In particular, different papers seem to use different subsets of operations. What is unclear is whether the missing operations are just as easy as the ones shown; difficult; or even impossible. This is therefore a big source of incommensurability.
Example programs. Depending on the representation of tables and the nature of the type systems and languages, these programs may have to be rewritten and may (to some observers) look quite unnatural.

All these might be predictable with some thought. There are two more components that may be a little more surprising:

Erroneous programs. In all sophisticated systems, there is a trade-off between complexity and explainability. We are disturbed by how little discussion there is of error-reporting in the papers we’ve read, and think the community should re-balance its emphasis. Even those who only care about technical depth (boo!) can take solace: there can be rich technical work in explaining errors, too! Furthermore, by making errors an explicit component, a team that does research into human factors—even if they leave all other aspects alone—has a “place to stand” to demonstrate their contribution.
A datasheet. To improve commensurability, we want authors to tell each other—and their users—in a standard format not only what they did but also where the bodies are buried.

Of course, all these parts are interesting even in the absence of types. We just expect that types will impose the most interesting challenges.

An Open Process

We expect this benchmark to grow and evolve. Therefore, we’ve put our benchmark in a public repository. You’re welcome to make contributions: correct mistakes, refine definitions, add features, provide more interesting examples, etc. You can also contribute solutions in your favorite language!

For More Details

You can read about all this in our paper and work with our repository.

Student Help-Seeking for (Un)Specified Behaviors

2021-10-02T00:00:00+00:00

Here’s a summary of the full arc, including later work, of the Examplar project.

Over the years we have done a lot of work on Examplar, our system for helping students understand the problem before they start implementing it. Given that students will even use it voluntarily (perhaps even too much), it would seem to be a success story.

However, a full and fair scientific account of Examplar should also examine where it fails. To that end, we conducted an extensive investigation of all the posts students made on our course help forum for a whole semester, identified which posts had to do with problem specification (and under-specification!), and categorized how helpful or unhelpful Examplar was.

The good news is we saw several cases where Examplar had been directly helpful to students. These should indeed be considered a lower-bound, because the point of Examplar is to “answer” many questions directly, so they would never even make it onto the help forum.

But there is also bad news. To wit:

Students sometimes simply fail to use Examplar’s feedback; is this a shortcoming of the UI, of the training, or something inherent to how students interact with such systems?
Students tend to overly focus on inputs, which are only a part of the suite of examples.
Students do not transfer lessons from earlier assignments to later ones.
Students have various preconceptions about problem statements, such as imagining functionality not asked for or constraints not imposed.
Students enlarge the specification beyond what was written.
Students sometimes just don’t understand Examplar.

These serve to spur future research in this field, and may also point to the limits of automated assistance.

To learn more about this work, and in particular to get the points above fleshed out, see our paper!

Adding Function Transformers to CODAP

2021-08-22T00:00:00+00:00

CODAP is a wonderful tool for data transformation. However, it also has important limitations, especially from the perspective of our curricula. So we’ve set about addressing them so that we can incorporate CODAP into our teaching.

CODAP

We at Brown PLT and Bootstrap are big fans of CODAP, a data-analysis tool from the Concord Consortium. CODAP has very pleasant support for working with tables and generating plots, and we often turn to it to perform a quick analysis and generate a graph.

One of the nice things about CODAP, that sets it apart from traditional spreadsheets, is that the basic unit of space is not a table or spreadsheet but a desktop that can contain several objects on it. A workspace can therefore contain many objects side-by-side: a table, a graph, some text, etc.:

Also, a lot of things in CODAP are done through direct manipulation. This is helpful for younger students, who may struggle with formal programming but can use a GUI to manipulate objects.

There are many other nice features in CODAP, such as the ability to track a data case cross representations, and so on. We urge you to go try it out! When you launch a new CODAP instance, CODAP will offer you a folder of examples, which can help you get acquainted with it and appreciate its features.

What’s Not to Love?

Unfortunately, we don’t love everything about CODAP. We’ll illustrate with an example. To be clear, this is not a bug in CODAP, but rather an important difference of opinion in ease-of-use.

Let’s say we want to find all the people who are below 50 years of age. In CODAP, there are a few ways to do it, all of which have their issues.

If you don’t mind being imprecise (which may be okay for a quick data exploration, but isn’t if you want to, say, compute a statistic over the result):

Create a new graph.
Drag the Age column to the graph.
Select all the items that are under 50 using visual inspection. (Depending on how much data you have and their spread, you’ll quite possibly under- and/or over-shoot.)
Then do the last few steps below.

If you care to get an accurate selection, instead begin with:

First, add a new column to the original table.
Enter a formula for that column (in this case, Age < 50).
Obtain a selection of the desired items, which can be done in several different ways, also all with trade-offs:

Sort by that column. Unfortunately, this won’t work if there’s grouping in the table. You’d have to select manually. (Try it. This may be a bit harder than it seems.)
Create a graph as above, but of the new column. This will give you a clean separation into two values. Manually select all the values in the true column. At least now it will be visually clear if you didn’t select all the right values (if the dataset is not too large).
Remove the formula for the new column. Now drag it to the leftmost end of the table. (If you don’t remove the formula, you’ll get an error!) Now you have all the elements grouped by true and false (and operations performed to one can also be performed to the other).

You’re not done! You still have more steps to go:

If you aren’t already in the table (e.g., if you made a graph), select the table.
Click on the “Eye” icon.
Choose the “Set Aside Unselected Cases” entry.

Note that, in most or all of these cases:

You’ve added a completely superfluous column to your dataset.
You may have changed the order of items in your dataset.
You’ve lost the ability to see the original data alongside the filtered data.
You had to take numerous steps.
You had to remember to use the Eye icon for filtering, as opposed to other GUI operations for other tasks.
You had to remember where the Eye icon even is: it’s hidden when a table isn’t selected.

But most of all, in every single case:

You had to perform all these operations manually.

Why does this matter? We need data science to be reproducible: we should be able to give others our datasets and scripts so they can re-run them to check that they get the same answer, tweak them so they can check the robustness of our answers, and so on. But when all the operations are done manually, there’s no “script”, only output. That focuses on answers rather than processes, and is anti-reproducibility.

In contrast, we think of filtering as a program operation that we apply to a table to produce a new table, leaving the original intact: e.g., the way it works in Pyret. This addresses almost all of the issues above.

Other Pedagogic Consequences

CODAP had to make certain design choices. They made good choices for some settings: for younger children, in particular, the direct manipulation interface works very nicely. It’s a low floor. However, we feel it’s also a lower-than-we’d-like ceiling. There are many things that the CODAP view of data transformation inhibits:

Making operations explicit, as we noted above.
Introducing the idea of functions or transformations of data as objects in their own right, not only as manual operations.
Having explicit data-transformation functions also connects to other related school curricula, such as algebra.
Saving and naming repeated operations, to learn a bottom-up process of developing abstractions.
Examining old and new tables side-by-side.

This last point is especially important. A critical task in data science is performing “what if” analyses. What-if fundamentally means we should be able to perform some operation (the “if” part) and compare the output (the “what” part). We might even want to look at multiple different scenarios, representing different possible outcomes. But traditional what-if analysis, whether in CODAP or on spreadsheets, often requires you, the human, to remember what has changed, rather than letting the computer do it for you. (Microsoft Excel has limited support to get around this, but its very presence indicates that spreadsheets, traditionally, did not support what-if analysis—even though that’s how they have often been marketed.)

Finally, there’s also a subtle consequence to CODAP’s design: derived tables must look substantially similar to their parents. In computing terms, the schema should be largely the same. That works fine when an operation has little impact on the schema: filtering doesn’t change the schema at all (in principle, though in CODAP you have to add an extra column…), and adding a new column is a conservative extension. But what if you want to perform an operation that results in a radically different schema? For instance, consider the “pivot wider” and “pivot longer” operations when we create tidy data. The results of those operations have substantially different schemas!

Introducing CODAP Transformers

In response to this critique, we’ve added a new plugin to CODAP called Transformers:

(This work was done by undergrad trio Paul Biberstein, Thomas Castleman, and Jason Chen.)

This introduces a new pane that lists several transformation operations, grouped by functionality:

For instance, with no more textual programming than before (the formula is the same), we can perform our same example as before, i.e., finding all the people younger than 50:

The result is a new table, which co-exists with the original:

The resulting table is just as much a table as the original. For instance, we can graph the ages in the two tables and see exactly the difference we’d expect:

(Over time, of course, you may build up many tables. The Transformers plugin chooses names based on the operations, to make them easy to tell apart. CODAP also lets you resize, minimize, and delete tables. In practice, we don’t expect users to have more than 3–4 tables up at a time.)

Saving Transformers

We might want to perform the same operation on multiple tables. This is valuable in several contexts:

We create a hand-curated table, with known answers, as a test case to make sure our operations perform what we expect. After confirming this, we want to be sure that we applied the exact same operation to the real dataset.
We want to perform the same operation to several related datasets: e.g., a table per year.

We might also simply want to give a meaningful name to the operation.

In such cases, we can use the “Save This Transformer” option at the bottom of the Transformers pane:

Following the programming processes we follow and teach, we naturally want you to think about the Design Recipe steps when saving it because, in programming terms, you’re creating a new function.

This now creates a new named transformer:

Every part of this is frozen other than the choice of dataset; it can be applied as many times as you want, to as many datasets as you want. The above use-cases are suggestions, but you can use it however you wish.

A Note on Errors

Suppose you try to apply an operation improperly. Say, for instance, you have a table of people that does not have an Age column, and you try to filter people with Age < 50. There are at least two choices that Transformers can take:

Allow you to try to perform the operation, and report an error.
Prevent you from even trying by simply not showing tables that are invalid in the drop-down list of tables that the operation can be applied to.

We know exactly what the programming languages reader reading this is thinking: “You’re going to choose the latter, right? Right?!? PLEASE TELL ME YOU ARE!!!”

Gentle Reader: we’re not.

Here’s why we chose not to.

There’s the messy implementation detail of figuring out exactly when a table should or shouldn’t be shown in the drop-down. And we’d have to maintain that across changes to the CODAP language. There are no such problems in the dynamic version.

But hey, we’re language implementors, we can figure these things out. Rather, our real reason comes from human factors:

Imagine you’re a teacher with a classroom full of students. A student tries to apply an operation to the wrong table. They probably don’t even realize that the operation can’t be applied. All they know is that the table doesn’t appear in the list. Their table doesn’t appear in the list! Their reaction is (perhaps rightly) going to be to raise their hand and say to their teacher, “This tool is broken! It won’t even show me my table!!!” And the teacher, dealing with a whole bunch of students, all in different states, may not immediately realize why the table doesn’t show. Everyone’s frustrated; the student feels stuck, and the teacher may be left feeling inadequate.

In contrast, if we just let the operation happen, here’s what the student sees:

Now the student has a pretty good chance of figuring out for themselves what went wrong: not pulling away the teacher from helping someone else, not blaming the tool, and instead giving themselves a chance of fixing their own problem.

There’s potentially a broader lesson here about making invalid states unrepresentable. Potentially.

Many, Many Transformers!

We’ve focused on just one transformation here. There are many more. We even have the pivoting operations for tidy data! (It would have been wrong to tease you with that up top, otherwise.)

We even take the what-if part seriously: the Compare Transformer lets you compare numeric and categorical data. Believe it or not, the categorical comparison operator was actually inspired by prior work we’ve done for many years on comparing access control policies, router configurations, and SDN programs (see also our two brief position papers). It’s pretty thrilling to see the flow of ideas from security and networking research to data science education in a small but very non-obvious way: the grouping in the categorical output is directly inspired by the multi-terminal decision diagrams of our original Margrave system.

Examples

For your benefit, we’ve set up a bunch of pre-built CODAP examples that show you the operations in action:

Make Your Own!

As you might have guessed from the examples above, transformers are now part of the official CODAP tool. You can go play with them right now on the CODAP site. Have fun! Tell us what you learned.

Thanks

Special thanks to our friends at the Concord Consortium, especially William Finzer, Jonathan Sandoe, and Chad Dorsey, for their support.

Developing Behavioral Concepts of Higher-Order Functions

2021-07-31T00:00:00+00:00

Higher-Order Functions (HOFs) are an integral part of the programming process. They are so ubiquitous, even Java had to bow and accept them. They’re especially central to cleanly expressing the stages of processing data, as the R community and others have discovered.

How do we teach higher-order functions? In some places, like How to Design Programs, HOFs are presented as abstractions over common patterns. That is, you see a certain pattern of program behavior over and over, and eventually learn to parameterize it and call it map, filter, and so on. That is a powerful method.

In this work, we take a different perspective. Our students often write examples first, so they can think in terms of the behavior they want to achieve. Thus, they need to develop an understanding of HOFs as abstractions over common behaviors: “mapping”, “filtering”, etc.

Our goal in this work is to study how well students form behavioral abstractions having been taught code pattern abstractions. Our main instrument is sets of input-output behavior, such as

(list "red" "green" "blue")
-->
(list 3 5 4)

(Hopefully that calls to mind “mapping”.) We very carefully designed a set of these to capture various specific similarities, overlaps, and impossibilities.

We then evaluated students using two main devices, inspired by activities in machine learning:

Clustering: Group behaviors into clusters of similar ones.
Classification: For each behavior, assign a label.

We also tried labeling over visual presentations.

Our paper describes these instruments in detail, and our outcomes. What is most interesting, perhaps, is not our specific outcomes, but this style of thinking about teaching HOFs. We think the materials—especially the input-output pairs, and a table of properties—will be useful to educators who are tackling this topic. More broadly, we think there’s a huge unexplored space of HOF pedagogy combined with meaningful evaluation. We hope this work inspires more work along these lines.

Adversarial Thinking Early in Post-Secondary Education

2021-07-18T00:00:00+00:00

Adversarial Thinking (AT) is often described as “thinking like a hacker” or as a “security mindset”. These quasi-definitions are not only problematic in their own right (in some cases, they can be outright circular), they are also too narrow. We believe that AT applies in many other settings as well: in finding ways where machine learning can go wrong, for identifying problems with user interfaces, and for that matter even in software testing and verification.

All these are, however, quite sophisticated computer science concepts. Does that mean AT can only be covered in advanced computer science courses—security, machine learning, formal methods, and the like? Put differently, how much technical sophistication do students need before they can start to engage in it?

We believe AT can be covered starting from a fairly early stage. In this work, we’ve studied its use with (accelerated) introductory post-secondary (university) students. We find that they do very well, but also exhibit some weaknesses. We also find that they are able to reckon with the consequences of systems well beyond their technical capability. Finally, we find that they focus heavily on social issues, not just on technical ones.

In addition to these findings, we have also assembled a rich set of materials covering several aspects of computer science. Students generally found these engaging and thought-provoking, and responded to them with enthusiasm. We think educators would benefit greatly from this collection of materials.

Want to Learn More?

If you’re interested in this, and in the outcomes, please see our paper.

Teaching and Assessing Property-Based Testing

2021-01-10T00:00:00+00:00

Property-Based Testing (PBT) sees increasing use in industry, but lags significantly behind in education. Many academics have never even heard of it. This isn’t surprising; computing education still hasn’t come to terms with even basic software testing, even when it can address pedagogic problems. So this lag is predictable.

The Problem of Examples

But even people who want to use it often struggle to find good examples of it. Reversing a list drives people to drink, and math examples are hard to relate to. This is a problem from several respects. Without compelling examples, nobody will want to teach it. Even if they do, unless the examples are compelling, students will not pay attention to it. And if the students don’t, they won’t recognize opportunities to use it later in their careers.

This loses much more than a testing technique. We consider PBT a gateway to formal specification. Like a formal spec, it’s an abstract statement about behaviors. Unlike a formal spec, it doesn’t require learning a new language or mathematical formalism, it’s executable, and it produces concrete counter-examples. We therefore use it, in Brown’s Logic for Systems course, as the starting point to more formal specifications. (If they learn nothing about formal specification but simply become better testers, we’d still consider that a win.)

Therefore, for the past 10 years, and with growing emphasis, we’ve been teaching PBT: starting in our accelerated introductory course, then in Logic for Systems, and gradually in other courses as well. But how do we motivate the concept?

Relational Problems

We motivate PBT through what we call relational problems. What are those?

Think about your typical unit test. You write an input-output pair: f(x) is y. Let’s say it fails:

Usually, the function f is wrong. Congrats, you’ve just caught a bug!
Sometimes, the test is wrong: f(x) is not, in fact, y. This can take some reflection, and possibly reveals a misunderstanding of the problem.

That’s usually where the unit-testing story ends. However, there is one more possibility:

Neither is “wrong”. f(x) has multiple legal results, w, y, and z; your test chose y, but this particular implementation happened to return z or w instead.

We call these “relational” because f is clearly more a relation than a function.

Some Examples

So far, so abstract. But many problems in computing actually have a relational flavor:

Consider computing shortest paths in a graph or network; there can be many shortest paths, not just one. If we write a test to check for one particular path, we could easily run into the problem above.
Many other graph algorithms are also relational. There are many legal answers, and the implementation happens to pick just one of them.
Non-deterministic data structures inspire relational behavior.
Various kinds of matching problems—e.g., the stable-marriage problem—are relational.
Combinatorial optimization problems are relational.
Even sorting, when done over non-atomic data, is relational.

In short, computing is full of relational problems. While they are not at all the only context in which PBT makes sense, they certainly provide a rich collection of problems that students already study that can be used to expose this idea in a non-trivial setting.

Assessing Student Performance

Okay, so we’ve been having students write PBT for several years now. But how well do they do? How do we go about measuring such a question? (Course grades are far too coarse, and even assignment grades may include various criteria—like code style—that are not strictly germane to this question.) Naturally, their core product is a binary classifier—it labels a purported implementation as valid or invalid—so we could compute precision and recall. However, these measures still fail to offer any semantic insight into how students did and what they missed.

We therefore created a new framework for assessing this. To wit, we took each problem’s abstract property statement (viewed as a formal specification), and sub-divided it into a set of sub-properties whose conjunction is the original property. Each sub-property was then turned into a test suite, which accepted those validators that enforced the property and rejected those that did not. This let us get a more fine-grained understanding of how students did, and what kinds of mistakes they made.

Want to Learn More?

If you’re interested in this, and in the outcomes, please see our paper.

What’s Next?

The results in this paper are interesting but preliminary. Our follow-up work describes limitations to the approach presented here, thereby improving the quality of evaluation, and also innovates in the generation of classifying tests. Check it out!

Students Testing Without Coercion

2020-10-20T00:00:00+00:00

Here’s a summary of the full arc, including later work, of the Examplar project.

It is practically a trope of computing education that students are over-eager to implement, yet woefully under-eager to confirm they understand the problem they are tasked with, or that their implementation matches their expectations. We’ve heard this stereotype couched in various degrees of cynicism, ranging from “students can’t test” to “students won’t test”. We aren’t convinced, and have, for several years now, experimented with nudging students towards early example writing and thorough testing.

We’ve blogged previously about our prototype IDE, Examplar — our experiment in encouraging students to write illustrative examples for their homework assignments before they actually dig into their implementation. Examplar started life as a separate, complementary tool from Pyret’s usual programming environment, providing a buffer just for the purposes of developing a test suite. Clicking Run in Examplar runs the student’s test suite against our implementations (not theirs) and then reports the degree to which the suite was valid and thorough (i.e., good at catching our buggy implementations). With this tool, students could catch their misconceptions before implementing them.

Although usage of this tool was voluntary for all but the first assignment, students relied on it extensively throughout the semester and the quality of final submissions improved drastically compared to prior offerings of the course. Our positive with this prototype encouraged us to fully-integrate Examplar’s feedback into students’ development environment. Examplar’s successor provides a unified environment for the development of both tests and implementation:

This new environment — which provides Examplar-like feedback on every Run — no longer requires that students have the self-awareness to periodically switch to a separate tool. The environment also requires students to first click a “Begin Implementation” button before showing the tab in which they write their implementation.

This unified environment enabled us to study, for the first time, whether students wrote examples early, relative to their implementation progress. We tracked the maximum test thoroughness students had achieved prior to each edit-and-run of their implementations file. Since the IDE notified students of their test thoroughness upon each run, and since students could only increase their thoroughness via edits to their tests file, the mean of these scores summarizes how thoroughly a student explored the problem with tests before fully implementing it.

We find that nearly every student on nearly every assignment achieves some level of thoroughness before their implementation work:

The "Mean Implementation-Interval Thoroughness" of each student, on various assignments. (Click picture to open in a new window.)

To read more about our design of this environment, its pedagogic context, and our evaluation of students’ development process, check out the full paper here.

Using Design Alternatives to Learn About Data Organizations

2020-06-27T00:00:00+00:00

A large number of computer science education papers focus on data structures. By this, they mean the canon: lists, queues, stacks, heaps, and so on. These are certainly vital to the design of most programs.

However, there is another kind of “data structure” programmers routinely contend with: how to represent the world your program is about. Suppose, for instance, you’re trying to represent a family’s genealogy. You could:

Represent each person as a node and have references to their two biological parents, who in turn have references to their biological parents, and so on. The tree “bottoms out” when we get to people about whom we have no more information.
Represent each person as a node and have references to their children instead (a list, say, if we want to preserve their birth order). This tree bottoms out at people who have no children.
Represent each coupling as a node, and have references to their children (or issue, as genealogists like to say). Now you may have a kind of node for children and another for coupling.

And so on. There are numerous possibilities. Which one should we pick? It depends on (1) what information we even have, (2) what operations we want to perform, and (3) what complexity we need different operations to take.

Unfortunately, computing education research doesn’t talk about this problem very much at all; in fact, we don’t seem to even have terminology to talk about this issue. In a sense, this is also very much a matter of data structure, though of a different kind: whereas the purely abstract data structures of computer science we might call computational data structures, these — which center around directly representing real-world information — we might instead call representational data structures. That could get pretty confusing, though, so we’ve adopted the term data organization to refer to the latter.

Learning to think about data organization is an essential computing skill. But how early can we teach it? How well can students wrestle with it? What methods should we use? Do they need to be sophisticated programmers before they can engage in reasoning about representations?

Good news: we can begin this quite early, and students don’t need to be sophisticated computer scientists: they can just think about the world, and their experiences living in it, to reason about data organizations. Representational data structures probably do a far better job of drawing on their lived experience than computational ones do! (Unless they’ve previously lived as a computer.)

There are several ways we could introduce this topic. We chose to expose them to pairs of representations for the same domain, and have them compare the two. This is related to theories of perception. Read the paper to learn more!

Somewhat subtly, this also adds a dimension to “computational thinking” that is usually quite missing from standard discussions about it. Activities like those described in this paper generate new and engaging activities that many students can participate in. Indeed, computing background does not seem to matter much in our data, and a more diverse group of students is likely to make a much richer set of judgments—thereby enabling students in traditionally underrepresented groups to contribute based on their unique experiences, and also feel more valued.

What Help Do Students Seek in TA Office Hours?

2020-05-20T00:00:00+00:00

In computer science, a large number of students get help from teaching assistants (TAs). A great deal of their real education happens in these hours. While TA hours are an excellent resource, they are also rather opaque to the instructors, who do not really know what happens in them.

How do we construct a mechanism to study what happens in hours? It’s actually not obvious at all:

We could set up cameras to record all the interactions in hours. While this would provide a lot of information, it significantly changes the nature of hours. For many students, hours are private time with a TA, where they can freely speak about their discomfort and get help from a peer; they might ask personal questions; they might also complain about the instructor. One does not install cameras in confessionals.
We could ask TAs to write extensive notes (redacting private information) after the student has left. This also has various flaws:
- Their memory may be faulty.
- Their recollection may be biased by their own beliefs.
- It would slow down processing students, who already confront overly-long lines and waits.

What do we instead want? A process that is non-intrusive, lightweight, and yet informative. We have to also give up on perfect knowledge, and focus on information that is actually useful to the instructor.

Part of the problem is that we as a community lack a systematic method to help students in the first place. If students have no structure to how they approach help-seeking, then it’s hard to find patterns and make sense of what they actually do.

However, this is exactly a problem that the How to Design Programs Design Recipe was addressed to solve. It provides a systematic way for students to structure their problem-solving and help-seeking. TAs are instructed to focus on the steps of the Design Recipe in order, not addressing later steps until students have successfully completed the earlier ones. This provides an “early warning” diagnostic, addressing root causes rather than their (far-removed) manifestations.

Therefore, we decided to use the Design Recipe steps as a lens for obtaining insight into TA hours. We argue that this provides a preliminary tool that addresses our needs: it is lightweight, non-intrusive, and yet useful to the instructor. Read the paper to learn more!

Combating Misconceptions by Encouraging Example-Writing

2020-01-11T00:00:00+00:00

Here’s a summary of the full arc, including later work, of the Examplar project.

When faced with an unfamiliar programming problem, undergraduate computer science students all-too-often begin their implementations with an incomplete understanding of what the problem is asking, and may not realize until far into their development process (if at all) that they have solved the wrong problem. At best, a student realizes their mistake, suffers from some frustration, and is able to correct it before the final submission deadline. At worst, they might not realize their mistake until they receive feedback on their final submission—depriving them of the intended learning goal of the assignment.

Educators must therefore provide students with some mechanism by which students can evaluate their own understanding of a problem—before they waste time implementing some misconceived variation of that problem. To this end, we provide students with Examplar: an IDE for writing input–output examples that provides on-demand feedback on whether the examples are:

valid (consistent with the problem), and
thorough (explore the conceptually interesting corners of the problem).

For a demonstration, watch this brief video!

With its gamification, we believed students would find Examplar compelling to use. Moreover, we believed its feedback would be helpful. Both of these hypotheses were confirmed. We found that students used Examplar extensively—even when they were not required to use it, and even for assignments for which they were not required to submit test cases. The quality of students’ final submissions generally improved over previous years, too. For more information, read the full paper here!

The Hidden Perils of Automated Assessment

2018-07-26T00:00:00+00:00

We routinely rely on automated assessment to evaluate our students’ work on programming assignments. In principle, these techniques improve the scalability and reproducibility of our assessments. In actuality, these techniques may make it incredibly easy to perform flawed assessments at scale, with virtually no feedback to warn the instructor. Not only does this affect students, it can also affect the reliability of research that uses it (e.g., that correlates against assessment scores).

To Test a Test Suite

The initial object of our study was simply to evaluate the quality of student test suites. However, as we began to perform our measurements, we wondered how stable they were, and started to use different methods to evaluate stability.

In this group, we take the perspective that test suites are classifiers of implementations. You give a test suite an implementation, and it either accepts or rejects it. Therefore, to measure the quality of a test suite, we can standard metrics for classifiers, true positive rate and true negative rate. However, to actually do this, we need a set of implementations that we know, a priori, to be correct or faulty.

		Ground Truth
		Correct	Faulty
Test Suite	Accept	True Negative	False Negative
Test Suite	Reject	False Positive	True Positive

A robust assessment of a classifier may require a larger collection of known-correct and known-faulty implementations than the instructor could craft themselves. Additionally, we can leverage all of the implementations that students are submitting—we just need to determine which are correct and which are faulty.

There are basically two ways of doing this in the literature; let’s see how they fare.

The Axiomatic Model

In the first method, the instructor writes a test suite and whatever that test suite’s judgments is used as the ground truth; e.g., if the instructor test suite accepts a given implementation, it is a false positive for a student’s test suite to reject it.

The Algorithmic Model

The second method does this by taking every test suite you have (i.e., both the instructor’s and the students’), running them all against a known-correct implementation, and gluing all the ones that pass it into one big mega test suite that is used to establish ground truth.

A Tale of Two Assessments

We applied each model in turn to classify 38 student implementations and a handful of specially crafted ones (both correct and faulty, in case the student submissions were skewed heavily towards faultiness or correctness), then computed the true-positive and true-negative rate for each student’s test suite.

The choice of underlying implementation classification model substantially impacted the apparent quality of student test suites. Visualized as kernel density estimation plots (akin to smoothed histograms):

The Axiomatic Model:

Judging by this plot, students did astoundingly well at catching buggy implementations. Their success at identifying correct implementations was more varied, but still pretty good.

The Algorithmic Model:

Judging by this plot, students performed astoundingly poorly at detecting buggy implementations, but quite well at identifying correct ones.

Towards Robust Assessments

So which is it? Do students miss half of all buggy implementations, or are they actually astoundingly good? In actuality: neither. These strikingly divergent analysis outcomes are produced by fundamental, theoretical flaws in how these models classify implementations.

We were alarmed to find that these theoretical flaws, to varying degrees, affected the assessments of every assignment we evaluated. Neither model provides any indication to warn instructors when these flaws are impacting their assessments. For more information about these perils, see our paper, in which we present a technique for instructors and researchers that detects and protects against them.

Mystery Languages

2018-07-05T00:00:00+00:00

How do you learn a new language? Do you simply read its reference manual, and then you’re good to go? Or do you also explore the language, trying things out by hand to see how they behave?

This skill—learning about something by poking at it—is frequently used but almost never taught. However, we have tried to teach it via what we dub mystery languages, and this post is about how we went about it.

Mystery Lanuages

A mystery language is exactly what it sounds like: a programming language whose behavior is a mystery. Each assignment comes with some vague documentation, and an editor in which you can write programs. However, when you run a program it will produce multiple answers! This is because there are actually multiple languages, with the same syntax but different semantics. The answers you get are the results of running your program in all of the different languages. The goal of the assignment is to find programs that tell the languages apart.

As an example, maybe the languages have the syntax a[i], and the documentation says that this “accesses an array a at index i”. That totally specifies the behavior of this syntax, right?

Not even close. For example, what happens if the index is out of bounds? Does it raise an error (like Java), or return undefined (like JavaScript), or produce nonsense (like C), or wrap the index to a valid one (like Python)? And what happens if the index is 2.5, or "2", or 2.0, or "two"?

(EDIT: Specifically, Python wraps negative indices that are smaller than the length of the array.)

Students engage with the mystery languages in three ways:

The first part of each assignment is to find a set of programs that distinguishes the different languages (a.k.a. a classifier).
The second part of the assignment is to describe a theory that explains the different behaviors of the languages. (For example, a theory about C’s behavior could include accessing heap addresses past the end of an array.)
Finally, after an assignment is due, there’s an in-class discussion and explanation of the mystery languages. This is especially useful to provide terminology for behaviors that students encountered in the languages.

This example is somewhat superficial (the real mystery languages mostly use more significant features than array access), but you get the idea: every aspect of a programming language comes with many possible designs, and the mystery languages have students explore them first-hand.

Why Mystery Languages?

We hope to teach a number of skills through mystery languages:

Evaluating Design Decisions: The mystery languages are almost entirely based on designs chosen by real languages: either historical choices that have since been settled, or choices that are still up in the air and vary between modern languages. Students get to explore the consequences of these decisions themselves, so that they get a feel for them. And the next day in class they learn about the broader historical context.
Syntax vs. Semantics: We are frustrated by how frequently discussions about programming languages revolve around syntax. With luck, showing students first-hand that a single syntax can have a variety of semantics will bring their discussions to a somewhat higher level.
Adversarial Thinking: Notice that the array example was all about exploring edge cases. If you only ever wrote correct, safe, boring programs then you’d never notice the different ways that array access could work. This is very similar to the kind of thinking you need to do when reasoning about security, and we hope that students might pick up on this mindset.
Experimental Thinking: In each assignment, students are asked not only to find programs that behave differently across the mystery languages, but also to explain the behavior of the different languages. This is essentially a scientific task: brainstorm hypotheses about how the languages might work; experimentally test these hypotheses by running programs; discard any hypothesis that was falsified; and iterate.

Adopting Mystery Languages

If you want to use Mystery Languages in your course, email us and we’ll help you get started!

There are currently 13 mystery languages, and more in development. At Brown, we structured our programming languages course around these mystery languages: about half of the assignments and half of the lectures were about mystery languages. However, mystery languages are quite flexible, and could also be used as a smaller supplement to an existing course. One possibility is to begin with one or two simpler languages to allow students to learn how they work, and from there mix and match any of the more advanced languages. Alternatively, you could do just one or two mystery languages to meet specific course objectives.

Learn More

You can learn more about mystery languages in our SNAPL’17 paper, or dive in and try them yourself (see the assignments prefixed with “ML:”).

Resugaring Type Rules

2018-06-19T00:00:00+00:00

This is the final post in a series about resugaring. It focuses on resugaring type rules. See also our posts on resugaring evaluation steps and resugaring scope rules.

No one should have to see the insides of your macros. Yet type errors often reveal them. For example, here is a very simple and macro in Rust (of course you should just use && instead, but we’ll use this as a simple working example):

and the type error message you get if you misuse it:

You can see that it shows you the definition of this macro. In this case it’s not so bad, but other macros can get messy, and you might not want to see their guts. Plus in principle, a type error should only show the erronous code, not correct code that it happend to call. You wouldn’t be very happy with a type checker that sometimes threw an error deep in the body of a (type correct) library function that you called, just because you used it the wrong way. Why put up with one that does the same thing for macros?

The reason Rust does is that that it does not know the type of and. As a result, it can only type check after and has been desugared (a.k.a., expanded), and so the error occurs in the desugared code.

But what if Rust could automatically infer a type rule for checking and, using only the its definition? Then the error could be found in the original program that you wrote (rather than its expansion), and presented as such. This is exactly what we did—albeit for simpler type systems than Rust’s—in our recent PLDI’18 paper Inferring Type Rules for Syntactic Sugar.

We call this process resugaring type rules; akin to our previous work on resugaring evaluation steps and resugaring scope rules. Let’s walk through the resugaring of a type rule for and:

We want to automatically derive a type rule for and, and we want it to be correct. But what does it mean for it to be correct? Well, the meaning of and is defined by its desugaring: α and β is synonymous with if α then β else false. Thus they should have the same type:

(Isurf ||- means “in the surface language type system”, Icore ||- means “in the core language type system”, and the fancy D means “desugar”.)

How can we achieve this? The most straightforward to do is to capture the iff with two type rules, one for the forward implication, and one for the reverse:

The first type rule is useful because it says how to type check and in terms of its desugaring. For example, here’s a derivation that true and false has type Bool:

However, while this t-and^→ rule is accurate, it’s not the canonical type rule for and that you’d expect. And worse, it mentions if, so it’s leaking the implementation of the macro!

However, we can automatically construct a better type rule. The trick is to look at a more general derivation. Here’s a generic type derivation for any term α and β:

Notice D_α and D_β: these are holes in the derivation, which ought to be filled in with sub-derivations proving that α has type Bool and β has type Bool. Thus, “α : Bool” and “β : Bool” are assumptions we must make for the type derivation to work. However, if these assumptions hold, then the conclusion of the derivation must hold. We can capture this fact in a type rule, whole premises are these assumptions, and whose conclusion is the conclusion of the whole derivation:

And so we’ve inferred the canonical type rule for and! Notice that (i) it doesn’t mention if at all, so it’s not leaking the inside of the macro, and (ii) it’s guaranteed to be correct, so it’s a good starting point for fixing up a type system to present errors at the right level of abstraction. This was a simple example for illustrative purposes, but we’ve tested the approach on a variety of sugars and type systems.

You can read more in our paper, or play with the implementation.

Picking Colors for Pyret Error Messages

2018-06-11T00:00:00+00:00

Pyret has beautiful error messages, like this one:

Notice the colors. Aren’t they pretty? Whenever a section of code is mentioned in an error message, it gets highlighted with its own color. And we pick colors that are as different as possible, so they don’t get confused with each other. It is useful to keep all of the colors distinct because it provides a very intuitive one-to-one mapping between parts of the code you wrote and the code snippets mentioned in the error messages. If two error messages used the same color for a snippet, it might look at first glance that they were mentioning the same thing.

(We should say up front: while we believe that the approach described in this post should be fairly robust to most forms of color blindness, it’s difficult to reason about so we make no guarantees. However, if two colors are hard to distinguish by sight, you can always hover over one of them to see the matching section of code blink. EDIT: Actually, it’s not as robust as we had hoped. If you know a good approach to this, let us know.)

How did we make them? It should be easy, right? We could have a list of, say, six colors and use those. After all, no error message needs more than six colors.

Except that there might be multiple error messages. In fact, if you have failing test cases, then you’ll have one failure message per failing test case, each with its own highlight, so there is no upper bound on how many colors we need. (Pyret will only show one of these highlights at a time—whichever one you have selected—but even so it’s nice for them to all have different colors.) Thus we’ll need to be able to generate a set of colors on demand.

Ok, so for any given run of the program, we’ll first determine how many colors we need for that run, and then generate that many colors.

Except that it’s difficult to tell how many colors we need beforehand. In fact, Pyret has a REPL, where users can evaluate expressions, which might throw more errors. Thus it’s impossible to know how many colors we’ll need beforehand, because the user can always produce more errors in the REPL.

Therefore, however we pick colors, it must satisfy these two properties:

Distinctness: all of the colors in all of the highlights should be as visually different from each other as possible.
Streaming: we must always be able to pick new colors.

Also, the appearance of the highlights should be pretty uniform; none of them should stand out too much:

Uniformity: all of the colors should have the same saturation (i.e. colorfulness) and lightness as each other. This way none of them blend in with the background color (which is white) or the text color (which is black), or stand out too much.

The Phillotactic Color-Picking Algorithm

Now let’s talk about the algorithm we use!

(Note that this is “phillotactic”, not “phyllotactic”. It has nothing to do with plants.)

To keep uniformity, it makes sense to pick colors from a rainbow. This is a circle in color space, with constant saturation and lightness and varying hue. Which color space should we use? We should not use RGB, because that space doesn’t agree well with how colors actually appear. For example, if we used a rainbow in RGB space, then green would appear far too bright and blue would appear far too dark. Instead, we should use a color space that agrees with how people actually perceieve colors. The CIELAB color space is better. It was designed so that if you take the distance between two colors in it, that distance approximately agrees with how different the colors seem when you look at them. (It’s only approximate because—among other things—perceptual color space is non-Euclidean.)

Therefore we’ll pick colors from a circle in CIELAB space. This space has three coordinates: L, for lightness, A for green-red, and B for blue-yellow (hence the LAB). We determined by experimentation that a good lightness to use was 73 out of 100. Given this lightness, we picked the largest saturation possible, using A^2 + B^2 = 40^2.

Now how do we vary the hue? Every color picked needs a new hue, and they need to all be as different as possible. It would be bad, for instance, if we picked 13 colors, and then the 13th color looked just like the 2nd color.

Our solution was to have each color’s hue be the golden angle from the previous hue. From Wikipedia, the golden angle is “the angle subtended by the smaller arc when two arcs that make up a circle are in the golden ratio”. It is also 1/ϕ^2 of a circle, or about 137 degrees.

Thus the phillogenic algorithm keeps track of the number of colors generated so far, and assigns the n’th color a hue of n times the golden angle. So the first color will have a hue of 0 degrees. The second color will have a hue of 137 degrees. The third will have a hue of 137 * 2 = 274 degrees. The fourth will be 137 * 3 = 411 = 51 degrees. This is a little close to the first color. But even if we knew we’d have four colors total, they’d be at most 90 degrees apart, so 51 isn’t too bad. This trend continues: as we pick more and more colors, they never end up much closer to one another than is necessary.

There’s a reason that no two colors end up too similar. It follows from the fact that ϕ is the most difficult number to approximate as a fraction. Here’s a proof that colors aren’t similar:

Suppose that the m’th color and the (m+n)’th color end up being very similar. The difference between the m’th and (m+n)’th colors is the same as the difference between the 0’th and the n’th colors. Thus we are supposing that the 0’th color and the n’th color are very similar.

Let’s measure angles in turns, or fractions of 360 degrees. The n’th color’s hue is, by definition, n/ϕ^2 % 1 turns. The 0’th hue is 0. So if these colors are similar, then n/ϕ^2 % 1 ~= 0 (using ~= for “approximately equals”). We can then reason as follows, using in the third step the fact that ϕ^2 - ϕ - 1 = 0 so ϕ^2 = ϕ + 1:

n/ϕ^2 % 1 ~= 0
n/ϕ^2 ~= k    for some integer k
ϕ^2 ~= n/k
1 + ϕ ~= n/k
ϕ ~= (n-k)/k

Now, if n is small, then k is small (because n/k ~= ϕ^2), so (n-k)/k is a fraction with a small denominator. But ϕ is difficult to approximate with fractions, and the smaller the denominator the worse the approximation, so ϕ actually isn’t very close to (n-k)/k, so n/ϕ^2 % 1 actually isn’t very close to 0, so the n’th color actually isn’t very similar to the 0’th color.

And that’s why the phillotactic colors work.

Can We Crowdsource Language Design?

2017-07-06T00:00:00+00:00

Programming languages are user interfaces. There are several ways of making decisions when designing user interfaces, including:

a small number of designers make all the decisions, or
user studies and feedback are used to make decisions.

Most programming languages have been designed by a Benevolent Dictator for Life or a committee, which corresponds to the first model. What happens if we try out the second?

We decided to explore this question. To get a large enough number of answers (and to engage in rapid experimentation), we decided to conduct surveys on Amazon Mechanical Turk, a forum known to have many technically literate people. We studied a wide range of common programming languages features, from numbers to scope to objects to overloading behavior.

We applied two concrete measures to the survey results:

Consistency: whether individuals answer similar questions the same way, and
Consensus: whether we find similar answers across individuals.

Observe that a high value of either one has clear implications for language design, and if both are high, that suggests we have zeroed in on a “most natural” language.

As Betteridge’s Law suggests, we found neither. Indeed,

A surprising percentage of workers expected some kind of dynamic scope (83.9%).
Some workers thought that record access would distribute over the field name expression.
Some workers ignored type annotations on functions.
Over the field and method questions we asked on objects, no worker expected Java’s semantics across all three.

These and other findings are explored in detail in our paper.

Crowdsourcing User Studies for Formal Methods

2017-07-03T00:00:00+00:00

For decades, we have neglected performing serious user studies of formal-methods tools. This is now starting to change. An earlier post introduces our new work in this area.

That study works with students in an upper-level class, who are a fairly good proxy for some developers (and are anyway an audience we have good access to). Unfortunately, student populations are problematic for several reasons:

There are only so many students in a class. There may not be enough to obtain statistical strength, especially on designs that require A/B testing and the like.
The class is offered only so often. It may take a whole year between studies. (This is a common problem in computing education research.)
As students progress through a class, it’s hard to “rewind” them and study their responses at an earlier stage in their learning.

And so on. It would be helpful if we could obtain large numbers of users quickly, relatively cheaply, and repeatedly.

This naturally suggests crowdsourcing. Unfortunately, the tasks we are examining involve using tools based on formal logic, not identifying birds or picking Web site colors (or solving CAPTCHAs…). That would seem to greatly limit the utility of crowd-workers on popular sites like Mechanical Turk.

In reality, this depends on how the problem is phrased. If we view it as “Can we find lots of Turkers with knowledge of Promela (or Alloy or …)?”, the answer is pretty negative. If, however, we can rework the problems somewhat so the question is “Can we get people to work on a puzzle?”, we can find many, many more workers. That is, sometimes the problem is one of vocabulary (and in particular, the use of specific formal methods languages) than of raw ability.

Concretely, we have taken the following steps:

Adapt problems from being questions about Alloy specifications to being phrased as logic “puzzles”.
Provide an initial training phase to make sure workers understand what we’re after.
Follow that with an evaluation phase to ensure that they “got the idea”. Only consider responses from those workers who score at a high enough threshold on evaluation.
Only then conduct the actual study.

Observe that even if we don’t want to trust the final results obtained from crowdsourcing, there are still uses for this process. Designing a good study requires several rounds of prototyping: even simple wording choices can have huge and unforeseen (negative) consequences. The more rounds we get to test a study, the better it will come out. Therefore, the crowd is useful at least to prototype and refine a study before unleashing it on a more qualified, harder-to-find audience — a group that, almost by definition, you do not want to waste on a first-round study prototype.

For more information, see our paper. We find fairly useful results using workers on Mechanical Turk. In many cases the findings there correspond with those we found with class students.

User Studies of Principled Model Finder Output

2017-07-01T00:00:00+00:00

For decades, formal-methods tools have largely been evaluated on their correctness, completeness, and mathematical foundations while side-stepping or hand-waving questions of usability. As a result, tools like model checkers, model finders, and proof assistants can require years of expertise to negotiate, leaving knowledgeable but uninitiated potential users at a loss. This state of affairs must change!

One class of formal tool, model finders, provides concrete instances of a specification, which can guide a user’s intuition or witness the failure of desired properties. But are the examples produced actually helpful? Which examples ought to be shown first? How should they be presented, and what supplementary information can aid comprehension? Indeed, could they even hinder understanding?

We’ve set out to answer these questions via disciplined user-studies. Where can we find participants for these studies? Ideally, we would survey experts. Unfortunately, it has been challenging to do so in the quantities needed for statistical power. As an alternative, we have begun to use formal methods students in Brown’s upper-level Logic for Systems class. The course begins with Alloy, a popular model-finding tool, so students are well suited to participate in basic studies. With this population, we have found some surprising results that call into question some intuitively appealing answers to (e.g.) the example-selection question.

For more information, see our paper.

Okay, that’s student populations. But there are only so many students in a class, and they take the class only so often, and it’s hard to “rewind” them to an earlier point in a course. Are there audiences we can use that don’t have these problems? Stay tuned for our next post.

A Third Perspective on Hygiene

2017-06-20T00:00:00+00:00

In the last post, we talked about scope inference: automatically inferring scoping rules for syntatic sugar. In this post we’ll talk about the perhaps surprising connection between scope inference and hygiene.

Hygiene can be viewed from a number of perspectives and defined in a number of ways. We’ll talk about two pre-existing perspectives, and then give a third perspective that comes from having scope inference.

First Perspective

The traditional perspective on hygiene (going all the way back to Kohlbecker in ’86) defines it by what shouldn’t happen when you expand macros / syntactic sugar. To paraphrase the idea:

Expanding syntactic sugar should not cause accidental variable capture. For instance, a variable used in a sugar should not come to refer to a variable declared in the program simply because it has the same name.

Thus, hygiene in this tradition is defined by a negative.

It has also traditionally focused strongly on algorithms. One would expect papers on hygiene to state a definition of hygiene, give an algorithm for macro expansion, and then prove that the algorithm obeys these properties. But this last step, the proof, is usually suspiciously missing. At least part of the reason for this, we suspect, is that definitions of hygiene have been too informal to be used in a proof.

And a definition of hygiene has been surprisingly hard to pin down precisely. In 2015, a full 29 years after Kohlbecker’s seminal work, Adams writes that “Hygiene is an essential aspect of Scheme’s macro system that prevents unintended variable capture. However, previous work on hygiene has focused on algorithmic implementation rather than precise, mathematical definition of what constitutes hygiene.” He goes on to discuss “observed-binder hygiene”, which is “not often discussed but is implicitly averted by traditional hygiene algorithms”. The point we’re trying to make is that the traditional view on hygiene is subtle.

Second Perspective

There is a much cleaner definition of hygiene, however, that is more of a positive statement (and subsumes the preceding issues):

If two programs are α-equivalent (that is, they are the same up to renaming variables), then when you desugar them (that is, expand their sugars) they should still be α-equivalent.

Unfortunately, this definition only makes sense if we have scope defined on both the full and base languages. Most hygiene systems can’t use this definition, however, because the full language is not usually given explicit scoping rules; rather, it’s defined implicitly through translation into the base language.

Recently, Herman and Wand have advocated for specifying the scoping rules for the full language (in addition to the base language), and then verifying that this property holds. If the property doesn’t hold, then either the scope specification or the sugar definitions are incorrect. This is, however, an onerous demand to place on the developer of syntactic sugar, especially since scope can be surprisingly tricky to define precisely.

Third Perspective

Scope inference gives a third perspective. Instead of requiring authors of syntactic sugar to specify the scoping rules for the full language, we give an algorithm that infers them. We have to then define what it means for this algorithm to work “correctly”.

We say that an inferred scope is correct precisely if the second definition of hygiene holds: that is, if desugaring preserves α-equivalence. Thus, our scope inference algorithm finds scoping rules such that this property holds, and if no such scoping rules exist then it fails. (And if there are multiple sets of scoping rules to choose from, it chooses the ones that put as few names in scope as possible.)

An analogy would be useful here. Think about type inference: it finds type annotations that could be put in your program such that it would type check, and if there are multiple options then it picks the most general one. Scope inference similarly finds scoping rules for the full language such that desugaring preserves α-equivalence, and if there are multiple options then it picks the one that puts the fewest names in scope.

This new perspective on hygiene allows us to shift the focus from expansion algorithms to the sugars themselves. When your focus is on an expansion algorithm, you have to deal with whatever syntactic sugar is thrown your way. If a sugar introduces unbound identifiers, then the programmer (who uses the macro) just has to deal with it. Likewise, if a sugar uses scope inconsistently, treating a variable either as a declaration or a reference depending on the phase of the moon, the programmer just has to deal with it. In contrast, since we infer scope for the full language, we check check weather a sugar would do one of these bad things, and if so we can call the sugar unhygienic.

To be more concrete, consider a desugaring rule for bad(x, expression) that sometimes expands to lambda x: expression and sometimes expands to just expression, depending on the context. Our algorithm would infer from the first rewrite that the expression must be in scope of x. However, this would mean that the expression was allowed to contain references to x, which would become unbound when the second rewrite was used! Our algorithm detects this and rejects this desugaring rule. Traditional macro systems allow this, and only detect the potential unbound identifier problem when it actually occurred. The paper contains a more interesting example called “lambda flip flop” that is rejected because it uses scope inconsistently.

Altogether, scope inference rules out bad sugars that cannot be made hygienic, but if there is any way to make a sugar hygienic it will find it.

Again, here’s the paper and implementation, if you want to read more or try it out.

Scope Inference, a.k.a. Resugaring Scope Rules

2017-06-12T00:00:00+00:00

This is the second post in a series about resugaring. It focuses on resugaring scope rules. See also our posts on resugaring evaluation steps and resugaring type rules.

Many programming languages have syntactic sugar. We would hazard to guess that most modern languages do. This is when a piece of syntax in a language is defined in terms of the rest of the language. As a simple example, x += expression might be shorthand for x = x + expression. A more interesting sugar is Pyret’s for loops. For example:

for fold(p from 1, n from range(1, 6)):
  p * n
end

computes 5 factorial, which is 120. This for is a piece of sugar, though, and the above code is secretly shorthand for:

fold(lam(p, n): p * n end, 1, range(1, 6))

Sugars like this are great for language development: they let you grow a language without making it more complicated.

Languages also have scoping rules that say where variables are in scope. For instance, the scoping rules should say that a function’s parameters are in scope in the body of the function, but not in scope outside of the function. Many nice features in editors depend on these scoping rules. For instance, if you use autocomplete for variable names, it should only suggest variables that are in scope. Similarly, refactoring tools that rename variables need to know what is in scope.

This breaks down in the presence of syntactic sugar, though: how can your editor tell what the scoping rules for a sugar are?

The usual approach is to write down all of the scoping rules for all of the sugars. But this is error prone (you need to make sure that what you write down matches the actual behavior of the sugars), and tedious. It also goes against a general principle we hold: to add a sugar to a language, you should just add the sugar to the language. You shouldn’t also need to update your scoping rules, or update your type system, or update your debugger: that should all be done automatically.

We’ve just published a paper at ICFP that shows how to automatically infer the scoping rules for a piece of sugar, like the for example above. Here is the paper and implementation. This is the latest work we’ve done with the goal of making the above principle a reality. Earlier, we showed how to automatically find evaluation steps that show how your program runs in the presence of syntatic sugar.

How it Works

Our algorithm needs two things to run:

The definitions of syntactic sugar. These are given as pattern-based rewrite rules, saying what patterns match and what they should be rewritten to.
The scoping rules for the base (i.e. core) language.

It then automatically infers scoping rules for the full language, that includes the sugars. The final step to make this useful would be to add these inferred scoping rules to editors that can use them, such as Sublime, Atom, CodeMirror, etc.

For example, we have tested it on Pyret (as well as other languages). We gave it scoping rules for Pyret’s base language (which included things like lambdas and function application), and we gave it rules for how for desugars, and it determined the scoping rules of for. In particular:

The variables declared in each from clause are visible in the body, but not in the argument of any from clause.
If two from clauses both declare the same variable, the second one shadows the first one.

This second rule is exactly the sort of thing that is easy to overlook if you try to write these rules down by hand, resulting in obscure bugs (e.g. when doing automatic variable refactoring).

Here are the paper and implementation, if you want to read more or try it out.

The PerMission Store

2017-02-21T00:00:00+00:00

This is Part 2 of our series on helping users manage app permissions. Click here to read Part 1.

As discussed in Part 1 of this series, one type of privacy decision users have to make is which app to install. Typically, when choosing an app, users pick from the first few apps that come up when they search a keyword in their app store, so the app store plays a big roll in which apps users download.

Unfortunately, most major app stores don’t help users make this decision in a privacy-minded way. Because these stores don’t factor privacy into their ranking, the top few search results probably aren’t the most privacy-friendly, so users are already picking from a problematic pool. Furthermore, users rely on information in the app store to choose from within that limited pool, and most app stores offer very little in the way of privacy information.

We’ve built a marketplace, the PerMission Store, that tackles both the ranking and user information concerns by adding one key component: permission-specific ratings. These are user ratings, much like the star ratings in the Google Play store, but they are specifically about an app’s permissions.¹

To help users find more privacy friendly apps, the privacy ratings are incorporated into the PerMission Store’s ranking mechanism, so that apps with better privacy scores are more likely to appear in the top hits for a given search. (We also consider factors like the star rating in our ranking, so users are still getting useful apps.) So users are selecting from a more privacy-friendly pool of apps right off the bat.

Apps’ privacy ratings are also displayed in an easy-to-understand way, alongside other basic information like star rating and developer. This makes it straightforward for users to consider privacy along with other key factors when deciding which app to install.

Incorporating privacy into the store itself makes it so that choosing privacy-friendly apps is as a natural as choosing useful apps.

The PerMission Store is currently available as an Android app and can be found on Google Play.

A more detailed discussion of the PerMission Store can be found in Section 3.1 of our paper.

This is Part 2 of our series on helping users manage app permissions. Click here to read Part 1.

_{1: As a bootstrapping mechanism, we’ve collected rating for a couple thousand
apps from Mechanical Turk. Ultimately, though, we expect the ratings to come
from in-the-wild users.}

Examining the Privacy Decisions Facing Users

2017-01-25T00:00:00+00:00

This is Part 1 of our series on helping users manage app permissions. Click here to read Part 2.

It probably comes as no surprise to you that users are taking their privacy in their hands every time they install or use apps on their smartphones (or tablets, or watches, or cars, or…). This begs the question, what kinds of privacy decisions are users actually making? And how can we help them with those decisions?

At first blush, users can manage privacy in two ways: by choosing which apps to install, and by managing their apps’ permissions once they’ve installed them. For the first type of decision, users could benefit from a privacy-conscious app store to help them find more privacy-respecting apps. For the second type of decision, users would be better served by an assistant that helps them decide which permissions to grant.

Users can only making installation decisions when they actually have a meaningful choice between different apps. If you’re looking for Facebook, there really aren’t any other apps that you could use instead. This left us wondering if users ever have a meaningful choice between different apps, or whether they are generally looking for a specific app.

To explore this question, we surveyed Mechanical turk workers about 66 different Android apps, asking whether they thought the app could be replaced by a different one. The apps covered a broad range of functionality, from weather apps, to games, to financial services.

It turns out that apps vary greatly in their “replaceability,” and, rather than falling cleanly into “replaceable” and “unique” groups, they run along a spectrum between the two. At one end of the spectrum you have apps like Instagram, which less than 20% of workers felt could be replaced. On the other end of the spectrum are apps like Waze, which 100% of workers felt was replaceable. In the middle are apps whose replaceability depends on which features you’re interested in. For example, take an app like Strava, which lets you track your physical activity and compete with friends. If you only want to track yourself, it could be replaced by something like MapMyRide, but if you’re competing with friends who all use Strava, you’re pretty much stuck with Strava.

Regardless of exactly which apps fall where on the spectrum, though, there are replaceable apps, so users are making real decisions about which apps to install. And, for irreplaceable apps, they are also having to decide how to manage those apps’ permissions. These two types of decisions require two approaches to assisting users. A privacy-aware marketplace would aid users with installation decisions by helping them find more privacy-respecting apps, while a privacy assistant could help users manage their apps’ permissions.

Click here to read about our privacy-aware marketplace, the PerMission Store, and stay tuned for our upcoming post on a privacy assistant!

A more detailed discussion of this study can be found in Section 2 of our paper.

The Pyret Programming Language: Why Pyret?

2016-06-26T00:00:00+00:00

We need better languages for introductory computing. A good introductory language makes good compromises between expressiveness and performance, and between simplicity and feature-richness. Pyret is our evolving experiment in this space.

Since we expect our answer to this question will evolve over time, we’ve picked a place for our case for the language to live, and will update it over time:

The Pyret Code; or A Rationale for The Pyret Programming Language

The first version answers a few questions that we expect many people have when considering languages in general and languages for education in particular:

Why not just use Java, Python, Racket, OCaml, or Haskell?
Will Pyret ever be a full-fledged programming language?
But there are lots of kinds of “education”!
What are some ways the educational philosophy influences the langauge?

In this post, it’s worth answering one more immediate question:

What’s going on right now, and what’s next?

We are currently hard at work on three very important features:

Support for static typing. Pyret will have a conventional type system with tagged unions and a type checker, resulting in straightforward type errors without the complications associated with type inference algorithms. We have carefully designed Pyret to always be typeable, but our earlier type systems were not good enough. We’re pretty happy with how this one is going.
Tables are a critical type for storing real-world data. Pyret is adding linguistic and library support for working effectively with tables, which PAPL will use to expose students to “database” thinking from early on.
Our model for interactive computation is based on the “world” model. We are currently revising and updating it in a few ways that will help it better serve our new educational programs.

On the educational side, Pyret is already used by the Bootstrap project. We are now developing three new curricula for Bootstrap:

A CS1 curriculum, corresponding to a standard introduction to computer science, but with several twists based on our pedagogy and materials.
A CS Principles curriculum, for the new US College Board Advanced Placement exam.
A physics/modeling curriculum, to help teach students physics and modeling through the medium of programming.

If you’d like to stay abreast of our developments or get involved in our discussions, please come on board!

Resugaring Evaluation Sequences

2016-02-06T00:00:00+00:00

This is the first post in a series about resugaring. It focuses on resugaring evaluation sequences. See also our later posts on resugaring scope rules and resugaring type rules.

A lot of programming languages are defined in terms of syntactic sugar. This has many advantages, but also a couple of drawbacks. In this post, I’m going to tell you about one of these drawbacks, and the solution we found for it. First, though, let me describe what syntactic sugar is and why it’s used.

Syntactic sugar is when you define a piece of syntax in a language in terms of the rest of the language. You’re probably already familiar with many examples. For instance, in Python, x + y is syntactic sugar for x.__add__(y). I’m going to use the word “desugaring” to mean the expansion of syntactic sugar, so I’ll say that x + y desugars to x.__add__(y). Along the same lines, in Haskell, [f x | x <- lst] desugars to map f lst. (Well, I’m simplifying a little bit; the full desugaring is given by the Haskell 98 spec.)

As a programming language researcher I love syntactic sugar, and you should too. It splits a language into two parts: a big “surface” language that has the sugar, and a much smaller “core” language that lacks it. This separation lets programmers use the surface language that has all of the features they know and love, while letting tools work over the much simpler core language, which lets the tools themselves be simpler and more robust.

There’s a problem, though (every blog post needs a problem). What happens when a tool, which has been working over the core language, tries to show code to the programmer, who has been working over the surface? Let’s zoom in on one instance of this problem. Say you write a little snippet of code, like so: (This code is written in an old version of Pyret; it should be readable even if you don’t know the language.)

my-list = [2]
cases(List) my-list:
  | empty() => print("empty")
  | link(something, _) =>
    print("not empty")
end

And now say you’d like to see how this code runs. That is, you’d like to see an evaluation sequence (a.k.a. an execution trace) of this program. Or maybe you already know what it will do, but you’re teaching students, and would like to show them how it will run. Well, what actually happens when you run this code is that it is first desugared into the core, like so:

my-list = list.["link"](2, list.["empty"])
block:
  tempMODRIOUJ :: List = my-list
  tempMODRIOUJ.["_match"]({
    "empty" : fun(): print("empty") end,
    "link" : fun(something, _):
      print("not empty") end
},
fun(): raise("cases: no cases matched") end)
end

This core code is then run (each block of code is the next evaluation step):

my-list = obj.["link"](2, list.["empty"])
block:
tempMODRIOUJ :: List = my-list
  tempMODRIOUJ.["_match"]({"empty" : fun():   print("empty") end,
  "link" : fun(something, _):   print("not empty") end}, fun():
  raise("cases: no cases matched") end)
  end

my-list = obj.["link"](2, list.["empty"])
block:
tempMODRIOUJ :: List = my-list
  tempMODRIOUJ.["_match"]({"empty" : fun():   print("empty") end,
  "link" : fun(something, _):   print("not empty") end}, fun():
  raise("cases: no cases matched") end)
  end

my-list = <func>(2, list.["empty"])
block:
tempMODRIOUJ :: List = my-list
  tempMODRIOUJ.["_match"]({"empty" : fun():   print("empty") end,
  "link" : fun(something, _):   print("not empty") end}, fun():
  raise("cases: no cases matched") end)
  end

my-list = <func>(2, obj.["empty"])
block:
tempMODRIOUJ :: List = my-list
  tempMODRIOUJ.["_match"]({"empty" : fun():   print("empty") end,
  "link" : fun(something, _):   print("not empty") end}, fun():
  raise("cases: no cases matched") end)
  end

my-list = <func>(2, obj.["empty"])
block:
tempMODRIOUJ :: List = my-list
  tempMODRIOUJ.["_match"]({"empty" : fun():   print("empty") end,
  "link" : fun(something, _):   print("not empty") end}, fun():
  raise("cases: no cases matched") end)
  end

my-list = <func>(2, [])
block:
tempMODRIOUJ :: List = my-list
  tempMODRIOUJ.["_match"]({"empty" : fun():   print("empty") end,
  "link" : fun(something, _):   print("not empty") end}, fun():
  raise("cases: no cases matched") end)
  end

my-list = [2]
block:
tempMODRIOUJ :: List = my-list
  tempMODRIOUJ.["_match"]({"empty" : fun():   print("empty") end,
  "link" : fun(something, _):   print("not empty") end}, fun():
  raise("cases: no cases matched") end)
  end

tempMODRIOUJ :: List = [2]
tempMODRIOUJ.["_match"]({"empty" : fun(): print("empty") end, "link" :
fun(something, _): print("not empty") end}, fun(): raise("cases: no
cases matched") end)

[2].["_match"]({"empty" : fun(): print("empty") end, "link" :
fun(something, _): print("not empty") end}, fun(): raise("cases: no
cases matched") end)

[2].["_match"]({"empty" : fun(): print("empty") end, "link" :
fun(something, _): print("not empty") end}, fun(): raise("cases: no
cases matched") end)

<func>({"empty" : fun(): end, "link" : fun(something, _): print("not
empty") end}, fun(): raise("cases: no cases matched") end)

<func>({"empty" : fun(): end, "link" : fun(): end}, fun():
raise("cases: no cases matched") end)

<func>(obj, fun(): raise("cases: no cases matched") end)

<func>(obj, fun(): end)

<func>("not empty")

"not empty"

But that wasn’t terribly helpful, was it? Sometimes you want to see exactly what a program is doing in all its gory detail (along the same lines, it’s occasionally helpful to see the assembly code a program is compiling to), but most of the time it would be nicer if you could see things in terms of the syntax you wrote the program with! In this particular example, it would be much nicer to see:

my-list = [2]
cases(List) my-list:
  | empty() => print("empty")
  | link(something, _) =>
    print("not empty")
end

my-list = [2]
cases(List) my-list:
  | empty() => print("empty")
  | link(something, _) =>
    print("not empty")
end

cases(List) [2]:
| empty() => print("empty")
| link(something, _) =>
  print("not empty")
end

<func>("not empty")

"not empty"

(You might have noticed that the first step got repeated for what looks like no reason. What happened there is that the code [2] was evaluated to an actual list, which also prints itself as [2].)

So we built a tool that does precisely this. It turns core evaluation sequences into surface evaluation sequences. We call the process resugaring, because it’s the opposite of desugaring: we’re adding the syntactic sugar back into your program. The above example is actual output from the tool, for an old version of Pyret. I’m currently working on a version for modern Pyret.

Resugaring Explained

I always find it helpful to introduce a diagram when explaining resugaring. On the right is the core evaluation sequence, which is the sequence of steps that the program takes when it actually runs. And on the left is the surface evaluation sequence, which is what you get when you try to resugar each step in the core evaluation sequence. As a special case, the first step on the left is the original program.

Here’s an example. The starting program will be not(true) or true, where not is in the core language, but or is defined as a piece of sugar:

x or y    ==desugar==>  let t = x in
                          if t then t else y

And here’s the diagram:

The steps (downarrows) in the core evaluation sequence are ground truth: they are what happens when you actually run the program. In contrast, the steps in the surface evaluation sequence are made up; the whole surface evaluation sequence is an attempt at reconstructing a nice evaluation sequence by resugaring each of the core steps. Notice that the third core term fails to resugar. This is because there’s no good way to represent it in terms of or.

Formal Properties of Resugaring

It’s no good to build a tool without having a precise idea of what it’s supposed to do. To this end, we came up with three properties that (we think) capture exactly what it means for a resugared evaluation sequence to be correct. It will help to look at the diagram above when thinking about these properties.

Emulation says that every term on the left should desugar to the term to its right. This expresses the idea that the resugared terms can’t lie about the term they’re supposed to represent. Another way to think about this is that desugaring and resugaring are inverses.
Abstraction says that the surface evaluation sequence on the left should show a sugar precisely when the programmer used it. So, for example, it should show things using or and not let, because the original program used or but not let.
Coverage says that you should show as many steps as possible. Otherwise, technically the surface evaluation sequence could just be empty! That would satisfy both Emulation and Abstraction, which only say things about the steps that are shown.

We’ve proved that our resugaring algorithm obeys Emulation and Abstraction, and given some good emperical evidence that it obeys Coverage too.

I’ve only just introduced resugaring. If you’d like to read more, see the paper, and the followup that deals with hygiene (e.g., preventing variable capture).

Slimming Languages by Reducing Sugar

2016-01-08T00:00:00+00:00

JavaScript is a crazy language. It’s defined by 250 pages of English prose, and even the parts of the language that ought to be simple, like addition and variable scope, are very complicated. We showed before how to tackle this problem using λ_s5, which is an example of what’s called a tested semantics.

You can read about λ_s5 at the above link. But the basic idea is that λ_s5 has two parts:

A small core language that captures the essential parts of JavaScript, without all of its foibles, and
A desugaring function that translates the full language down to this small core.

(We typically call this core language λ_s5, even though technically speaking it’s only part of what makes up λ_s5.)

These two components together give us an implementation of JavaScript: to run a program, you desugar it to λ_s5, and then run that program. And with this implementation, we can run JavaScript’s conformance test suite to check that λ_s5 is accurate: this is why it’s called a tested semantics. And lo, λ_s5 passes the relevant portion of the test262 conformance suite.

The Problem

Every blog post needs a problem, though. The problem with λ_s5 lies in desugaring. We just stated that JavaScript is complicated, while the core language for λ_s5 is simple. This means that the complications of JavaScript must be dealt with not in the core language, but instead in desugaring. Take an illustrative example. Here’s a couple of innocent lines of JavaScript:

function id(x) {
    return x;
}

These couple lines desugar into the following λ_s5 code:

{let
 (%context = %strictContext)
 { %defineGlobalVar(%context, "id");
  {let
   (#strict = true)
   {"use strict";
    {let
     (%fobj4 =
       {let
         (%prototype2 = {[#proto: %ObjectProto,
                          #class: "Object",
                          #extensible: true,]
                         'constructor' : {#value (undefined) ,
                                          #writable true ,
                                          #configurable false}})
         {let
          (%parent = %context)
          {let
           (%thisfunc3 =
             {[#proto: %FunctionProto,
               #code: func(%this , %args)
                     { %args[delete "%new"];
                       label %ret :
                       { {let
                          (%this = %resolveThis(#strict,
                                                %this))
                          {let
                           (%context =
                             {let
                               (%x1 = %args
                                        ["0" , null])
                               {[#proto: %parent,
                                 #class: "Object",
                                 #extensible: true,]
                                'arguments' : {#value (%args) ,
                                         #writable true ,
                                         #configurable false},
                                'x' : {#getter func
                                         (this , args)
                                         {label %ret :
                                         {break %ret %x1}} ,
                                       #setter func
                                         (this , args)
                                         {label %ret :
                                         {break %ret %x1 := args
                                         ["0" , {[#proto: %ArrayProto,
                                         #class: "Array",
                                         #extensible: true,]}]}}}}})
                           {break %ret %context["x" , {[#proto: null,
                                                  #class: "Object",
                                                  #extensible: true,]}];
                            undefined}}}}},
               #class: "Function",
               #extensible: true,]
              'prototype' : {#value (%prototype2) ,
                             #writable true ,
                             #configurable true},
              'length' : {#value (1.) ,
                          #writable true ,
                          #configurable true},
              'caller' : {#getter %ThrowTypeError ,
                          #setter %ThrowTypeError},
              'arguments' : {#getter %ThrowTypeError ,
                             #setter %ThrowTypeError}})
           { %prototype2["constructor" = %thisfunc3 , null];
             %thisfunc3}}}})
     %context["id" = %fobj4 ,
              {[#proto: null, #class: "Object", #extensible: true,]
               '0' : {#value (%fobj4) ,
                      #writable true ,
                      #configurable true}}]}}}}}

This is a bit much. It’s hard to read, and it’s hard for tools to process. But more to the point, λ_s5 is meant to be used by researchers, and this code bloat has stood in the way of researchers trying to adopt it. You can imagine that if you’re trying to write a tool that works over λ_s5 code, and there’s a bug in your tool and you need to debug it, and you have to wade through that much code just for the simplest of examples, it’s a bit of a nightmare.

The Ordinary Solution

So, there’s too much code. Fortunately there are well-known solutions to this problem. We implemented a number of standard compiler optimization techniques to shrink the generated λ_s5 code, while preserving its semantics. Here’s a boring list of the Semantics-Preserving optimizations we used:

Dead-code elimination
Constant folding
Constant propogation
Alias propogation
Assignment conversion
Function inlining
Infer type & eliminate static checks
Clean up unused environment bindings

Most of these are standard textbook optimizations; though the last two are specific to λ_s5. Anyhow, we did all this and got… 5-10% code shrinkage.

The Extraordinary Solution

That’s it: 5-10%.

Given the magnitude of the code bloat problem, that isn’t nearly enough shrinkage to be helpful. So let’s take a step back and ask where all this bloat came from. We would argue that code bloat can be partitioned into three categories:

Intended code bloat. Some of it is intentional. λ_s5 is a small core language, and there should be some expansion as you translate to it.
Incidental code bloat. The desugaring function from JS to λ_s5 is a simple recursive-descent function. It’s purposefully not clever, and as a result it sometimes generates redundant code. And this is exactly what the semantics-preserving rewrites we just mentioned get rid of.
Essential code bloat. Finally, some code bloat is due to the semantics of JS. JS is a complicated langauge with complicated features, and they turn into complicated λ_s5 code.

There wasn’t much to gain by way of reducing Intended or Incidental code bloat. But how do you go about reducing Essential code bloat? Well, Essential bloat is the code bloat that comes from the complications of JS. To remove it, you would simplify the language. And we did exactly that! We defined five Semantics-Altering transformations:

(IR) Identifier restoration: pretend that JS is lexically scoped
(FR) Function restoration: pretend that JS functions are just functions and not function-object-things.
(FA) Fixed arity: pretend that JS functions always take as many arguments as they’re declared with.
(UAE) Assertion elimination: unsafely remove some runtime checks (your code is correct anyways, right?)
(SAO) Simplifying arithmetic operators: eliminate strange behavior for basic operators like “+”.

These semantics-altering transformations blasphemously break the language. This is actually OK, though! The thing is, if you’re studying JS or doing static analysis, you probably already aren’t handling the whole language. It’s too complicated, so instead you handle a sub-language. And this is exactly what these semantics-altering transformations capture: they are simplifying assumptions about the JS language.

Lessons about JavaScript

And we can learn about JavaScript from them. We implemented these transformations for λ_s5, and so we could run the test suite with the transformations turned on and see how many tests broke. This gives a crude measure of “correctness”: a transformation is 50% correct if it breaks half the tests. Here’s the graph:

Notice that the semantics-altering transformations shrink code by more than 50%: this is way better than the 5-10% that the semantics-preserving ones gave. Going back to the three kinds of code bloat, this shows that most code bloat in λ_s5 is Essential: it comes from the complicated semantics of JS, and if you simplify the semantics you can make it go away.

Next, here’s the shrinkages of each of the semantics-altering transformations:

Since these semantics-altering transformations are simplifications of JS semantics, and desugared code size is a measure of complexity, you can view this graph as a (crude!) measure of complexity of language features. In this light, notice IR (Identifier Restoration): it crushes the other transformations by giving 30% code reduction. This shows that JavaScript’s scope is complex: by this metric 30% of JavaScript’s complexity is due to its scope.

Takeaway

These semantics-altering transformations give semantic restrictions on JS. Our paper makes these restrictions precise. And they’re exactly the sorts of simplifying assumptions that papers need to make to reason about JS. You can even download λ_s5 from git and implement your analysis over λ_s5 with a subset of these restrictions turned on, and test it. So let’s work toward a future where papers that talk about JS say exactly what sub-language of JS they mean.

The Paper

This is just a teaser: to read more, see the paper.

In-flow Peer Review: An Overview

2016-01-02T00:00:00+00:00

We ought to give students opportunities to practice code review. It’s a fundamental part of modern software development, and communicating clearly and precisely about code is a skill that only develops with time. It also helps shape software projects for the better early on, as discussions about design and direction in the beginning can avoid costly mistakes that need to be undone later.

Introducing code review into a curriculum faces a few challenges. First, there is the question of the pedagogy of code review: what code artifacts are students qualified to review? Reviewing entire solutions may be daunting if students are already struggling to complete their own, and it can be difficult to scope feedback for an entire program. Adding review also introduces a time cost for assignments, if it actually makes up a logically separate assignment from the code under review.

We propose in-flow peer review (IFPR) as a strategy for blending some of these constraints and goals. The fundamental idea is to break assignments into separately-submittable chunks, where each intermediate submittable is designed to be amenable to a brief peer review. The goal is for students to practice review, benefit from the feedback they get from their peers while the assignment is still ongoing, and also learn from seeing other approaches to the same problem. We’ve experimented with in-flow peer review in several settings, and future posts will discuss more of our detailed results. Here, we lay out some of the design space of in-flow peer review, including which intermediate steps might show promise, and what other choices a practitioner of IFPR can make. This discussion is based on our experience and on an ITiCSE working group report that explored many of the design dimensions of IFPR. That report has deeper discussions of the topics we introduce here, along with many other design dimensions and examples for IFPR.

What to Submit

The first question we need to address is what pieces of assignments are usefully separable and reviewable. There are a number of factors at work here. For example, it may be detrimental from an evaluation point of view to share too much of a solution while the assignment is ongoing, so the intermediate steps shouldn’t “give away” the whole solution. The intermediate steps need to be small enough to review in a brief time window, but interesting enough to prompt useful feedback. Some examples of intermediate steps are:

Full test suites for the problem, without the associated solution
A class or structure definition used to represent a data structure, without the associated operations
Function headers for helper functions without the associated body
Full helper functions, without the associated “main” implementation
A task-level breakdown of a work plan for a project (e.g. interface boundaries, test plan, and class layout)
A description of properties (expressed as predicates in a programming language, or informally) that ought to be true of an eventual implementation

Each of these reviewable artifacts can give students hints about a piece of the problem without giving away full solutions, and seem capable of prompting meaningful feedback that will inform later stages of the assignment.

How to Review

The second question has to do with the mechanics of review itself. How many submissions should students review (and how many reviews should they receive)? Should students’ names be attached to their submissions and/or their reviews, or should the process be completely anonymous? What prompts should be given in the review rubric to guide students towards giving useful feedback? How much time should students have to complete reviews, and when should they be scheduled in the assignment process?

These questions, obviously, don’t have right or wrong answers, but some in particular are useful to discuss, especially with respect to different goals for different classrooms.

Anonymity is an interesting choice. Professional code review is seldom anonymous, and having students take ownership of their work encourages an attitude of professionalism. If reviewer-reviewee pairs can identify one another, they can communicate outside the peer review system, as well, which may be encouraged or not desired depending on the course. Anonymity has a number of benefits, in that it avoids any unconscious bias in reviews based on knowing another student’s identity, and may make students feel more comfortable with the process.
Rubric design can heavily shape the kinds of reviews students write. At one extreme, students could simply get an empty text box and provide only free-form comments. Students could also be asked to identify specific features in their review (“does the test suite cover the case of a list with duplicate elements?”), fill in line-by-line comments about each part of the submission, write test cases to demonstrate bugs that they find, or many other structured types of feedback. This is a pretty wide-open design space, and the complexity and structure of the rubric can depend on curricular goals, and on the expected time students should take for peer review.
Scheduling reviews and intermediate submissions is an interesting balancing act. For a week-long assignment, it may be useful to have initial artifacts submitted as early as the second day, with reviews submitted on the third day, in order to give students time to integrate the feedback into their submissions. For longer assignments, the schedule can be stretched or involve more steps. This can have ancillary benefits, in that students are forced to start their assignment early in order to participate in the review process (which can be mandatory), combatting procrastination.

Logistics (and Software Support)

Setting up an IFPR workflow manually would involve serious instructor effort, so software support is a must for an easy integration of IFPR into the classroom. The software ought to support different review workflows and rubrics, across various assignment durations in types, in order to be useful in more than one class or setting. In the next post, we’ll talk about some design goals for IFPR software and how we’ve addressed them.

Tierless Programming for SDNs: Differential Analysis

2015-06-02T00:00:00+00:00

This post is part of our series about tierless network programming with Flowlog:
Part 1: Tierless Programming
Part 2: Interfacing with External Events
Part 3: Optimality
Part 4: Verification
Part 5: Differential Analysis

Verification is a powerful way to make sure a program meets expectations, but what if those expectations aren't written down, or the user lacks the expertise to write formal properties? Flowlog supports a powerful form of property-free analysis: program differencing.

When we make a program change, usually we're starting from a version that "works". We'd like to transfer what confidence we had in the original version to the new version, plus confirm our intuition about the changes. In other words, even if the original program had bugs, we'd like to at least confirm that the edit doesn't introduce any new ones.

Of course, taking the syntactic difference of two programs is easy — just use diff! — but usually that's not good enough. What we want is the behavioral, or semantic difference. Flowlog provides semantic differencing via Alloy, similarly to how it does property checking. We call Flowlog's differencing engine Chimp (short for Change-impact).

Differences in Output and State Transitions

Chimp translates both the old (prog1) and new (prog2) versions to Alloy, then supports asking questions like: Will the two versions ever handle packets differently? More generally, we can ask Chimp whether the program's output behavior ever differs: does there exist some program state and input event such that, in that state, the two programs will disagree on output?

pred changePolicyOut[st: State, ev: Event] {
  some out: Event |
    prog1/outpolicy[st,ev,out] && not prog2/outpolicy[st,ev,out] ||
    prog2/outpolicy[st,ev,out] && not prog1/outpolicy[st,ev,out]
}

Any time one program issues an output event that the other doesn't, Chimp displays an Alloy scenario.

We might also ask: When can the programs change state differently? Similarly to changePolicyOut above, Chimp defines changeStateTransition[st: State, ev: Event] as matching any of the following for each table T in the program:

some x0, ..., xn: univ | 
  prog1/afterEvent_T[prestate, ev, x0, ..., xn] && 
    not prog2/afterEvent_T[prestate, ev, x0, ..., xn] ||
  prog2/afterEvent_T[prestate, ev, x0, ..., xn] && 
    not prog1/afterEvent_T[prestate, ev, x0, ..., xn]

Recall that afterEvent_T keeps track of when each tuple is in the table T after an event is processed.

Refining Differential Analysis

The two predicates above are both built into Chimp. Using them as a starting point, users can ask pointed questions about the effects of the change. For instance, will any TCP packets be handled differently? Just search for a pre-state and a TCP packet that the programs disagree on:

some prestate: State, p: EVtcp_acket |
  changePolicyOut[prestate, p]

This lets users explore the consequences of their change without any formal guidance except their intuition about what the change should do.

Reachability

So far, these queries show scenarios where the programs differ, taking into consideration all potential inputs and starting states; this includes potentially unreachable starting states. We could, for instance, have two programs that behave differently if a table is populated (resulting in a non-empty semantic diff!) yet never actually insert rows into that table. Chimp provides optional reachability-checking to counter this, although users must cap the length of system traces being searched.

Schema Clashes

Suppose that we want to modify the original source-tracking example to keep track of flows by source and destination, rather than just source addresses. Now instead of one column:

TABLE seen(macaddr);

the seen table has two columns:

TABLE seen(macaddr, macaddr);

This poses a challenge for Chimp; what shape should the seen table be? If Chimp finds a scenario, should it show a seen table with one or two columns? We call this situation a schema clash, and Chimp addresses it by creating two separate tables in the prestate: one with one column (used by the first program) and another with two columns (used by the second program).

Doing this causes a new issue: Chimp searches for arbitrary states that satisfy the change-impact predicates. Since there is no constraint between the values of the two tables, Chimp might return a scenario where (say) the first seen table is empty, but the second contains tuples!

This doesn't match our intuition for the change: we expect that for every source in the first table, there is a source-destination pair in the second table, and vice versa. We can add this constraint to Chimp and filter the scenarios it shows us, but first, we should ask whether that constraint actually reflects the behavior of the two programs.

Differential Properties

Since it's based on Flowlog's verification framework, Chimp allows us to check properties stated over multiple programs. Our expecation above, stated in Alloy for an arbitrary state s, is:

all src: Macaddr | 
  src in s.seen1 
  iff 
  some dst: Macaddr | src->dst in s.seen2

Let's check that this condition holds for all reachable states. We'll proceed inductively. The condition holds trivially at the starting (empty) state; so we only need to show that it is preserved as the program transitions. We search for a counterexample:

some prestate: State, ev: Event | {
  // prestate satisfies the condition
  all src: Macaddr | src in prestate.seen_1 iff 
    some dst: Macaddr | src->dst in prestate.seen_2
	
  // poststate does not
  some src: Macaddr | 
    (prog1/afterEvent_seen_1[prestate,ev,src] and 
     all dst: Macaddr | not prog2/afterEvent_seen_2[prestate,ev,src,dst])
    ||
    (not prog1/afterEvent_seen_1[prestate,ev,src] and 
     some dst: Macaddr | prog2/afterEvent_seen_2[prestate,ev,src,dst])
}

Chimp finds no counterexample. Unfortunately, Chimp can't guarantee that this isn't a false negative; the query falls outside the class where Chimp can guarantee a complete search. Nevertheless, the lack of counterexample serves to increase our confidence that the change respects our intent.

After adding the constraint that, for every source in the first table, there is a source-destination pair in the second table, Chimp shows us that the new program will change the state (to add a new destination) even if the source is already in seen.

Tierless Programming for SDNs: Verification

2015-04-17T00:00:00+00:00

The last post said what it means for Flowlog's compiler to be optimal, which prevents certain bugs from ever occurring. But what about the program itself? Flowlog has built-in features to help verify program correctness, independent of how the network is set up.

To see Flowlog's program analysis in action, let's first expand our watchlist program a bit more. Before, we just flooded packets for demo purposes:

DO forward(new) WHERE
    new.locPt != p.locPt;

Now we'll do something a bit smarter. We'll make the program learn which ports lead to which hosts, and use that knowledge to avoid flooding when possible (this is often called a "learning switch"):

TABLE learned(switchid, portid, macaddr);
ON packet(pkt):
  INSERT (pkt.locSw, pkt.locPt, pkt.dlSrc) INTO learned;

  DO forward(new) WHERE
    learned(pkt.locSw, new.locPt, pkt.dlDst);
    OR
    (NOT learned(pkt.locSw, ANY, pkt.dlDst) AND
     pkt.locPt != new.locPt);

The learned table stores knowledge about where addresses have been seen on the network. If a packet arrives with a destination the switch has seen before as a source, there's no need to flood! While this program is still fairly naive (it will fail if the network has cycles in it) it's complex enough to have a few interesting properties we'd like to check. For instance, if the learned table ever holds multiple ports for the same switch and address, the program will end up sending multiple copies of the same packet. But can the program ever end up in such a state? Since the initial, startup state is empty, this amounts to checking: "Can the program ever transition from a valid state (i.e., one where every switch and address has at most one port in learned) into an invalid one?"

Verifying Flowlog

Each Flowlog rule defines part of an event-handling function saying how the system should react to each packet seen. Rules compile to logical implications that Flowlog's runtime interprets whenever a packet arrives.

Alloy is a tool for analyzing relational logic specifications. Since Flowlog rules compile to logic, it's easy to describe in Alloy how Flowlog programs behave. In fact, Flowlog can automatically generate Alloy specifications that describe when and how the program takes actions or changes its state.

For example, omitting some Alloy-language foibles for clarity, here's how Flowlog describes our program's forwarding behavior in Alloy.

pred forward[st: State, p: EVpacket, new: EVpacket] {
  // Case 1: known destination
  (p.locSw->new.locPt->p.dlDst) in learned and
   (p.locSw->new.locPt) in switchHasPort and ...)
  or
  // Case 2: unknown destination
  (all apt: Portid | (p.locSw->apt->p.dlDst) not in learned and
   new.locPt != p.locPt and 
   (p.locSw->new.locPt) in switchHasPort and ...)
}

An Alloy predicate is either true or false for a given input. This one says whether, in a given state st, an arbitrary packet p will be forwarded as a new packet new (containing the output port and any header modifications). It combines both forwarding rules together to construct a logical definition of forwarding behavior, rather than just a one-way implication (as in the case of individual rules).

The automatically generated specification also contains other predicates that say how and when the controller's state will change. For instance, afterEvent_learned, which says when a given entry will be present in learned after the controller processes a packet. An afterEvent predicate is automatically defined for every state table in the program.

Using afterEvent_Learned, we can verify our goal: that whenever an event ev arrives, the program will never add a second entry (sw, pt2,addr) to learned:

assert FunctionalLearned {
  all pre: State, ev: Event |
    all sw: Switchid, addr: Macaddr, pt1, pt2: Portid |
      (not (sw->pt1->addr in pre.learned) or 
       not (sw->pt2->addr in pre.learned)) and
      afterEvent_learned[pre, ev, sw, pt1, addr] and
      afterEvent_learned[pre, ev, sw, pt2, addr] implies pt1 = pt2
}

Alloy finds a counterexample scenario (in under a second):

The scenario shows an arbitrary packet (L/EVpacket; the L/ prefix can be ignored) arriving at port 1 (L/Portid1) on an arbitrary switch (L/Switchid). The packet has the same source and destination MAC address (L/Macaddr). Before the packet arrived, the controller state had a single row in its learned table; it had previously learned that L/Macaddr can be reached out port 0 (L/Portid1). Since the packet is from the same address, but a different port, it will cause the controller to add a second row to its learned table, violating our property.

This situation isn't unusual if hosts are mobile, like laptops on a campus network are. To fix this issue, we add a rule that removes obsolete mappings from the table:

DELETE (pkt.locSw, pt, pkt.dlSrc) FROM learned WHERE
  not pt = pkt.locPt;

Alloy confirms that the property holds of the modified program. We now know that any reachable state of our program is valid.

Verification Completeness

Alloy does bounded verification: along with properties to check, we provide concrete bounds for each datatype. We might say to check up to to 3 switches, 4 IP addresses, and so on. So although Alloy never returns a false positive, it can in general produce false negatives, because it searches for counterexamples only up to the given bounds. Fortunately, for many useful properties, we can compute and assert a sufficient bound. In the property we checked above, a counterexample needs only 1 State (to represent the program's pre-state) and 1 Event (the arriving packet), plus room for its contents (2 Macaddrs for source and destination, etc.), along with 1 Switchid, 2 Portids and 1 Macaddr to cover the possibly-conflicted entries in the state. So when Alloy says that the new program satisfies our property, we can trust it.

Benefits of Tierlessness

Suppose we enhanced the POX version of this program (Part 1) to learn ports in the same way, and then wanted to check the same property. Since the POX program explicitly manages flow-table rules, and the property involves a mixture of packet-handling (what is sent up to the controller?) and controller logic (how is the state updated?), checking the POX program would mean accounting for those rules and how the controller updates them over time. This isn't necessary for the Flowlog version, because rule updates are all handled optimally by the runtime. This means that property checking is simpler: there's no multi-tiered model of rule updates, just a model of the program's behavior.

You can read more about Flowlog's analysis support in our paper.

In the next post, we'll finish up this sequence on Flowlog by reasoning about behavioral differences between multiple versions of the same program.

Tierless Programming for SDNs: Optimality

2015-04-13T00:00:00+00:00

Since packets can trigger controller-state updates and event output, you might wonder exactly which packets a Flowlog controller needs to see. For instance, a packet without a source in the watchlist will never alter the controller's state. Does such a packet need to grace the controller at all? The answer is no. In fact, there are only three conditions under which switch rules do not suffice, and the controller must be involved in packet-handling:

when the packet will cause a change in controller state;
when the packet will cause the controller to send an event; and
when the packet must be modified in ways that OpenFlow 1.0 does not support on switches.

Flowlog's compiler ensures the controller sees packets if and only if one of these holds; the compiler is therefore optimal with respect to this list. To achieve this, the compiler analyzes every packet-triggered statement in the program. For instance, the INSERT statement above will only change the state for packets with a source in the watchlist (a condition made explicit in the WHERE clause) and without a source in the seen table (implicit in Flowlog's logical semantics for INSERT). Only if both of these conditions are met will the controller see a packet. An optimal compiler prevents certain kinds of bugs from occurring: the controller program will never miss packets that will affect its state, and it will never receive packets it doesn't need.

You can read more about Flowlog in our paper.

In the next post, we'll look at Flowlog's built-in verification support.

Tierless Programming for SDNs: Events

2015-03-01T00:00:00+00:00

This post is part of our series about tierless network programming with Flowlog: Part 1: Tierless Programming
Part 2: Interfacing with External Events
Part 3: Optimality
Part 4: Verification
Part 5: Differential Analysis

The last post introduced Flowlog, a tierless language for SDN controller programming. You might be wondering, "What can I write in Flowlog? How expressive is it?" To support both its proactive compiler and automated program analysis (more on this in the next post) we deliberately limited Flowlog's expressive power. There are no loops in the language, and no recursion. Instead of trying to be universally expressive, Flowlog embraces the fact that most programs don't run in a vacuum. A controller may need to interact with other services, and developers may wish to re-use pre-existing code. To enable this, Flowlog programs can call out to non-Flowlog libraries. The runtime uses standard RPCs (Thrift) for inter-process communication, so existing programs can be quickly wrapped to communicate with Flowlog. Much like how Flowlog abstracts out switch-rule updates, it also hides the details of inter-process communcation. To see this, let's enhance the address-logger application with a watch-list that external programs can add to. We need a new table ("watchlist"), populated by arriving "watchplease" events that populate the table. Finally, we make sure only watched addresses are logged:

TABLE seen(macaddr);
TABLE watchlist(macaddr);
EVENT watchplease = {target: macaddr};

ON watchplease(w):
  INSERT (w.target) INTO watchlist;

ON packet(p):
  INSERT (p.dlSrc) INTO seen WHERE
    watchlist(p.dlSrc);
  DO forward(new) WHERE
    new.locPt != p.locPt;

When the program receives a watchplease event (sent via RPC from an external program) it adds the appropriate address to its watchlist.

Sending Events

Flowlog programs can also send events. Suppose we want to notify some other process when a watchlisted address is seen, and the process is listening on TCP port 20000. We just declare a named pipe that carries notifications to that port:

EVENT sawaddress = {addr: macaddr};
OUTGOING sendaddress(sawaddress) THEN
  SEND TO 127.0.0.1:20000;

and then write a notification to that pipe for appropriate packets:

ON packet(p) WHERE watchlist(p.dlSrc):
  DO sendaddress(s) WHERE s.addr = p.dlSrc;

Synchronous Communication

The event system supports asynchronous communication, but Flowlog also allows synchronous queries to external programs. It does this with a remote state abstraction. If we wanted to manage the watchlist remotely, rather than writing

TABLE watchlist(macaddr);

we would write:

REMOTE TABLE watchlist(macaddr)
  FROM watchlist AT 127.0.0.1 20000
  TIMEOUT 10 seconds;

which tells Flowlog it can obtain the current list by sending queries to port 20000. Since these queries are managed behind the scenes, the program doesn't need to change—as far as the programmer is concerned, a table is a table. Finally, the timeout says that Flowlog can cache prior results for 10 seconds.

Interfacing External Programs with Flowlog

Flowlog can interface with code in any language that supports Thrift RPC (including C++, Java, OCaml, and many others). To interact with Flowlog, one only needs to implement the interface Flowlog requires: a function that accepts notifications and a function that responds to queries. Other functions may also (optionally) send notifications. Thrift's library handles the rest.

You can read more about Flowlog's events in our paper.

In the next post, we'll look at what it means for Flowlog's compiler to be optimal.

Tierless Programming for Software-Defined Networks

2014-09-30T00:00:00+00:00

This post is part of our series about tierless network programming with Flowlog: Part 1: Tierless Programming
Part 2: Interfacing with External Events
Part 3: Optimality
Part 4: Verification
Part 5: Differential Analysis

Network devices like switches and routers update their behavior in real-time. For instance, a router may change how it forwards traffic to address an outage or congestion. In a traditional network, devices use distributed protocols to decide on mutually consistent behavior, but Software-Defined Networks (SDN) operate differently. Switches are no longer fully autonomous agents, but instead receive instructions from logically centralized controller applications running on separate hardware. Since these applications can be arbitrary programs, SDN operators gain tremendous flexibility in customizing their network.

The most popular SDN standard in current use is OpenFlow. With OpenFlow, Controller applications install persistent forwarding rules on the switches that match on packet header fields and list actions to take on a match. These actions can include header modifications, forwarding, and even sending packets to the controller for further evaluation. When a packet arrives without a matching rule installed, the switch defaults to sending the packet to the controller for instructions.

Let's write a small controller application. It should (1) record the addresses of machines sending packets on the network and (2) cause each switch to forward traffic by flooding (i.e., sending out on all ports except the arrival port). This is simple enough to write in POX, a controller platform for Python. The core of this program is a function that reacts to packets as they arrive at the controller (we have removed some boilerplate and initialization):

def _handle_PacketIn (self, event): packet = event.parsed def install_nomore (): msg = of.ofp_flow_mod() msg.match = of.ofp_match(dl_src = packet.src) msg.buffer_id = event.ofp.buffer_id msg.actions.append(of.ofp_action_output(port = of.OFPP_FLOOD)) self.connection.send(msg) def do_flood (): msg = of.ofp_packet_out() msg.actions.append(of.ofp_action_output(port = of.OFPP_FLOOD)) msg.data = event.ofp msg.buffer_id = None msg.in_port = event.port self.connection.send(msg) self.seenTable.add(packet.src) install_nomore() do_flood()

First, the controller records the packet's source in its internal table. Next, the install_nomore function adds a rule to the switch saying that packets with this source should be flooded. Once the rule is installed, the switch will not send packets with the same source to the controller again. Finally, the do_flood function sends a reply telling the switch to flood the packet.

This style of programming may remind you of the standard three-tier web-programming architecture. Much like a web program generates JavaScript or SQL strings, controller programs produce new switch rules in real-time. One major difference is that switch rules are much less expressive than JavaScript, which means that less computation can be delegated to the switches. A bug in a controller program can throw the entire network's behavior off. But it's easy to introduce bugs when every program produces switch rules in real-time, effectively requiring its own mini-compiler!

SDN Programming Without Tiers

We've been working on a tierless language for SDN controllers: Flowlog. In Flowlog, you write programs as if the controller sees every packet, and never have to worry about the underlying switch rules. This means that some common bugs in controller/switch interaction can never occur, but it also means that the programming experience is simpler. In Flowlog, our single-switch address-monitoring program is just:

TABLE seen(macaddr); ON ip_packet(p): INSERT (p.dlSrc) INTO seen; DO forward(new) WHERE new.locPt != p.locPt;

The first line declares a one-column database table, "seen". Line 2 says that the following two lines are triggered by IP packets. Line 3 adds those packets' source addresses to the table, and line 4 sends the packets out all other ports.

As soon as this program runs, the Flowlog runtime proactively installs switch rules to match the current controller state and automatically ensures consistency. As the controller sees more addresses, the switch sends fewer packets back to the controller—but this is entirely transparent to the programmer, whose job is simplified by the abstraction of an all-seeing controller.

Examples and Further Reading

Flowlog is good for more than just toy examples. We've used Flowlog for many different network applications: ARP-caching, network address translation, and even mediating discovery and content-streaming for devices like Apple TVs. You can read more about Flowlog and Flowlog applications in our paper.

The next post talks more about what you can use Flowlog to write, and also see how Flowlog allows programs to call out to external libraries in other languages.

CS Student Work/Sleep Habits Revealed As Possibly Dangerously Normal

2014-06-14T00:00:00+00:00
Written by Jesse Polhemus, and originally posted at the Brown CS blog
Imagine a first-year computer science concentrator (let’s call him Luis) e-mailing friends and family back home after a few weeks with Brown Computer Science (BrownCS). Everything he expected to be challenging is even tougher than anticipated: generative recursion, writing specifications instead of implementations, learning how to test his code instead of just writing it. Worst of all is the workload. On any given night, he’s averaging –this seems too cruel to be possible– no more than eight or nine hours of sleep.

Wait, what? Everyone knows that CS students don't get any sleep, so eight or nine hours is out of the question. Or is it? Recent findings from PhD student Joseph Gibbs Politz, adjunct professor Kathi Fisler, and professor Shriram Krishnamurthi analyze when students completed tasks in two different BrownCS classes, shedding interesting light on an age-old question: when do our students work, and when (if ever) do they sleep? The question calls to mind a popular conception of the computer scientist that Luis has likely seen in countless movies and books:

Hours are late. (A recent poster to boardgames@lists.cs.brown.edu requests a 2 PM start time in order to avoid being “ridiculously early” for prospective players.)

Sleep is minimal. BrownCS alumnus Andy Hertzfeld, writing about the early days of Apple Computer in Revolution in the Valley, describes the “gigantic bag of chocolate-covered espresso beans” and “medicinal quantities of caffeinated beverages” that allowed days of uninterrupted coding.

Part 1: Deadline Experiments

The story begins a few years before Luis’s arrival, when Shriram would routinely schedule his assignments to be due at the 11:00 AM start of class. “Students looked exhausted,” he remembers. “They were clearly staying up all night in order to complete the assignment just prior to class.”

Initially, he moved the deadline to 2:00 AM, figuring that night owl students would finish work in the early hours of the morning and then get some sleep. This was effective, but someone pointed out that it was unfair to other professors who taught earlier classes and were forced to deal with tired students who had finished Shriram’s assignment but not slept sufficiently.

“My final step,” he explains, “was to change deadlines to midnight. I also began penalizing late assignments on a 24-hour basis instead of an hourly one. This encourages students to get a full night’s sleep even if they miss a deadline.”

This was the situation when Luis arrives. The next task was to start measuring the results.

Part 2: Tracking Events

Shriram, Kathi, and Joe analyzed two of Shriram’s classes, CS 019 and CS 1730. For each class, Luis must submit test suites at any time he chooses, then read reviews of his work from fellow students. He then continues working on the solution, eventually producing a final implementation that must be submitted prior to the midnight deadline.

Part 3: Reality And Mythology

Given these parameters, what work and sleep patterns would you expect? We asked professor Tom Doeppner to reflect on Luis and share his experience of working closely with students as Director of Undergraduate Studies and Director of the Master’s Program. “Do students work late? I know I get e-mail from students at all hours of the night,” he says, “and I found out quickly that morning classes are unpopular, which is why I teach in the afternoon. Maybe it’s associated with age? I liked to work late when I was young, but I got out of the habit in my thirties.”

Asked about the possible mythologizing of late nights and sleeplessness, Tom tells a story from his own teaching: “Before we broke up CS 169 into two classes, the students had t-shirts made: ‘CS 169: Because There Are Only 168 Hours In A Week’. I think there’s definitely a widespread belief that you’re not really working hard unless you’re pulling multiple all-nighters.”

This doesn’t exactly sound like Luis’s sleep habits! Take a look at the graphs below to see how mythology and reality compare.

Part 4: Results And Conclusions

The graphs below depict test suite submissions, with time displayed in six-hour segments. For example, between 6 PM and the midnight deadline (“6-M”), 50 CS 173 students are submitting tests.

This graph is hypothetical, showing Joe, Kathi, and Shriram’s expectations for submission activity. They expected activity to be slow and increase steadily, culminating in frantic late-night activity just before the deadline. Generally taller “M-6” (midnight to 6 AM) bars indicate late-night work and a corresponding flurry of submissions, followed by generally shorter “6-N” (6 AM to noon) bars when students tried to get a few winks in. Cumulatively, these two trends depict the popular conception of the computer science student who favors late hours and perpetually lacks sleep.

These graphs show actual submissions. As expected, activity generally increases over time and the last day contains the majority of submissions. However, unexpectedly, the “N-6” (noon to 6 PM) and “6-M” (6 PM to midnight) segments are universally the most active. In the case of the CS 173 graph, this morning segment contains far more submissions than any other of the day’s three segments. In both of these graphs, the “M-6” (midnight to 6 AM) segments are universally the least active, even the day the assignment is due. For example, the final segment of this type, which represents the last available span of early morning hours, is among the lowest of all segments, with only ten submissions occurring. In contrast, the corresponding “6-N” (6 AM to noon) shows more than four times as many submissions, suggesting that most students do their work before or after the pre-dawn hours but not during them.

“I wouldn’t have expected that,” Joe comments. “I think of the stories folks tell of when they work not lining up with that, in terms of staying up late and getting up just in time for class. Our students have something important to do at midnight other than work: they cut off their work before midnight and do something else. For the majority it’s probably sleep, but it could just be social time or other coursework. Either way, it’s an interesting across-the-board behavior.”

If word of these results gets out, what can Luis and his fellow students expect? “People will realize,” Shriram says, “that despite what everyone likes to claim, students even in challenging courses really are getting sleep, so it’s okay for them to, too.” Joe agrees: “There isn’t so much work in CS that you have to sacrifice normal sleeping hours for it.”

Luis, his family, and his well-rested classmates will undoubtedly be glad to hear it. The only question is: will their own descriptions of their work/sleep habits change to match reality, or are tales of hyper-caffeinated heroics too tempting to resist?

Appendix

The graphs above are simplified for readability, and aggregated into 6-hour increments. Below we include graphs of the raw data in 3-hour increments. This shows that there is some work going on from 12am-3am the night before assignments are due, but essentially nothing from 3am-6am.

In both of these classes, we were also performing experiments on code review, so the raw data includes when students read the code reviews they received, in addition to when they submitted their work. Since the review necessarily happens after submission, and the reading of the review after that, we see many more “late” events for reading reviews.

CS019 in 3-hour increments:

CS173 in 3-hour increments:

Parley: User Studies for Syntax Design

2014-04-01T00:00:00+00:00

Programming languages' syntax choices have always been cause for spirited, yet unresolvable, debate. Opinions and so-called best practices dominate discussion, with little empirical evidence to justify them. As part of the Pyret project, we've been performing one of the most comprehensive empirical studies on programming language syntax.

To recap some of the issues: Many languages repurpose plain English words for keywords, and run afoul of the impedance mismatch between, for instance, the dictionary meaning of switch and its semantics within the language (see Language-Independent Conceptual "Bugs" in Novice Programming for a much more detailed discussion). Another alternative, using non-ASCII symbols which cannot have their meaning conflated, is promising but doesn't work well with traditional editors (APL, we're looking at you).

We are, instead, evalauting the use of unconventional syntax motivated by a non-technical, easily-interpreted lexicon. We refer to these cohesive lexicons as lingos, and have begun experimenting with one lingo-based syntax for Pyret, which we call Parley. It is best to see the Parley lingo in action, compared to the more traditional syntax in early versions of Pyret:

Normal Pyret
var sum = 0 var arr = [1,2,3,4,5,6,7,8] for each(piece from arr): sum := sum + piece end

With Parley Lingo Enabled
yar sum be 0 yar arr be [1,2,3,4,5,6,7,8] fer each(piece of arr): sum now be sum + piece end

While it should seem obvious to the casual reader that Parley lingo is a strict improvement, this is not a substitute for empirical evaluation. To this end, we have been running a rummy series of user studies to validate the new syntax. Our latest experiments are testing program comprehension using an aye-tracker. The results, which will be submitted to the Principles of Pirate Lingo conference, are pending clearance from our Aye Arr Bee.

Typechecking Uses of the jQuery Language

2014-01-17T00:00:00+00:00

Manipulating HTML via JavaScript is often a frustrating task: the APIs for doing so, known as the Document Object Model (or DOM), are implemented to varying degrees of fidelity across different browsers, leading to browser-specific workarounds that contort the code. Worse, the APIs are too low-level: they amount to an “assembly language for trees”, allowing programmers to navigate from a node to its parent, children or adjacent siblings; add, remove, or modify a node at a time; etc. And like assembly programming, these low-level operations are imperative and often error-prone.

Fortunately, developers have created libraries that abstract away from this low-level tedium to provide a powerful, higher-level API. The best-known example of these is jQuery, and its API is so markedly different from the DOM API that it might as well be its own language — a domain-specific language for HTML manipulation.

With great powerlanguage comes great responsibility

Every language has its own idiomatic forms, and therefore has its own characteristic failure modes and error cases. Our group has been studying various (sub-)languages of JavaScript— JS itself, ADsafety, private browsing violations— so the natural question to ask is, what can we do to analyze jQuery as a language?

This post takes a tour of the solution we have constructed to this problem. In it, we will see how a sophisticated technique, a dependent type system, can be specialized for the jQuery language in a way that:

Retains decidable type-inference,

Requires little programmer annotation overhead,

Hides most of the complexity from the programmer, and yet

Yields non-trivial results.

Engineering such a system is challenging, and is aided by the fact that jQuery’s APIs are well-structured enough that we can extract a workable typed description of them.
What does a jQuery program look like?

Let’s get a feel for those idiomatic pitfalls of jQuery by looking at a few tiny examples. We’ll start with a (seemingly) simple question: what does this one line of code do?

$(".tweet span").next().html()

With a little knowledge of the API, jQuery’s code is quite readable: get all the ".tweet span" nodes, get their next() siblings, and…get the HTML code of the first one of them. In general, a jQuery expression consists of three parts: a initial query to select some nodes, a sequence of navigational steps that select nearby nodes related to the currently selected set, and finally some manipulation to retrieve or set data on those node(s).

This glib explanation hides a crucial assumption: that there exist nodes in the page that match the initial selector in the first place! So jQuery programs’ behavior depends crucially on the structure of the pages in which they run.

Given the example above, the following line of code should behave analogously:

$(".tweet span").next().text()

But it doesn’t—it actually returns the concatenated text content of all the selected nodes, rather than just the first. So jQuery APIs have behavior that depends heavily on how many nodes are currently selected.

Finally, we might try to manipulate the nodes by mapping a function across them via jQuery’s each() method, and here we have a classic problem that appears in any language: we must ensure that the mapped function is only applied to the correct sorts of HTML nodes.

Our approach

The examples above give a quick flavor of what can go wrong:

The final manipulation might not be appropriate for the type of nodes actually in the query-set.

The navigational steps might “fall off the edge of the tree”, navigating by too many steps and resulting in a vacuous query-set.

The initial query might not match any nodes in the document, or match too many nodes.

Our tool of choice for resolving all these issues is a type system for JavaScript. We’ve used such a system before to analyze the security of Firefox browser extensions. The type system is quite sophisticated to handle JavaScript idioms precisely, and while we take advantage of that sophistication, we also completely hide the bulk of it from the casual jQuery programmer.

To adapt our prior system for jQuery, we need two technical insights: we define multiplicities, a new kind that allows us to approximate the size of a query set in its type and ensure that APIs are only applied to correctly-sized sets, and we define local structure, which allows developers to inform the type system about the query-related parts of the page.

Technical details

Multiplicities

Typically, when type system designers are confronted with the need to keep track of the size of a collection, they turn to a dependently typed system, where the type of a collection can depend on, say, a numeric value. So “a list of five booleans” has a distinct type from “a list of six booleans”. This is very precise, allowing some APIs to be expressed with exactitude. It does come at a heavy cost: most dependent type systems lose the ability to infer types, requiring hefty programmer annotations. But is all this precision necessary for jQuery?

Examining jQuery’s APIs reveals that its behavior can be broken down into five cases: methods might require their invocant query-set contain

Zero elements of any type T, writen 0<T>

One element of some type T, written 1<T>

Zero or one elements of some type T, written 01<T>

One or more elements of some type T, written 1+<T>

Any number (zero or more) elements of some type T, written 0+<T>

These cases effectively abstract the exact size of a collection into a simple, finite approximation. None of jQuery’s APIs actually care if a query set contains five or fifty elements; all the matters is that it contains one-or-more. And therefore our system can avoid the very heavyweight onus of a typical dependent type system, instead using this simpler abstraction without any loss of necessary precision.

Moreover, we can manipulate these cases using interval arithmetic: for example, combining one collection of zero-or-one elements with another collection containing exactly one element yields a collection of one-or-more elements: 01<T> + 1<T> = 1+<T>. This is just interval addition.

jQuery’s APIs all effectively map some behavior across a queryset. Consider the next() method: given a collection of at least one element, it transforms each element into at most one element (because some element might not have any next sibling). The result is a collection of zero-or-more elements (if every element does not have a next sibling). Symbolically: 1+<01<T>> = 0+<T>. This is just interval multiplication.

We can use these two operations to describe the types of all of jQuery’s APIs. For example, the next() method can be given the type Forall T, 1+<T> -> 0+<@next<T>>.

Local structure

Wait — what’s @next? Recall that we need to connect the type system to the content of the page. So we ask programmers to define local structures that describe the portions of their pages that they intend to query. Think of them as “schemas for page fragments”: while it is not reasonable to ask programmers to schematize parts of the page they neither control nor need to manipulate (such as 3^rd-party ads), they certainly must know the structure of those parts of the page that they query! A local structure for, say, a Twitter stream might say:

(Tweet : Div classes = {tweet} optional = {starred} (Author : Span classes = {span}) (Time : Span classes = {time}) (Content : Span classes = {content})

Read this as “A Tweet is a Div that definitely has class tweet and might have class starred. It contains three children elements in order: an Author, a Time, and a Content. An Author is a Span with class span…”

(If you are familiar with templating libraries like mustache.js, these local structures might look familiar: they are just the widgets you would define for template expansion.)

From these definitions, we can compute crucial relationships: for instance, the @next sibling of an Author is a Time. And that in turn completes our understanding of the type for next() above: for any local structure type T, next() can be called on a collection of at least one T and will return collection of zero or more elements that are the @next siblings of Ts.

Note crucially that while the uses of @next are baked into our system, the function itself is not defined once-and-for-all for all pages: it is computed from the local structures that the programmer defined. In this way, we’ve parameterized our type system. We imbue it with knowledge of jQuery’s fixed API, but leave a hole for programmers to provide their page-specific information, and thereby specialize our system for their code.

Of course, @next is not the only function we compute over the local structure. We can compute @parents, @siblings, @ancestors, and more. But all of these functions are readily deducible from the local structure.

Ok, so what does this all do for me?
Continuing with our Tweet example above, our system provides the following output for these next queries:

// Mistake made: one too many calls to .next(), so the query set is empty $(".tweet").children().next().next().next().css("color", "red") // ==> ERROR: 'css' expects 1+<Element>, got 0<Element> // Mistake made: one too few calls to .next(), so the query set is too big $(".tweet #myTweet").children().next().css("color") // ==>; ERROR: 'css' expects 1<Element>, got 1+<Author+Time> // Correct query $(".tweet #myTweet").children().next().next().css("color") // ==> Typechecks successfully

The paper contains additional examples, showing how the types progress across more elaborate jQuery expressions.
The big picture

We have defined a set of types for the jQuery APIs that captures their intended behavior. These types are expressed using helper functions such as @next, whose behavior is specific to each page. We ask programmers merely to define the local structures of their page, and from that we compute the helper functions we need. And finally, from that, our system can produce type errors whenever it encounters the problematic situations we listed above. No additional programmer effort is needed, and the type errors produced are typically local and pertinent to fixing buggy code.

Further reading

Obviously we have elided many of the nitty-gritty details that make our system work. We’ve written up our full system, with more formal definitions of the types and worked examples of finding errors in buggy queries and successfully typechecking correct ones. The writeup also explains some surprising subtleties of the type environment, and proposes some API enhancements to jQuery that were suggested by difficulties in engineering and using the types we defined.

Verifying Extensions’ Compliance with Firefox's Private Browsing Mode

2013-08-19T00:00:00+00:00

All modern browsers now support a “private browsing mode”, in which the browser ostensibly leaves behind no traces on the user's file system of the user's browsing session. This is quite subtle: browsers have to handle caches, cookies, preferences, bookmarks, deliberately downloaded files, and more. So browser vendors have invested considerable engineering effort to ensure they have implemented it correctly.

Firefox, however, supports extensions, which allow third party code to run with all the privilege of the browser itself. What happens to the security guarantee of private browsing mode, then?

The current approach

Currently, Mozilla curates the collection of extensions, and any extension must pass through a manual code review to flag potentially privacy-violating behaviors. This is a daunting and tedious task. Firefox contains well over 1,400 APIs, of which about twenty are obviously relevant to private-browsing mode, and another forty or so are less obviously relevant. (It depends heavily on exactly what we mean by the privacy guarantee of “no traces left behind”: surely the browser should not leave files in its cache, but should it let users explicitly download and save a file? What about adding or deleting bookmarks?) And, if the APIs or definition of private-browsing policy ever change, this audit must be redone for each of the thousands of extensions.

The asymmetry in this situation should be obvious: Mozilla auditors should not have to reconstruct how each extension works; it should be the extension developers' responsibility to convince the auditor that their code complies with private-browsing guarantees. After all, they wrote the code! Moreover, since auditors are fallible people, too, we should look to (semi-)automated tools to lower their reviewing effort.

Our approach

So what property, ultimately, do we need to confirm about an extension's code to ensure its compliance? Consider the pseudo-code below, which saves the current preferences to disk every few minutes:

var prefsObj = ... const thePrefsFile = "..."; function autoSavePreferences() { if (inPivateBrowsingMode()) { // ...must store data only in memory... return; } else { // ...allowed to save data to disk... var file = openFile(thePrefsFile); file.write(prefsObj.tostring()); } } window.setTimeout(autoSafePreferences, 3000);

The key observation is that this code really defines two programs that happen to share the same source code: one program runs when the browser is in private browsing mode, and the other runs when it isn't. And we simply do not care about one of those programs, because extensions can do whatever they'd like when not in private-browsing mode. So all we have to do is “disentangle” the two programs somehow, and confirm that the private-browsing version does not contain any file I/O.

Technical insight

Our tool of choice for this purpose is a type system for JavaScript. We've used such a system before to analyze the security of the ADsafe sandbox. The type system is quite sophisticated to handle JavaScript idioms precisely, but for our purposes here we need only part of its expressive power. We need three pieces: first, three new types; second, specialized typing rules; and third, an appropriate type environment.

We define one new primitive type: Unsafe. We will ascribe this type to all the privacy-relevant APIs.

We use union types to define Ext, the type of “all private-browsing-safe extensions”, namely: numbers, strings, booleans, objects whose fields are Ext, and functions whose argument and return types are Ext. Notice that Unsafe “doesn’t fit” into Ext, so attempting to use an unsafe function, or pass it around in extension code, will result in a type error.

Instead of defining Bool as a primitive type, we will instead define True and False as primitive types, and define Bool as their union.

We'll also add two specialized typing rules:

If an expression has some union type, and only one component of that union actually typechecks, then we optimistically say that the expression typechecks even with the whole union type. This might seem very strange at first glance: surely, the expression 5("true") shouldn't typecheck? But remember, our goal is to prevent privacy violations, and the code above will simply crash---it will never write to disk. Accordingly, we permit this code in our type system.

We add special rules for typechecking if-expressions. When the condition typechecks at type True, we only check the then-branch; when the condition typechecks at type False, we only check the else-branch. (Otherwise, we check both branches as normal.)

Finally, we add the typing environment which initializes the whole system:

We give all the privacy-relevant APIs the type Unsafe.

We give the API inPrivateBrowsingMode() the type True. Remember: we just don't care what happens when it's false!

Put together, what do all these pieces achieve? Because Unsafe and Ext are disjoint from each other, we can safely segregate any code into two pieces that cannot communicate with each other. By carefully initializing the type environment, we make Unsafe precisely delineate the APIs that extensions should not use in private browsing mode. The typing rules for if-expressions plus the type for inPrivateBrowsingMode() amount to disentangling the two programs from each other: essentially, it implements dead-code elimination at type-checking time. Lastly, the rule about union types makes the system much easier for programmers to use, since they do not have to spend any effort satisfying the typechecker about properties other than this privacy guarantee.

In short, if a program passes our typechecker, then it must not call any privacy-violating APIs while in private-browsing mode, and hence is safe. No audit needed!

Wait, what about exceptions to the policy?

Sometimes, extensions have good reasons for writing to disk even while in private-browsing mode. Perhaps they're updating their anti-phishing blacklists, or they implement a download-helper that saves a file the user asked for, or they are a bookmark manager. In such cases, there simply is no way for the code to typecheck. As in any type system, we provide a mechanism to break out of the type system: an unchecked typecast. We currently write such casts as cheat(T). Such casts must be checked by a human auditor: they are explicitly marking the places where the extension is doing something unusual that must be confirmed.

(In our original version, we took our cue from Haskell and wrote such casts as unsafePerformReview, but sadly that is tediously long to write.)

But does it work?

Yes.

We manually analyzed a dozen Firefox extensions that had already passed Mozilla's auditing process. We annotated the extensions with as few type annotations as possible, with the goal of forcing the code to pass the typechecker, cheating if necessary. These annotations found five extensions that violated the private-browsing policy: they could not be typechecked without using cheat, and the unavoidable uses of cheat pointed directly to where the extensions violated the policy.

Further reading

We've written up our full system, with more formal definitions of the types and worked examples of the annotations needed. The writeup also explains how we create the type environment in more detail, and what work is necessary to adapt this system to changes in the APIs or private-browsing implementation.

From MOOC Students to Researchers

2013-06-18T00:00:00+00:00

Much has been written about MOOCs, including the potential for its users to be treated, in effect, as research subjects: with tens of thousands of users, patterns in their behavior will stand out starkly with statistical significance. Much less has been written about using MOOC participants as researchers themselves. This is the experiment we ran last fall, successfully.

Our goal was to construct a “tested semantics” for Python, a popular programming language. This requires some explanation. A semantics is a formal description of the behavior of a language so that, given any program, a user can precisely predict what the program is going to do. A “tested” semantics is one that is validated by checking it against real implementations of the language itself (such as the ones that run on your computer).

Constructing a tested semantics requires covering all of a large language, carefully translating its every detail into a small core language. Sometimes, a feature can be difficult to translate. Usually, this just requires additional quantities of patience, creativity, or elbow grease; in rare instances, it may require extending the core language. Doing this for a whole real-world language is thus a back-breaking effort.

Our group has had some success building such semantics for multiple languages and systems. In particular, our semantics for JavaScript has come to be used widely. The degree of interest and rapidity of uptake of that work made clear that there was interest in this style of semantics for other languages, too. Python, which is not only popular but also large and complex (much more so than JavaScript), therefore seemed like an obvious next target. However, whereas the first JavaScript effort (for version 3 of the language) took a few months for a PhD student and an undergrad, the second one (focusing primarily on the strict-mode of version 5) took far more effort (a post-doc, two PhD students, and a master's student). JavaScript 5 approaches, but still doesn't match, the complexity of Python. So the degree of resources we would need seemed daunting.

Crowdsourcing such an effort through, say, Mechanical Turk did not seem very feasible (though we encourage someone else to try!). Rather, we needed a trained workforce with some familiarity with the activity of formally defining a programming language. In some sense, Duolingo has a similar problem: to be able to translate documents it needs people who know languages. Duolingo addresses it by...teaching languages! In a similar vein, our MOOC on programming languages was going to serve the same purpose. The MOOC would deliver a large and talented workforce; if we could motivate them, we could then harness them to help perform the research.

During the semester, we therefore gave three assignments to get students warmed up on Python: 1, 2, and 3. By the end of these three assignments, all students in the class had had some experience wrestling with the behavior of a real (and messy) programming language, writing a definitional interpreter for its core, desugaring the language to this core, and testing this desugaring against (excerpts of) real test suites. The set of features was chosen carefully to be both representative and attainable within the time of the course.

(To be clear, we didn't assign these projects only because we were interested in building a Python semantics. We felt there would be genuine value for our students in wrestling with these assignments. In retrospect, however, this was too much work, and it interfered with other pedagogic aspects of the course. As a result, we're planning to shift this workload to a separate, half-course on implementing languages.)

Once the semester was over, we were ready for the real business to begin. Based on the final solutions, we invited several students (out of a much larger population of talent) to participate in taking the project from this toy sub-language to the whole Python language. We eventually ended up with an equal number of people who were Brown students and who were from outside Brown. The three Brown students were undergraduates; the three outsiders were an undergraduate student, a professional programmer, and a retired IT professional who now does triathlons. The three outsiders were from Argentina, China, and India. The project was overseen by a Brown CS PhD student.

Even with this talented workforce, and the prior preparation done through the course and creating the assignments prepared for the course, getting the semantics to a reasonable state was a daunting task. It is clear to us that it would have been impossible to produce an equivalent quality artifact—or to even come close—without this many people participating. As such, we feel our strategy of using the MOOC was vindicated. The resulting paper has just been accepted at a prestigious venue that was the most appropriate one for this kind of work, with eight authors: the lead PhD student, the three Brown undergrads, the three non-Brown people, and the professor.

A natural question is whether making the team even larger would have helped. As we know from Fred Brooks's classic The Mythical Man Month, adding people to projects can often hurt rather than help. Therefore, the determinant is to what extent the work can be truly parallelized. Creating a tested semantics, as we did, has a fair bit of parallelism, but we may have been reaching its limits. Other tasks that have previously been crowdsourced—such as looking for stellar phenomena or trying to fold proteins—are, as the colloquial phrase has it, “embarrassingly parallel”. Most real research problems are unlikely to have this property.

In short, the participants of a MOOC don't only need to be thought of as passive students. With the right training and motivation, they can become equal members of a distributed research group, one that might even have staying power over time. Also, participation in such a project can help a person establish their research abilities even when they are at some remove from a research center. Indeed, the foreign undergraduate in our project will be coming to Brown as a master's student in the fall!

Would we do it again? For this fall, we discussed repeating the experiment, and indeed considered ways of restructuring the course to better support this goal. But before we do, we have decided to try to use machine learning instead. Once again, machines may put people out of work.

Social Ratings of Application Permissions (Part 4: The Goal)

2013-05-31T00:00:00+00:00

(This is the fourth post in our series on Android application permissions. Click through for Part 1, Part 2, and Part 3.)

In this, the final post in our application permissions series, we'll discuss our trajectory for this research. Ultimately, we want to enable users to make informed decisions about the apps they install on their smartphones. Unfortunately, informed consent becomes difficult when you are asking users to make decisions in an area in which they have little expertise. Rating systems allow users to rely on the collective expertise of other users.

We intend to integrate permission ratings in to the app store in much the same way that functionality ratings are already there. This allows users to use visual cues they are already familiar with, such as the star rating that appears on the app page.

We may also wish to convey to users how each individual permission is rated. This finer-grained information gives users the ability to make decisions in line with their own priorities. For example, if users are particularly concerned about the integrity of their email accounts, an app that has a low-rated email access permission may be unacceptable to a user, even if the app receives otherwise high scores for permissions. We can again leverage well-known visual cues to convey this information, perhaps with meters similar to password meters, as seen in the mock-up image below.

There are a variety of other features we may want to incorporate into a permission rating system: allowing users to select favorite or trusted raters could enable them to rely on a particularly savvy relative or friend. Additionally, users could build a privacy profile, and view ratings only from like-minded users. Side-by-side comparisons of different apps' permissions rating could let users choose between similar apps more easily.

Giving users an easy way to investigate app permissions will allow them to make privacy a part of their decision-making process without requiring extra work or research on their part. This will improve the overall security of using a smartphone (or other permission-rich device), leaving users less vulnerable to unintended sharing of their personal data.
There's more! Click through to read Part 1, Part 2, and Part 3of the series!

Social Ratings of Application Permissions (Part 3: Permissions Within a Domain)

2013-05-29T00:00:00+00:00

(This is the third post in our series on Android application permissions. Click through for Part 1, Part 2, and Part 4.)

In a prior post we discussed the potential value for a social rating system for smartphone apps. Such a system would give non-expert users some information about apps before installing them. Ultimately, the goal of such a system would be to help users choose between different apps with similar functionality (for an app they need) or decide if the payoff of an app is worth the potential risk of installing it (for apps they want). Both of these use cases would require conscientious ratings of permissions.

We chose to study this issue by considering the range of scores that respondents give to permissions. If respondents were not considering the permissions carefully, we would expect the score to be uniform across different permissions. We examined the top five weather forecasting apps in the Android marketplace: The Weather Channel, WeatherBug Elite, Acer Life Weather, WeatherPro, and AccuWeather Platinum. We chose weather apps because they demonstrate a range of permission requirements; Acer Life Weather requires only four permissions while AccuWeather Platinum and WeatherBug Elite each require eleven permissions. We asked respondents to rate an app's individual permissions as either acceptable or unacceptable.

Our findings, which we present in detail below, show that users will rate application permissions conscientiously. In short, we found that although the approval ratings for each permission are all over 50%, they vary significantly from permission to permission. Approval ratings for individual permissions ranged from 58.8% positive (for “Modify or delete the contents of your USB storage”) to 82.5% (for “Find accounts on the device”). The table at the bottom of this post shows the percentage of users who considered a given permission acceptable. Because the ratings range from acceptable to unacceptable, they are likely representative of a given permissions' risk (unlike uniformly positive or negative reviews). This makes them effective tools for users in determining which applications they wish to install on their phones.

Meaningful ratings tell us that it is possible to build a rating system for application permissions to accompany the existing system for functionality. In our next post, we'll discuss what such a system might look like!

Modify or delete the contents of your USB storage 58.8 %

Send sticky broadcast 60 %

Control vibration 67.5 %

View Wi-Fi connections 70 %

Read phone status and identity 70 %

Test access to protected storage 72.5 %

Google Play license check 73.8 %

Run at startup 75.8 %

Read Google service configuration 76.3 %

Full network access 76.5 %

Approximate location 79 %

View network connections 80.5 %

Find accounts on the device 82.5 %

There's more! Click through to read Part 1, Part 2, and Part 4 of the series!

Social Ratings of Application Permissions (Part 2: The Effect of Branding)

2013-05-22T00:00:00+00:00

(This is the second post in this series. Click through for Part 1 and Part 3, and Part 4.)
In a prior post, we introduced our experiments investigating user ratings of smartphone application permissions. In this post we'll discuss the effect that branding has on users' evaluation of an app's permissions. Specifically, what effect does a brand name have on users' perceptions and ratings of an app?

We investigated this question using four well-known apps: Facebook, Gmail, Pandora Radio, and Angry Birds. Subjects were presented with a description of the app and its required permissions. We created surveys displaying the information presented to users in the Android app store, and asked users to rate the acceptability of the apps required permissions, and indicate whether they would install the app on their phone. Some of the subjects were presented with the true description of the app including its actual name, and the rest were presented with the same description, but with the well-known name replaced by a generic substitute. For example, Gmail was disguised as Mangogo Mail.

In the cases of Pandora and Angry Birds, there were no statistically significant differences in subjects' responses between the two conditions. However, there were significant differences in the responses for Gmail and Facebook.

For Gmail, participants rated the generic version's permissions as less acceptable and were less likely to install that version. For Facebook, however, participants rated the permissions for the generic version as less acceptable, but it had no effect on whether subjects would install the app. These findings raise interesting questions. Are the differences in responses caused by privacy considerations or other concerns, such as ability to access existing accounts? Why are people more willing to install a less secure social network than an insecure email client?

It is possible that people would be unwilling to install a generic email application because they want to be certain they could access their existing email or social network accounts. To separate access concerns from privacy concerns, we did a follow-up study in which we asked subjects to evaluate an app that was an interface over a brand-name app. In Gmail's case, for instance, subjects were presented with Gmore!, an app purporting to offer a smoother interaction with one's Gmail account.

Our findings for the interface apps was similar to the generic apps: for Facebook, subjects rated the permissions as less acceptable, but there was no effect on the likelihood of their installing the app; for Gmail, subjects rated the permissions as less acceptable and were less likely to install the app. In fact, the app that interfaced with Gmail had the lowest installation rate of any of the apps: just 47% of respondents would install the app, as opposed to 83% for brand-name Gmail, and 71% for generic Mangogo Mail. This suggests that subjects were concerned about the privacy of the apps, not just their functionality.

It is interesting that the app meant to interface with Facebook showed no significant difference in installation rates. Perhaps users are less concerned about the information on a social network than the information in their email, and see the potential consequences of installing an insecure social network as less dire than those associated with installing an insecure email client. This is just speculation, and this question requires further examination.

Overall, it seems that branding may play a role in how users perceive a given app's permissions, depending on the app. We would like to examine the nuances of this effect in greater detail. Why does this effect occur in some apps but not others? When does the different perception of permissions affect installation rates and why? These questions are exciting avenues for future research!
There's more! Click through to read Part 1, Part 3, and Part 4 of the series!

Social Ratings of Application Permissions (Part 1: Some Basic Conditions)

2013-05-18T00:00:00+00:00

(This is the first post in our series on Android application permissions. Click through for Part 2, Part 3, and Part 4.)

Smartphones obtain their power from allowing users to install arbitrary apps to customize the device’s behavior. However, with this versatility comes risk to security and privacy.

Different manufacturers have chosen to handle this problem in different ways. Android requires all applications to display their permissions to the user before being installed on the phone (then, once the user installs it, the application is free to use its permissions as it chooses). The Android approach allows users to make an informed decision about the applications they choose to install (and to do so at installation time, not in the midst of a critical task), but making this decision can be overwhelming, especially for non-expert users who may not even know what a given permission means. Many applications present a large number of permissions to users, and its not always clear why an application requires certain permissions. This requires users to gamble on how dangerous they expect a given application to be.

One way to help users is to rely on the expertise or experiences of other users, an approach that is already common in online marketplaces. Indeed, the Android application marketplace already allows users to rate applications. However, these reviews are meant to rate the application as a whole, and are not specific to the permissions required by the application. Therefore the overall star rating of an application is largely indicative of users’ opinions of the functionality of an application, not the security of the application. When users do offer opinions about security and privacy, as they sometimes do, these views are buried in text and lost unless the user reads all the comments.

Our goal is to make security and privacy ratings first-class members of the marketplace rating system. We have begun working on this problem, and will explain our preliminary results in this and a few more blog posts. All the experiments below were conducted on Mechanical Turk.

In this post, we examine the following questions:

Will people even rate the app's permissions? Even when there are lots of permissions to rate?

Does users’ willingness to install a given application change depending on when they are asked to make this choice - before they’ve reflected on the individual permissions or after?

Do their ratings differ depending on how they were told about the app?

The short answers to these questions is: yes (and yes), not really, and not really. In later posts we will introduce some more interesting questions and explore their effects.

We created surveys that mirrored the data provided by the Android installer (and as visible on the Google Play Web site). We examined four applications: Facebook, Gmail, Pandora, and Angry Birds. We asked respondents to rate the acceptability of the permissions required by each application and state whether they would install the application if they needed an app with that functionality.

In the first condition, respondents were asked whether they would install the app before or after they were asked to rate the app’s individual permissions. In this case, only Angry Birds showed any distinction between the two conditions: Respondents were more likely to install the application if the were asked after they were asked to rate the permissions.
Overall, however, the effect of asking before or after was very small; this is good, because it suggests that in the future we can ignore the overall rating, and it also offers some flexibility for interface design.

The second condition was how the subject heard about the app (or rather, how they were asked to imagine they heard about it). Subjects were asked to imagine either that the app had been recommended to them by a colleague, that the app was a top “featured app” in the app store, or that the app was a top rated app in the app store. In this case, only Facebook showed any interesting results: respondents were less likely to install the application if it had been recommended by a colleague than if it was featured or highly rated. This result is particularly odd given that, due to the network effect of an app like Facebook, we would expect the app to be more valuable if friends or colleagues also use it. We would like to study this phenomenon further.
Again, though this finding may be interesting, the fact that it has so little impact means we can set this condition aside in our future studies, thus narrowing the search space of factors that do affect how users rate permissions.

That concludes this first post on this topic. In future posts we’ll examine the effect of branding, and present detailed ratings of apps in one particular domain. Stay tuned!

There's more! Click through to read Part 2, Part 3, and Part 4 of the series!

Aluminum: Principled Scenario Exploration Through Minimality

2013-05-13T00:00:00+00:00

Software artifacts are hard to get right. Not just programs, but protocols, configurations, etc. as well! Analysis tools can help mitigate the risk at every stage in the development process.

Tools designed for scenario-finding produce examples of how an artifact can behave in practice. Scenario-finders often allow the results to be targeted to a desired output or test a particular portion of the artifact—e.g., producing counterexamples that disprove an assumption—and so they are fundamentally different from most forms of testing or simulation. Even better, concrete scenarios appeal to the intuition of the developer, revealing corner-cases and potential bugs that may never have been considered.

Alloy

The Alloy Analyzer is a popular light-weight scenario-finding tool. Let's look at at a small example: a (partial) specification of the Java type system that is included in the Alloy distribution. The explanations below each portion are quoted from comments in the example's source file.

abstract sig Type {subtypes: set Type} sig Class, Interface extends Type {} one sig Object extends Class {} sig Instance {type: Class} sig Variable {holds: lone Instance, type: Type}
Each type is either a class or an interface, and each has a set of subtypes. The Object class is unique. Every instance has a creation type. Each variable has a declared type and may (but need not) hold an instance.
fact TypeHierarchy { Type in Object.*subtypes no t: Type | t in t.^subtypes all t: Type | lone t.~subtypes & Class } fact TypeSoundness { all v: Variable | v.holds.type in v.type.*subtypes }
These facts say that (1) every type is a direct or indirect subtype of Object; (2) no type is a direct or indirect subtype of itself; (3) every type is a subtype of at most one class; and (4) all instances held by a variable have types that are direct or indirect subtypes of the variable's declared type.
Alloy will let us examine how the model can behave, up to user-provided constraints. We can tell Alloy to show us scenarios that also meet some additional constraints:

run { some Class - Object some Interface some Variable.type & Interface } for 4
We read this as: "Find a scenario in which (1) there is a class distinct from Object; (2) there is some interface; and (3) some variable has a declared type that is an interface."
Alloy always checks for scenarios up to a finite bound on each top-level type—4 in this case. Here is the one that Alloy gives:

Alloy's first scenario

This scenario illustrates a possible instance of the Java type system. It's got one class and one interface definition; the interface extends Object (by default), and the class implements the interface. There are two instances of that class. Variable0 and Variable1 both hold Instance1, and Variable2 holds Instance0.

Alloy provides a Next button that will give us more examples, when they exist. If we keep clicking it, we get a parade of hundreds more:

Another Scenario

Yet Another Scenario

Still Another Scenario...

Even a small specification like this one, with a relatively small size bound (like 4), can yield many hundreds of scenarios. That's after Alloy has automatically ruled out lots of superfluous scenarios that would be isomorphic to those already seen. Scenario-overload means that the pathological examples--the ones that the user needs to see--may be buried beneath many normal ones.

In addition, the order in which scenarios appear is up to the internal SAT-solver that Alloy uses. There is no way for a user to direct which scenario they see next without stopping their exploration, returning to the specification, adding appropriate constraints, and starting the parade of scenarios afresh. What if we could reduce the hundreds of scenarios that Alloy gives down to just a few, each of which represented many other scenarios in a precise way? What if we could let a user's goals guide their exploration without forcing a restart? It turns out that we can.

Aluminum: Beyond the Next Button

We have created a modified version of Alloy that we call Aluminum. Aluminum produces fewer, more concise scenarios and gives users additional control over their exploration. Let's look at Aluminum running on the same example as before, versus the first scenario that Alloy returned:

For reference:
The 1st Scenario from Alloy

The 1st Scenario from Aluminum

Compared to the scenario that Alloy produced, this one is quite small. That's not all, however: Aluminum guarantees that it is minimal: nothing can be removed from it without violating the specification! Above and beyond just seeing a smaller concrete example, a minimal scenario embodies necessity and gives users a better sense of the dependencies in the specification.

There are usually far fewer minimal scenarios than scenarios overall. If we keep clicking Next, we get:

The 2nd Minimal Scenario

The 3rd Minimal Scenario

Asking for a 4th scenario via Next results in an error. That's because in this case, there are only three minimal scenarios.

What if...?

Alloy's random, rich scenarios may contain useful information that minimal scenarios do not. For instance, the Alloy scenarios we saw allowed multiple variables to exist and classes to be instantiated. None of the minimal scenarios illustrate these possibilities.

Moreover, a torrent of scenarios--minimal or otherwise--doesn't encourage a user to really explore what can happen. After seeing a scenario, a user may ask whether various changes are possible. For instance, "Can I add another variable to this scenario?" More subtle (but just as important) is: "What happens if I add another variable?" Aluminum can answer that question.

To find out, we instruct Aluminum to try augmenting the starting scenario with a new variable. Aluminum produces a fresh series of scenarios, each of which illustrates a way that that the variable can be added:

The Three Augmented Scenarios
(The newly added variable is Variable0.)

From these, we learn that the new variable must have a type. In fact, these new scenarios cover the three possible types that the variable can receive.

It's worth mentioning that the three augmented scenarios are actually minimal themselves, just over a more restricted space—the scenarios containing the original one, plus an added variable. This ensures that the consequences shown are all truly necessary.

More Information

To learn more about Aluminum, see our paper and watch our video here or download the tool here.

A Privacy-Affecting Change in Firefox 20

2013-04-10T00:00:00+00:00

Attention, privacy-conscious Firefox users! Firefox has supported private browsing mode (PBM) for a long time, and Mozilla's guidelines for manipulating privacy sensitive data have not changed...but the implementation of PBM has, with potentially surprising consequences. Previously innocent extensions may now be leaking private browsing information. Read on for details.

Global Private-Browsing Mode: Easy!

In earlier versions of Firefox, PBM was an all-or-nothing affair: either all currently open windows were in private mode, or none of them were. This made handling sensitive data easy; the essential logic for privacy-aware extensions is simply:

if (inPBM) ...must store data only in memory... else ...allowed to save data to disk...

(For the technically curious reader, there were some additional steps to take if extension code used shared modules: they additionally had to listen to events signalling exit from PBM, to flush any sensitive data from memory. This was not terribly hard; trying to "do the right thing" would generally work.)

Per-window Private Browsing: Trickier!

However, in Firefox 20, PBM is now a per-window setting. This means that both public and private windows can be open simultaneously. Now, the precautions above are not sufficient. Consider a session-management extension, that periodically saves all open windows and tabs:

window.setInterval(3000, function() { if (inPBM) return; var allWindowsAndTabs = enumerateAllWindowsAndTabs(); save(allWindowsAndTabs); });

Most likely, this code internally uses Firefox's nsIWindowMediator API to produce the enumeration. The trouble is, that API does exactly what it claims: it enumerates all windows—public and private—regardless of the privacy status of the calling window. In particular, suppose that both public and private windows were open simultaneously, and the callback above ran in one of the public windows. Then inPBM would be false, and so the rest of the function would continue, enumerate all windows, and save them: a clear violation of private browsing, even though no code was running in the private browsing windows!

This code was perfectly safe in earlier versions of Firefox, because the possibility of having private and public windows open simultaneously just could not occur. This example demonstrates the need to carefully audit interactions between seemingly unrelated APIs, features, and modes—ideally, mechanically.

Takeaway Lessons

Private browsing “mode” is no longer as modal as it used to be. Privacy-conscious users need to take a careful look at the extensions they use—especially ones that observe browser-wide changes, like the session-manager example above—and double-check that they appear to behave properly with the new private-browsing changes, or (better yet) have been updated to support PBM explicitly.

Privacy-conscious developers need to take a careful look at their code and ensure that it's robust enough to handle these changed semantics for PBM, particularly in all code paths that occur after a check for PBM has returned false.

The New MOOR's Law

2013-04-01T00:00:00+00:00

Though this was posted on April 1, and the quotes in this article should be interpreted relative to that date, MOOR itself is not a joke. Students from our online course produced a semantics for Python, with a paper describing it published at OOPSLA 2013.

PROVIDENCE, RI, USA - With the advent of Massively Open and Online Courses (MOOCs), higher education faces a number of challenges. Services like Udacity and Coursera will provide learning services of equivalent quality for a fraction of the cost, and may render the need for traditional education institutions obsolete. With the majority of universities receiving much of their cash from undergraduate tuition, this makes the future of academic research uncertain as well.

"[Brown] lacks the endowment of a school like Harvard to weather the coming storm."
-Roberto Tamassia

Schools may eventually adapt, but it's unclear what will happen to research programs in the interim. “I'm concerned for the future of my department at Brown University, which lacks the endowment of a school like Harvard to weather the coming storm,” said Roberto Tamassia, chair of the Computer Science department at Brown.

But one Brown professor believes he has a solution for keeping his research program alive through the impending collapse of higher education. He calls it MOOR: Massively Open and Online Research. The professor, Shriram Krishnamurthi, claims that MOOR can be researchers' answer to MOOCs: “We've seen great user studies come out of crowdsourcing platforms like Mechanical Turk. MOOR will do the same for more technical research results; it's an effective complement to Turk.”

"MOOR ... is an effective complement to [Mechanical] Turk."
-Shriram Krishnamurthi

MOOR will reportedly leverage the contributions of novice scholars from around the globe in an integrated platform that aggregates experimental results and proposed hypotheses. This is combined with a unique algorithm that detects and flags particularly insightful contributions. This will allow research results never before possible, says Joe Gibbs Politz: “I figure only one in 10,000 people can figure out decidable and complete principal type inference for a language with prototype-based inheritance and equirecursive subtyping. So how could we even hope to solve this problem without 10,000 people working on it?”

Krishnamurthi has historically recruited undergraduate Brown students to aid him in his research efforts. Leo Meyerovich, a former Krishnamurthi acolyte, is excited about changing the structure of the educational system, and shares concerns about Krishnamurthi's research program. With the number of undergraduates sure to dwindle to nothing in the coming years, he says, “Shriram will need to come up with some new tricks.”

Brown graduates Ed Lazowska and Peter Norvig appear to agree. While Lazowska refused to comment on behalf of the CRA, saying its position on the matter is “still evolving,” speaking personally, he added, “evolve or die.” As of this writing, Peter Norvig has pledged a set of server farms and the 20% time of 160,000 Google Engineers to support MOOR.

With additional reporting from Kathi Fisler, Hannah Quay-de la Vallee, Benjamin S. Lerner, and Justin Pombrio.

Essentials of Garbage Collection

2013-02-19T00:00:00+00:00

How do we teach students the essential ideas behind garbage collection?

A garbage collector (GC) implementation must address many overlapping concerns. There are algorithmic decisions, e.g., the GC algorithm to use; there are representation choices, e.g, the layout of the stack; and there are assumptions about the language, e.g., unforgeable pointers. These choices, and more, affect each other and make it very difficult for students to get off the ground.

Programming language implementations can be similarly complex. However, over the years, starting from McCarthy's seminal paper, we've found it invaluable to write definitional interpreters for simple core languages, that allow us to easily explore the consequences of language design. What is the equivalent for garbage collectors?

Testing and Curricular Dependencies

Testing is important for any program, and is doubly so for a garbage collector, which must correctly deal with complex inputs (i.e., programs!). Furthermore, correctness itself is subtle: besides being efficient, a collector must terminate, be sound (not collect non-garbage), be effective (actually collect enough true garbage), and preserve the essential structure of live data: e.g., given a cyclic datum, it must terminate, and given data with sharing, it must preserve the sharing-structure and not duplicate data.

For instructors, a further concern is the number of deep, curricular dependencies that teaching GC normally incurs. In a course where students already learn compilers and interpreters, it is possible (but hard) to teach GC. However, GC should be teachable in several contexts, such as a systems course, an architecture course, or even as a cool application of graph algorithms. Eliminating GC's curricular dependencies is thus valuable.

Even in courses that have student implement compilers or interpreters, testing a garbage collector is very hard. A test-case is a program that exercises the collector in an interesting way. Good tests require complex programs, and complex programs require a rich programming language. In courses, students tend to write compilers and interpreters for simple, core languages and are thus stuck writing correspondingly simple programs. Worse, some important GC features, such as the treatment of cycles and first-class functions, are impossible to test unless the programming language being implemented has the necessary features. Growing a core language for the sake of testing GC would be a case of the tail wagging the dog.

In short, we would like a pedagogic framework where:

Students can test their collectors using a full-blown programming language, instead of a stilted core language, and

Students can experiment with different GC algorithms and data-representations, without learning to implement languages.

The Key Idea

Over the past few years, we've built a framework for teaching garbage collection, first started by former Brown PhD student Greg Cooper.

Consider the following Racket program, which would make a great test-case for a garbage collector (with a suitably small heap):
#lang racket (define (map f l) (cond [(empty? l) empty] [(cons? l) (cons (f (first l)) (map f (rest l)))])) (map (lambda (x) (+ x 1)) (list 1 2 3))
As written, Racket's native garbage collector allocates the closure created by lambda and the lists created by cons and list. But, what if we could route these calls to a student-written collector and have it manage memory instead?

Greg made the following simple observation. Why are these values being managed by the Racket memory system? That's because the bindings for lambda, cons and list (and other procedures and values) are provided by #lang racket. Instead, suppose that with just one change, we could swap in a student-written collector:
#lang plai/mutator (allocator-setup "my-mark-and-sweep.rkt" 30)
Using the magic of Racket's modules and macros, #lang plai/mutator maps memory-allocating procedures to student-written functions. The student designates a module, such as my-mark-and-sweep that exports a small, prescribed interface for memory allocation. That module allocates values in a private heap and runs its own GC algorithm when it runs out of space. In our framework, the heap size is a parameter that is trivial to change. The program above selects a heap size of 30, which is small enough to trigger several GC cycles.

We've designed the framework so that students can easily swap in different collectors, change the size of the heap, and even use Racket's native collector, as a sanity check.

Our framework for writing garbage collectors.

Discovering GC Roots

Routing primitive procedures and constants is only half the story. A collector also needs to inspect and modify the program stack, which holds the roots of garbage collection. Since we allow students to write programs in Racket, the stack is just the Racket stack, which is not open for reflection. Therefore, we must also transform the program to expose the contents of the stack.

Our framework automatically transforms Racket programs to A-normal form, which names all intermediate expressions in each activation record. Unlike other transformations, say CPS, A-normalization preserves stack-layout. This makes it easy for students to co-debug their programs and collectors using the built-in DrRacket debugger.

In addition to inspecting individual activation records, a collector also needs to traverse the entire list of records on the stack. We maintain this list on the stack itself using Racket's continuation marks. Notably, continuation marks preserve tail-calls, so the semantics of Racket programs are unchanged.

The combination of A-normalization and continuation marks gives us a key technical advantage. In their absence, we would have to manually convert programs to continuation-passing style (followed by defunctionalization) to obtain roots. Not only would these be onerous curricular dependencies, but they would make debugging very hard.

Controlling and Visualizing the Heap

Of course, we would like to prevent collectors from cheating. An easy way to cheat is to simply allocate on the Racket heap. We do not go to great lengths to prevent all forms of cheating (these are carefully-graded homeworks, after all). However, we do demand that collectors access the “heap” through a low-level interface (where the heap is actually just one large Racket array). This interface prevents the storage of non-flat values, thereby forcing students to construct compound representations for compound data. Notably, our framework is flexible enough that we have students design their own memory representations.

The low-level heap interface has an added benefit. Because our framework can tell “where the heap is”, and has access to the student's heap-inspection procedures, we are able to provide a fairly effective heap visualizer:

Visualizing the heap of a student-written semispace collector. The visualizer dereferences pointers and draws arrows too!

Without this visualizer, students would have to print and examine memory themselves. We agree that studying core dumps builds character, but first-class debugging tools make for a gentler and more pleasant learning experience. Due to this and other conveniences, our students implement at least two distinct GC algorithms and extensive tests within two weeks.

HexFiend: Core dumps build character, but we don't inflict them on our students.

Learn More

We've been using this framework to teach garbage collection for several years at several universities. It is now included with Racket, too. If you'd like to play with the framework, check out one of our GC homework assignments.

A pedagogic paper about this work will appear at SIGCSE 2013.

References

[1] John Clements and Matthias Felleisen. A Tail-Recursive Machine with Stack Inspection. ACM Transactions on Programming Languages and Systems (TOPLAS), Vol. 26, No. 6, November 2004. (PDF)

[2] Gregory H. Cooper, Arjun Guha, Shriram Krishnamurthi, Jay McCarthy, and Robert Bruce Findler. Teaching Garbage Collection without Implementing Compilers or Interpreters. In ACM Technical Symnposium on Computer Science Education (SIGCSE) 2013. (paper)

[3] Cormac Flanagan, Amr Sabry, Bruce F. Duba, and Matthias Felleisen. The Essence of Compiling with Continuations. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 1993. (PDF)

[4] John McCarthy. Recursive Functions of Symbolic Expressions and their Computation by Machine, Part 1. Communications of the ACM, Volume 3, Issue 4, April 1960. (paper)

(Sub)Typing First Class Field Names

2012-12-10T00:00:00+00:00

This post picks up where a previous post left off, and jumps back into the action with subtyping.

Updating Subtyping

There are two traditional rules for subtyping records, historically known as width and depth subtyping. Width subtyping allows generalization by dropping fields; one record is a supertype of another if it contains a subset of the sub-record's fields at the same types. Depth subtyping allows generalization within a field; one record is a supertype of another if it is identical aside from generalizing one field to a supertype of the type in the same field in the sub-record.

We would like to understand both of these kinds of subtyping in the context of our first-class field names. With traditional record types, fields are either mentioned in the record or not. Thus, for each possible field in both types, there are four combinations to consider. We can describe width and depth subtyping in a table:

T₁ T₂ T₁ <: T₂ if...

f: S f: T S <: T

f: - f: T Never

f: S f: - Always

f: - f: - Always

We read f: S as saying that T₁ has the field f with type S, and we read f: - as saying that the corresponding type doesn't mention the field f. The first row of the table corresponds to depth subtyping, where the field f is still present, but at a more general type in T₂. The second row is a failure to subtype, when T₂ has a field that isn't mentioned at all in T₁. The third row corresponds to width subtyping, where a field is dropped and not mentioned in the supertype. The last row is a trivial case, where neither type mentions the field.

For records with string patterns, we can extend this table with new combinations to account for ○ and ↓ annotations. The old rows remain, and become the ↓ cases, and new rows are added for ○ annotations:

T₁ T₂ T₁ <: T₂ if...

f^↓: S f^↓: T S <: T

f^○: S f^↓: T Never

f: - f^↓: T Never

f^↓: S f^○: T S <: T

f^○: S f^○: T S <: T

f: - f^○: T Never

f^↓: S f: - Always

f^○: S f: - Always

f: - f: - Always

Here, we see that it is safe to treat a definitely present field as a possibly-present one, in the case where we compare f^↓:S to f^○:T). The dual of this case, treating a possibly-present field as definitely-present, is unsafe, as the comparison of f^○:S to f^↓:T shows. Possibly present annotations do not allow us to invent fields, as having f: - on the left-hand-side is still only permissible if the right-hand-side also doesn't mention f.

Giving Types to Values

In order to ascribe these rich types to object values, we need rules for typing basic objects, and then we need to apply these subtyping rules to generalize them. As a working example, one place where objects with field patterns come up every day in practice is in JavaScript arrays. Arrays in JavaScript hold their elements in fields named by stringified numbers. Thus, a simplified type for a JavaScript array of booleans is roughly:

BoolArrayFull: { [0-9]+^○: Bool }

That is, each field made up of a sequence of digits is possibly present, and if it is there, it has a boolean value. For simplicity, let's consider a slightly neutered version of this type, where only single digit fields are allowed:

BoolArray: { [0-9]^○: Bool }

Let's think about how we'd go about typing a value that should clearly have this type: the array [true, false]. We can think of this array literal as desugaring into an object like (indeed, this is what λJS does):

{"0": true, "1": false}

We would like to be able to state that this object is a member of the BoolArray type above. The traditional rule for record typing would ascribe a type mapping the names that are present to the types of their right hand side. Since the fields are certainly present, in our notation we can write:

{"0"^↓: Bool, "1"^↓: Bool}

This type should certainly be generalizable to BoolArray. That is, it should hold (using the rules in the table above) that:

{"0"^↓: Bool, "1"^↓: Bool} <: { [0-9]^○: Bool }

Let's see what happens when we instantiate the table for these two types:

T₁ T₂ T₁ <: T₂ if...

0^↓: Bool 0^○: Bool Bool <: Bool

1^↓: Bool 1^○: Bool Bool <: Bool

3: - 3^○: Bool Fail!

4: - 4^○: Bool Fail!

... ... ...

(We cut off the table for 5-9, which are the same as the cases for 3 and 4). Our subtyping fails to hold for these types, which don't let us reflect the fact that the fields 3 and 4 are actually absent, and we should be allowed to consider them as possibly present at the boolean type. In fact, our straightforward rule for typing records is in fact responsible for throwing away this information! The type that it ascribed,

{"0"^↓: Bool, "1"^↓: Bool}

is actually the type of many objects, including those that happen to have fields like "banana" : 42. Traditional record typing drops fields when it doesn't care if they are present or absent, which loses information about definitive absence.

We extend our type language once more to keep track of this information. We add an explicit piece of a record type that tracks a description of the fields that are definitely absent on an object, and use this for typing object literals:

p = ○ | ↓ T = ... | { L^p : T ..., L_A: abs }

Thus, the new type for ["0": true, "1": false] would be:

{"0"^↓: Bool, "1"^↓: Bool, ("0"|"1"): abs}

Here, the overbar denotes regular-expression complement, and this type is expressing that all fields other than "0" and "1" are definitely absent.

Adding another type of field annotation requires that we again extend our table of subtyping options, so we now have a complete description with 16 cases:

T₁ T₂ T₁ <: T₂ if...

f^↓: S f^↓: T S <: T

f^○: S f^↓: T Never

f: abs f^↓: T Never

f: - f^↓: T Never

f^↓: S f^○: T S <: T

f^○: S f^○: T S <: T

f: abs f^○: T Always

f: - f^○: T Never

f^↓: S f: abs Never

f^○: S f: abs Never

f: abs f: abs Always

f: - f: abs Never

f^↓: S f: - Always

f^○: S f: - Always

f: abs f: - Always

f: - f: - Always

We see that absent fields cannot be generalized to be definitely present (the abs to f^↓ case), but they can be generalized to be possibly present at any type. This is expressed in the case that compares f : abs to f^○: T, which always holds for any T. To see these rules in action, we can instantiate them for the array example we've been working with to ask a new question:

{"0"^↓: Bool, "1"^↓: Bool, ("0"|"1"): abs} <: { [0-9]^○: Bool }

And the table:

T₁ T₂ T₁ <: T₂ if...

0^↓: Bool 0^○: Bool Bool <: Bool

1^↓: Bool 1^○: Bool Bool <: Bool

3: abs 3^○: Bool OK!

4: abs 4^○: Bool OK!

... ... ...

9: abs 9^○: Bool OK!

foo: abs foo: - OK!

bar: abs bar: - OK!

... ... ...

There's two things that make this possible. First, it is sound to generalize the absent fields that are possibly present on the array type, because the larger type doesn't guarantee their presence either. Second, it is sound to generalize absent fields that aren't mentioned on the array type, because unmentioned fields can be present or absent with any type. The combination of these two features of our subtyping relation lets us generalize from particular array instances to the more general type for arrays.

Capturing the Whole Table

The tables above present subtyping on a field-by-field basis, and the patterns we considered at first were finite. In the last case, however, the pattern of “fields other than 0 and 1” was in fact infinite, and we cannot actually construct that infinite table to describe subtyping. The writeup and its associated proof document lay out an algorithmic version of the rules presented in the tables above, and also provides a proof of their soundness.

The writeup also discusses another interesting problem, which is the interaction between these pattern types and inheritance, where patterns on the child and parent objects may overlap in subtle ways. It goes further and discusses what happens in cases like JavaScript, where the field "__proto__" is an accessible member that has inheritance semantics. Check it all out here!

Typing First Class Field Names

2012-12-03T00:00:00+00:00

In a previous post, we discussed some of the powerful features of objects in scripting languages. One feature that stood out was the use of first-class strings as member names for objects. That is, in programs like

var o = {name: "Bob", age: 22}; function lookup(f) { return o[f]; } lookup("name"); lookup("age");

the name position in field lookup has been abstracted over. Presumably only a finite set of names actually works with the lookup (o appears to only have two fields, after all).

It turns out that so-called “scripting” languages aren't the only ones that compute fields for lookup. For example, even within the constraints of Java's type system, the Bean framework computes method names to call at runtime. Developers can provide information about the names of fields and methods on a Bean with a BeanInfo instance, but even if they don't provide complete information, “the rest will be obtained by automatic analysis using low-level reflection of the bean classes’ methods and applying standard design patterns.” These “standard design patterns” include, for example, concatenating the strings "get" and "set" onto field names to produce method names to invoke at runtime.

Traditional type systems for objects and records have little to say about these computed accesses. In this post, we're going to build up a description of object types that can describe these values, and explore their use. The ideas in this post are developed more fully in a writeup for the FOOL Workshop.

First-class Singleton Strings

In the JavaScript example above, we said that it's likely that the only intended lookup targets―and thus the only intended arguments to lookup―are "name" and "age". Giving a meaningful type to this function is easy if we allow singleton strings as types in their own right. That is, if our type language is:

T = s | Str | Num | { s : T ... } | T → T | T ∩ T

Where s stands for any singleton string, Str and Num are base types for strings and numbers, respectively, record types are a map from singleton strings s to types, arrow types are traditional pairs of types, and intersections are allowed to express a conjunction of types on the same value.

Given these definitions, we could write the type of lookup as:

lookup : ("name" → Str) ∩ ("age" → Num)

That is, if lookup is provided the string "name", it produces a string, and if it is provided the string "age", it produces a number.

In order to type-check the body of lookup, we need a type for o as well. That can be represented with the type { "name" : Str, "age" : Num }. Finally, to type-check the object lookup expression o[f], we need to compare the singleton string type of f with the fields of o. In this case, only the two strings that are already present on o are possible choices for f, so the comparison is easy and type-checking works out.

For a first cut, all we did was make the string labels on objects' fields a first-class entity in our type system, with singleton string types s. But what can we say about the Bean example, where get* and set* method invocations are computed rather than just used as first-class values?

String Pattern Types

In order to express the type of objects like Beans, we need to express field name patterns, rather than just singleton field names. For example, we might say that a Bean with Int-typed parameters has a type like:

IntBean = { ("get".+) : → Int }, ("set".+) : Int → Void, "toString" : → Str }

Here, we are using .+ as regular expression notation for any non-empty string. We read the type above as saying that all fields that begin with get and end with any string are functions that return Int values. The same is true for "set" methods. The singleton string "toString" is also a field, and is simply a function that returns strings.

To express this type, we need to extend our type language to handle these string patterns, which we write down as regular expressions (the write-up outlines the actual limits on what kinds of patterns we can support). We extend our type language to include patterns as types, and as field names:

L = regular expressions T = L | Str | Num | { L : T ... } | T → T | T ∩ T

This new specification gives us the ability to write down types like IntBean, which have field patterns that describe infinite sets of fields. Let's stop and think about what that means as a description of a runtime object. Our type for o above, { "name" : Str, "age" : Num }, says that values bound to o at runtime certainly have name and age fields at the listed types. The type for IntBean, on the other hand, seems to assert that these objects will have the fields getUp, getDown, getSerious, and infinitely more. But a runtime object can't actually have all of those fields, so a pattern indicating an infinite number of field names is describing a fundamentally different kind of value.

What an object type with an infinite pattern represents is that all the fields that match the pattern are potentially present. That is, at runtime, they may or may not be there, but if they are there, they must have the annotated type. We extend object types again to make this explicit with presence annotations, which explicitly list fields as definitely present, written ↓, or possibly present, written ○:

p = ○ | ↓ T = ... | { L^p : T ... }

In this notation, we would write:

IntBean = { ("get".+)^○ : → Int }, ("set".+)^○ : Int → Void, "toString"^↓ : → Str }

Which indicates that all the fields in ("get".+) and ("set".+) are possibly present with the given arrow types, and toString is definitely present.

Subtyping

Now that we have these rich object types, it's natural to ask what kinds of subtyping relationships they have with one another. A detailed account of subtyping will come soon; in the meantime, can you guess what subtyping might look like for these types?

Update: The answer is in the next post.

S5: Engineering Eval

2012-10-21T00:00:00+00:00

In an earlier post, we introduced S5, our semantics for ECMAScript 5.1 (ES5). S5 is no toy, but strives to correctly model JavaScript's messy details.

One such messy detail of JavaScript is eval. The behavior of eval was updated in the ES5 specification to make its behavior less surprising and give more control to programmers. However, the old behavior was left intact for backwards compatibility. This has led to a language construct with a number of subtle behaviors. Today, we're going to explore JavaScript's eval, explain its several modes, and describe our approach to engineering an implementation of it.

Quiz Time!

We've put together a short quiz to give you a tour of the various types of eval in JavaScript. How many can you get right on the first try?

Question 1

function f(x) { eval("var x = 2;"); return x; } f(1) === ?;

f(1) === 2
This example returns 2 because the var declaration in the eval actually refers to the same variables as the body of the function. So, the eval body overwrites the x parameter and returns the new value.

Question 2

function f(x) { eval("'use strict'; var x = 2;"); return x; } f(1) === ?;

f(1) === 1
The 'use strict'; directive creates a new scope for variables defined inside the eval. So, the var x = 2; still evaluates, but doesn't affect the x that is the function's parameter. These first two examples show that strict mode changes the scope that eval affects. We might ask, now that we've seen these, what scope does eval see?

Question 3

function f(x) { eval("var x = y;"); return x; } f(1) === ?;

f(1) === ReferenceError: y is not defined
OK, that was sort of a trick question. This program throws an exception saying that y is unbound. But it serves to remind us of an important JavaScript feature; if a variable isn't defined in a scope, trying to access it is an exception. Now we can ask the obvious question: can we see y if we define it outside the eval?

Question 4

function f(x) { var y = 2; eval("var x = y;"); return x; } f(1) === ?;

f(1) === 2
OK, here's our real answer. The y is certainly visible inside the eval, which can both see and affect the outer scope. What if the eval is strict?

Question 5

function f(x) { var y = 2; eval("'use strict'; var x = y;"); return x; } f(1) === ?;

f(1) === 1
Interestingly, we don't get an error here, so it seems like y was visible to the eval even in strict mode. However, as before the assignment doesn't escape. New topic next.

Question 6

function f(x) { var avel = eval; avel("var x = y;"); return x; } f(1) === ?;

f(1) === ReferenceError: y is not defined
OK, that was a gimme. Lets add the variable declaration we need.

Question 7

function f(x) { var avel = eval; var y = 2; avel("var x = y;"); return x; } f(1) === ?;

f(1) --> ReferenceError: y is not defined
What's going on here? We defined a variable and it isn't visible like it was before, and all we did was rename eval. Let's try a simpler example.

Question 8

function f(x) { var avel = eval; avel("var x = 2;"); return x; } f(1) === ?;

f(1) === 1
OK, so somehow we aren't seeing the assignment to x either... Let's try making one more observation:

Question 9

function f(x) { var avel = eval; avel("var x = 2;"); return x; } f(1); x === ?;

x === 2
Whoa! So that eval changed the x in the global scope. This is what the specification refers to as an indirect eval; when the call to eval doesn't use a direct reference to the variable eval.

Question 10 (On the home stretch!)

function f(x) { "use strict"; eval("var x = 2;"); return x; } f(1) === ?; x === ?;

f(1) === 1
Before, when we had "use strict"; inside the eval, we saw that the variable declarations did not escape. Here, the "use strict"; is outside, but we see the same thing: the value of 1 simply flows through to the return statement unaffected. Second, we know that we aren't doing the same thing as the indirect eval from the previous question, because we didn't affect the global scope.

Question 11 (last one!)

function f(x) { "use strict"; var avel = eval; avel("var x = 2;"); return x; } f(1) === ?; x === ?;

f(1) === 1
x === 2
Unlike in the previous question, this indirect eval has the same behavior as before: it affects the global scope. The presence of a "use strict"; appears to mean something different to an indirect versus a direct eval.

Capturing all the Evals

We saw three factors that could affect the behavior of eval above:

Whether the code passed to eval was in strict mode;

Whether the code surrounding the eval was in strict mode; and

Whether the eval was direct or indirect.

Each of these is a binary choice, so there are eight potential configurations for an eval. Each of the eight cases specifies both:

Whether the eval sees the current scope or the global one;

Whether variables introduced in the eval are seen outside of it.

We can crisply describe all of these choices in a table:

Strict outside? Strict inside? Direct or Indirect? Local or global scope? Affects scope?

Yes Yes Indirect Global No

No Yes Indirect Global No

Yes No Indirect Global Yes

No No Indirect Global Yes

Yes Yes Direct Local No

No Yes Direct Local No

Yes No Direct Local No

No No Direct Local Yes

Rows where eval can affect some scope are shown in red (where it cannot is blue), and rows where the string passed to eval is strict mode code are in bold. Some patterns emerge here that make some of the design decisions of eval clear. For example:

If the eval is indirect it always uses global scope; if direct it always uses local scope.

If the string passed to eval is strict mode code, then variable declarations will not be seen outside the eval.

An indirect eval behaves the same regardless of the strictness of its context, while direct eval is sensitive to it.

Engineering eval

To specify eval, we need to somehow both detect these different configurations, and evaluate code with the right combination of visible environment and effects. To do so, we start with a flexible primitive that lets us evaluate code in an environment expressed as an object:

internal-eval(string, env-object)

This internal-eval expects env-object to be an object whose fields represent the environment to evaluate in. No identifiers other than those in the passed-in environment are bound. For example, a call like:

internal-eval("x + y", { "x" : 2, "y" : 5 })

Would evaluate to 7, using the values of the "x" and "y" fields from the environment object as the bindings for the identifiers x and y. With this core primitive, we have the control we need to implement all the different versions of eval.

In previous posts, we talked about the overall strategy of our evaluator for JavaScript. The relevant high-level point for this discussion is that we define a core language, dubbed S5, that contains only the essential features of JavaScript. Then, we define a source-to-source transformer, called desugar, that converts JavaScript programs to S5 programs. Since our evaluator is defined only over S5, we need to use desugar in our interpreter to perform the evaluation step. Semantically, the evaluation of internal-eval is then:

internal-eval(string, env-object) -> desugar(string)[x₁ / v₁, ...] for each x₁ : v₁ in env-object (where [x / v] indicates substitution)

It is the combination of desugar and the customizable environment argument to internal-eval that let us implement all of JavaScript's eval forms. We actually desugar all calls to JavaScript's eval into a function call defined in S5 called maybeDirectEval, which performs all the necessary checks to construct the correct environment for the eval.

Leveraging S5's Eval

With our implementation of eval, we have made progress on a few fronts.

Analyzing more JavaScript: We can now tackle more programs than any of our prior formal semantics for JavaScript. For example, we can actually run all of the complicated evals in Secure ECMAScript, and print the heap inside a use of a sandboxed eval. This enables new kinds of analyses that we haven't been able to perform before.

Understanding scripting languages' eval: Other scripting languages, like Ruby and Python, also have eval. Their implementations are closer to our internal-eval, in that they take dictionary arguments that specify the bindings that are available inside the evaluation. Is something like internal-eval, which was inspired by well-known semantic considerations, a useful underlying mechanism to use to describe all of these?

The implementation of S5 is open-source, and a detailed report of our strategy and test results is appearing at the Dynamic Languages Symposium. Check them out if you'd like to learn more!

Progressive Types

2012-09-01T00:00:00+00:00

Adding types to untyped languages has been studied extensively, and with good reason. Type systems offer developers strong static guarantees about the behavior of their programs, and adding types to untyped code allows developers to reap these benefits without having to rewrite their entire code base. However, these guarantees come at a cost. Retrofitting types on to untyped code can be an onerous and time-consuming process. In order to mitigate this cost, researchers have developed methods to type partial programs or sections of programs, or to allow looser guarantees from the type system. (Gradual typing and soft typing are some examples.) This reduces the up-front cost of typing a program.

However, these approaches only address a part of the problem. Even if the programmer is willing to expend the effort to type the program, he still cannot control what counts as an acceptable program; that is determined by the type system. This significantly reduces the flexibility of the language and forces the programmer to work within a very strict framework. To demonstrate this, observe the following program in Racket...

#lang racket (define (gt2 x y) (> x y))

...and its Typed Racket counterpart.

#lang typed/racket (: gt2 (Number Number -> Boolean)) (define (gt2 x y) (> x y))

The typed example above, which appears to be logically typed, fails to type-check. This is due to the sophistication with which Typed Racket handles numbers. It can distinguish between complex numbers and real numbers, integers and non-integers, even positive and negative integers. In this system, Number is actually an alias for Complex. This makes sense in that complex numbers are in fact the super type of all other numbers. However, it would also be reasonable to assume that Number means Real, because that's what people tend to think of when they think “number”. Because of this, a developer may expect all functions over real numbers to work over Numbers. However, this is not the case. Greater-than, which is defined over reals, cannot be used with Number because it is not defined over complex numbers. Now, this could be resolved by changing the type of gt2 to take Reals, rather than Numbers. But then consider this program:
#lang typed/racket (: plus (Number Number -> Number)) (define (plus x y) (+ x y)) ;Looks fine so far... (: gt2 (Real Real -> Boolean)) (define (gt2 x y) (> x y)) ;...Still ok... (gt2 (plus 3 4) 5) ;...Here (plus 3 4) evaluates to a Number which causes gt2 to give ;the type error “Expected Real but got Complex”.
Now, in order to make this program type, we would have to adjust plus to return Reals, even though it works with it's current typing! And we'd have to do the same for every program that calls plus. This can cause a ripple effect through the program, making typing the program labor-intensive, despite the fact that the program will actually run just fine on some inputs, which may be all we care about. But we still have to jump through hoops to get the program to run at all!

In the above example, the type system in Typed Racket requires the programmer to ensure that there are no runtime errors caused by using a complex number where a real number is expected, even if it means significant extra programming effort. There are cases, however, where type systems do not provide guarantees because it would cross the threshold of too much work for programmers. One such guarantee is ensuring that vector references are always given positive integer inputs. The Typed Racket type system does not offer this guarantee because of the required programming effort, and so it traded that particular guarantee for convenience and ease of programming.

In both these cases, type systems are trying to determine the best balance betwen safety and convenience. However, the best a system can do is choose either safety or convenience and apply that to all programs. Vector references cannot be checked in any program, because it isn't worth the extra engineering effort, whereas all programs must be checked for number realness, because it's worth the extra engineering effort. This seems pretty arbirtary! Type systems are trying to guess at what the developer might want, instead of just asking. However, the developer has a much better idea of which checks are relevant and important for a specific program and which are irrelevant or unimportant. The type system should leverage this information and offer the useful guarantees without requiring unhelpful ones.

Progressive Types

To this end, we have developed progressive types, which allow the developer to require type guarantees that are significant to the program, and ignore those that are irrelevant. From the total set of possible type errors, the developer would select which among them must be detected as compile time type errors, and which should be allowed to possibly cause runtime failures. In the above example, the developer could allow errors caused by treating a Number as a Real at runtime, trusting that they will never occur or that it won't be catastrophic if they do or that the particular error is orthogonal to the reasons for type-checking the program at all. Thus, the developer can disregard an insignificant error while still reaping the benefits of the rest of the type system. This addresses a problem that underlies all type systems: The programmer doesn't get to choose which classes of programs are “good” and which are “bad.” Progressive types give the programmer that control.

In order to allow this, the type system has an allowed error set, Ω, in addition to the type environment. So while a traditional typing rule takes the form Γ⊢e:τ, a rule in progressive type would take the form Ω;Γ⊢e:τ. Here, Ω is the set of errors the developer wants to allow to cause runtime failures. Expressions may evaluate to errors, and if those errors are in Ω, the expression will type to ⊥, otherwise it will fail to type. This is reflected in the progress theorem that goes along with the type system.

If Typed Racket were a progressively typed language, the above program would type only if the programmer had selected “Expected Real, but got Complex” to be in Ω. This means that if numerical calculations are really orthogonal to the point of the program, or there are other checks in place insuring the function will only get the right type of input, the developer can just tell the type checker not to worry about those errors! However, if it's important to ensure that complex numbers never appear where reals are required, the developer can tell the type checker to detect those errors. Thus the programmer can determine what constitutes a “good” program, rather than working around a different, possibly inconvenient, definition of “good”. By passing this control over to the developer, progressive type systems allow the balance between ease of engineering and saftey to be set at a level appropriate to the program.

Progressive typing differs from gradual typing in that while gradual typing allows the developer to type portions of a program with a fixed type system, progressive types instead allow the developer to vary the guarantees offered by the type system. Further, like soft typing, progressive typing allows for runtime errors instead of static guarantees, but unlike soft typing, it restricts which classes of runtime failures are allowed to occur. Because our system allows programmers to progressively adjust the restrictions imposed by the type system, either to loosen or tighten them, they can reap many of the flexibility benefits of a dynamic languages, but get static guarantees of a type system in the way best suited to each of their programs or preferences.

If you are interested in learning more about progressive types, look here.

Modeling DOM Events

2012-07-17T00:00:00+00:00

In previous posts, we’ve talked about our group’s work on providing an operational semantics for JavaScript, including the newer features of the language. While that work is useful for understanding the language, most JavaScript programs don’t run in a vacuum: they run in a browser, with a rich API to access the contents of the page.

That API, known as the Document Object Model (or DOM), consists of several parts:

A graph of objects encoding the structure of page (This graph is optimistically called a "tree" since the HTML markup is indeed tree-shaped, but this graph has extra pointers between objects.),

Methods to manipulate the HTML tree structure,

A sophisticated event model to allow scripts to react to user interactions.

These three parts of the DOM interact with one other, making reasoning about any one of them in isolation challenging. Moreover, the specs describing them are long, heavily self-referential, and difficult to understand incrementally. So what to do?

What makes this event programming so special?

To a first approximation, the execution of every web page looks roughly like: load the markup of the page, load scripts, set up lots of event handlers … and wait. For events. To fire. Accordingly, to understand the control flow of a page, we have to understand what happens when events fire.

Let’s start with this:

<div id="d1"> In outer div <p id="p1"> In paragraph in div. <span id="s1" style="background:white;"> In span in paragraph in div. </span> </p> </div> <script> document.getElementById("s1").addEventListener("click", function() { this.style.color = "red"; }); </script>

Requires JavaScript enabled to view the example

If you click on the text "In span in paragraph in div" the event listener that gets added to element span#s1 is triggered by the click, and turns the text red. But consider the slightly more complicated example:

<div id="d2"> In outer div <p id="p2"> In paragraph in div. <span id="s2" style="background:white;"> In span in paragraph in div. </span> </p> </div> <script> document.getElementById("d2").addEventListener("click", function() { this.style.color = "red"; }); document.getElementById("s2").addEventListener("click", function() { this.style.color = "blue"; }); </script>

Requires JavaScript enabled to view the example

Now, clicking anywhere in the box will turn all the text red. That makes sense: we just clicked on the <div> element, so its listener fires. But clicking on the <span> will turn it blue and still turn the rest red. Why? We didn’t click on the <div>! Well, not directly…

The key feature of event dispatch, as implemented for the DOM, is that it takes advantage of the page structure. Clicking on an element of the page (or typing into a text box, moving the mouse over an element, etc.) will cause an event to fire "at" that element: the element is the target of the event, and any event listener installed for that event on that target node will be called. But in addition, the event will also trigger event listeners on the ancestors of the target node: this is called the dispatch path. So in the example above, because div#d2 is an ancestor of span#s2, its event listener is also invoked, turning the text red.

What Could Possibly Go Wrong?

In a word: mutation. The functions called as event listeners are arbitrary JavaScript code, which can do anything they want to the state of the page, including modifying the DOM. So what might happen?

The event listener might move the current target in the page. What happens to the dispatch path?

The event listener adds (or removes) other listeners for the event being dispatched. Should newly installed listeners be invoked before or after existing ones? Should those listeners even be called?

The event listener tries to cancel event dispatch. Can it do so?

The listener tries to (programmatically) fire another event while the current one is active. Is event dispatch reentrant?

There are legacy mechanisms to add event "handlers" as well as listeners. How should they interact with listeners?

Modeling Event Dispatch

Continuing our group’s theme of reducing a complicated, real-world system to a simpler operational model, we developed an idealized version of event dispatch in PLT Redex, a domain-specific language embedded in Racket for specifying operational semantics. Because we are focusing on exactly how event dispatch works, our model does not include all of JavaScript, nor does it need to—instead, it includes a miniature statement language containing the handful of DOM APIs that manipulate events. Our model does not include all the thousands of DOM properties and methods, instead including just a simplified tree-structured heap of nodes: this is all the structure we need to faithfully model the dispatch path of an event.

Our model is based on the DOM Level 3 Events specification. It expresses the key behaviors of event dispatch, and does so far more compactly than the spec: roughly 1000 lines of commented Redex code replace several pages’ worth of (at times self-contradictory!) requirements that are spread throughout a spec over a hundred pages long. From this concise model, for example, we can easily extract a state machine describing the key stages of dispatch:
From this state machine, it’s much easier to answer the questions raised above, precisely and formally. For example, if an event listener moves the event target in the page, nothing happens to the dispatch path: only the first state of the machine constructs the dispatch path, while all the others just read from it. Done! It’s unfortunate that this state machine isn't sketched in the spec anywhere…

Moreover, the model is executable: Redex allows us to construct test cases—randomly, systematically, or ad-hoc, as we choose—and then run them through our model and see what output it produces. Even better, we can export our tests to HTML and JavaScript, and run them in real browsers and compare results:

Comparing a test model (tree structure, event listeners, and an event to be fired) in our semantics, and in various browsers.

Most importantly, our model agrees with all browsers on most test cases: this gives us confidence that our model is faithful to the intent of the spec. But not all test cases—not too surprisingly, we identified examples where real-world browsers differ in their behavior. Under our reading of the spec, at least one of these browsers is wrong—but since the spec is so intricate, it is easy to see why browsers have a hard time agreeing in all cases!

What’s Done

Here’s what we’ve got so far:

A PLT Redex model of event dispatch,

An annotated copy of the DOM Level 3 Events spec, showing exactly which lines of our model correspond to which text in the spec, and

A paper describing the model (and some applications of it) in greater detail.

What’s Next

Since our original JavaScript semantics was also written in Redex, we can combine our model of event dispatch with the JavaScript one, for a much higher-fidelity model of what event listeners can do in a browser setting. Then of course there are further applications, such as building a precise control-flow analysis of web pages and analyzing their code. And other uses? If you’re interested in using our model, let us know!

Mechanized LambdaJS

2012-06-04T00:00:00+00:00

See the discussion on Lambda the Ultimate about this work.

In an earlier post, we introduced λ_JS, our operational semantics for JavaScript. Unlike many other operational semantics, λ_JS is no toy, but strives to correctly model JavaScript's messy details. To validate these claims, we test λ_JS with randomly generated tests and with portions of the Mozilla JavaScript test suite.

Testing is not enough. Despite our work, other researchers found a missing case in λ_JS. Today, we're introducing Mechanized λ_JS, which comes with a machine-checked proof of correctness, using the Coq proof assistant.

Recap: The Structure of λ_JS

λ_JS has two key parts: an operational semantics and a desugaring function. Our earlier post discusses how we tackle the minutiae of JavaScript with our desugaring function. This post focuses on the operational semantics, where others found a bug, which now has a machine-checked proof of correctness.

The operational semantics is typical of programming languages research. It specifies the sequence of steps required to evaluate the program. For example, the following sequence evaluates to a value:
{ x: 2 + 3, y : 9 }["x"] * (11 + 23) → { x: 5, y: 9 }["x"] * (11 + 23) → 5 * (11 + 23) → 5 * 34 → 170
The sequence above evaluates expressions from left to right—a detail spelled out in the operational semantics.

Not all expressions reduce to values. For example, the following reduces to an error:
null["x"] → err "Cannot read property 'x' of null"
An operational semantics specifies exactly which errors may occur.

Finally, an operational semantics allows some programs to run forever. This is a basic infinite loop, and its non-terminating reduction sequence:
while (true) { 1; } → if true then 1; while (true) { 1; } else undefined → 1; while (true) { 1; } → while (true) { 1; } → if true then 1; while (true) { 1; } else undefined → 1; while (true) { 1; } …

In general, these are the only three cases that the semantics should allow—an expression must either (1) evaluate to a value, (2) signal an error, or (3) not terminate. In fact, we can state that as a theorem.
Theorem 1 (Soundness). For all λ_JS programs, p, either:

p →* v,
p →* err, or
p →* p₂, and there exists a p₃ such that p₂ → p₃.

This is a standard theorem worth proving for any language. Since languages and their correctness proofs involve detailed, delicate designs and decisions, the proofs are easy to do wrong, and tedious for humans to get right. If only computers could help.

PLT Redex: Lightweight Mechanization

We first developed λ_JS in PLT Redex, a domain-specific language embedded in Racket for specifying operational semantics.

Redex brings dull semantics to life. It doesn't just make a semantics executable, but also lets you visualize it. For example, here is our first example sequence in Redex (parentheses included):

The visualizer is a lot of fun, and a very effective debugging tool. It helped us catch several bugs in the early design of λ_JS.

Redex can also generate random tests to exercise your semantics. Random testing caught several more bugs in λ_JS.

Coq: A Machine-Checked Proof

Testing is not enough. We shipped λ_JS with a bug that breaks the soundness theorem above. We didn't discover it for a year. David van Horn and Ian Zerny both reported it to us independently. We'd missed a case in the semantics, which caused certain terms to get "stuck". It turned out to be a simple fix, but we were left wondering if anything else was left lurking.

To gain further assurance, we mechanized λ_JS with the Coq proof assistant. The soundness theorem now has a machine-checked proof of correctness. You still need to read the Coq definition of λ_JS and ensure it matches your intuitions. But once that's done, you can be confident that the proofs are valid.

Doing this proof was surprisingly easy, once we'd read Software Foundations and Certified Programming with Dependent Types. We'd like to thank Benjamin Pierce and his co-authors, and Adam Chlipala, for putting their books online.

What's Done

Here's what we've got so far:

A PLT Redex model,

A Coq model, and

A proof of soundness in Coq.

What's Next

We're not done. Here's what's coming up:

There are a few easy bits missing from the Coq model (e.g., a parameterized delta-function).

Once those easy bits are done, we'll wire it together with desugaring.

Finally, we'll upgrade the model to support semantics for ECMAScript 5.

ECMA Announces Official λJS Adoption

2012-04-01T00:00:00+00:00

GENEVA - ECMA's Technical Committee 39, which oversees the standardization of ECMAScript, has completed the adoption of Brown PLT's λ_JS as the new basis for the language. "We were being hampered by the endless debates about the semantics of ECMAScript 5", said J. Neumann, the Chairman of the Committee. "By adopting λ_JS, we can return to focusing on the important parts of the programming language instead, such as its interaction with parts of the W3C DOM Specification."

"The replacement of scope objects with substitution is a clear design flaw."
-Arjun Guha

Improvements to λ_JS - Neumann added that the standardization process uncovered a significant weakness in λ_JS: the absence of the with construct. The Technical Committee therefore mandated its introduction. Lead designer Arjun Guha agreed, stating, "The replacement of scope objects with substitution is a clear design flaw. It was pointed out to me by numerous academic researchers who have obtained considerable mileage from them, but it took me a while to appreciate their value." The Committee also recommended a "strict mode", so Guha removed first-class functions, which are widely believed to induce laxity by deferring decision-making.

Opposition to the Change - The adoption of λ_JS has not, however, met with unanimous approval. When asked for comment, Douglas Crockford of Yahoo! complained that the small parts are not good while the good parts are not small. Another detractor, Northeastern University researcher Sam Tobin-Hochstadt, had pushed for the adoption of Racket as the core language instead of λ_JS, but he admitted that Racket was untenable as it suffered from having a working module system. The team from Apple declined response, but it is widely rumored that Jonathan Ive is at work on a new core calculus that will have only one operation, which will automatically take the step that the user did not know they should have performed.

"We see this as a fight for the future of the Internet."
-David Herman

Influential Support - Nevertheless, the adoption has support from various influential circles. The Internet Explorer group at Microsoft has already agreed to implement λ_JS in the core engine of their upcoming release, IE12; lead designer Dean Hachamovitch said it is second in innovation only to the introduction of tabs. Strict mode will be supported in IE13. Google's Mark Miller pointed out, "With the aid of membranes, any primordial vat can be instantiated with desirable liveness properties." When asked to comment about λ_JS instead of the Miller-Urey experiment, Miller repeated the comment. Finally, noted Mozilla researcher Dave Herman commented, "For Mozilla, we see this as a fight for the future of the Internet." On questioning, he admitted that he diverts all interviews into conversations about Boot2Gecko.

Objects in Scripting Languages

2012-02-28T00:00:00+00:00

We've been studying scripting languages in some detail, and have collected a number features of their object systems that we find unusually expressive. This expressiveness can be quite powerful, but also challenges attempts to reason about and understand programs that use these features. This post outlines some of these exceptionally expressive features for those who may not be intimately familiar with them.

Dictionaries with Inheritance

Untyped scripting languages implement objects as dictionaries mapping member names (strings) to values. Inheritance affects member lookup, but does not affect updates and deletion. This won't suprise any experienced JavaScript programmer:

var parent = {"z": 9}; // Using __proto__ sets up inheritance directly in most browsers var obj = { "x": 1, "__proto__": parent}; obj.x // evaluates to 1 obj.z // evaluates to 9 obj.z = 50 // creates new field on obj obj.z // evaluates to 50, z on parent is "overridden" parent.z // evaluates to 9; parent.z was unaffected by obj.z = 50

In other scripting languages, setting up this inheritance can't be done quite so directly. Still, its effect can be accomplished, and the similar object structure observed. For example, in Python:

class parent(object): z = 9 # class member def __init__(self): self.x = 1 # instance member obj = parent() obj.x # evaluates to 1 obj.z # evaluates to 9 obj.z = 50 # creates new field on obj obj.z # evaluates to 50, z on parent is "overridden" parent.z # evaluates to 9, just like JavaScript

We can delete the field in both languages, which returns obj to its original state, before it was extended with a z member. In JavaScript:

delete obj.z; obj.z // evaluates to 9 again

This also works in Python:

delattr(obj, "z"); obj.z # evaluates to 9 again

In both languages, we could have performed the assignments and lookups with computed strings as well:

// JavaScript obj["x " + "yz"] = 99 // creates a new field, "x yz" obj["x y" + "z"] // evaluates to 99

# Python setattr(obj, "x " + "yz", 99) # creates a new field, "x yz" getattr(obj, "x y" + "z") # evaluates to 99

We can go through this entire progression in Ruby, as well:

class Parent; def z; return 9; end; end obj = Parent.new class << obj; def x; return 1; end; end obj.x # returns 1 obj.z # returns 9 class << obj; def z; return 50; end; end obj.z # return 50 # no simple way to invoke shadowed z method class << obj; remove_method :z; end obj.z # returns 9 class << obj define_method("xyz".to_sym) do; return 99; end end print obj.xyz # returns 99

Classes Do Not Shape Objects

The upshot is that a class definition in a scripting language says little about the structure of its instances. This is in contrast to a language like Java, in which objects' structure is completely determined by their class, to the point where memory layouts can be predetermined for runtime objects. In scripting languages, this isn't the case. An object is an instance of a 'class' in JavaScript, Python, or Ruby merely by virtue of several references to other runtime objects. Some of these be changed at runtime, others cannot, but in all cases, members can be added to and removed from the inheriting objects. This flexibility can lead to some unusual situations.

Brittle inheritance: Fluid classes make inheritance brittle. If we start with this Ruby class:

class A def initialize; @privateFld = 90; end def myMethod; return @privateFld * @privateFld; end end

Then we might assume that implementation of myMethod assumes a numeric type for @privateFld. This assumption can be broken by subclasses, however:

class B < A def initialize; super(); @privateFld = "string (not num)"; end end

Since both A and B use the same name, and it is simply a dictionary key, B instances violate the assumptions of A's methods:

obj = B.new B.myMethod # error: cannot multiply strings

Ruby's authors are well aware of this; the Ruby manual states "it is only safe to extend Ruby classes when you are familiar with (and in control of) the implementation of the superclass" (page 240).

Mutable Inheritance: JavaScript and Python expose the inheritance chain through mutable object members. In JavaScript, we already saw that the member "__proto__" could be used to implement inheritance directly. The "__proto__" member is mutable, so class hierarchies can be changed at runtime. We found it a bit more surprising when we realized the same was possible in Python:

class A(object): def method(self): return "from class A" class B(object): def method(self): return "from class B" obj = A() obj.method() # evaluates to "from class A" isinstance(obj, A) # evaluates to True obj.__class__ = B # the __class__ member determines inheritance obj.method() # evaluates to "from class B" isinstance(obj, B) # evaluates to True: obj's 'class' has changed!

Methods?

These scripting languages also have flexible, and different, definitions of "methods".

JavaScript simply does not have methods. The syntax

obj.method(...)

Binds this to the value of obj in the body of method. However, the method member is just a function and can be easily extracted and applied:

var f = obj.method; f(...);

Since f() does not use the method call syntax above, it is treated as a function call. In this case, it is a well known JavaScript wart that this is bound to a default "global object" rather than obj.

Python and Ruby make a greater effort to retain a binding for the this parameter. Python doesn't care about the name of the parameter (though self is canonically used), and simply has special semantics for the first argument of a method. If a method is extracted via member access, it returns a function that binds the object from the member access to the first parameter:

class A(object): def __init__(self_in_init): self_in_init.myField = 900 def method(self_in_method): return self_in_method.myField obj = A() f1 = obj.method # the access binds self_in_method to obj f1() # evaluates to 900, using the above binding

If the same method is accessed as a field multiple times, it isn't the same function both times―a new function is created for each access:

obj = A() f1 = obj.method # first extraction f2 = obj.method # second extraction f1 is f2 # evaluates to False, no reference equality

Python lets programmers access the underlying function without the first parameter bound through the member im_func. This is actually the same reference across all extracted methods, regardless of even the original object of extraction:

obj = A() f1 = obj.method # first extraction f2 = obj.method # second extraction otherobj = A() f3 = obj.method # extraction from another object # evaluates to True, same function referenced from extractions on the # same object f1.im_func is f2.im_func # evaluates to True, same function referenced from extractions on # different objects f2.im_func is f3.im_func

Ruby has a similar treatment of methods, their extraction, and their reapplication to new arguments.

But Why?

These features aren't just curiosities―we've found examples where they are used in practice. For example, Django's ORM builds classes dynamically, modifying them based on strings that come from modules describing database tables and relationships ( base.py):

attr_name = '%s_ptr' % base._meta.module_name field = OneToOneField(base, name=attr_name, auto_created=True, parent_link=True) new_class.add_to_class(attr_name, field)

Ruby on Rails' ActiveRecord uses dynamic field names as well, iterating over fields and invoking methods only when their names match certain patterns ( base.rb):

attributes.each do |k, v| if k.include?("(") multi_parameter_attributes << [ k, v] elsif respond_to?("#{k}=") if v.is_a?(Hash) nested_parameter_attributes << [ k, v ] else send("#{k}=", v) else raise(UnkownAttributeError, "unknown attribute: #{k}") end end

These applications use objects as dictionaries (with inheritance) to build up APIs that they couldn't otherwise.
These expressive features aren't without their perils. Django has explicit warnings that things can go awry if relationships between tables expressed in ORM classes overlap. And the fact that __proto__ is in the same namespace as the other members bit Google Docs, whose editor would crash if the string "__proto__" was entered. The implementation was using an object as a hashtable keyed by strings from the document, which led to an assignment to __proto__ that changed the behavior of the map.

So?

The languages presented here are widely adopted and used, and run critical systems. Yet, they contain features that defy conventional formal reasoning, at the very least in their object systems. Perhaps these features' expressiveness outweighs the cognitive load of using them. If it doesn't, and using these features is too difficult or error-prone, we should build tools to help us use them, or find better ways to implement the same functionality. And if not, we should take notice and recall that we have these powerful techniques at our disposal in the next object system we design.

S5: Wat?

2012-01-31T00:00:00+00:00

Gary Bernhardt's Wat talk has been making a well-deserved round of the blogodome in the past few weeks. If you haven't seen it, go give it a watch (you can count it as work time, since you saw it on the Brown PLT Blog, and we're Serious Researchers). The upshot of the second half of the talk is that JavaScript has some less than expected behaviors. We happen to have a JavaScript implementation floating around in the form of S5, and like to claim that it handles the hairy corners of the language. We decided to throw Gary's examples at it.

The Innocuous +

Gary's first JavaScript example went like this:

failbowl:~(master!?) $ jsc > [] + [] > [] + {} [object Object] > {} + [] 0 > {} + {} NaN

S5 lacks a true REPL―it simply takes JavaScript strings and produces output and answers―so we started by approximating a little bit. We first tried a series of print statements to see if we got the same effect:

$ cat unit-tests/wat-arrays.js print([] + []); print([] + {}); print({} + []); print({} + {}); $ ./s5 < unit-tests/wat-arrays.js [object Object] [object Object] [object Object][object Object] undefined

WAT.

Well, that doesn't seem good at all. Only half of the answers are right, and there's an undefined at the end. What went wrong? It turns out the semantics of REPLs are to blame. If we take the four programs and run them on their own, we get something that looks quite a bit better:

$ ./s5 "[] + []" "" $ ./s5 "[] + {}" "[object Object]" $ ./s5 "{} + []" 0. $ ./s5 "{} + {}" nan

There are two issues here:

Why do 0. and nan print like that?

Why did this work, when the previous attempt didn't?

The answer to the first question is pretty straightforward: under the covers, S5 is using Ocaml floats and printing Ocaml values at the end of its computation, and Ocaml makes slightly different decisions than JavaScript in printing numbers. We could change S5 to print answers in JavaScript-printing mode, but the values themselves are the right ones.

The second question is more interesting. Why do we get such different answers depending on whether we evaluate individual strings versus printing the expressions? The answer is in the semantics of JavaScript REPLs. When parsing a piece of JavaScript, the REPL needs to make a choice. Sensible decisions would be to treat each new JavaScript string as a Statement, or as an entire JavaScript Program. Most REPLs choose the Program production.

The upshot is that the parsing of {} + {} is quite different from [] + []. With S5, it's trivial to print the desugared representation and understand the difference. When we parse and desugar, we get very different results for {} + {} and [] + []:

$ ./s5-print "{} + {}" {undefined; %UnaryPlus({[#proto: %ObjectProto, #class: "Object", #extensible: true,] })} $ ./s5-print "[] + []" %PrimAdd({ [#proto: %ArrayProto, #class: "Array", #extensible: true,] 'length' : {#value 0., #writable true, #configurable false} }, { [#proto: %ArrayProto, #class: "Array", #extensible: true,] 'length' : {#value 0., #writable true, #configurable false} } )

It is clear that {} + {} parses as two statements (an undefined followed by a UnaryPlus), and [] + [] as a single statement containing a binary addition expression. What's happening is that in the Program production, for the string {} + {}, the first {} is matched with the Block syntactic form, with no internal statements. The rest of the expression is parsed as a UnaryExpression. This is in contrast to [] + [], which only correctly parses as an ExpressionStatement containing an AdditiveExpression.

In the example where we used successive print statements, every expression in the argument position to print was parsed in the second way, hence the different answers. The lesson? When you're at a REPL, be it Firebug, Chrome, or the command line, make sure the expression you're typing is what you think it is: not being aware of this difference can make it even more difficult to know what to expect!

If You Can't Beat 'Em...

Our first example led us on an interesting excursion into parsing, from which S5 emerged triumphant, correctly modelling the richness and/or weirdness of the addition examples. Next up, Gary showed some straightforward uses of Array.join():

failbowl:~(master!?) $ jsc > Array(16) ,,,,,,,,,,,,,,,, > Array(16).join("wat") watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat > Array(16).join("wat" + 1) wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1 > Array(16).join("wat" - 1) + " Batman" NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN Batman

Our results look oh-so-promising, right up until the last line (note: we call String on the first case, because S5 doesn't automatically toString answers, which the REPL does).

$ ./s5 "String(Array(16))" ",,,,,,,,,,,,,,,," $ ./s5 "Array(16).join('wat')" "watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat" $ ./s5 "Array(16).join('wat' + 1)" "wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1" $ ./s5 "Array(16).join('wat' - 1) + ' Batman'" "nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull Batman"

WAT.

Are we really that awful that we somehow yield null rather than NaN? A quick glance at the desugared code shows us that we actually have the constant value null as the argument to join(). How did that happen? Interestingly, the following version of the program works:

$ ./s5 "var wat = 'wat'; Array(16).join(wat - 1) + ' Batman';" "NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN Batman"

This leads us to our answer. We use SpiderMonkey's very handy Parser API as part of our toolchain. Reflect.parse() takes strings and converts them to JSON structures with rich AST information, which we stringify and pass off to the innards of S5 to do desugaring and evaluation. Reflect.parse() is part of a JavaScript implementation that strives for performance, and to that end it performs constant folding. That is, as an optimization, when it sees the expression "wat" - 1, it automatically converts it to NaN. All good so far.

The issue is that the NaN yielded by constant folding is not quite the same NaN we might expect in JavaScript programs. In JavaScript, the identifier NaN is a property of the global object with the value NaN. The Parser API can't safely fold to the identifier NaN (as was pointed out to us when we reported this bug), because it might be shadowed in a different context. Presumably to avoid this pitfall, the folding yields a JSON structure that looks like:

expression:{type:"Literal", value:NaN}

But we can't sensibly use JSON.stringify() on this structure, because NaN isn't valid JSON! Any guesses on what SpiderMonkey's JSON implementation turns NaN into? If you guessed null, we owe you a cookie.

We have designed a hack based on suggestions from the bug report to get around this (passing a function to stringify to look for NaNs and return a stylized object literal instead). There's a bug open to make constant folding optional in Reflect.parse(), so this will be fixed in Mozilla's parser. (Update) The bug is fixed, and we've updated our version of Spidermonkey. This example now works happily, thanks to Dave Herman.

Producing a working JavaScript implementation leads to a whole host of exciting moments and surprising discoveries. Building this semantics and its desugaring gives us much more confidence that our tools say something meaningful about real JavaScript programs. These examples show that getting perfect correspondence is difficult, but we strive to be as close as possible.

Belay Lessons: Smarter Web Programming

2011-12-18T00:00:00+00:00

This post comes from the keyboard of Matt Carroll, who has worked with us for the past two years. He's the main implementer of desugaring for S5, and spent this semester rebuilding and improving in-house Brown PLT web applications. He writes about his experience here.

The Brown computer science department uses a home-grown web application called Resume to conduct its faculty recruitment process. This semester, Joe and I re-wrote Resume with Belay. Belay is the product of Joe and Arjun's summer research at Google: it's an ongoing inquiry into web development best practices, specifically concerning identity, account management, and security. From my perspective (that of a novice web programmer), getting to grips with the Belay philosophy was a thought-provoking experience, and a great education in the pitfalls that a web developer must (unfortunately) bear in mind.

I Am Not My Cookies

Standard web applications make use of cookies for authentication. When you visit a site and enter your credentials, the site's response sets a session cookie in your browser. Subsequent requests to the site use the information in the cookie to determine 'who you are' and whether 'you' are allowed to do what 'your' request is trying to do. I use quotations in the prior sentence to highlight the fact that HTTP cookies are a poor method of establishing user identity. If another, malicious, web site you visit manages to trick you into sending a request to the original site, that request will contain your cookie, and the good site may treat that request as legitimate and execute it. This is the infamous cross-site request forgery (CSRF) attack.

Belay applications eschew the use of cookies, especially for authentication, and thus they are secure by design against this type of vulnerability. This begs the question: without cookies, how do Belay applications decide whether a request is authenticated? The answer may shock you (as it did me): all requests that reach request handler code are treated as legitimate. At this point, we must examine the server-side of Belay apps in greater detail.

Web Capabilities

Your everyday possibly-CSRF-vulnerable site probably has a URL scheme with well-known endpoints that lead directly to application functionality. For example, to post to your blog, you (typically via your browser) send a POST request to www.blog.com/post with your cookies and the blog body's text. The server-side handler finds your account in the database using your cookie, checks that your account can post to that blog, and adds a new post. If the whole surface of the site's URL space is well-known, a CSRF-ing attacker can excercise the entirety of a user's view of the site with one compromised cookie.

In contrast, Belay applications have few well-known URLs, corresponding to the public entry points to the site (the login and contact pages, for instance). Instead, Belay's libraries allow server-side code to dynamically generate random unique URLs and map them to request handler functions. Each of these handlers services a particular type of request for a particular set of data. The randomly generated "capability" urls are embedded in the JavaScript or markup returned to the browser. In a well-designed Belay application, each page has the minimum necessary set of capabilities to carry out its mission, and the capabilities are scoped to the minimum set of data with which they need concern themselves. After you successfully log in to a Belay site, the response will contain the set of capabilities needed by the page, and scoped to only that data which is needed by the page's functionality and associated with your user account. No cookies are necessary to identify you as a user or to authenticate your requests.

A Belay app uses its limited URL scheme as its primary security mechanism, ignoring requests unless they come along trusted capability URLs created by a prior, explicit grant. As long as we can rely on our platform's ability to generate unguessable large random numbers, attackers are out of luck. And, even if a capability URL is leaked from its page, it is scoped to only a small set of data on the server, so the vulnerability is limited. This is a much-improved situation compared to a site using cookie-based authentication---leaking a cookie leaks access to the user's entire view of the site.

Grants and Cap Handlers

Here's a Belay request handler, taken from Resume:

class GetLetterHandler(bcap.CapHandler): def get(self, reference): filename = get_letter_filename(reference) return file_response(filename, 'letter')

This handler simply looks up the filename associated with a reference and returns it (using a few helper functions). Accessing a letter written by an applicant's reference is quite a sensitive operation---letting the wrong person access a letter would be a serious security bug. Yet, GetLetterHandler is a two-liner with no apparent security checks or guards. How can this be safe?

To answer this, we need to consider how a client can cause GetLetterHandler to be invoked. The Belay server library will only invoke this handler via capability URLs created with a grant to GetLetterHandler. So, we can search the codebase for code that granted such access. A quick search shows one spot:

class GetApplicantsHandler(bcap.CapHandler): def get(self, reviewer): applicants_json = [] for applicant in reviewer.get_applicants(): # ... some processing refs_json = [] for ref in applicants.get_references(): refs_json.append({ 'refName': ref.name, 'getLetter': bcap.grant(GetLetterHandler, ref)) }) # ... add some things to applicants_json return bcap.bcapResponse(applicants_json)

When GetApplicantsHandler is invoked, it will return a structure that, for each applicant, shows something like:

{ name: 'Theodore Roosevelt', getLetter: 'https://resume.cs.brown.edu/cap/f7327056-4b91-ad57-e5e4f6c514b6' }

On the server, the string f7327056-4b91-ad57-e5e4f6c514b6 was created and mapped to the pair of GetLetterHandler and the Reference database item for Theodore Roosevelt. A GET request to the URL above will return the reference letter. Note a nice feature of this setup: the server doesn't use any information from the client, other than the capability URL, to decide which reference's letter to return. Thus, a client cannot try providing different id's or other parameters to explore which letters they have access to. Only those explicitly granted are accessible.

Poking around in the codebase more, we can see that GetApplicantsHandler is only granted to reviewers, who can only create accounts via an email from the administrator. This reasoning is how we convince ourselves, as developers, that we haven't screwed up and given away the ability to see a letter to the wrong user. We do all of this without worrying about a check on accessing the letter, instead relying on the unguessability of the URLs generated by grant to enforce our access restrictions.

This may seem like a new-concept overload, and indeed, I had that exact reaction at first. Over time I gained familiarity with the Belay style, and I became more and more convinced by the benefits it offers. Porting Resume became a fairly straightforward process of identifying each server-side request handler, converting it to a Belay handler, and ensuring that whatever pages needed that functionality received grants to call the handler. There were wrinkles, many due to the fact that Resume also uses Flapjax (a language/library for reactive programming in the browser). Flapjax is another Brown PLT product and it is certainly worthy of its own blog post. We had to account for the interaction between Belay's client-side library and Flapjax.

Note that Belay isn't the first place these ideas have surfaced. Belay builds on foundational research: Waterken and PLT Web Server both support cookie-less, capability-based web interactions. The Belay project addresses broader goals in identity management and sharing on the web, but we've leveraged its libraries to build a more robust system for ourselves.
At the end, the benefits of the redesigned Resume are numerous. Cookies are no longer involved. JavaScript code doesn't know or care about unique IDs for picking items out of the database. Random HTTP request probes result in a 404 response and a line in the server's log, instead of a possible data corruption. You can open as many tabs as you like, with each one logged into its own Resume account, and experience no unwanted interference. We were able to realize these improvements while re-using a significant portion of the original Resume code, unchanged.

After my experience with the Resume port, I'm certainly a Belay fan. The project has more to say about topics such as cross-site authorization, sharing, and multi-site identity management, so check out their site and stay tuned for future updates:
Belay Research

S5: Semantics for Accessors

2011-12-11T00:00:00+00:00

Getters and setters (known as accessors) are a new feature in ECMAScript 5 that extend the behavior of assignment and lookup expressions on JavaScript objects. If a field has a getter defined on it, rather than simply returning the value in field lookup, a getter function is invoked, and its return value is the result of the lookup:

var timesGotten = 0; var o = {get x() { timesGotten++; return 22; }}; o.x; // calls the function above, evaluates to 22 timesGotten; // is now 1, due to the increment in the getter o.x; // calls the function above, still evaluates to 22 timesGotten; // is now 2, due to another increment in the getter

Similarly, if a field has a setter defined on it, the setter function is called on field update. The setter function gets the assigned value as its only argument, and its return value is ignored:

var foo = 0; var o = {set x(v) { foo = v; }}; o.x = 37; // calls the function above (with v=37) foo; // evaluates to 37 o.x; // evaluates to undefined

Getters and setters have a number of proposed uses―they can be used to wrap DOM objects that have interesting effects on assignment, like onmessage and onbeforeunload, for example. We leave discovering good uses to more creative JavaScript programmers, and focus on their semantic properties here.

The examples above are straightforward, and it seems like a simple model might work out quite easily. First, we need some definitions, so we'll start with what's in λ_JS. Here's a fragment of the values that λ_JS works with, and the most basic of the operations on objects:
v := str | { str₁:v₁, ⋯, str_n:v_n } | func(x ⋯) . e | ⋯ e := e[e] | e[e=e] | e(e, ⋯) | ⋯ (E-Lookup) { ⋯, str:v, ⋯ }[str_x] → v when str_x = str (E-Update) { ⋯, str:v, ⋯}[str_x=v'] → { ⋯, str:v', ⋯} when str_x = str (E-UpdateAdd) { str₁:v₁, ⋯}[str=v] → { str:v, str₁:v₁, ⋯} when str ≠ str₁, ⋯

We update and set fields when they are found, and add fields if there is an update on a not-found field. Clearly, this isn't enough to model the semantics of getters and setters. On lookup, if the value of a field is a getter, we need to have our semantics step to an invocation of the function. We need to make the notion of a field richer, so the semantics can have behavior that depends on the kind of field. We distinguish two kinds of fields p, one for simple values and one for accessors:

p := [get: v_g, set: v_s] | [value: v] v := str | { str₁:p₁, ⋯, str_n:p_n } | func(x ⋯) . e | ⋯ e := e[e] | e[e=e] | e(e, ⋯) | ⋯

The updated rules for simple values are trivial to write down (differences in bold):

(E-Lookup) { ⋯, str:[value:v], ⋯ }[str_x] → v when str_x = str (E-Update) { ⋯, str:[value:v], ⋯}[str_x=v'] → { ⋯, str:[value:v'], ⋯} when str_x = str (E-UpdateAdd) { str₁:v₁, ⋯}[str=v] → { str:[value:v], str₁:v₁, ⋯} when str ≠ str₁, ⋯

But now we can also handle the cases where we have a getter or setter. If a lookup expression e[e] finds a getter, it applies the function, and the same goes for setters, which get the value as an argument:

(E-LookupGetter) { ⋯, str:[get:v_g, set:v_s], ⋯ }[str_x] → v_g() when str_x = str (E-UpdateSetter) { ⋯, str:[get:v_g, set:v_s], ⋯}[str_x=v'] → v_s(v') when str_x = str

Great! This can handle the two examples from the beginning of the post. But those two examples weren't the whole story for getters and setters, and our first fragment wasn't the whole story for λ_JS objects.

Consider this program:

var o = { get x() { return this._x + 1; }, set x(v) { this._x = v * 2; } }; o.x = 5; // calls the set function above (with v=5) o._x; // evaluates to 10, because of assignment in the setter o.x; // evaluates to 11, because of addition in the getter

Here, we see that the functions also have access to the target object of the assignment or lookup, via the this parameter. We could try to encode this into our rules, but let's not get too far ahead of ourselves. JavaScript objects have more subtleties up their sleeves. We can't forget about prototype inheritance. Let's start with the same object o, this time called parent, and use it as the prototype of another object:

var parent = { get x() { return this._x + 1; }, set x(v) { this._x = v * 2; } }; var child = Object.create(parent); child.x = 5; // Sets... what exactly to 10? parent._x; // ??? child._x; // ??? parent.x; // ??? child.x; // ???

Take a minute to guess what you think each of the values should be. Click here to see the answers (which hopefully are what you expected).

(Update: These answers were changed on June 20, 2012 when we noticed a bug. parent.x used to have undefined listed as the answer, which is incorrect.)

parent._x; // undefined child._x; // 10 parent.x; // NaN (parent._x is undefined, undefined + 1 = NaN) child.x; // 11

So, JavaScript is passing the object in the lookup expression into the function, for both field access and field update. Something else subtle is going on, as well. Recall that before, when an update occurred on a field that wasn't present, JavaScript simply added it to the object. Now, on field update, we see that the assignment traverses the prototype chain to check for setters. This is fundamentally different from JavaScript before accessors―assignment never considered prototypes. So, our semantics needs to do two things:

Pass the correct this argument to getters and setters;

Traverse the prototype chain for assignments.

Let's think about a simple way to pass the this argument to getters:

(E-LookupGetter) { ⋯, str:[get:v_g, set:v_s], ⋯ }[str_x] → v_g({ ⋯, str:[get:v_g, set:v_s], ⋯ }) when str_x = str

Here, we simply copy the object over into the first argument to the function v_g. We can (and do) desugar functions to have an implicit first this argument to line up with this invocation. But we need to think carefully about this rule's interaction with prototype inheritance.

Here is E-Lookup-Proto from the original λ_JS:

(E-Lookup-Proto) { str₁:v₁, ⋯, "__proto__": v_p, str_n:v_n, ⋯}[str] → v_p[str] when str ≠ str₁, ⋯, str_n, ⋯

Let's take a moment to look at this rule in conjunction with E-LookupGetter. If the field isn't found, and __proto__ is present, it looks up the __proto__ field and performs the same lookup on that object (we are eliding the case where proto is not present or not an object for this presentation). But note something crucial: the expression on the right hand side drops everything about the original object except its prototype. If we applied this rule to child above, the getter rule would pass parent to the getter instead of child!

The solution is to keep track of the original object as we traverse the prototype chain. If we don't, the reduction relation simply won't have the information it needs to pass in to the getter or setter when it reaches the right point in the chain. This is a deep change―we need to modify our expressions to get it right:

p := [get: v_g, set: v_s] | [value: v] v := str | { str₁:p₁, ⋯, str_n:p_n } | func(x ⋯) . e | ⋯ e := e[e] | e[e=e] | e^v[e] | e^v[e=e] | e(e, ⋯) | ⋯

And now, when we do a prototype lookup, we can keep track of the same this argument (written as v_t) the whole way up the chain, and the rules for getters and setters can use this new piece of the expression:

(E-Lookup-Proto) { str₁:v₁, ⋯, "__proto__": v_p, str_n:v_n, ⋯}^v_t[str] → v_p^v_t[str] when str ≠ str₁, ⋯, str_n, ⋯ (E-LookupGetter) { ⋯, str:[get:v_g, set:v_s], ⋯ }^v_t[str_x] → v_g(v_t) when str_x = str (E-UpdateSetter) { ⋯, str:[get:v_g, set:v_s], ⋯}^v_t[str_x=v'] → v_s(v_t,v') when str_x = str

This idea was inspired by Di Gianantonio, Honsell, and Liquori's 1998 paper, A lambda calculus of objects with self-inflicted extension. They use a similar encoding to model method dispatches in a small prototype-based object calculus. The original expressions, e[e] and e[e=e], simply copy values into the new positions once the subexpressions have reduced to values:

(E-Lookup) v[str] → v^v[str] (E-Update) v[str=v'] → v^v[str=v']

The final set of evaluation rules and expressions is a little larger:

p := [get: v_g, set: v_s] | [value: v] v := str | { str₁:p₁, ⋯, str_n:p_n } | func(x ⋯) . e | ⋯ e := e[e] | e[e=e] | e^v[e] | e^v[e=e] | e(e, ⋯) | ⋯ (E-Lookup) v[str] → v^v[str] (E-Update) v[str=v'] → v^v[str=v'] (E-LookupGetter) { ⋯, str:[get:v_g, set:v_s], ⋯ }^v_t[str_x] → v_g(v_t) when str_x = str (E-Lookup-Proto) { str₁:v₁, ⋯, "__proto__": v_p, str_n:v_n, ⋯}^v_t[str] → v_p^v_t[str] when str ≠ str₁, ⋯, str_n, ⋯ (E-UpdateSetter) { ⋯, str:[get:v_g, set:v_s], ⋯}^v_t[str_x=v'] → v_s(v_t,v') when str_x = str (E-Update-Proto) { str₁:v₁, ⋯, "__proto__": v_p, str_n:v_n, ⋯}^v_t[str=v'] → v_p^v_t[str=v'] when str ≠ str₁, ⋯, str_n, ⋯

This is most of the rules―we've elided some details to only present the key insight behind the new ones. Our full semantics (discussed in our last post), handles the details of the arguments object that is implicitly available within getters and setters, and using built-ins, like defineProperty, to add already-defined functions to existing objects as getters and setters.

S5: A Semantics for Today's JavaScript

2011-11-11T00:00:00+00:00

The JavaScript language isn't static―the ECMAScript committee is working hard to improve the language, and browsers are implementing features both in and outside the spec, making it difficult to understand just what "JavaScript" means at any point in time. Existing implementations aren't much help―their goal is to serve pages well and fast. We need a JavaScript architecture that can help us make sense of the upcoming (and existing!) features of the language.

To this end, we've developed S5, an ECMAScript 5 runtime, built on λ_JS, with the explicit goal of helping people understand and tinker with the language. We built it to understand the features in the new standard, building on our previous efforts for the older standard. We've now begun building analyses for this semantics, and are learning more about it as we do so. We're making it available with the hope that you can join us in playing with ES5, extending it with new features, and building tools for it.

S5 implements the core features of ES5 strict mode. How do we know this? We've tested S5 against Test262 to measure our progress. We are, of course, not feature complete, but we're happy with our progress, which you can check out here.
A Malleable Implementation

The semantics of S5 is designed to be two things: a language for writing down the algorithms of the specification, and a translation target for JavaScript programs. We've implemented an interpreter for S5, and a desugaring function that translates JavaScript source into S5 programs.

We have a number of choices to make in defining desugaring. The ECMAScript standard defines a whole host of auxiliary functions and library routines that we must model. Putting these implementations directly in the desugaring function would work, but would make desugaring unnecessary brittle, and require recompilation on every minor change. Instead, we implement the bulk of this functionality as an S5 program. The majority of our work happens in an environment file that defines the spec in S5 itself. The desugaring defines a translation from the syntactic forms of JavaScript to the (smaller) language of S5, filled with calls into the functions defined in this environment.

This separation of concerns is what makes our implementation so amenable to exploration. If you want to try something out, you can edit the environment file and rerun whatever tests you care to learn about. Want to try a different implementation of the == operator? Just change the definition, as it was pulled from the spec, at line 300. Want a more expressive Object.toString() that doesn't just print "[object Object]"? That's right here. No changing an interpreter, no recompiling a desugaring function necessary.

The environment we've written reflects the standard's algorithms as we understand them in terms of S5. The desugaring from JavaScript to S5 code with calls into this library is informed by the specification's definitions of expression and statement evaluation. We have confidence in the combination of desugaring and library implementation, given our increasing test coverage. Further, we know how to continue―implement more of the spec and pass more test cases!

More than λ_JS

S5 is built on λ_JS, but extends it in three significant ways:

Explicit getters and setters;

Object fields with attributes, like writable and configurable, built-in;

Support for eval().
For those that haven't fiddled with getters and setters, they are a new feature introduced in ECMAScript 5 that allow programmer-defined behavior on property access and assignment. Getters and setters fundamentally change how property access and assignment work. They make property assignment interact with the prototype chain, which used to not be the case, and cause syntactically similar expressions to behave quite differently at runtime. In a separate post we'll discuss the interesting problems they introduce for desugaring and how we implement them in the semantics. (Update: This post has been written, check it out!)

Attributes on objects weren't treated directly in the original λ_JS. In 5th Edition, they are crucial to several security-relevant operations on objects. For example, the standard specifies Object.freeze(), which makes an object's properties forever unwritable. S5 directly models the writable and configurable attributes that make this operation possible, and make its implementation in S5 easy to understand.

λ_JS explicitly elided eval() from its semantics. S5 implements eval() by performing desugaring within the interpreter and then interpreting the desugared code. We implement only the strict mode version of eval, which restricts the environment that the eval'd code can affect. With these restrictions, we can implement eval in a straightforward way within our interpreter. We'll cover the details of how we do this, and why it works, in another post.

Building on S5
There's a ton we can do with S5. More, in fact, than we can do by ourselves:

Experiment with Harmony features: ECMAScript 6, or Harmony, as it is often called, is being designed right now. Proxies, string interpolation, syntactic sugar for classes, and modules are just a few of the upcoming features. Modeling them in S5 would help us understand these features better as they get integrated into the language.

Build Verification Tools: Verification based on objects' attributes is an interesting research problem―what can we prove about interacting programs if we know about unwritable fields and inextensible objects? Building this knowledge into a type-checker or a program analysis could give interesting new guarantees.

Abstract Our Machine: Matt Might and David van Horn wrote about abstracting λ_JS for program analysis. We've added new constructs to the language since then. Do they make abstraction any harder?

Complete the Implementation: We've made a lot of progress, but there's still more ground to cover. We need support for more language features, like JSON and regular expressions, that would move our implementation along immensely. We'll work on this more, but anyone who wants to get involved is welcome to help.

If any of this sounds interesting, or if you're just curious, go ahead and check out S5! It's open source and lives in a Github repository. Let us know what you do with it!

The Essence of JavaScript

2011-09-29T00:00:00+00:00

Back in 2008, the group decided to really understand JavaScript. Arjun had built a static analysis for JavaScript from scratch. Being the honest chap that he is, he was forced to put the following caveat into the paper:

"We would like to formally prove that our analysis is sound. A sound analysis would guarantee that our tool will never raise a false alarm, an imporant usability concern. However, a proof of soundness would require a formal semantics for JavaScript and the DOM in browsers, and this does not exist."

A "formal semantics for JavaScript [...] does not exist"? Didn't he know about the official documents on such matters, the ECMAScript standard? ECMAScript 3rd edition, the standard at the time, was around 180 pages long, written in prose and pseudocode. Reading it didn't help much. It includes gems such as this description of the switch statement:

1. Let A be the list of CaseClause items in the first CaseClauses, in source text order. 2. For the next CaseClause in A, evaluate CaseClause. If there is no such CaseClause, go to step 7. 3. If input is not equal to Result(2), as defined by the !== operator, go to step 2. 4. Evaluate the StatementList of this CaseClause. 5. If Result(4) is an abrupt completion then return Result(4). 6. Go to step 13. 7. Let B be the list of CaseClause items in the second CaseClauses, in source text order. 8. For the next CaseClause in B, evaluate CaseClause. If there is no such CaseClause, go to step 15. 9. If input is not equal to Result(8), as defined by the !== operator, go to step 8. 10. Evaluate the StatementList of this CaseClause. 11. If Result(10) is an abrupt completion then return Result(10). 12. Go to step 18. ...

And this is just one of 180 pages of lesser or greater eloquence. With this as his formal reference, it's no wonder Arjun had a hard time making soundness claims.

Around the same time, Ankur Taly, Sergio Maffeis, and John Mitchell noticed the same problem. They presented a formal semantics for JavaScript in their APLAS 2008 paper. You can find their semantics here, and it is a truly staggering effort, running for 40+ pages (that's at least four times easier to understand!). However, we weren't quite satisfied. Their semantics formalizes the ECMAScript specification as written, and therefore inherits some of its weirdness, such as heap-allocated "scope objects", implicit coercions, etc. We still couldn't build tools over it, and were unwilling to do 40-page case analyses for proofs. Leo Meyerovich, peon extraordinaire and friend of the blog, felt the same:

"Challenging current attempts to analyze JavaScript, there is no formal semantics realistic enough to include many of the attack vectors we have discussed yet structured and tractable enough that anyone who is not the inventor has been able to use; formal proofs are therefore beyond the scope of this work."

How To Tackle JavaScript: The PLT Way

We decided to start smaller. In the fall of 2009, Arjun wrote down a semantics for the "core" of JavaScript that fits on just three pages (that's 60 times easier to understand!). This is great programming languages research—we defined away the hairy parts of the problem and focused on a small core that was amenable to proof. For these proofs, we could assume the existence of a trivial desugaring that maps real JavaScript programs into programs in the core semantics, which Arjun dubbed λ_JS.

Things were looking great until one night Arjun had a few too many glasses of wine and decided to implement desugaring. Along with Claudiu Saftoiu, he wrote a thousand lines of Haskell that turns JavaScript programs into λ_JS programs. Even worse, they implemented an interpreter for λ_JS, so the resulting programs actually run. They had therefore produced a JavaScript runtime.

Believe it or not, there are other groups in the business of creating JavaScript runtimes, namely Google, Mozilla, Microsoft, and a few more. And since they care about the correctness of their implementations, they have actual test suites. Which Arjun's system could run, and give answers for, that may or may not be the right ones:

As it turns out, Arjun and Claudiu did a pretty good job. λ_JS agrees with Mozilla SpiderMonkey on a few thousand lines of tests. We say "agreed" and not "passed" because SpiderMonkey fails some of its own tests. Without any other standard of correctness, λ_JS strives for bug-compatibility with SpiderMonkey on those tests.

Building on λ_JS

λ_JS is discussed in our ECOOP paper, but it's the work built on λ_JS that's most interesting. We've built the following systems ourselves:

A type-checker for JavaScript that employs a novel mix of type-checking and flow analysis ("flow typing"), discussed in our ESOP 2011 paper, and

An extension to the above type-checker to verify ADsafe, as discussed in our USENIX Security 2011 paper.

Others have built on λ_JS too:

David van Horn and Matt Might use λ_JS to build an analytic framework for JavaScript,

Rodolfo Toledo and Éric Tanter use λ_JS to specify aspects for JavaScript,

IBEX, from Microsoft Research, uses λ_JS for its JavaScript backend to produce verified Web browser extensions, and

Others have a secret reimplementation of λ_JS in Java. We are now enterprise-ready.

Want to use λ_JS to write JavaScript tools? Check out our software and let us know what you think!

Coming up next: The latest version of JavaScript, ECMAScript 5th ed., is vastly improved. We've nearly finished updating our JavaScript semantics to match ECMAScript 5th ed. Our new semantics uses the official ECMAScript test suite and tackles problems, such as eval, that the original λ_JS elided. We'll talk about it next time. Update: We've written about our update, dubbed S5, its semantics for accessors, and a particularly interesting example.

ADsafety

2011-09-13T00:00:00+00:00

A mashup is a webpage that mixes and mashes content from various sources. Facebook apps, Google gadgets, and various websites with embedded maps are obvious examples of mashups. However, there is an even more pervasive use case of mashups on the Web. Any webpage that displays third-party ads is a mashup. It's well known that third-party content can include third-party cookies; your browser can even block these if you're concerned about "tracking cookies". However, third party content can also include third-party JavaScript that can do all sorts of wonderful and malicious things (just some examples).

Is it possible to safely embed untrusted JavaScript on a page? Google Caja, Microsoft Web Sandbox, and ADsafe are language-based Web sandboxes that try to do so. Language-based sandboxing is a programming language technique that restricts untrusted code using static and runtime checks and rewriting potential dangerous calls to safe, trusted functions.

Sandboxing JavaScript, with all its corner cases, is particularly hard. A single bug can easily break the entire sandboxing system. JavaScript sandboxes do not clearly state their intended guarantees, nor do they clearly argue why they are safe.

This is how ADsafe works.

Verifying Web Sandboxes

A year ago, we embarked on a project to verify ADsafe, Douglas Crockford's Web sandbox. ADsafe is admittedly the simplest of the aforementioned sandboxes. But, we were also after the shrimp bounty that Doug offers for sandbox-breaking bugs:
Write a program [...] that calls the alert function when run on any browser. If the program produces no errors when linted with the ADsafe option, then I will buy you a plate of shrimp. (link)
A year later, we've produced a USENIX Security paper on our work, which we presented in San Francisco in August. The paper discusses the many common techniques employed by Web sandboxes and discusses the intricacies of their implementations. (TLDR: JavaScript and the DOM are really hard.) Focusing on ADsafe, it precisely states what ADsafety actually means. The meat of the paper is our approach to verifying ADsafe using types. Our verification leverages our earlier work on semantics and types for JavaScript, and also introduces some new techniques:

Check out the ★s and ☠s in our object types; we use them to type-check "scripty" features of JavaScript. ☠ marks a field as "banned" and ★ specifies the type of all other fields.

We also characterize JSLint as a type-checker. The Widget type presented in the paper specifies, in 20 lines, the syntactic restrictions of JSLint's ADsafety checks.

Unlike conventional type systems, ours does not prevent runtime errors. After all, stuck programs are safe because they trivially don't execute any code. If you think type systems only catch "method not found" errors, you should have a look at ours.

We found bugs in both ADsafe and JSLint that manifested as type errors. We reported all of them and they were promptly fixed by Doug Crockford. A big thank you to Doug for his encouragement, for answering our many questions, and for buying us every type of shrimp dish in the house.

Doug Crockford, Joe, Arjun, and seven shrimp dishes

Learn more about ADsafety! Check out:

The paper, code, and proofs;

Video of Arjun presenting at USENIX Security;

ADsafe and JSLint.

Modify or delete the contents of your USB storage	58.8 %
Send sticky broadcast	60 %
Control vibration	67.5 %
View Wi-Fi connections	70 %
Read phone status and identity	70 %
Test access to protected storage	72.5 %
Google Play license check	73.8 %
Run at startup	75.8 %
Read Google service configuration	76.3 %
Full network access	76.5 %
Approximate location	79 %
View network connections	80.5 %
Find accounts on the device	82.5 %

`T₁`	`T₂`	`T₁ <: T₂` if...
`f: S`	`f: T`	`S <: T`
`f: -`	`f: T`	Never
`f: S`	`f: -`	Always
`f: -`	`f: -`	Always

`T₁`	`T₂`	`T₁ <: T₂` if...
`f^↓: S`	`f^↓: T`	`S <: T`
`f^○: S`	`f^↓: T`	Never
`f: -`	`f^↓: T`	Never
`f^↓: S`	`f^○: T`	`S <: T`
`f^○: S`	`f^○: T`	`S <: T`
`f: -`	`f^○: T`	Never
`f^↓: S`	`f: -`	Always
`f^○: S`	`f: -`	Always
`f: -`	`f: -`	Always

`T₁`	`T₂`	`T₁ <: T₂` if...
`0^↓: Bool`	`0^○: Bool`	`Bool <: Bool`
`1^↓: Bool`	`1^○: Bool`	`Bool <: Bool`
`3: -`	`3^○: Bool`	Fail!
`4: -`	`4^○: Bool`	Fail!
...	...	...

Strict outside?	Strict inside?	Direct or Indirect?	Local or global scope?	Affects scope?
Yes	Yes	Indirect	Global	No
No	Yes	Indirect	Global	No
Yes	No	Indirect	Global	Yes
No	No	Indirect	Global	Yes
Yes	Yes	Direct	Local	No
No	Yes	Direct	Local	No
Yes	No	Direct	Local	No
No	No	Direct	Local	Yes