Articles by tag: Verification

Sharing is Scaring: Why is Cloud File-Sharing Hard?
Practical Static Analysis for Privacy Bugs
Lightweight Diagramming for Lightweight Formal Methods
LTL Tutor
Misconceptions In Finite-Trace and Infinite-Trace Linear Temporal Logic
Differential Analysis: A Summary
Forge: A Tool to Teach Formal Methods
Little Tricky Logics
Applying Cognitive Principles to Model-Finding Output
Automated, Targeted Testing of Property-Based Testing Predicates
Crowdsourcing User Studies for Formal Methods
User Studies of Principled Model Finder Output
Tierless Programming for SDNs: Differential Analysis
Tierless Programming for SDNs: Verification
Tierless Programming for SDNs: Optimality
Tierless Programming for SDNs: Events
Tierless Programming for Software-Defined Networks
Verifying Extensions' Compliance with Firefox's Private Browsing Mode
Aluminum: Principled Scenario Exploration Through Minimality
ADsafety

Sharing is Scaring: Why is Cloud File-Sharing Hard?

Tags: Semantics, User Studies, Verification

Posted on 25 August 2025.

Here is an actual situation we were asked to help non-technical computer users with:

Alice and Bob want to collaborate on a flyer for a social event. They are more comfortable with Word than with cloud-based tools like Google Docs. They have both heard that Dropbox is a good way to share and jointly edit files.

Alice thus creates a draft of the flyer F on Dropbox. She then shares F with Bob. Bob makes a change using Word and informs Alice he has done so. Alice opens F and sees no change. Bob confirms he is editing a file that is in Dropbox. They have several baffling rounds of exchange.

Do you see the problem? No? Here’s the critical question: How did Alice share F? Alice dragged the file from the Folder interface into Google Mail. (Do you see it now?)

The problem is that “sharing” means at least two very different things: providing a link to a single shared resource, or making a copy of the resource. Each one is correct for some contexts and wrong for others.

Does this sound familiar? To us, it certainly is! It’s the same kind of problem we saw when studying programming misconceptions, work that we have studied in great detail as part of our work on SMoL and the SMoL Tutor. Indeed, in this work, we call this the central analogy: the deep semantic similarity between

sharing document links ↔ aliasing,
downloading or attaching documents ↔ copying objects, and
editing documents ↔ mutation.

We therefore find it especially unsurprising that when you combine mutation with sharing, people get confused. (Of course, there may be a deeper problem: that people are just not very good at reasoning about sharing and mutation. Cloud file-sharing and programming may just be two concrete instantiations of that abstract difficulty, and the same problems may arise in other domains as well.) Essentially, cloud file-sharing operations form an end-user instruction set architecture (EUISA) over which users “program” to achieve their goals.

This paper builds on this idea in the following ways:

We identify this analogy both in the abstract and through several concrete examples (drawn from personal experience).
We use our knowledge of programming semantics misconceptions to design corresponding cloud file-sharing situations, and show that people are confused there also.
Drawing on the computing education literature, we devise an interesting way to get crowdsourcing subjects to not only “trace” but also “program” over the EUISA, and show that difficulties extend to both situations.
We present a formal semantics for cloud file-sharing.
We then discuss how this formal semantics might be employed to help end-users.

For more details, see our paper!

As a fun aside: The formal model is built using Forge. We were going to build a custom visualization but, to get a quick prototype in place, we used Cope and Drag. But the CnD spec was so good that we never needed to build that custom visualization!

Practical Static Analysis for Privacy Bugs

Tags: Privacy, Rust, Tools, Verification

Posted on 03 August 2025.

Privacy bugs are deeply problematic for software users (“once it’s out there you can’t take it back”), legally significant (due to laws like the GDPR), and difficult for programmers to find and to keep out. Static program analysis would therefore appear to be very helpful here.

Unfortunately, making an effective tool runs into several problems:

Rules need to be expressed in a way that is auditable by legal and policy experts, which means they cannot be too close to the level of code.
Policies need to then be mapped to the actual code.
The analysis needs to be as precise as possible.
The analysis needs to also be as quick as possible—ideally, fast enough to integrate into interactive development.

Our new system, Paralegal, checks off all these boxes:

It supports policies expressed in a first-order logic, but written in a stylized English form. For instance, here is a policy stating that there is a way for all personal data to be deleted:

Somewhere:
1. For each "user data" type marked user_data:
A. There is a "source" that produces "user data" where:
  a. There is a "deleter" marked deletes where:
    i) "source" goes to "deleter"

It introduces markers as a key abstraction for mapping the policy onto the program. Several of the terms above — e.g., user_data — are designated in a lightweight way in the source program:
```
#[paralegal::marker(user_data)]
```
It deeply leverages Rust’s type system to obtain useful summaries of function behavior without having to traverse their bodies. In particular, this avoids the pain of writing mock versions of functions, which are time-consuming and error-prone, without sacrificing correctness or precision. It also has some additional optimizations, such as adaptive approximation.

As a result, Paralegal is able to efficiently and effectively analyze several third-party, real-world codebases.

For more details, see our paper!

Lightweight Diagramming for Lightweight Formal Methods

Tags: Diagram, Formal Methods, Spatial, Tools, Verification, Visualization

Posted on 09 June 2025.

We now also have Spytial, a newer version of this idea for diagramming program values.

Formal methods tools like Alloy and Forge help users define, explore, verify, and diagnose specifications for complex systems incrementally. A crucial feature of these tools is their visualizer, which lets users explore generated models through graphical representations.

In some cases, they suffer from familiar usability issues—such as overlapping lines, unclear labels, or cluttered layouts—that make it hard to understand what the model represents. But even when the diagrams are clean and well-organized, they can still be confusing if the layout doesn’t reflect the structure that users expect.

For example, a visualization of a binary tree might place left and right children arbitrarily, rather than arranging them to the left and right of their parent node. This breaks with common conventions and can make it harder for users to quickly recognize familiar structures. Similarly, if a model is meant to convey a specific organization (e.g., the grouping of files in a directory) but the visualization fails to reflect this, it becomes difficult to discern the intended relationships, making the model harder to interpret and reason about.

Fig 1. Default Forge visualization of a binary tree with 10 nodes. A node’s left and right children are not consistently laid out to its left and right.

Fig 2. Alloy visualization of a file system model. Aside from the overlapping edges, the layout fails to convey relationships between entries that share a directory.

Our previous research has explored using custom, domain-specific visualizations to address this challenge. Yet existing tools for custom visualization come with several significant drawbacks:

Users must learn and use entirely different languages to create custom visualizations, learning skills (like CSS) that have nothing to do with formal modeling.
The level of detail required by these languages (such as controlling how elements are rendered) often makes the visualizer code larger and more complex than the models themselves.
Most critically, these visualizations can be brittle. While careful visualization design can handle certain edge cases, they are inherently limited by their authors’ assumptions about potential issues. The very “unknown unknowns” that lightweight formal methods exist to help surface are often masked by visualizers.

We encountered this issue even when building visualizations for relatively simple data structures that we understand well. It is well-documented that experts suffer from blind spots about what mistakes students might make. When faced with such a modeling mistake, failing to specify that a node’s left and right children must be distinct, our custom visualizer failed silently.

DAG where the Node -1's left and right children are both the Node -2. — Fig 3. The custom visualizer fails to guard against instances that are DAGs, leading to graphs that actually have the same left and right child (left) being rendered as as a trees (right).

A binary tree where Node -1 has two distinct children, each with value -2. — Fig 3. The custom visualizer fails to guard against instances that are DAGs, leading to graphs that actually have the same left and right child (left) being rendered as as a trees (right).

Such failures aren’t merely aesthetic—they actively prevent users from discovering a specification error.

Cope and Drag (or CnD) is a novel lightweight diagramming language built to address these issues. CnD’s design was driven by two approaches:

A top-down exploration of cognitive science principles that influence spatial reasoning, visualization, and diagramming.
A bottom-up analysis that distills patterns from dozens of actual custom visualizations.

The resulting language is small, requires minimal annotation, and can be used incrementally. The key idea is that each CnD operation meaningfully refines the default visualization. These operations include constraining spatial layout (such as positioning child nodes below their parents in a binary tree), grouping elements (like clustering related components in a software architecture), and directing drawing style (for instance, coloring nodes in a red-black tree based on their color).

CnD visualization of the same 10 Node binary tree. A node’s left and right children are consistently laid out to its left and right. — Fig 4. CnD visualization of the 10 Node binary tree in Fig 1.

CnD visualization of the same file system model. Grouping conveys how files are related. — Fig 5. CnD visualization of the file system model in Fig 2.

Rather than prioritizing aesthetics, CnD focuses on encoding the spatial intuitions implicit in communicating the model. Its lightweight, declarative structure captures these relationships directly.

Fig 6. This figure shows how to incrementally refine default Forge output of an instance describing a face using (A) grouping, (B) orientation, and (C) icons.

Lightweight Specs, Not Programs: Diagramming with CnD resembles writing a spec, not coding a full program. An empty spec yields a default diagram; each constraint refines it. Unlike traditional viz tools like D3 (where you don’t get a visualization until you’ve written a full program) CnD supports incremental visualization, making it easy to start and evolve alongside your model.
Useful, not pretty: Generated diagrams may lack the visual polish of more sophisticated tools, but they prioritize structural clarity and correctness over style.The trade-off is a lower ceiling: user’s have less fine-grained control over how diagram elements are rendered (e.g., spacing, fonts, shapes).
Fail Loudly: CnD constraints are hard constraints. When a diagram fails to match the model, the system prevents visualization and produces a solver-generated error.

For instance, a CnD specification for a binary tree might encode tree layouts as two constraints (lay the left child below and to the left of its parent, and the right child below and to the right of the parent). When faced with the DAG described earlier, CnD identifies that these visualization constraints are unsatisfiable, and produces an error message instead of a misleading diagram.

When faced with the DAG described earlier, CnD identifies that these visualization constraints are unsatisfiable, and produces an error message instead of a misleading diagram. — Fig 7. When visualization constraints are unsatisfiable, CnD produces an error message instead of a diagram.

CnD isn’t the final word on diagramming. It’s one design point in a larger landscape, trading visual polish for ease of use, structural clarity, and exploration of model-diagram mismatches. Other tools will (and should) explore different trade-offs.

CnD is embedded in an open-source visualizer for Forge. We encourage you to try it as part of your existing Forge workflows. To learn more about CnD, please read our paper!

LTL Tutor

Tags: Linear Temporal Logic, Education, Formal Methods, Misconceptions, Properties, Tools, Verification

Posted on 08 August 2024.

We have been engaged in a multi-year project to improve education in Linear Temporal Logic (LTL) [Blog Post 1, Blog Post 2]. In particular, we have arrived at a detailed understanding of typical misconceptions that learners and even experts have. Our useful outcome from our studies is a set of instruments (think “quizzes”) that instructors can deploy in their classes to understand how well their students understand the logic and what weaknesses they have.

However, as educators, we recognize that it isn’t always easy to add new materials to classes. Furthermore, your students make certain mistakes—now what? They need explanations of what went wrong, need additional drill problems, and need checks whether they got the additional ones right. It’s hard for an educator to make time for all that. And if one is an independent learner, they don’t even have access to such educators.

Recognizing these practical difficulties, we have distilled our group’s expertise in LTL into a free online tutor:

https://ltl-tutor.xyz

We have leveraged insights from our studies to create a tool designed to be used by learners with minimal prior instruction. All you need is a brief introduction to LTL (or even just propositional logic) to get started. As an instructor, you can deliver your usual lecture on the topic (using your preferred framing), and then have your students use the tool to grow and to reinforce their learning.

In contrast to traditional tutoring systems, which are often tied to a specific curriculum or course, our tutor adaptively generates multiple-choice question-sets in the context of common LTL misconceptions.

The tutor provides students who get a question wrong with feedback in terms of their answer’s relationship to the correct answer. Feedback can take the form of visual metaphors, counterexamples, or an interactive trace-stepper that shows the evaluation of an LTL formula across time.

In this example of LTL tutor feedback, a diagram is used to show students that their answer is logically more permissive than the correct answer to the question. The learner is also shown an example of an LTL trace that satisfies their answer but not the correct answer.

In this example of interactive stepper usage, the user examines the satisfaction of the formula (G (z <-> X(a))) in the third state of a trace. While the overall formula is not satisfied at this moment in time, the sub-formula (X a) is satisfied. This allows learners to explore where their understanding of a formula may have diverged from the correct answer. — In this example of interactive stepper usage, the user examines the satisfaction of the formula `(G (z <-> X(a)))` in the third state of a trace. While the overall formula is not satisfied at this moment in time, the sub-formula `(X a)` is satisfied. This allows learners to explore where their understanding of a formula may have diverged from the correct answer.

If learners consistently demonstrate the same misconception, the tutor provides them further insight in the form of tailored text grounded in our previous research.

Here, a student who consistently assumes the presence of the `Globally` operator even when it is not present, is given further insight into the pertinent semantics of the operator.

Once it has a history of misconceptions exhibited by the student, the tutor generates novel, personalized question sets designed to drill students on their specific weaknesses. As students use the tutor, the system updates its understanding of their evolving needs, generating question sets to address newly uncovered or pertinent areas of difficulty.

We also designed the LTL Tutor with practical instructor needs in mind:

Curriculum Agnostic: The LTL Tutor is flexible and not tied to any specific curriculum. You can seamlessly integrate it into your existing course without making significant changes. It both generates exercises for students and allows you to import your own problem sets.
Detailed Reporting: To track your class’s progress effectively, you can create a unique course code for your students to enter, so you can get detailed insights into their performance.
Self-Hostable: If you prefer to have full control over your data, the LTL Tutor can easily be self-hosted.

To learn more about the Tutor, please read our paper!

Misconceptions In Finite-Trace and Infinite-Trace Linear Temporal Logic

Tags: Linear Temporal Logic, Crowdsourcing, Misconceptions, User Studies, Verification

Posted on 07 July 2024.

We now also have an automated tutor that puts this material to work to help students directly.

Over the past three years and with a multi-national group of collaborators, we have been digging deeper into misconceptions in LTL (Linear Temporal Logic) and studying misconceptions in LTLf, a promising variant of LTL restricted to finite traces. Why LTL and LTLf? Because beyond their traditional uses in verification and now robot synthesis, they support even more applications, from image processing to web-page testing to process-rule mining. Why study misconceptions? Because ultimately, human users need to fully understand what a formula says before they can safely apply synthesis tools and the like.

Our original post on LTL misconceptions gives more background and motivation. It also explains the main types of questions we use: translations between English specifications and formal specifications.

So what’s new this time around?

First, we provide two test instruments that have been field tested with several audiences:

One instrument [PDF] focuses on the delta between LTL and LTLf. If you know LTL but not LTLf, give it a try! You’ll come away with hands-on experience of the special constraints that finite traces bring.
The other instrument [PDF] is for LTL beginners — to see what preconceptions they bring to the table. It assumes basic awareness of G (“always”), F (“eventually”), and X (“next state”). It does not test the U (“until”) operator. Live survey here.

Second, we find evidence for several concrete misconceptions in the data. Some misconceptions were identified in prior work and are confirmed here. Others are new to this work.

For example, consider the LTLf formula: G(red => X(X(red))). What finite traces satisfy it?

In particular, can any finite traces that have red true at some point satisfy the formula?

Click to show answer:
No, because whenever red is true it must be true again two states later, but every finite trace will eventually run out of states.

Now consider the LTL formula F(X(X(red))). Is it true for an infinite trace where red is true exactly once?

Click to show answer:
Yes. But interestingly, some of our LTL beginners said no on the grounds that X(X(red)) ought to "spread out" and constrain three states in a row.

Third, we provide a code book of misconceptions and how to identify them in new data [PDF].

For more details, see the paper.

See also our LTL Tutor (traditional LTL only, not finite-trace).

Differential Analysis: A Summary

Tags: Differential Analysis, Formal Methods, Properties, Verification

Posted on 27 June 2024.

For multiple decades we have worked on a the problem of differential analysis. This post explains where it comes from, what it means, and what its consequences are.

Context: Verification

For decades, numerous researchers have worked on the problem of verification. To a first approximation, we can describe this problem as follows:

P ⊧ ɸ

That is, checking whether some program P satisfies some property ɸ. There have been many productive discussions about exactly what methods we should use, but this remains the fundamental question.

Starting in around 2004, we started to build tools to verify a variety of interesting system descriptions (the Ps), starting with access-control policies. They could also be security policies, network configurations, and more. We especially recognized that many of these system descriptions are (sufficiently) sub-Turing-complete, which means we can apply rich methods to precisely answer questions about them. That is, there is a rich set of problems we can verify.

The Problem

Attractive as this idea is, it runs into a significant problem in practice. When you speak to practitioners, you find that they are not short of system descriptions (Ps), but they are severely lacking in properties (ɸs). The problem is not what some might imagine — that they can’t express their properties in some precise logic (which is a separate problem!) — but rather that they struggle to express non-trivial properties at all. A typical conversation might go like:

We have a tool for verification!
Nice, what does it do?
It consumes system descriptions of the kind you produce!
Oh, that’s great!
You just need to specify your properties.
My what?
Your properties! What you want your system to do!
I don’t have any properties!
But what do you want the system to do?
… Work correctly?

This is not to mock practitioners: not at all. Quite the contrary! Even formal methods experts would struggle to precisely describe the expected behavior of complex systems. In fact, talk to formal methods researchers long enough and they’ll admit knowing that this is a problem. It’s just not something we like to think about.

An Alternative

In fact, the “practitioner” answer shows us the path forward. What does it mean for a system to “work”? How does a system’s maintainer know that it “works”?

As a practical matter, many things help us confirm that a system is working well enough. We might have some test suites, we might have monitoring of its execution, we observe it run and use it; lots of people are using it every day; we might even have verified a few properties of a few parts! The net result is that we have confidence in a system.

And then, things happen! Typically, one of two things:

We find a bug, and need to fix it.
We modify an existing feature or add a new one (or — all too rarely — remove one!).

So the problem we run into is the following:

How do we transfer the confidence we had
in the old version to the new one?

Put differently, the core of formal methods is checking for compatibility between two artifacts. Traditionally, we have a system description (P) and a property (ɸ); these are meant to be expressed independent of one another, so that compatibility gives us confidence and incompatibility indicates an error. But now we have two different artifacts: an old system (call it P) and a new one (call it P’). These are obviously not going to be the same (except in rare cases; see below), but broadly, we want to know, how are they different?

P - P'

Of course, what we care about is not the syntactic difference, but the semantic change. Large syntactic differences may have small semantic ones and vice versa.

Defining the Difference

Computing the semantic difference is often not easy. There is a long line of work of computing the difference of programs. However, it is difficult for Turing-complete languages; it is also not always clear what the type of the difference should be. Computing it is a lot easier when the language is sub-Turing-complete (as our papers show). The question of exactly what a “difference” is is also interesting.

Many of the system description languages we have worked with tend to be of the form Request ⇒ Response. For instance, an access control policy might have the type:

Request ⇒ {Accept, Deny}

(In practice, policy languages can be much richer, but this suffices for illustration.) So what is the type of the difference? It maps every request to the cross-product of responses:

Request ⇒ {Accept↦Accept, Deny↦Deny, Accept↦Deny, Deny↦Accept}

That is: some requests that used to be accepted still are; some that were denied still are; but some that were accepted are now deined, and some that were denied are now accepted. (This assumes the domains are exactly the same; otherwise some requests that previously produced a decision no longer do, and vice versa. We’ll assume you can work out the details of domains with bottom values.) The difference is of course the requests whose outcomes change: in this case,

Request ⇒ {Accept↦Deny, Deny↦Accept}

Using the Difference

Depending on how we compute the difference, we can treat the difference essentially as a database. That is, it is a set of pairs of request and change-of-response. The database perspective is very productive, because we can do many things with this database:

Queries: What is the set of requests whose decisions go from Deny↦Accept? These are places where we might look for data leaks.
Views: What are all the requests whose decisions go from Accept↦Deny? These are all the requests that lost access. We might then want to perform queries over this smaller database: e.g., who are the entities whose requests fall in it?

And perhaps most surprisingly:

Verification: Confirm that as a result of this change, certain entities did not gain access.

In our experience, administrators who would not be able to describe properties of their system can define properties of their changes. Indeed, classical verification even has a name for some such properties: they’re called “frame conditions”, and mean, in effect, “and nothing else changed”. Here, we can actually check for these. It’s worth noting that these properties are not true of either the old or new systems! For instance, certain individuals may have had privileges before and will have them after the alteration; all we’re checking is that their privileges did not change.

Uses

Having a general “semantic difference engine”, and being able to query it (interactively), is very powerful. We can perform all the operations we have described above. We can use it to check the consequences of an intended edit. In some rare cases, we expect the difference to be empty: e.g., when we refactor the policy to clean it up syntactically, but expect that the refactoring had no semantic impact. Finally, a semantic differencing engine is also useful as an oracle when performing mutation testing, as Martin and Xie demonstrated.

A Cognitive Perspective

We think there are a few different, useful framings of differential analysis.

One might sound like a truism: we’re not very good at thinking about the things that we didn’t think about. That is, when we make a change to the system, there was some intent behind the change; but it can be very difficult to determine all the consequences of that change. Our focus on the intended change can easily blind us thinking through the consequences. Overcoming these blind spots is very difficult for humans. A semantic differential analysis engine lays them bare.

Another is that we lack good tools to figure out what the properties of a system should be. Model-exploration tools (such as Alloy, or our derivative of it, Forge) are useful at prompting people to think about how they expect systems to behave and not behave. Differential output can also be such a spur: in articulating why something should or should not happen with a change, we learn more about the system itself.

Finally, it’s worth distinguishing the different conditions that lead to system changes. When working on features, we can often do so with some degree of flexibility. But when fixing bugs, we’re often in a hurry: we need to make a change to immediately block, say, a data leakage. If we’re being principled, we might add some tests to check for the intended behavior (and perhaps also to avoid regression); but at that moment, we are likely to be in an especially poor shape to think through unintended consequences. Differential analysis serves as an aid in preventing fixing one problem introducing another.

Readings

Here are some of our papers describing differential analysis (which we have also called “change-impact analysis” in the past):

For access-control policies: paper
For obligations: paper
For firewalls: paper
For SDNs: blog; paper

Forge: A Tool to Teach Formal Methods

Tags: Education, Formal Methods, Properties, Tools, User Studies, Verification, Visualization

Posted on 21 April 2024.

For the past decade we have been studying how best to get students into formal methods (FM). Our focus is not on the 10% or so of students who will automatically gravitate towards it, but on the “other 90%” who don’t view it as a fundamental part of their existence (or of the universe). In particular, we decided to infuse FM thinking into the students who go off to build systems. Hence the course, Logic for Systems.

The bulk of the course focuses on solver-based formal methods. In particular, we began by using Alloy. Alloy comes with numerous benefits: it feels like a programming language, it can “Run” code like an IDE, it can be used for both verification and state-exploration, it comes with a nice visualizer, and it allows lightweight exploration with gradual refinement.

Unfortunately, over the years we have also run into various issues with Alloy, a full catalog of which is in the paper. In response, we have built a new FM tool called Forge. Forge is distinguished by the following three features:

Rather than plunging students into the full complexity of Alloy’s language, we instead layer it into a series of language levels.
We use the Sterling visualizer by default, which you can think of as a better version of Alloy’s visualizer. But there’s much more! Sterling allows you to craft custom visualizations. We use this to create domain-specific visualizations. As we show in the paper, the default visualization can produce unhelpful, confusing, or even outright misleading images. Custom visualization takes care of these.
In the past, we have explored property-based testing as a way to get students on the road from programming to FM. In turn, we are asking the question, “What does testing look like in this FM setting?” Forge provides preliminary answers, with more to come.

Just to whet your appetite, here is an example of what a default Sterling output looks like (Alloy’s visualizer would produce something similar, with fewer distinct colors, making it arguably even harder to see):

Default Sterling output

Here’s what custom visualization shows:

Custom Sterling output

See the difference?

For more details, see the paper. And please try out Forge!

Acknowledgements

We are grateful for support from the U.S. National Science Foundation (award #2208731).

Little Tricky Logics

Tags: Linear Temporal Logic, Crowdsourcing, Misconceptions, User Studies, Verification

Posted on 05 November 2022.

We also have followup work that continues to explore LTL and now also studies finite-trace LTL. In addition, we also have an automated tutor that puts this material to work to help students directly.

LTL (Linear Temporal Logic) has long been central in computer-aided verification and synthesis. Lately, it’s also been making significant inroads into areas like planning for robots. LTL is powerful, beautiful, and concise. What’s not to love?

However, any logic used in these settings must also satisfy the central goal of being understandable by its users. Especially in a field like synthesis, there is no second line of defense: a synthesizer does exactly what the specification says. If the specification is wrong, the output will be wrong in the same way.

Therefore, we need to understand how people comprehend these logics. Unfortunately, the human factors of logics has seen almost no attention in the research community. Indeed, if anything, the literature is rife with claims about what is “easy” or “intuitive” without any rigorous justification for such claims.

With this paper, we hope to change that conversation. We bring to bear on this problem several techniques from diverse areas—but primarily from education and other social sciences (with tooling provided by computer science)—to understand the misconceptions people have with logics. Misconceptions are not merely mistakes; they are validated understanding difficulties (i.e., having the wrong concept), and hence demand much greater attention. We are especially inspired by work in physics education on the creation of concept inventories, which are validated instruments for rapidly identifying misconceptions in a population, and take steps towards the creation of one.

Concretely, we focus on LTL (given its widespread use) and study the problem of LTL understanding from three different perspectives:

LTL to English: Given an LTL formula, can a reader accurately translate it into English? This is similar to what a person does when reading a specification, e.g., when code-reviewing work or studying a paper.
English to LTL: Given an English statement, can a reader accurately express it in LTL? This skill is essential for specification and verification.

Furthermore, “understanding LTL” needs to be divided into two parts: syntax and semantics. Therefore, we study a third issue:

Trace satisfaction: Given an LTL formula and a trace (sequence of states), can a reader accurately label the trace as satisfying or violating? Such questions directly test knowledge of LTL semantics.

Our studies were conducted over multiple years, with multiple audiences, and using multiple methods, with both formative and confirmatory phases. The net result is that we find numerous misconceptions in the understanding of LTL in all three categories. Notably, our studies are based on small formulas and traces, so we expect the set of issues will only grow as the instruments contain larger artifacts.

Ultimately, in addition to

finding concrete misconceptions,

we also:

create a codebook of misconceptions that LTL users have, and
provide instruments for finding these misconceptions.

We believe all three will be of immediate use to different communities, such as students, educators, tool-builders, and designers of new logic-based languages.

For more details, see the paper.

Applying Cognitive Principles to Model-Finding Output

Tags: User Studies, Verification, Visualization

Posted on 26 April 2022.

Model-finders produce output to help users understand the specifications they have written. They therefore effectively make assumptions about how these will be processed cognitively, but are usually unaware that they are doing so. What if we apply known principles from cognitive science to try to improve the output of model-finders?

Model Finding and Specification Exploration

Model-finding is everywhere. SAT and SMT solvers are the canonical model-finders: given a logical specification, they generate a satisfying instance (a “model”) or report that it’s impossible. Their speed and generality have embedded them in numerous back-ends. They are also used directly for analysis and verification, e.g., through systems like Alloy.

One powerful modality enabled by tools like Alloy is the exploration of specifications. Usually, model-finders are used for verification: you have a specification and some properties about it, and a verifier tells you whether the properties are satisfied or not. However, we often don’t have properties; we just want to understand the consequences of a design. While a conventional verifier is useless in this setting, model-finders have no problem with it: they will generate models of the specification that show different possible ways in which it can be realized.

Presenting Exploration

The models generated by exploration (or even by verification, where they are typically counterexamples) can be presented in several ways. For many users, the most convenient output is visual. Here, for instance, is a typical image generated using the Sterling visualizer:

Model visualization

As of this writing, Alloy will let you sequentially view one model at a time.

Exploration for Understanding

The purpose of showing these models is to gain understanding. It is therefore reasonable to ask what forms of presentation would be most useful to enable the most understanding. In earlier work we studied details of how each model is shown. That work is orthogonal to what we do here.

Here, we are interested in how many models, and of what kind, should be displayed. We draw on a rich body of literature in perceptual psychology going back to seminal work by Gibson and Gibson in 1955. A long line of work since then has explored several dimensions of this, resulting in a modern understanding of contrasting cases. In this work, you don’t show a single result; rather, you show a set of similar examples, to better help people build models of what they are seeing. Since our goal is to help people understand a specification through visual output, it was natural to ask whether any of this literature could help in our setting.

Our Study

We concretely studied numerous experimental conditions involving different kinds of contrasting cases, where we show multiple models on screen at once. Critically, we looked at the use of both positive and negative models. Positive models are what you expect: models of the specification. In contrast, “negative” models are ones that don’t model the specification.

There can, of course, be an infinite number of negative models, most of which are of no use whatsoever: if I write a specification of a leader-election protocol, a whale or a sandwich are legitimate negative models. What we are interested in is “near miss” models, i.e., ones that could almost have been models but for a small difference. Our theory was that showing these models would help a user better understand the “space” of their model. (In this, we were inspired by prior work by Montaghmi and Rayside.)

Our Findings

We study these questions through both crowdsourced and talkaloud studies, and using both quantitative and qualitative methods. We find that in this setting, the use of multiple models does not seem to have been a big win. (Had it been, we would still have to confront the problem of how to fit all that information onto a screen in the general case.) The use of negative instances does seem to be helpful. We also constructed novel modes of output such as where a user can flip between positive and negative instances, and these seem especially promising.

Of course, our findings come with numerous caveats. Rather than think of our results as in any way definitive, we view this as formative work for a much longer line of research at the intersection of formal methods and human-computer interaction. We especially believe there is enormous potential to apply cognitive science principles in this space, and our paper provides some very rough, preliminary ideas of how one might do so.

For More Details

You can read about all this in our paper. Be warned, the paper is a bit of heavy going! There are a lot of conditions and lots of experiments and data. But hopefully you can get the gist of it without too much trouble.

Automated, Targeted Testing of Property-Based Testing Predicates

Tags: Formal Methods, Properties, Testing, Verification

Posted on 24 November 2021.

Property-Based Testing (PBT) is not only a valuable sofware quality improvement method in its own right, it’s a critical bridge between traditional software development practices (like unit testing) and formal specification. We discuss this in our previous work on assessing student performance on PBT. In particular, we introduce a mechanism to investigate how they do by decomposing the property into a collection of independent sub-properties. This gives us semantic insight into how students perform: rather than a binary scale, we can identify specific sub-properties that they may have difficulty with.

While this preliminary work was very useful, it suffered from several problems, some of which are not surprising while others became clear to us only in retrospect. In light of that, our new work makes several improvements.

The previous work expected each of the sub-properties to be independent. However, this is too strong a requirement. For one, it masks problems that can lurk in the conjunction of sub-properties. The other problem is more subtle: when you see a surprising or intriguing student error, you want to add a sub-property that would catch that error, so you can generate statistics on it. However, there’s no reason the new property will be independent; in fact, it almost certainly won’t be.
Our tests were being generated by hand, with one exception that was so subtle, we employed Alloy to find the test. Why only once? Why not use Alloy to generate tests in all situations? And while we’re at it, why not also use a generator from a PBT framework (specifically, Hypothesis)?
And if we’re going to use both value-based and SAT-based example generators, why not compare them?

This new paper does all of the above. It results in a much more flexible, useful tool for assessing student PBT performance. Second, it revisits our previous findings about student performance. Third, it lays out architectures for PBT evaluation using SAT and a PBT-generator (specifically Hypothesis). In the process it explains various engineering issues we needed to address. Fourth, it compares the two approaches; it also compares how the two approaches did relative to hand-curated test suites.

You can read about all this in our paper.

Crowdsourcing User Studies for Formal Methods

Tags: Crowdsourcing, User Studies, Verification

Posted on 03 July 2017.

For decades, we have neglected performing serious user studies of formal-methods tools. This is now starting to change. An earlier post introduces our new work in this area.

That study works with students in an upper-level class, who are a fairly good proxy for some developers (and are anyway an audience we have good access to). Unfortunately, student populations are problematic for several reasons:

There are only so many students in a class. There may not be enough to obtain statistical strength, especially on designs that require A/B testing and the like.
The class is offered only so often. It may take a whole year between studies. (This is a common problem in computing education research.)
As students progress through a class, it’s hard to “rewind” them and study their responses at an earlier stage in their learning.

And so on. It would be helpful if we could obtain large numbers of users quickly, relatively cheaply, and repeatedly.

This naturally suggests crowdsourcing. Unfortunately, the tasks we are examining involve using tools based on formal logic, not identifying birds or picking Web site colors (or solving CAPTCHAs…). That would seem to greatly limit the utility of crowd-workers on popular sites like Mechanical Turk.

In reality, this depends on how the problem is phrased. If we view it as “Can we find lots of Turkers with knowledge of Promela (or Alloy or …)?”, the answer is pretty negative. If, however, we can rework the problems somewhat so the question is “Can we get people to work on a puzzle?”, we can find many, many more workers. That is, sometimes the problem is one of vocabulary (and in particular, the use of specific formal methods languages) than of raw ability.

Concretely, we have taken the following steps:

Adapt problems from being questions about Alloy specifications to being phrased as logic “puzzles”.
Provide an initial training phase to make sure workers understand what we’re after.
Follow that with an evaluation phase to ensure that they “got the idea”. Only consider responses from those workers who score at a high enough threshold on evaluation.
Only then conduct the actual study.

Observe that even if we don’t want to trust the final results obtained from crowdsourcing, there are still uses for this process. Designing a good study requires several rounds of prototyping: even simple wording choices can have huge and unforeseen (negative) consequences. The more rounds we get to test a study, the better it will come out. Therefore, the crowd is useful at least to prototype and refine a study before unleashing it on a more qualified, harder-to-find audience — a group that, almost by definition, you do not want to waste on a first-round study prototype.

For more information, see our paper. We find fairly useful results using workers on Mechanical Turk. In many cases the findings there correspond with those we found with class students.

User Studies of Principled Model Finder Output

Tags: Crowdsourcing, Formal Methods, User Studies, Verification, Visualization

Posted on 01 July 2017.

For decades, formal-methods tools have largely been evaluated on their correctness, completeness, and mathematical foundations while side-stepping or hand-waving questions of usability. As a result, tools like model checkers, model finders, and proof assistants can require years of expertise to negotiate, leaving knowledgeable but uninitiated potential users at a loss. This state of affairs must change!

One class of formal tool, model finders, provides concrete instances of a specification, which can guide a user’s intuition or witness the failure of desired properties. But are the examples produced actually helpful? Which examples ought to be shown first? How should they be presented, and what supplementary information can aid comprehension? Indeed, could they even hinder understanding?

We’ve set out to answer these questions via disciplined user-studies. Where can we find participants for these studies? Ideally, we would survey experts. Unfortunately, it has been challenging to do so in the quantities needed for statistical power. As an alternative, we have begun to use formal methods students in Brown’s upper-level Logic for Systems class. The course begins with Alloy, a popular model-finding tool, so students are well suited to participate in basic studies. With this population, we have found some surprising results that call into question some intuitively appealing answers to (e.g.) the example-selection question.

For more information, see our paper.

Okay, that’s student populations. But there are only so many students in a class, and they take the class only so often, and it’s hard to “rewind” them to an earlier point in a course. Are there audiences we can use that don’t have these problems? Stay tuned for our next post.

Tierless Programming for SDNs: Differential Analysis

Tags: Differential Analysis, Flowlog, Programming Languages, Software-Defined Networking, Verification

Posted on 02 June 2015.

This post is part of our series about tierless network programming with Flowlog:
Part 1: Tierless Programming
Part 2: Interfacing with External Events
Part 3: Optimality
Part 4: Verification
Part 5: Differential Analysis

Verification is a powerful way to make sure a program meets expectations, but what if those expectations aren't written down, or the user lacks the expertise to write formal properties? Flowlog supports a powerful form of property-free analysis: program differencing.

When we make a program change, usually we're starting from a version that "works". We'd like to transfer what confidence we had in the original version to the new version, plus confirm our intuition about the changes. In other words, even if the original program had bugs, we'd like to at least confirm that the edit doesn't introduce any new ones.

Of course, taking the syntactic difference of two programs is easy — just use diff! — but usually that's not good enough. What we want is the behavioral, or semantic difference. Flowlog provides semantic differencing via Alloy, similarly to how it does property checking. We call Flowlog's differencing engine Chimp (short for Change-impact).

Differences in Output and State Transitions

Chimp translates both the old (prog1) and new (prog2) versions to Alloy, then supports asking questions like: Will the two versions ever handle packets differently? More generally, we can ask Chimp whether the program's output behavior ever differs: does there exist some program state and input event such that, in that state, the two programs will disagree on output?

pred changePolicyOut[st: State, ev: Event] {
  some out: Event |
    prog1/outpolicy[st,ev,out] && not prog2/outpolicy[st,ev,out] ||
    prog2/outpolicy[st,ev,out] && not prog1/outpolicy[st,ev,out]
}

Any time one program issues an output event that the other doesn't, Chimp displays an Alloy scenario.

We might also ask: When can the programs change state differently? Similarly to changePolicyOut above, Chimp defines changeStateTransition[st: State, ev: Event] as matching any of the following for each table T in the program:

some x0, ..., xn: univ | 
  prog1/afterEvent_T[prestate, ev, x0, ..., xn] && 
    not prog2/afterEvent_T[prestate, ev, x0, ..., xn] ||
  prog2/afterEvent_T[prestate, ev, x0, ..., xn] && 
    not prog1/afterEvent_T[prestate, ev, x0, ..., xn]

Recall that afterEvent_T keeps track of when each tuple is in the table T after an event is processed.

Refining Differential Analysis

The two predicates above are both built into Chimp. Using them as a starting point, users can ask pointed questions about the effects of the change. For instance, will any TCP packets be handled differently? Just search for a pre-state and a TCP packet that the programs disagree on:

some prestate: State, p: EVtcp_acket |
  changePolicyOut[prestate, p]

This lets users explore the consequences of their change without any formal guidance except their intuition about what the change should do.

Reachability

So far, these queries show scenarios where the programs differ, taking into consideration all potential inputs and starting states; this includes potentially unreachable starting states. We could, for instance, have two programs that behave differently if a table is populated (resulting in a non-empty semantic diff!) yet never actually insert rows into that table. Chimp provides optional reachability-checking to counter this, although users must cap the length of system traces being searched.

Schema Clashes

Suppose that we want to modify the original source-tracking example to keep track of flows by source and destination, rather than just source addresses. Now instead of one column:

TABLE seen(macaddr);

the seen table has two columns:

TABLE seen(macaddr, macaddr);

This poses a challenge for Chimp; what shape should the seen table be? If Chimp finds a scenario, should it show a seen table with one or two columns? We call this situation a schema clash, and Chimp addresses it by creating two separate tables in the prestate: one with one column (used by the first program) and another with two columns (used by the second program).

Doing this causes a new issue: Chimp searches for arbitrary states that satisfy the change-impact predicates. Since there is no constraint between the values of the two tables, Chimp might return a scenario where (say) the first seen table is empty, but the second contains tuples!

This doesn't match our intuition for the change: we expect that for every source in the first table, there is a source-destination pair in the second table, and vice versa. We can add this constraint to Chimp and filter the scenarios it shows us, but first, we should ask whether that constraint actually reflects the behavior of the two programs.

Differential Properties

Since it's based on Flowlog's verification framework, Chimp allows us to check properties stated over multiple programs. Our expecation above, stated in Alloy for an arbitrary state s, is:

all src: Macaddr | 
  src in s.seen1 
  iff 
  some dst: Macaddr | src->dst in s.seen2

Let's check that this condition holds for all reachable states. We'll proceed inductively. The condition holds trivially at the starting (empty) state; so we only need to show that it is preserved as the program transitions. We search for a counterexample:

some prestate: State, ev: Event | {
  // prestate satisfies the condition
  all src: Macaddr | src in prestate.seen_1 iff 
    some dst: Macaddr | src->dst in prestate.seen_2
	
  // poststate does not
  some src: Macaddr | 
    (prog1/afterEvent_seen_1[prestate,ev,src] and 
     all dst: Macaddr | not prog2/afterEvent_seen_2[prestate,ev,src,dst])
    ||
    (not prog1/afterEvent_seen_1[prestate,ev,src] and 
     some dst: Macaddr | prog2/afterEvent_seen_2[prestate,ev,src,dst])
}

Chimp finds no counterexample. Unfortunately, Chimp can't guarantee that this isn't a false negative; the query falls outside the class where Chimp can guarantee a complete search. Nevertheless, the lack of counterexample serves to increase our confidence that the change respects our intent.

After adding the constraint that, for every source in the first table, there is a source-destination pair in the second table, Chimp shows us that the new program will change the state (to add a new destination) even if the source is already in seen.

Tierless Programming for SDNs: Verification

Tags: Flowlog, Programming Languages, Software-Defined Networking, Verification

Posted on 17 April 2015.

The last post said what it means for Flowlog's compiler to be optimal, which prevents certain bugs from ever occurring. But what about the program itself? Flowlog has built-in features to help verify program correctness, independent of how the network is set up.

To see Flowlog's program analysis in action, let's first expand our watchlist program a bit more. Before, we just flooded packets for demo purposes:

DO forward(new) WHERE
    new.locPt != p.locPt;

Now we'll do something a bit smarter. We'll make the program learn which ports lead to which hosts, and use that knowledge to avoid flooding when possible (this is often called a "learning switch"):

TABLE learned(switchid, portid, macaddr);
ON packet(pkt):
  INSERT (pkt.locSw, pkt.locPt, pkt.dlSrc) INTO learned;

  DO forward(new) WHERE
    learned(pkt.locSw, new.locPt, pkt.dlDst);
    OR
    (NOT learned(pkt.locSw, ANY, pkt.dlDst) AND
     pkt.locPt != new.locPt);

The learned table stores knowledge about where addresses have been seen on the network. If a packet arrives with a destination the switch has seen before as a source, there's no need to flood! While this program is still fairly naive (it will fail if the network has cycles in it) it's complex enough to have a few interesting properties we'd like to check. For instance, if the learned table ever holds multiple ports for the same switch and address, the program will end up sending multiple copies of the same packet. But can the program ever end up in such a state? Since the initial, startup state is empty, this amounts to checking: "Can the program ever transition from a valid state (i.e., one where every switch and address has at most one port in learned) into an invalid one?"

Verifying Flowlog

Each Flowlog rule defines part of an event-handling function saying how the system should react to each packet seen. Rules compile to logical implications that Flowlog's runtime interprets whenever a packet arrives.

Alloy is a tool for analyzing relational logic specifications. Since Flowlog rules compile to logic, it's easy to describe in Alloy how Flowlog programs behave. In fact, Flowlog can automatically generate Alloy specifications that describe when and how the program takes actions or changes its state.

For example, omitting some Alloy-language foibles for clarity, here's how Flowlog describes our program's forwarding behavior in Alloy.

pred forward[st: State, p: EVpacket, new: EVpacket] {
  // Case 1: known destination
  (p.locSw->new.locPt->p.dlDst) in learned and
   (p.locSw->new.locPt) in switchHasPort and ...)
  or
  // Case 2: unknown destination
  (all apt: Portid | (p.locSw->apt->p.dlDst) not in learned and
   new.locPt != p.locPt and 
   (p.locSw->new.locPt) in switchHasPort and ...)
}

An Alloy predicate is either true or false for a given input. This one says whether, in a given state st, an arbitrary packet p will be forwarded as a new packet new (containing the output port and any header modifications). It combines both forwarding rules together to construct a logical definition of forwarding behavior, rather than just a one-way implication (as in the case of individual rules).

The automatically generated specification also contains other predicates that say how and when the controller's state will change. For instance, afterEvent_learned, which says when a given entry will be present in learned after the controller processes a packet. An afterEvent predicate is automatically defined for every state table in the program.

Using afterEvent_Learned, we can verify our goal: that whenever an event ev arrives, the program will never add a second entry (sw, pt2,addr) to learned:

assert FunctionalLearned {
  all pre: State, ev: Event |
    all sw: Switchid, addr: Macaddr, pt1, pt2: Portid |
      (not (sw->pt1->addr in pre.learned) or 
       not (sw->pt2->addr in pre.learned)) and
      afterEvent_learned[pre, ev, sw, pt1, addr] and
      afterEvent_learned[pre, ev, sw, pt2, addr] implies pt1 = pt2
}

Alloy finds a counterexample scenario (in under a second):

The scenario shows an arbitrary packet (L/EVpacket; the L/ prefix can be ignored) arriving at port 1 (L/Portid1) on an arbitrary switch (L/Switchid). The packet has the same source and destination MAC address (L/Macaddr). Before the packet arrived, the controller state had a single row in its learned table; it had previously learned that L/Macaddr can be reached out port 0 (L/Portid1). Since the packet is from the same address, but a different port, it will cause the controller to add a second row to its learned table, violating our property.

This situation isn't unusual if hosts are mobile, like laptops on a campus network are. To fix this issue, we add a rule that removes obsolete mappings from the table:

DELETE (pkt.locSw, pt, pkt.dlSrc) FROM learned WHERE
  not pt = pkt.locPt;

Alloy confirms that the property holds of the modified program. We now know that any reachable state of our program is valid.

Verification Completeness

Alloy does bounded verification: along with properties to check, we provide concrete bounds for each datatype. We might say to check up to to 3 switches, 4 IP addresses, and so on. So although Alloy never returns a false positive, it can in general produce false negatives, because it searches for counterexamples only up to the given bounds. Fortunately, for many useful properties, we can compute and assert a sufficient bound. In the property we checked above, a counterexample needs only 1 State (to represent the program's pre-state) and 1 Event (the arriving packet), plus room for its contents (2 Macaddrs for source and destination, etc.), along with 1 Switchid, 2 Portids and 1 Macaddr to cover the possibly-conflicted entries in the state. So when Alloy says that the new program satisfies our property, we can trust it.

Benefits of Tierlessness

Suppose we enhanced the POX version of this program (Part 1) to learn ports in the same way, and then wanted to check the same property. Since the POX program explicitly manages flow-table rules, and the property involves a mixture of packet-handling (what is sent up to the controller?) and controller logic (how is the state updated?), checking the POX program would mean accounting for those rules and how the controller updates them over time. This isn't necessary for the Flowlog version, because rule updates are all handled optimally by the runtime. This means that property checking is simpler: there's no multi-tiered model of rule updates, just a model of the program's behavior.

You can read more about Flowlog's analysis support in our paper.

In the next post, we'll finish up this sequence on Flowlog by reasoning about behavioral differences between multiple versions of the same program.

Tierless Programming for SDNs: Optimality

Tags: Flowlog, Programming Languages, Software-Defined Networking, Verification

Posted on 13 April 2015.

Since packets can trigger controller-state updates and event output, you might wonder exactly which packets a Flowlog controller needs to see. For instance, a packet without a source in the watchlist will never alter the controller's state. Does such a packet need to grace the controller at all? The answer is no. In fact, there are only three conditions under which switch rules do not suffice, and the controller must be involved in packet-handling:

when the packet will cause a change in controller state;
when the packet will cause the controller to send an event; and
when the packet must be modified in ways that OpenFlow 1.0 does not support on switches.

Flowlog's compiler ensures the controller sees packets if and only if one of these holds; the compiler is therefore optimal with respect to this list. To achieve this, the compiler analyzes every packet-triggered statement in the program. For instance, the INSERT statement above will only change the state for packets with a source in the watchlist (a condition made explicit in the WHERE clause) and without a source in the seen table (implicit in Flowlog's logical semantics for INSERT). Only if both of these conditions are met will the controller see a packet. An optimal compiler prevents certain kinds of bugs from occurring: the controller program will never miss packets that will affect its state, and it will never receive packets it doesn't need.

You can read more about Flowlog in our paper.

In the next post, we'll look at Flowlog's built-in verification support.

Tierless Programming for SDNs: Events

Tags: Flowlog, Programming Languages, Software-Defined Networking, Verification

Posted on 01 March 2015.

This post is part of our series about tierless network programming with Flowlog: Part 1: Tierless Programming
Part 2: Interfacing with External Events
Part 3: Optimality
Part 4: Verification
Part 5: Differential Analysis

The last post introduced Flowlog, a tierless language for SDN controller programming. You might be wondering, "What can I write in Flowlog? How expressive is it?" To support both its proactive compiler and automated program analysis (more on this in the next post) we deliberately limited Flowlog's expressive power. There are no loops in the language, and no recursion. Instead of trying to be universally expressive, Flowlog embraces the fact that most programs don't run in a vacuum. A controller may need to interact with other services, and developers may wish to re-use pre-existing code. To enable this, Flowlog programs can call out to non-Flowlog libraries. The runtime uses standard RPCs (Thrift) for inter-process communication, so existing programs can be quickly wrapped to communicate with Flowlog. Much like how Flowlog abstracts out switch-rule updates, it also hides the details of inter-process communcation. To see this, let's enhance the address-logger application with a watch-list that external programs can add to. We need a new table ("watchlist"), populated by arriving "watchplease" events that populate the table. Finally, we make sure only watched addresses are logged:

TABLE seen(macaddr);
TABLE watchlist(macaddr);
EVENT watchplease = {target: macaddr};

ON watchplease(w):
  INSERT (w.target) INTO watchlist;

ON packet(p):
  INSERT (p.dlSrc) INTO seen WHERE
    watchlist(p.dlSrc);
  DO forward(new) WHERE
    new.locPt != p.locPt;

When the program receives a watchplease event (sent via RPC from an external program) it adds the appropriate address to its watchlist.

Sending Events

Flowlog programs can also send events. Suppose we want to notify some other process when a watchlisted address is seen, and the process is listening on TCP port 20000. We just declare a named pipe that carries notifications to that port:

EVENT sawaddress = {addr: macaddr};
OUTGOING sendaddress(sawaddress) THEN
  SEND TO 127.0.0.1:20000;

and then write a notification to that pipe for appropriate packets:

ON packet(p) WHERE watchlist(p.dlSrc):
  DO sendaddress(s) WHERE s.addr = p.dlSrc;

Synchronous Communication

The event system supports asynchronous communication, but Flowlog also allows synchronous queries to external programs. It does this with a remote state abstraction. If we wanted to manage the watchlist remotely, rather than writing

TABLE watchlist(macaddr);

we would write:

REMOTE TABLE watchlist(macaddr)
  FROM watchlist AT 127.0.0.1 20000
  TIMEOUT 10 seconds;

which tells Flowlog it can obtain the current list by sending queries to port 20000. Since these queries are managed behind the scenes, the program doesn't need to change—as far as the programmer is concerned, a table is a table. Finally, the timeout says that Flowlog can cache prior results for 10 seconds.

Interfacing External Programs with Flowlog

Flowlog can interface with code in any language that supports Thrift RPC (including C++, Java, OCaml, and many others). To interact with Flowlog, one only needs to implement the interface Flowlog requires: a function that accepts notifications and a function that responds to queries. Other functions may also (optionally) send notifications. Thrift's library handles the rest.

You can read more about Flowlog's events in our paper.

In the next post, we'll look at what it means for Flowlog's compiler to be optimal.

Tierless Programming for Software-Defined Networks

Tags: Flowlog, Programming Languages, Software-Defined Networking, Verification
Posted on 30 September 2014.
This post is part of our series about tierless network programming with Flowlog: Part 1: Tierless Programming
Part 2: Interfacing with External Events
Part 3: Optimality
Part 4: Verification
Part 5: Differential Analysis

Network devices like switches and routers update their behavior in real-time. For instance, a router may change how it forwards traffic to address an outage or congestion. In a traditional network, devices use distributed protocols to decide on mutually consistent behavior, but Software-Defined Networks (SDN) operate differently. Switches are no longer fully autonomous agents, but instead receive instructions from logically centralized controller applications running on separate hardware. Since these applications can be arbitrary programs, SDN operators gain tremendous flexibility in customizing their network.

The most popular SDN standard in current use is OpenFlow. With OpenFlow, Controller applications install persistent forwarding rules on the switches that match on packet header fields and list actions to take on a match. These actions can include header modifications, forwarding, and even sending packets to the controller for further evaluation. When a packet arrives without a matching rule installed, the switch defaults to sending the packet to the controller for instructions.

Let's write a small controller application. It should (1) record the addresses of machines sending packets on the network and (2) cause each switch to forward traffic by flooding (i.e., sending out on all ports except the arrival port). This is simple enough to write in POX, a controller platform for Python. The core of this program is a function that reacts to packets as they arrive at the controller (we have removed some boilerplate and initialization):

def _handle_PacketIn (self, event): packet = event.parsed def install_nomore (): msg = of.ofp_flow_mod() msg.match = of.ofp_match(dl_src = packet.src) msg.buffer_id = event.ofp.buffer_id msg.actions.append(of.ofp_action_output(port = of.OFPP_FLOOD)) self.connection.send(msg) def do_flood (): msg = of.ofp_packet_out() msg.actions.append(of.ofp_action_output(port = of.OFPP_FLOOD)) msg.data = event.ofp msg.buffer_id = None msg.in_port = event.port self.connection.send(msg) self.seenTable.add(packet.src) install_nomore() do_flood()

First, the controller records the packet's source in its internal table. Next, the install_nomore function adds a rule to the switch saying that packets with this source should be flooded. Once the rule is installed, the switch will not send packets with the same source to the controller again. Finally, the do_flood function sends a reply telling the switch to flood the packet.

This style of programming may remind you of the standard three-tier web-programming architecture. Much like a web program generates JavaScript or SQL strings, controller programs produce new switch rules in real-time. One major difference is that switch rules are much less expressive than JavaScript, which means that less computation can be delegated to the switches. A bug in a controller program can throw the entire network's behavior off. But it's easy to introduce bugs when every program produces switch rules in real-time, effectively requiring its own mini-compiler!

SDN Programming Without Tiers

We've been working on a tierless language for SDN controllers: Flowlog. In Flowlog, you write programs as if the controller sees every packet, and never have to worry about the underlying switch rules. This means that some common bugs in controller/switch interaction can never occur, but it also means that the programming experience is simpler. In Flowlog, our single-switch address-monitoring program is just:

TABLE seen(macaddr); ON ip_packet(p): INSERT (p.dlSrc) INTO seen; DO forward(new) WHERE new.locPt != p.locPt;

The first line declares a one-column database table, "seen". Line 2 says that the following two lines are triggered by IP packets. Line 3 adds those packets' source addresses to the table, and line 4 sends the packets out all other ports.

As soon as this program runs, the Flowlog runtime proactively installs switch rules to match the current controller state and automatically ensures consistency. As the controller sees more addresses, the switch sends fewer packets back to the controller—but this is entirely transparent to the programmer, whose job is simplified by the abstraction of an all-seeing controller.

Examples and Further Reading

Flowlog is good for more than just toy examples. We've used Flowlog for many different network applications: ARP-caching, network address translation, and even mediating discovery and content-streaming for devices like Apple TVs. You can read more about Flowlog and Flowlog applications in our paper.

The next post talks more about what you can use Flowlog to write, and also see how Flowlog allows programs to call out to external libraries in other languages.

Verifying Extensions' Compliance with Firefox's Private Browsing Mode

Tags: Browsers, JavaScript, Programming Languages, Security, Types, Verification
Posted on 19 August 2013.
All modern browsers now support a “private browsing mode”, in which the browser ostensibly leaves behind no traces on the user's file system of the user's browsing session. This is quite subtle: browsers have to handle caches, cookies, preferences, bookmarks, deliberately downloaded files, and more. So browser vendors have invested considerable engineering effort to ensure they have implemented it correctly.

Firefox, however, supports extensions, which allow third party code to run with all the privilege of the browser itself. What happens to the security guarantee of private browsing mode, then?

The current approach

Currently, Mozilla curates the collection of extensions, and any extension must pass through a manual code review to flag potentially privacy-violating behaviors. This is a daunting and tedious task. Firefox contains well over 1,400 APIs, of which about twenty are obviously relevant to private-browsing mode, and another forty or so are less obviously relevant. (It depends heavily on exactly what we mean by the privacy guarantee of “no traces left behind”: surely the browser should not leave files in its cache, but should it let users explicitly download and save a file? What about adding or deleting bookmarks?) And, if the APIs or definition of private-browsing policy ever change, this audit must be redone for each of the thousands of extensions.

The asymmetry in this situation should be obvious: Mozilla auditors should not have to reconstruct how each extension works; it should be the extension developers' responsibility to convince the auditor that their code complies with private-browsing guarantees. After all, they wrote the code! Moreover, since auditors are fallible people, too, we should look to (semi-)automated tools to lower their reviewing effort.

Our approach

So what property, ultimately, do we need to confirm about an extension's code to ensure its compliance? Consider the pseudo-code below, which saves the current preferences to disk every few minutes:

var prefsObj = ... const thePrefsFile = "..."; function autoSavePreferences() { if (inPivateBrowsingMode()) { // ...must store data only in memory... return; } else { // ...allowed to save data to disk... var file = openFile(thePrefsFile); file.write(prefsObj.tostring()); } } window.setTimeout(autoSafePreferences, 3000);

The key observation is that this code really defines two programs that happen to share the same source code: one program runs when the browser is in private browsing mode, and the other runs when it isn't. And we simply do not care about one of those programs, because extensions can do whatever they'd like when not in private-browsing mode. So all we have to do is “disentangle” the two programs somehow, and confirm that the private-browsing version does not contain any file I/O.

Technical insight

Our tool of choice for this purpose is a type system for JavaScript. We've used such a system before to analyze the security of the ADsafe sandbox. The type system is quite sophisticated to handle JavaScript idioms precisely, but for our purposes here we need only part of its expressive power. We need three pieces: first, three new types; second, specialized typing rules; and third, an appropriate type environment.

We define one new primitive type: Unsafe. We will ascribe this type to all the privacy-relevant APIs.

We use union types to define Ext, the type of “all private-browsing-safe extensions”, namely: numbers, strings, booleans, objects whose fields are Ext, and functions whose argument and return types are Ext. Notice that Unsafe “doesn’t fit” into Ext, so attempting to use an unsafe function, or pass it around in extension code, will result in a type error.

Instead of defining Bool as a primitive type, we will instead define True and False as primitive types, and define Bool as their union.

We'll also add two specialized typing rules:

If an expression has some union type, and only one component of that union actually typechecks, then we optimistically say that the expression typechecks even with the whole union type. This might seem very strange at first glance: surely, the expression 5("true") shouldn't typecheck? But remember, our goal is to prevent privacy violations, and the code above will simply crash---it will never write to disk. Accordingly, we permit this code in our type system.

We add special rules for typechecking if-expressions. When the condition typechecks at type True, we only check the then-branch; when the condition typechecks at type False, we only check the else-branch. (Otherwise, we check both branches as normal.)

Finally, we add the typing environment which initializes the whole system:

We give all the privacy-relevant APIs the type Unsafe.

We give the API inPrivateBrowsingMode() the type True. Remember: we just don't care what happens when it's false!

Put together, what do all these pieces achieve? Because Unsafe and Ext are disjoint from each other, we can safely segregate any code into two pieces that cannot communicate with each other. By carefully initializing the type environment, we make Unsafe precisely delineate the APIs that extensions should not use in private browsing mode. The typing rules for if-expressions plus the type for inPrivateBrowsingMode() amount to disentangling the two programs from each other: essentially, it implements dead-code elimination at type-checking time. Lastly, the rule about union types makes the system much easier for programmers to use, since they do not have to spend any effort satisfying the typechecker about properties other than this privacy guarantee.

In short, if a program passes our typechecker, then it must not call any privacy-violating APIs while in private-browsing mode, and hence is safe. No audit needed!

Wait, what about exceptions to the policy?

Sometimes, extensions have good reasons for writing to disk even while in private-browsing mode. Perhaps they're updating their anti-phishing blacklists, or they implement a download-helper that saves a file the user asked for, or they are a bookmark manager. In such cases, there simply is no way for the code to typecheck. As in any type system, we provide a mechanism to break out of the type system: an unchecked typecast. We currently write such casts as cheat(T). Such casts must be checked by a human auditor: they are explicitly marking the places where the extension is doing something unusual that must be confirmed.

(In our original version, we took our cue from Haskell and wrote such casts as unsafePerformReview, but sadly that is tediously long to write.)

But does it work?

Yes.

We manually analyzed a dozen Firefox extensions that had already passed Mozilla's auditing process. We annotated the extensions with as few type annotations as possible, with the goal of forcing the code to pass the typechecker, cheating if necessary. These annotations found five extensions that violated the private-browsing policy: they could not be typechecked without using cheat, and the unavoidable uses of cheat pointed directly to where the extensions violated the policy.

Further reading

We've written up our full system, with more formal definitions of the types and worked examples of the annotations needed. The writeup also explains how we create the type environment in more detail, and what work is necessary to adapt this system to changes in the APIs or private-browsing implementation.

Aluminum: Principled Scenario Exploration Through Minimality

Tag: Verification
Posted on 13 May 2013.
Software artifacts are hard to get right. Not just programs, but protocols, configurations, etc. as well! Analysis tools can help mitigate the risk at every stage in the development process.

Tools designed for scenario-finding produce examples of how an artifact can behave in practice. Scenario-finders often allow the results to be targeted to a desired output or test a particular portion of the artifact—e.g., producing counterexamples that disprove an assumption—and so they are fundamentally different from most forms of testing or simulation. Even better, concrete scenarios appeal to the intuition of the developer, revealing corner-cases and potential bugs that may never have been considered.

Alloy

The Alloy Analyzer is a popular light-weight scenario-finding tool. Let's look at at a small example: a (partial) specification of the Java type system that is included in the Alloy distribution. The explanations below each portion are quoted from comments in the example's source file.

abstract sig Type {subtypes: set Type} sig Class, Interface extends Type {} one sig Object extends Class {} sig Instance {type: Class} sig Variable {holds: lone Instance, type: Type}
Each type is either a class or an interface, and each has a set of subtypes. The Object class is unique. Every instance has a creation type. Each variable has a declared type and may (but need not) hold an instance.
fact TypeHierarchy { Type in Object.*subtypes no t: Type | t in t.^subtypes all t: Type | lone t.~subtypes & Class } fact TypeSoundness { all v: Variable | v.holds.type in v.type.*subtypes }
These facts say that (1) every type is a direct or indirect subtype of Object; (2) no type is a direct or indirect subtype of itself; (3) every type is a subtype of at most one class; and (4) all instances held by a variable have types that are direct or indirect subtypes of the variable's declared type.
Alloy will let us examine how the model can behave, up to user-provided constraints. We can tell Alloy to show us scenarios that also meet some additional constraints:

run { some Class - Object some Interface some Variable.type & Interface } for 4
We read this as: "Find a scenario in which (1) there is a class distinct from Object; (2) there is some interface; and (3) some variable has a declared type that is an interface."
Alloy always checks for scenarios up to a finite bound on each top-level type—4 in this case. Here is the one that Alloy gives:

Alloy's first scenario

This scenario illustrates a possible instance of the Java type system. It's got one class and one interface definition; the interface extends Object (by default), and the class implements the interface. There are two instances of that class. Variable0 and Variable1 both hold Instance1, and Variable2 holds Instance0.

Alloy provides a Next button that will give us more examples, when they exist. If we keep clicking it, we get a parade of hundreds more:

Another Scenario

Yet Another Scenario

Still Another Scenario...

Even a small specification like this one, with a relatively small size bound (like 4), can yield many hundreds of scenarios. That's after Alloy has automatically ruled out lots of superfluous scenarios that would be isomorphic to those already seen. Scenario-overload means that the pathological examples--the ones that the user needs to see--may be buried beneath many normal ones.

In addition, the order in which scenarios appear is up to the internal SAT-solver that Alloy uses. There is no way for a user to direct which scenario they see next without stopping their exploration, returning to the specification, adding appropriate constraints, and starting the parade of scenarios afresh. What if we could reduce the hundreds of scenarios that Alloy gives down to just a few, each of which represented many other scenarios in a precise way? What if we could let a user's goals guide their exploration without forcing a restart? It turns out that we can.

Aluminum: Beyond the Next Button

We have created a modified version of Alloy that we call Aluminum. Aluminum produces fewer, more concise scenarios and gives users additional control over their exploration. Let's look at Aluminum running on the same example as before, versus the first scenario that Alloy returned:

For reference:
The 1st Scenario from Alloy

The 1st Scenario from Aluminum

Compared to the scenario that Alloy produced, this one is quite small. That's not all, however: Aluminum guarantees that it is minimal: nothing can be removed from it without violating the specification! Above and beyond just seeing a smaller concrete example, a minimal scenario embodies necessity and gives users a better sense of the dependencies in the specification.

There are usually far fewer minimal scenarios than scenarios overall. If we keep clicking Next, we get:

The 2nd Minimal Scenario

The 3rd Minimal Scenario

Asking for a 4th scenario via Next results in an error. That's because in this case, there are only three minimal scenarios.

What if...?

Alloy's random, rich scenarios may contain useful information that minimal scenarios do not. For instance, the Alloy scenarios we saw allowed multiple variables to exist and classes to be instantiated. None of the minimal scenarios illustrate these possibilities.

Moreover, a torrent of scenarios--minimal or otherwise--doesn't encourage a user to really explore what can happen. After seeing a scenario, a user may ask whether various changes are possible. For instance, "Can I add another variable to this scenario?" More subtle (but just as important) is: "What happens if I add another variable?" Aluminum can answer that question.

To find out, we instruct Aluminum to try augmenting the starting scenario with a new variable. Aluminum produces a fresh series of scenarios, each of which illustrates a way that that the variable can be added:

The Three Augmented Scenarios
(The newly added variable is Variable0.)

From these, we learn that the new variable must have a type. In fact, these new scenarios cover the three possible types that the variable can receive.

It's worth mentioning that the three augmented scenarios are actually minimal themselves, just over a more restricted space—the scenarios containing the original one, plus an added variable. This ensures that the consequences shown are all truly necessary.

More Information

To learn more about Aluminum, see our paper and watch our video here or download the tool here.

ADsafety

Tags: Browsers, JavaScript, Programming Languages, Security, Types, Verification
Posted on 13 September 2011.

A mashup is a webpage that mixes and mashes content from various sources. Facebook apps, Google gadgets, and various websites with embedded maps are obvious examples of mashups. However, there is an even more pervasive use case of mashups on the Web. Any webpage that displays third-party ads is a mashup. It's well known that third-party content can include third-party cookies; your browser can even block these if you're concerned about "tracking cookies". However, third party content can also include third-party JavaScript that can do all sorts of wonderful and malicious things (just some examples).

Is it possible to safely embed untrusted JavaScript on a page? Google Caja, Microsoft Web Sandbox, and ADsafe are language-based Web sandboxes that try to do so. Language-based sandboxing is a programming language technique that restricts untrusted code using static and runtime checks and rewriting potential dangerous calls to safe, trusted functions.

Sandboxing JavaScript, with all its corner cases, is particularly hard. A single bug can easily break the entire sandboxing system. JavaScript sandboxes do not clearly state their intended guarantees, nor do they clearly argue why they are safe.

This is how ADsafe works.

Verifying Web Sandboxes

A year ago, we embarked on a project to verify ADsafe, Douglas Crockford's Web sandbox. ADsafe is admittedly the simplest of the aforementioned sandboxes. But, we were also after the shrimp bounty that Doug offers for sandbox-breaking bugs:
Write a program [...] that calls the alert function when run on any browser. If the program produces no errors when linted with the ADsafe option, then I will buy you a plate of shrimp. (link)
A year later, we've produced a USENIX Security paper on our work, which we presented in San Francisco in August. The paper discusses the many common techniques employed by Web sandboxes and discusses the intricacies of their implementations. (TLDR: JavaScript and the DOM are really hard.) Focusing on ADsafe, it precisely states what ADsafety actually means. The meat of the paper is our approach to verifying ADsafe using types. Our verification leverages our earlier work on semantics and types for JavaScript, and also introduces some new techniques:

Check out the ★s and ☠s in our object types; we use them to type-check "scripty" features of JavaScript. ☠ marks a field as "banned" and ★ specifies the type of all other fields.

We also characterize JSLint as a type-checker. The Widget type presented in the paper specifies, in 20 lines, the syntactic restrictions of JSLint's ADsafety checks.

Unlike conventional type systems, ours does not prevent runtime errors. After all, stuck programs are safe because they trivially don't execute any code. If you think type systems only catch "method not found" errors, you should have a look at ours.

We found bugs in both ADsafe and JSLint that manifested as type errors. We reported all of them and they were promptly fixed by Doug Crockford. A big thank you to Doug for his encouragement, for answering our many questions, and for buying us every type of shrimp dish in the house.

Doug Crockford, Joe, Arjun, and seven shrimp dishes

Learn more about ADsafety! Check out:

The paper, code, and proofs;

Video of Arjun presenting at USENIX Security;

ADsafe and JSLint.