In computer science, a large number of students get help from teaching assistants (TAs). A great deal of their real education happens in these hours. While TA hours are an excellent resource, they are also rather opaque to the instructors, who do not really know what happens in them.
How do we construct a mechanism to study what happens in hours? It’s actually not obvious at all:
We could set up cameras to record all the interactions in hours. While this would provide a lot of information, it significantly changes the nature of hours. For many students, hours are private time with a TA, where they can freely speak about their discomfort and get help from a peer; they might ask personal questions; they might also complain about the instructor. One does not install cameras in confessionals.
We could ask TAs to write extensive notes (redacting private information) after the student has left. This also has various flaws:
Their memory may be faulty.
Their recollection may be biased by their own beliefs.
It would slow down processing students, who already confront overly-long lines and waits.
What do we instead want? A process that is non-intrusive, lightweight, and yet informative. We have to also give up on perfect knowledge, and focus on information that is actually useful to the instructor.
Part of the problem is that we as a community lack a systematic method to help students in the first place. If students have no structure to how they approach help-seeking, then it’s hard to find patterns and make sense of what they actually do.
However, this is exactly a problem that the How to Design Programs Design Recipe was addressed to solve. It provides a systematic way for students to structure their problem-solving and help-seeking. TAs are instructed to focus on the steps of the Design Recipe in order, not addressing later steps until students have successfully completed the earlier ones. This provides an “early warning” diagnostic, addressing root causes rather than their (far-removed) manifestations.
Therefore, we decided to use the Design Recipe steps as a lens for obtaining insight into TA hours. We argue that this provides a preliminary tool that addresses our needs: it is lightweight, non-intrusive, and yet useful to the instructor. Read the paper to learn more!
When faced with an unfamiliar programming problem, undergraduate computer science students all-too-often begin their implementations with an incomplete understanding of what the problem is asking, and may not realize until far into their development process (if at all) that they have solved the wrong problem. At best, a student realizes their mistake, suffers from some frustration, and is able to correct it before the final submission deadline. At worst, they might not realize their mistake until they receive feedback on their final submission—depriving them of the intended learning goal of the assignment.
Educators must therefore provide students with some mechanism by which students can evaluate their own understanding of a problem—before they waste time implementing some misconceived variation of that problem. To this end, we provide students with Examplar: an IDE for writing input–output examples that provides on-demand feedback on whether the examples are:
- valid (consistent with the problem), and
- thorough (explore the conceptually interesting corners of the problem).
With its gamification, we believed students would find Examplar compelling to use. Moreover, we believed its feedback would be helpful. Both of these hypotheses were confirmed. We found that students used Examplar extensively—even when they were not required to use it, and even for assignments for which they were not required to submit test cases. The quality of students’ final submissions generally improved over previous years, too. For more information, read the full paper here!
We routinely rely on automated assessment to evaluate our students’ work on programming assignments. In principle, these techniques improve the scalability and reproducibility of our assessments. In actuality, these techniques may make it incredibly easy to perform flawed assessments at scale, with virtually no feedback to warn the instructor. Not only does this affect students, it can also affect the reliability of research that uses it (e.g., that correlates against assessment scores).
To Test a Test Suite
The initial object of our study was simply to evaluate the quality of student test suites. However, as we began to perform our measurements, we wondered how stable they were, and started to use different methods to evaluate stability.
In this group, we take the perspective that test suites are classifiers of implementations. You give a test suite an implementation, and it either accepts or rejects it. Therefore, to measure the quality of a test suite, we can standard metrics for classifiers, true positive rate and true negative rate. However, to actually do this, we need a set of implementations that we know, a priori, to be correct or faulty.
|Accept||True Negative||False Negative|
|Reject||False Positive||True Positive|
A robust assessment of a classifier may require a larger collection of known-correct and known-faulty implementations than the instructor could craft themselves. Additionally, we can leverage all of the implementations that students are submitting—we just need to determine which are correct and which are faulty.
There are basically two ways of doing this in the literature; let’s see how they fare.
The Axiomatic Model
In the first method, the instructor writes a test suite and whatever that test suite’s judgments is used as the ground truth; e.g., if the instructor test suite accepts a given implementation, it is a false positive for a student’s test suite to reject it.
The Algorithmic Model
The second method does this by taking every test suite you have (i.e., both the instructor’s and the students’), running them all against a known-correct implementation, and gluing all the ones that pass it into one big mega test suite that is used to establish ground truth.
A Tale of Two Assessments
We applied each model in turn to classify 38 student implementations and a handful of specially crafted ones (both correct and faulty, in case the student submissions were skewed heavily towards faultiness or correctness), then computed the true-positive and true-negative rate for each student’s test suite.
The choice of underlying implementation classification model substantially impacted the apparent quality of student test suites. Visualized as kernel density estimation plots (akin to smoothed histograms):
The Axiomatic Model:
Judging by this plot, students did astoundingly well at catching buggy implementations. Their success at identifying correct implementations was more varied, but still pretty good.
The Algorithmic Model:
Judging by this plot, students performed astoundingly poorly at detecting buggy implementations, but quite well at identifying correct ones.
Towards Robust Assessments
So which is it? Do students miss half of all buggy implementations, or are they actually astoundingly good? In actuality: neither. These strikingly divergent analysis outcomes are produced by fundamental, theoretical flaws in how these models classify implementations.
We were alarmed to find that these theoretical flaws, to varying degrees, affected the assessments of every assignment we evaluated. Neither model provides any indication to warn instructors when these flaws are impacting their assessments. For more information about these perils, see our paper, in which we present a technique for instructors and researchers that detects and protects against them.
How do you learn a new language? Do you simply read its reference manual, and then you’re good to go? Or do you also explore the language, trying things out by hand to see how they behave?
This skill—learning about something by poking at it—is frequently used but almost never taught. However, we have tried to teach it via what we dub mystery languages, and this post is about how we went about it.
A mystery language is exactly what it sounds like: a programming language whose behavior is a mystery. Each assignment comes with some vague documentation, and an editor in which you can write programs. However, when you run a program it will produce multiple answers! This is because there are actually multiple languages, with the same syntax but different semantics. The answers you get are the results of running your program in all of the different languages. The goal of the assignment is to find programs that tell the languages apart.
As an example, maybe the languages have the syntax
a[i], and the
documentation says that this “accesses an array
a at index
That totally specifies the behavior of this syntax, right?
Not even close. For example, what happens if the index is out of bounds?
Does it raise an error (like Java), or return
valid one (like Python)? And what happens if the index is
(EDIT: Specifically, Python wraps negative indices that are smaller than the length of the array.)
Students engage with the mystery languages in three ways:
- The first part of each assignment is to find a set of programs that distinguishes the different languages (a.k.a. a classifier).
- The second part of the assignment is to describe a theory that explains the different behaviors of the languages. (For example, a theory about C’s behavior could include accessing heap addresses past the end of an array.)
- Finally, after an assignment is due, there’s an in-class discussion and explanation of the mystery languages. This is especially useful to provide terminology for behaviors that students encountered in the languages.
This example is somewhat superficial (the real mystery languages mostly use more significant features than array access), but you get the idea: every aspect of a programming language comes with many possible designs, and the mystery languages have students explore them first-hand.
Why Mystery Languages?
We hope to teach a number of skills through mystery languages:
- Evaluating Design Decisions: The mystery languages are almost entirely based on designs chosen by real languages: either historical choices that have since been settled, or choices that are still up in the air and vary between modern languages. Students get to explore the consequences of these decisions themselves, so that they get a feel for them. And the next day in class they learn about the broader historical context.
- Syntax vs. Semantics: We are frustrated by how frequently discussions about programming languages revolve around syntax. With luck, showing students first-hand that a single syntax can have a variety of semantics will bring their discussions to a somewhat higher level.
- Adversarial Thinking: Notice that the array example was all about exploring edge cases. If you only ever wrote correct, safe, boring programs then you’d never notice the different ways that array access could work. This is very similar to the kind of thinking you need to do when reasoning about security, and we hope that students might pick up on this mindset.
- Experimental Thinking: In each assignment, students are asked not only to find programs that behave differently across the mystery languages, but also to explain the behavior of the different languages. This is essentially a scientific task: brainstorm hypotheses about how the languages might work; experimentally test these hypotheses by running programs; discard any hypothesis that was falsified; and iterate.
Adopting Mystery Languages
If you want to use Mystery Languages in your course, email us and we’ll help you get started!
There are currently 13 mystery languages, and more in development. At Brown, we structured our programming languages course around these mystery languages: about half of the assignments and half of the lectures were about mystery languages. However, mystery languages are quite flexible, and could also be used as a smaller supplement to an existing course. One possibility is to begin with one or two simpler languages to allow students to learn how they work, and from there mix and match any of the more advanced languages. Alternatively, you could do just one or two mystery languages to meet specific course objectives.
No one should have to see the insides of your macros. Yet type errors
often reveal them. For example, here is a very simple
and macro in
Rust (of course you should just use
&& instead, but we’ll use this
as a simple working example):
and the type error message you get if you misuse it:
You can see that it shows you the definition of this macro. In this case it’s not so bad, but other macros can get messy, and you might not want to see their guts. Plus in principle, a type error should only show the erronous code, not correct code that it happend to call. You wouldn’t be very happy with a type checker that sometimes threw an error deep in the body of a (type correct) library function that you called, just because you used it the wrong way. Why put up with one that does the same thing for macros?
The reason Rust does is that that it does not know the type of
As a result, it can only type check after
and has been desugared
(a.k.a., expanded), and so the error occurs in the desugared code.
But what if Rust could automatically infer a type rule for checking
and, using only the its definition? Then the error could be found in
the original program that you wrote (rather than its expansion), and
presented as such. This is exactly what we did—albeit for simpler type
systems than Rust’s—in our recent PLDI’18 paper
Inferring Type Rules for Syntactic Sugar.
We want to automatically derive a type rule for
and, and we want it
to be correct. But what does it mean for it to be correct? Well, the
and is defined by its desugaring:
α and β is synonymous
if α then β else false. Thus they should have the same type
(here the fancy D means “desugar”):
Isurf ||- means “in the surface language type system”,
||- means “in the core language type system”, and the fancy D means
How can we achieve this? The most straightforward to do is to capture
iff with two type rules, one for the forward implication, and
one for the reverse:
The first type rule is useful because it says how to type check
in terms of its desugaring. For example, here’s a derivation that
true and false has type
However, while this
t-and→ rule is accurate, it’s not the canonical
type rule for
and that you’d expect. And worse, it mentions
it’s leaking the implementation of the macro!
However, we can automatically construct a better type rule. The trick
is to look at a more general derivation. Here’s a generic type
derivation for any term
α and β:
Notice Dα and Dβ: these are holes in the derivation, which ought to be filled in with sub-derivations proving that α has type Bool and β has type Bool. Thus, “α : Bool” and “β : Bool” are assumptions we must make for the type derivation to work. However, if these assumptions hold, then the conclusion of the derivation must hold. We can capture this fact in a type rule, whole premises are these assumptions, and whose conclusion is the conclusion of the whole derivation:
And so we’ve inferred the canonical type rule for
and! Notice that
(i) it doesn’t mention
if at all, so it’s not leaking the inside of
the macro, and (ii) it’s guaranteed to be correct, so it’s a good
starting point for fixing up a type system to present errors at the
right level of abstraction. This was a simple example for illustrative
purposes, but we’ve tested the approach on a variety of sugars and