S5: Wat?

Tags: JavaScript, Programming Languages, Semantics

Posted on 31 January 2012.

Gary Bernhardt's Wat talk has been making a well-deserved round of the blogodome in the past few weeks. If you haven't seen it, go give it a watch (you can count it as work time, since you saw it on the Brown PLT Blog, and we're Serious Researchers). The upshot of the second half of the talk is that JavaScript has some less than expected behaviors. We happen to have a JavaScript implementation floating around in the form of S5, and like to claim that it handles the hairy corners of the language. We decided to throw Gary's examples at it.

The Innocuous +

Gary's first JavaScript example went like this:

failbowl:~(master!?) $ jsc
> [] + []

> [] + {}
[object Object]
> {} + []
0
> {} + {}
NaN

S5 lacks a true REPL―it simply takes JavaScript strings and produces output and answers―so we started by approximating a little bit. We first tried a series of print statements to see if we got the same effect:

$ cat unit-tests/wat-arrays.js 
print([] + []);
print([] + {});
print({} + []);
print({} + {});

$ ./s5 < unit-tests/wat-arrays.js 

[object Object]
[object Object]
[object Object][object Object]
undefined

WAT.

Well, that doesn't seem good at all. Only half of the answers are right, and there's an undefined at the end. What went wrong? It turns out the semantics of REPLs are to blame. If we take the four programs and run them on their own, we get something that looks quite a bit better:

$ ./s5 "[] + []"
""

$ ./s5 "[] + {}"
"[object Object]"

$ ./s5 "{} + []"
0.

$ ./s5 "{} + {}"
nan

There are two issues here:

  1. Why do 0. and nan print like that?
  2. Why did this work, when the previous attempt didn't?

The answer to the first question is pretty straightforward: under the covers, S5 is using Ocaml floats and printing Ocaml values at the end of its computation, and Ocaml makes slightly different decisions than JavaScript in printing numbers. We could change S5 to print answers in JavaScript-printing mode, but the values themselves are the right ones.

The second question is more interesting. Why do we get such different answers depending on whether we evaluate individual strings versus printing the expressions? The answer is in the semantics of JavaScript REPLs. When parsing a piece of JavaScript, the REPL needs to make a choice. Sensible decisions would be to treat each new JavaScript string as a Statement, or as an entire JavaScript Program. Most REPLs choose the Program production.

The upshot is that the parsing of {} + {} is quite different from [] + []. With S5, it's trivial to print the desugared representation and understand the difference. When we parse and desugar, we get very different results for {} + {} and [] + []:

$ ./s5-print "{} + {}"
{undefined;
 %UnaryPlus({[#proto: %ObjectProto,
              #class: "Object",
              #extensible: true,]
             })}

$ ./s5-print "[] + []"
%PrimAdd({
    [#proto: %ArrayProto,
     #class: "Array",
     #extensible: true,]
    'length' : {#value 0., #writable true, #configurable false}
  },
  {
    [#proto: %ArrayProto,
     #class: "Array",
     #extensible: true,]
    'length' : {#value 0., #writable true, #configurable false}
  }
)

It is clear that {} + {} parses as two statements (an undefined followed by a UnaryPlus), and [] + [] as a single statement containing a binary addition expression. What's happening is that in the Program production, for the string {} + {}, the first {} is matched with the Block syntactic form, with no internal statements. The rest of the expression is parsed as a UnaryExpression. This is in contrast to [] + [], which only correctly parses as an ExpressionStatement containing an AdditiveExpression.

In the example where we used successive print statements, every expression in the argument position to print was parsed in the second way, hence the different answers. The lesson? When you're at a REPL, be it Firebug, Chrome, or the command line, make sure the expression you're typing is what you think it is: not being aware of this difference can make it even more difficult to know what to expect!

If You Can't Beat 'Em...

Our first example led us on an interesting excursion into parsing, from which S5 emerged triumphant, correctly modelling the richness and/or weirdness of the addition examples. Next up, Gary showed some straightforward uses of Array.join():

failbowl:~(master!?) $ jsc
> Array(16)
,,,,,,,,,,,,,,,,
> Array(16).join("wat")
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
> Array(16).join("wat" + 1)
wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1
> Array(16).join("wat" - 1) + " Batman"
NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN Batman

Our results look oh-so-promising, right up until the last line (note: we call String on the first case, because S5 doesn't automatically toString answers, which the REPL does).

$ ./s5 "String(Array(16))"
",,,,,,,,,,,,,,,,"
$ ./s5 "Array(16).join('wat')"
"watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat"
$ ./s5 "Array(16).join('wat' + 1)"
"wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1wat1"
$ ./s5 "Array(16).join('wat' - 1) + ' Batman'"
"nullnullnullnullnullnullnullnullnullnullnullnullnullnullnullnull Batman"

WAT.

Are we really that awful that we somehow yield null rather than NaN? A quick glance at the desugared code shows us that we actually have the constant value null as the argument to join(). How did that happen? Interestingly, the following version of the program works:

$ ./s5 "var wat = 'wat'; Array(16).join(wat - 1) + ' Batman';"
"NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN Batman"

This leads us to our answer. We use SpiderMonkey's very handy Parser API as part of our toolchain. Reflect.parse() takes strings and converts them to JSON structures with rich AST information, which we stringify and pass off to the innards of S5 to do desugaring and evaluation. Reflect.parse() is part of a JavaScript implementation that strives for performance, and to that end it performs constant folding. That is, as an optimization, when it sees the expression "wat" - 1, it automatically converts it to NaN. All good so far.

The issue is that the NaN yielded by constant folding is not quite the same NaN we might expect in JavaScript programs. In JavaScript, the identifier NaN is a property of the global object with the value NaN. The Parser API can't safely fold to the identifier NaN (as was pointed out to us when we reported this bug), because it might be shadowed in a different context. Presumably to avoid this pitfall, the folding yields a JSON structure that looks like:

expression:{type:"Literal", value:NaN}

But we can't sensibly use JSON.stringify() on this structure, because NaN isn't valid JSON! Any guesses on what SpiderMonkey's JSON implementation turns NaN into? If you guessed null, we owe you a cookie.

We have designed a hack based on suggestions from the bug report to get around this (passing a function to stringify to look for NaNs and return a stylized object literal instead). There's a bug open to make constant folding optional in Reflect.parse(), so this will be fixed in Mozilla's parser. (Update) The bug is fixed, and we've updated our version of Spidermonkey. This example now works happily, thanks to Dave Herman.

Producing a working JavaScript implementation leads to a whole host of exciting moments and surprising discoveries. Building this semantics and its desugaring gives us much more confidence that our tools say something meaningful about real JavaScript programs. These examples show that getting perfect correspondence is difficult, but we strive to be as close as possible.