Useful syntax tricks

There are a lot of syntax improvements that people miss out on because the standard parser toolkits don’t make it very easy. That’s a pity, because fine-tuning the syntax does help out developers, and significant improvements can be made in just one or two days.

Let me give a few examples, motivate why they are valuable, and show a way to do it using Flex/Bison. I think these would be good candidates for the user manual of parser toolkits as well as for exercises for someone learning to work on compilers.

Significant line endings

If you look at well-formatted code, then statements and declarations will usually end at the end of a line. It’s redundant and a source of traps if the developer is looking at line endings and the compiler is looking at semicolons. The developer has to write the information in two different ways, and if there is a mismatch, it can easily go wrong.

Consider this example:

int x = Integer.valueOf(argv[1]) // Error here, but why?
int y = x * 10;

The developer is initializing two variables, and they have left off a semicolon on the first line. This will cause a compile error, but why? There’s nothing else the developer could have meant except to initialize two variables to two separate expressions.

Basic, Fortran, Ruby, and Python all treat the newline as significant, and the experience of developers in that language has proven that, across a broad spectrum of language constructs, it works pretty well to use the newline as a statement terminator. Developers in these languages largely don’t think about the issue at all. They just have one less character to type for each statement, and one less token to pay attention to when reading code.

To implement significant newlines with Flex and Bison, the best way I’ve seen is to insert a small function in between the two tools. Have Flex match a newline as its own token type, separate from whitespace. Have Bison treat that newline as a statement terminator, either as the only terminator or as an alternative to a semicolon.

This way already works pretty well, but the trouble is that there is now no way to have a multi-line statement in cases here the developer wants to. Thus, unless you are developing a language where statements just shouldn’t go longer than one line, anyway, you want a way to suppress some of the newlines. It’s this part that is difficult in Flex/Bison, because they way you want to suppress newlines is not a good match for how either tool works. It’s very simple to just write the code, however, and maybe parser tools in the future will make this easier, even if it’s only via examples in the documentation.

The common cases where you will want to suppress a newline are as follows:

Within parentheses, square brackets, or other grouping constructs where it’s just not possible to have multiple statements. This rule can be implemented in Flex by using a mode stack, but it requires some thought to realize you might need to do it.
After a character that cannot end a statement, or before a character that cannot start a statement. For example, commonly a comma character (,) cannot end a statement, so you may wish to suppress any newlines that occur after a comma token. As another example, if your syntax does not include a unary plus operator (+), then you may wish to suppress newlines that occur before a plus operator (+). Rules like this are very awkward to implement in Flex, because they require looking ahead by a whole token.

These rules are very simple to implement if you process the token stream coming from Flex before passing it to Bison. Start by having Flex recognize the newline character as its own token. Then, configure Bison to call a custom function that you implement rather than calling yylex directly.

#define yylex adjust_token_stream

Your custom function will then process the token stream that Flex produces. It can work to make it a sequence of processors, each doing one job. The outer layer can remove newlines that occur before or after a token that cannot start or end a statement, respectively:

int previous = -1;
int lookahead = -1;

int adust_token_stream() {
  if (lookahed >= 0) {
    // There is a lookahead token already stored.
    // Return it.
    previous = lookahead;
    lookahead = -1;
    return previous;
  }

  int next = count_parens();

  if (next != NL) {
    previous = next;
    return previous;
  }

  if (previous == COMMA || previous == COLON) {
    // Skip a NL that follows , or :
    previous = lookahead;
    lookhead = -1;
    return previous;
  }

  // Check the next character
  lookahead = count_parens();
  if (lookahead == PLUS) {
    // Skip a NL that precedes a + sign
    previous = lookahead;
    lookahead = -1;
    return previous;
  }

  // Keep this NL character
  previous = next;
  return previous;
}

The next layer removes newlines that are inside parentheses:

int paren_depth = 0;

int count_parens() {
  int next = condense_nls();

  switch (next) {
    case LPAREN:
      paren_depth++;
      return next;

    case RPAREN:
      paren_depth--;
      return next;

    case NL:
      if (paren_depth == 0) {
        // A top-level NL should be returned as is
        return next;
      } else {
        // Skip an NL that is between parentheses.
        return condense_nls();
      }

    default:
      return next;
  }
}

Finally, the innermost layer condenses multiple newlines into just one.

int last_was_nl = 0;
int condense_nls() {
  if (last_was_nl) {
    // Skip tokens until a non-NL token is seen
    last_was_nl = 0;
    while (1) {
      int next = yylex();
      if (next != NL) return next;
    }
  }

  int result = yylex();
  last_was_nl = (result == NL);
  return result;
}

Having gone this far, you can also write a token stream processor that removes comments and whitespace. Flex can do that, but it seems cleaner to me to let Flex focus on demarking the tokens, which is already a complex job on its own.

Significant indentation

Indentation has a similarity to line endings in that the developer has already expressed their intent by indenting their code the amount that they did. It is therefore redundant and a source of traps to require braces around code that is already indented.

if (request.shouldShutDown())
    System.out.println("Shutting down")
    doShutdownNow();  // Oops

Python has an approach to significant indentation that seems to work well in practice. Track a current indentation level, which to begin with would be column 1 of the file. Whenever a newline token is tokenized and kept, look at the column that the next token starts at, and compare it to the current level of indentation. If it is larger than the current indentation, then push the old indenation onto a stack, switch to the new indentation level, and return a synthetic INDENT token to the caller. Likewise,

The parser can now use INDENT and DEDENT the same way that curly braces ({ }) would be used in a C-like language.

These tokens are very straightforward to introduce if you are already writing token-stream processors to begin with, so it seems like token stream processing might be a secret weapon for parsing that is right there in front of everyone.

Precedence parsing

Bison is worse than unhelpful when it comes to precedence parsing. It actively makes things worse by including precedence and associativity declarations that only really achieve “precedence” and “associativity” if you very carefully write your grammar in a very specific way.

I stopped using these declarations when I looked into how they work. They will do the right thing if you organize your grammar in exactly the right way, but nothing checks that you have done so, and if you use the generality of Bison to branch out from a very simple definition of operator expressions, these rules will still take effect but will be hard to reason about. What these declarations do is pretty bizarre and hard to reason about: they cause Bison to tie break shift-reduce conflicts by looking at the tokens on the stack versus in the lookahead stream, and then tie-breaking based on the precedence declarations.

It is better to simply code the BNF rules for precedence, therefore giving you a grammar where it’s clear what is meant without having to think about the mechanics of the shift-reduce stack machine that Bison uses in its implementation. If you use Flex and Bison by themselves, then it’s simply worth learning common ways to encode precedence and associativity.

expr: expr PLUS term
    | expr MINUS term
    | term;
term: term STAR factor
    | term SLASH factor
    | term PERCENT factor;
factor: MINUS factor
    | atom CARET factor
    | atom;
atom: variable | literal | LPAREN expr RPAREN;

This encoding lays out three precedence levels: add/subtract, multiply/divide, and negate/exponentiation. The first two are left-associative, and the last one is right-associative.

Even better than this way, though, is to use a separate tool or library from Bison that really can implement precedence and associativity correctly. Here is the main loop of such a technique from Tiark Rompf’s highly entertaining article Just Write the Parser.

function binop(min) {
  let res = factor()
  while (peek in prec && prec[peek] >= min) {
    let nextMin = prec[peek] + assoc[peek] // + 1 for left assoc
    res = eval[next()](res, binop(nextMin))
  }
  return res
}

The technique is based on a 1973 paper by Vaughan Pratt, Top Down Operator Precedence. Tiark’s version is much easier to read and understand, though, and the main idea to me is to write down the precedence in a way that naturally matches how operator expressions work rather than trying to encode precedence into BNF.

Correct handling of exponents and negation

Exponents and negation are widely handled incorrectly, as overviewed in this Stack Exchange question. I’ve messed it up myself, and in my case, it was because I don’t use exponentiation much, and I just assumed that it would follow certain patterns without stopping to think about it.

The situation is not as ambiguous as it looks. There’s only one useful, non-problematic way to parse exponents and negation, and it’s (1) right to left, and (2) using the same precedence for both of these operators. Here are things that go wrong if you violate either of those rules.

If you parse left to right, then a^b^c will do the a^b first, and then raise that to the c power. However, this isn’t really a useful form of double exponentiation, because the developer could have already written a^(b*c).
If you make negation be higher-precedence than exponentiation, then polynomials cannot be written in their usual way. For example, consider -x^4 + 3*x - 2. In the term -x^4, the exponent should happen first, and then the negation. The other order is the same as leaving the minus sign off entirely.
If you make negation be lower precedence than exponentiation, then an expression like x^-y^z cannot be parsed. You need to make the -y^z part be what x is raised to, but normally the operands of an operator must use only higher- or equal-precedence operands, and we just assumed that unary - is lower precedence than exponentiation.

How to use these tricks

I believe the tricks in this article are widely useful, but even so, you can’t build a good language by piling in every trick or tactic you can think of.

A programming language is a form of user interface: it’s a form of communication from the user to the computer. A good language is customized for the context it will be used in and can be developed with many of the same techniques that are used for user interfaces.

Above all, try out a modified syntax and see how it goes. An extreme example of this can be seen in the Scala group’s exploration of optional braces (“Feedback sought: Optional Braces”), where they redid course work examples with multiple proposals, both in their own work and in coding examples for classrooms, and then fine-tuned the results based on how it went.

Design thinking is a topic of its own and is surveyed in The Reflective Practitioner, by Donald A. Schon. Some highlights include:

Design proceeds iteratively, where you put together a full design, try it out, and then modify it based on what you found.
Here are some options for the “try it out” part.
- Try to build something real, yourself, using your design.
- Watch someone use your design to accomplish a problem that they care about.
- Run the design through testing tools such as fuzzers and model checkers.
- Monitor discussion forums to see what people say as they try out your design.
- Capture error messages from your tools, and examine them to find out where people frequently go wrong.
- Scan code repositories such as GitHub to see which parts of your language design are being used and which parts are being avoided.
Here are some options for “modify it”.
- Use “moves”, as Schon describes it, that are circulated in the field and have proven useful to other designers. I hope that this article can contribute to such efforts.
- Study existing art for ideas. Find out what choices other designers made, and examine how it went for the users of those systems.
Separate yourself from what you are designing. Try to develop a sense that good designers make bad designs all the time. If you can’t say or accept bad things about your creations, because it stings too much, then you’ll be forever trapped in whatever your first ideas were.