321 — Syntax Error Recovery in Parsing Expression Grammars
Medeiros & Mascarenhas (10.1145/3167132.3167261)
Read on 07 July 2018When I took my automata course years ago, one of the core concepts of formal languages was that a string could either belong to or not belong to a language: There was no in-between. This means that simple syntax-checking can only tell you that your syntax is bad; it can’t recover from a failure.
One system for formally defining a grammar is a PEG, or “Parsing Expression Grammar.” A PEG describes a parser that allows for backtracking in parsing: In other words, PEGs leverage many of the capabilities of regular expressions, but also have an “order” to their choice to avoid ambiguities. When given the choice between two feasible interpretations of a syntax, a PEG has the ‘knowledge’ to prefer one, try it out, and then backtrack and try the other choices if the parsing fails.
But PEGs are no good when it comes to syntaxes that completely fail to parse using any of those choices: Instead, most PEG implementations must be augmented by a separate error-handling system, which is application-specific and tends not to be reusable or composable. Furthermore, most of these ad-hoc systems stop parsing after a failure is encountered and handled. While this works for certain cases, it’s no good for, say, IDE-based syntax highlighting, where text after an error should also be highlighted.
The technique proposed in this paper is an error-recoery technique for PEGs. Specifically, it applies to PEG systems that report farthest-failure: Instead of failing to parse when the system encounters a syntax error, this system includes “recovery expressions” — expressions that can be used after encountering an error in order to find the place where the PEG can continue parsing. For example, if an expression is missing a semicolon, the recovery expression might be the set of all strings that end in a semicolon:
i = 3;
j = 2
k = 1;
m = 0;
In this example, the line j = 2
fails, but a recovery expression matches k = 1;
because it ends with a semicolon. So the PEG can continue matching on the m = 0;
line.
This is of course a simple case, but by including recovery expressions alongside errors in a PEG, it is possible to use a PEG for top-down fast parsing of complex, errorful text.