Sunday, 14 August 2016

Pratt Parsing in Parsec. Perfect.

While working on the Kind interpreter in Haskell, using the Parsec parser combinator library, I came upon a difficulty.

Parsec has a built-in expression parser generator, but unfortunately it has a problem:

Prefix and postfix operators of the same precedence can only occur once (i.e. --2 is not allowed if - is prefix negate).

This restriction sort-of makes sense when you consider '-' as there is little reason to do this, but in languages that have implicit conversion to bool (which Kind will have), !!expr is a common idiom used to convert a value to bool without inverting it, and this means that the Parsec expression parser cannot understand this. Or for C-style pointer handling, being able to do **p (where p is a pointer to a pointer to some value) is pretty-much essential. The Parsec expression parser therefore is not useful for anything other than trivial languages, as far as I can see. I needed a replacement.

The traditional solution to writing infix expression parsers with precedence is to break your grammar productions into a series of different types of term that can include any operators of a given precedence or a value of the next type of term, with atomic values at the innermost level. For example, you might have:

  expr ::= expr ('+'|'-') expr                -- 1
  expr ::= term1                              -- 2
  term1 ::= term1 ('*'|'/') term1             -- 3
  term1 ::= term2                             -- 4
  term2 ::= variable | literal | '(' expr ')' -- 5

An example expression might be parsed like this:

  5 + 17 * 2
  [term2:5] + 17 * 2                 -- apply production 5
  [term1:5] + 17 * 2                 -- apply production 4
  [expr:5] + 17 * 2                  -- apply production 2
  [expr:5 +] 17 * 2                  -- partially apply production 1
  [expr:5 +] [term2:17] * 2          -- apply production 5
  [expr:5 +] [term1:17] * 2          -- apply production 4
  [expr:5 +] [term1:17 *] 2          -- partially apply production 3
  [expr:5 +] [term1:17 *] [term2:2]  -- apply production 5
  [expr:5 +] [term1:17 *] [term1:2]  -- apply production 4
  [expr:5 +] [term1:34]              -- complete production 3 & calculate result
  [expr:5 +] [expr:34]               -- apply production 2
  [expr:39]                          -- complete production 1

This works, and you can implement it using Parsec, but there are problems:

  • If you have many precedence levels, you need many different production types and it makes your parser confusing.
  • Depending on the algorithm used, your parser might internally not like left-recursive definitions, so you may have to adjust how the grammar is laid out according to your parser's internal structure. My understanding is that Parsec doesn't like left-recursive grammars, for example, so the nice logical grammar I give above would need to be made more complicated in order to fix that.
  • Your parser may spend a lot of time just translating terms between different precedence levels.

The Pratt parser algorithm, also known as a "top-down operator precedence parser", is an algorithm for handling expressions with infix operators by using a table associating operators with their precedences rather than having separate productions for them. I won't go into details of the algorithm here, as there are already many excellent articles available; this one is the easiest to understand in my opinion. Surprisingly, there didn't seem to be an implementation of a Pratt parser on Hackage, the Haskell repository for useful libraries. So I set about making one... and the result is here.

The interface is as a parser combinator, i.e. it takes a number of functions that parse individual parts of an expression and produces a function that uses those to parse an entire expression. The package includes an example that parses a simple language into a parse tree, then prints the tree. The example should be fairly simple to understand; here's the gist of it:

-- apply optional trailing whitespace to a parser
wsopt :: ContentStripper String () Identity ()
wsopt = optional spaces

This defines a parser that will ignore any spaces it encounters. This is used to discard any irrelevant input between the tokens of the expression; a more advanced language might also include the ability to strip comments at this point.

-- parse an integer value (with optional trailing whitespace)
parseIntValue :: ExprParser 
parseIntValue = do
    x <- many1 digit
    wsopt
    return (IntValue (read x))

Integer literals: parse one or more digits, skip any whitespace, then return the digits as an integer literal value (wrapped in a container for the parse tree).

-- parse an operator symbol
operator :: OperatorParser String () Identity String
operator = try (string "ifTrue") <|> many1 (oneOf "<>:@~\\/|!$%^&*-_=+")
           <|> string "(" <|> try (string "let")

Operators are either the string "ifTrue", a sequence of any of a set of symbols, an open bracket, or the string "let" (which is used for expressions like let id = expr1 in expr2). Note that we're taking a broad view on what constitutes an "operator" here.

There are other parsers for more involved parts of the language (like the parser for handling "let" expressions as mentioned above), but we'll skip those for now, and jump to a couple that you'll probably want something similar to for just about any language you implement with this library:

-- parse a binary operator with standard semantics
parseStdOp :: LeftDenotation String () Identity Expr String
parseStdOp (OperatorInfo name precedence _) lhs pex = do
    rhs <- pex precedence
    return (BinOp lhs name rhs)

When you register a binary operator with the library, you provide a function that is called to process it. While it can do anything you want, in most cases you'll just want to parse the right hand side of the expression and then wrap it up with the left hand side in a parse tree node, which is what the code above does. The OperatorInfo record is a copy of the information we provided when the operator was registered, which we use here to make a generic function that can handle any operator; if you have a function specifically for each operator you won't need this. lhs is the already-parsed tree for everything on the left of the operator. pex is a function generated by the Pratt parser that you can call to parse the right hand side of the operator, until an operator is found in it whose precedence is too low to bind inside our right hand side; we call it with our operator's precedence and capture the result, then return a tree node with both that subtree and the original left-hand subtree, and the type of operator that was found.

-- parse '(' <expression> ')'
parseBracketExpr :: NullDenotation String () Identity Expr
parseBracketExpr pex = between
    openBracket closeBracket
    (pex (LAssoc 0))
    where openBracket = char '('
          closeBracket = char ')'

Bracketed expressions are parsed as prefix operators (which are essentially the same thing as infix operators except they don't have a left-hand side subtree), but we override the precedence we parse to so that any valid operator is bound into the resulting subexpression. We check that this is followed by a close bracket, and return the generated expression tree. Many languages are likely to include this function verbatim.

-- parse terms
parseTerm :: NullDenotation String () Identity Expr
parseTerm pex = 
    parseIntValue <|>
    parseVarRef <|>
    parseBracketExpr pex

We need to give the parser generator a parser that can handle any individual term of the language. In this case, terms are either integers, variable references, or bracketed sub-expressions.

-- operator data
operatorList :: [OperatorInfo String () Identity Expr String]
operatorList = [
    OperatorInfo "-" (LAssoc 50) parseStdOp,
    OperatorInfo "+" (LAssoc 50) parseStdOp,
    OperatorInfo "|" (LAssoc 40) parseStdOp,
    OperatorInfo "*" (LAssoc 70) parseStdOp,
    OperatorInfo "&" (LAssoc 60) parseStdOp,
    OperatorInfo "/" (LAssoc 70) parseStdOp,
    OperatorInfo "<" (LAssoc 30) parseStdOp,
    OperatorInfo ">" (LAssoc 30) parseStdOp,
    OperatorInfo "<=" (LAssoc 30) parseStdOp,
    OperatorInfo ">=" (LAssoc 30) parseStdOp,
    OperatorInfo "||" (LAssoc 10) parseStdOp,
    OperatorInfo "&&" (LAssoc 20) parseStdOp,
    OperatorInfo "^" (RAssoc 90) parseStdOp,
    OperatorInfo "ifTrue" (RAssoc 5) parseIfOp]

prefixOperatorList :: [PrefixOperatorInfo String () Identity Expr String]
prefixOperatorList = [
    SimplePrefixOperator "-" bindPrefixOp,
    SimplePrefixOperator "!" bindPrefixOp,
    PrefixParserOperator "let" parseLetOp]

Finally, we can produce a table of all the operators understood by our language. Note that "ifTrue" is an infix operator that handles more than one right hand subexpression (it has the form expr ifTrue expr1 else expr2) and "let" is a prefix operator that likewise handles multiple subexpressions (as described above). These show more advanced ways of using the library than just standard prefix & infix code.

With this table in hand, we can generate a parser that can handle the entire language of expressions:

parser :: ExprParser
parser = buildPrattParser operatorList prefixOperatorList wsopt operator parseTerm

Using the resulting parser is as simple as using any Parsec parser.

parseToText :: ExprParser -> String -> String
parseToText parser input = case parse parser "input" input of
       Left error -> show error
       Right expr -> PP.render $ pPrint expr

main::IO()
main = do
   input <- getContents
   putStrLn (parseToText parser input)

The entire language, including parse tree structure and pretty printer, takes only 107 lines of code (excluding blank lines and comments).

No comments:

Post a Comment