Parsing in a dark and grammar-less world

June 23, 2018

Brad Nelson / @flagxor

Why?

    "I think programmers like to write interpreters. They like to do these elaborate difficult things."
    -- Chuck Moore (in Thinking Forth)

Acknowledgments

Bradford J. Rodriguez BNF Parser in Forth http://www.bradrodriguez.com/papers/bnfparse.htm).

Updated at https://github.com/letoh/fina-forth/blob/master/bnf.fs

Parsing

https://en.wikipedia.org/wiki/Parse_tree#/media/File:Parse_tree_1.jpg
under CC BY-SA 3.0

Parsing

  • Terminals
    • Tokens in your input
    • Often represented with lowercase or quoted
  • Non-Terminals
    • Abstract state used during parsing
    • Sometimes like parts of speech
    • Often represented with uppercase or in <>

Chomsky's Hierarchy

  • Type-0 : Recursively Enumerable
    Turing Machine : α → β
  • Type-1 : Context Sensitive
    Bounded Turing Machine : αAβ → αγβ
  • Type-2 : Context Free
    Pushdown Automaton : A → γ
  • Type-3 : Regular
    Finite State Automaton : A → α B or A → α

Approaches to Parsing

  • Bottom-Up
    • DFA
    • Shift Reduce
      • LR, SLR, LALR
  • Top-Down
    • Recursive Descent w/ lookahead
    • Recursive Descent w/ backtracking

Backus–Naur Form

  • John Backus proposes "metalinguistic formulas" to describe ALGOL
  • Peter Naur prefers "Backus normal form", but Knuth disagrees
  • <number> ::= <digit> | <digit> <number>

TMG (TransMoGrify)

  • Robert McClure, extended by Doug McIlroy
  • Predecessor of YACC

expr: <(>/exp1 expr operator expr <)> = { 3 1 2 };
exp1: ident = { < LOAD > 1 };
operator:
op0: <+>/op1 = { < ADD > };
op1: <->/op2 = { < SUB > };
op2: <*>/op3 = { < MPY > };
op3: </>     = { < DIV > };
          

YACC (Yet Another Compiler Compiler)

  • Developed by Stephen C. Johnson
    • Wanted to add XOR to B!
  • LALR Parser Generator
  • Lives on via GNU Bison (Robert Corbett)

exp:      NUM                { $$ = $1;         }
        | exp '+' exp        { $$ = $1 + $3;    }
        | exp '-' exp        { $$ = $1 - $3;    }
        | exp '*' exp        { $$ = $1 * $3;    }
        | exp '/' exp        { $$ = $1 / $3;    }
        | '-' exp  %prec NEG { $$ = -$2;        }
        | exp '^' exp        { $$ = pow ($1, $3); }
        | '(' exp ')'        { $$ = $2;         }
;
          

Goals

  • Provide Lex/Yacc like functionality for Forth
  • Keep it simple

Lexing

  • "Token" boundaries can often be more succinctly expressed with a regular language
  • Operating on tokens is faster
  • Avoids <WHITESPACE> showing up in grammars
  • Can be intermixed with symbol tables
  • But adds complexity to tools

Lex and Parse

  • Treat characters and EOL as tokens
  • Provide shorthand for strings
  • Swallow challenges with whitespace

Traversing Input


: @token ( -- n )
  source nip >in @ = if
    0
  else
    source >in @ /string drop c@
  then ;

: +token ( f -- ) if 1 >in +! then ;
          

Checking Tokens


variable success

: =token ( n -- )
   success @ if
     @token =  dup success ! +token
   else drop then ;
          

Defining Token Checkers

Which should I use?


char + token '+'   char - token '-'

tok + '+'   tok - '-'

:noname    t' + t' * ;

:noname   t" keyword"
          

Defining Token Checkers


: token ( n -- ) create c, does> ( a -- )  c@ =token ;

: t"   [char] " parse 0 ?do
       dup c@ postpone literal
       postpone =token 1+ loop drop ; immediate

: t'   char postpone literal
       postpone =token ; immediate

: tok   char token ;

0 token <EOL>
          

<NUMBER> ::= <DIGIT> <NUMBER> | <DIGIT>

bnf: <NUMBER>   <DIGIT> <NUMBER> | <DIGIT> ;bnf
          

Defining a production rule


: bnf: ( -- colon-sys )
    : postpone recursive
    postpone <bnf ; immediate

: ;bnf ( colon-sys -- )
    postpone bnf>
    postpone ; ; immediate  
          

State and Backtracking

  • Source position (>IN)
  • Output position / Dictionary position (HERE)
  • Use output "wisely"

Keeping state


: <bnf ( -- )
    success @ if
      r> >in @ >r here >r >r  ( stash old >in & here )
    else
      r> drop  ( bail out of caller, failure! )
    then ;

: bnf> ( -- )
    success @ if
      r> r> r> 2drop >r  ( ditch old >in & here )
    else
      r> r> dp! r> >in ! >r ( restore old >in & here )
    then ;
          

<NUMBER> ::= <DIGIT> <NUMBER> | <DIGIT>

bnf: <NUMBER>   <DIGIT> <NUMBER> | <DIGIT> ;bnf
          

Alternatives


: | ( -- )
    success @ if
      r> r> r> 2drop drop  ( bail out of caller, success! )
    else
      r> r> r> 2dup >r >r  ( pull out old state )
      >in ! dp!  ( restore old state )
      1 success ! >r  ( back in action )
    then ;
          

Elimination of Left Recursion

  • A → Aα | β
  • A → βA'
    A' → αA' | ε

Elimination of Left Recursion


E → E + T | T
T → T * F | F
F → ( E ) | id
          

Elimination of Left Recursion


E → TE'
E' → +TE' | ε
T → FT'
T' → *FT' | ε
F → (E) | id
          

Code Actions (in Bison)


exp:      NUM                { $$ = $1;         }
        | exp '+' exp        { $$ = $1 + $3;    }
        | exp '-' exp        { $$ = $1 - $3;    }
        | exp '*' exp        { $$ = $1 * $3;    }
        | exp '/' exp        { $$ = $1 / $3;    }
        | '-' exp  %prec NEG { $$ = -$2;        }
        | exp '^' exp        { $$ = pow ($1, $3); }
        | '(' exp ')'        { $$ = $2;         }
;
          

Handling Actions

  • Execute in order within productions
  • Only execute when known good
  • Track a $$ state variable
  • Store [$$, xt]* pairs in heap

Code Actions


: pow ( x n -- n ) 1 swap 0 ?do over * loop nip ;

bnf: <FACTOR>   <PRIMARY> t' ^ <FACTOR>   {{ pow }}
              | <PRIMARY> ;bnf
          

Code Actions (w/ parse state)


bnf: {DIGIT}   t' 0 | t' 1 | t' 2 | t' 3 | t' 4 |
               t' 5 | t' 6 | t' 7 | t' 8 | t' 9 ;bnf

bnf: <DIGIT>   @token 48 - {DIGIT} $$ ! {{ 10 * $$ @ + }} ;bnf
bnf: <NUMBER'>   <DIGIT> <NUMBER'> | <DIGIT> ;bnf
bnf: <NUMBER> {{ 0 }} <NUMBER'> ;bnf
          

Code Actions


variable $$
: {{   postpone ahead
       chainer3 ! chainer2 ! chainer1 !  ( stash orig )
       postpone ;
       noname : latestxt chainer !
       ; immediate

: }}   postpone ; noname :
       chainer1 @ chainer2 @ chainer3 @  ( restore orig )
       postpone then
       postpone $$ postpone @ postpone ,  ( store state var )
       chainer @ postpone literal postpone , ( store xt )
       ; immediate
          

Demo & Code Tour

What Next?

  • Kleene star/plus
  • Multi-line
  • Nicer invocation + syntax

Source and slides at:
github.com/flagxor

Thank you