Parsing • Chapter 6

polish man

A Polish Nobleman • A typical Polish Notation user

Polish Notation


To try out mpc we're going to implement a simple grammar that resembles a mathematical subset of our Lisp. It's called Polish Notation and is a notation for arithmetic where the operator comes before the operands.

For example...

1 + 2 + 6is+ 1 2 6
6 + (2 * 9)is+ 6 (* 2 9)
(10 * 2) / (4 + 2)is/ (* 10 2) (+ 4 2)

We need to work out a grammar which describes this notation. We can begin by describing it textually and then later formalise our thoughts.

To start, we observe that in polish notation the operator always comes first in an expression, followed by either numbers or other expressions in parentheses. This means we can say "a program is an operator followed by one or more expressions," where "an expression is either a number, or, in parentheses, an operator followed by one or more expressions".

More formally...

Programthe start of input, an Operator, one or more Expression, and the end of input.
Expressioneither a Number or '(', an Operator, one or more Expression, and an ')'.
Operator'+', '-', '*', or '/'.
Numberan optional -, and one or more characters between 0 and 9

Regular Expressions


We should be able to encode most of the above rules using things we know already, but Number and Program might pose some trouble. They contain a couple of constructs we've not learnt how to express yet. We don't know how to express the start or the end of input, optional characters, or range of characters.

These can be expressed, but they require something called a Regular Expression. Regular expressions are a way of writing grammars for small sections of text such as words or numbers. Grammars written using regular expressions can't consist of multiple rules, but they do give precise and concise control over what is matched and what isn't. Here are some basic rules for writing regular expressions.

.Any character is required.
aThe character a is required.
[abcdef]Any character in the set abcdef is required.
[a-f]Any character in the range a to f is required.
a?The character a is optional.
a*Zero or more of the character a are required.
a+One or more of the character a are required.
^The start of input is required.
$The end of input is required.

These are all the regular expression rules we need for now. Whole books have been written on learning regular expressions. For the curious much more information can be found online or from these sources. We will be using them in later chapters, so some basic knowledge will be required, but you won't need to master them for now.

In an mpc grammar we write regular expressions by putting them between forward slashes /. Using the above guide our Number rule can be expressed as a regular expression using the string /-?[0-9]+/.

Installing mpc


Before we work on writing this grammar we first need to include the mpc headers, and then link to the mpc library, just as we did for editline on Linux and Mac. Starting with your code from chapter 4, you can rename the file to parsing.c and download mpc.h and mpc.c from the mpc repo. Put these in the same directory as your source file.

To include mpc put #include "mpc.h" at the top of the file. To link to mpc put mpc.c directly into the compile command. On Linux you will also have to link to the maths library by adding the flag -lm.

On Linux and Mac

cc -std=c99 -Wall parsing.c mpc.c -ledit -lm -o parsing

On Windows

cc -std=c99 -Wall parsing.c mpc.c -o parsing

Hold on, don't you mean #include <mpc.h>?

There are actually two ways to include files in C. One is using angular brackets <> as we've seen so far, and the other is with quotation marks "".

The only difference between the two is that using angular brackets searches the system locations for headers first, while quotation marks searches the current directory first. Because of this system headers such as <stdio.h> are typically put in angular brackets, while local headers such as "mpc.h" are typically put in quotation marks.

Polish Notation Grammar


Formalising the above rules further, and using some regular expressions, we can write a final grammar for the language of polish notation as follows. Read the below code and verify that it matches what we had written textually, and our ideas of polish notation.

/* Create Some Parsers */
mpc_parser_t* Number   = mpc_new("number");
mpc_parser_t* Operator = mpc_new("operator");
mpc_parser_t* Expr     = mpc_new("expr");
mpc_parser_t* Lispy    = mpc_new("lispy");

/* Define them with the following Language */
mpca_lang(MPCA_LANG_DEFAULT,
  "                                                     \
    number   : /-?[0-9]+/ ;                             \
    operator : '+' | '-' | '*' | '/' ;                  \
    expr     : <number> | '(' <operator> <expr>+ ')' ;  \
    lispy    : /^/ <operator> <expr>+ /$/ ;             \
  ",
  Number, Operator, Expr, Lispy);

We need to add this to the interactive prompt we started on in chapter 4. Put this code right at the beginning of the main function before we print the Version and Exit information. At the end of our program we also need to delete the parsers when we are done with them. Right before main returns we should place the following clean-up code.

/* Undefine and Delete our Parsers */
mpc_cleanup(4, Number, Operator, Expr, Lispy);

I'm getting an error undefined reference to `mpc_lang'

That should be mpca_lang, with an a at the end!

Parsing User Input


Our new code creates a mpc parser for our Polish Notation language, but we still need to actually use it on the user input supplied each time from the prompt. We need to edit our while loop so that rather than just echoing user input back, it actually attempts to parse the input using our parser. We can do this by replacing the call to printf with the following mpc code, that makes use of our program parser Lispy.

/* Attempt to Parse the user Input */
mpc_result_t r;
if (mpc_parse("<stdin>", input, Lispy, &r)) {
  /* On Success Print the AST */
  mpc_ast_print(r.output);
  mpc_ast_delete(r.output);
} else {
  /* Otherwise Print the Error */
  mpc_err_print(r.error);
  mpc_err_delete(r.error);
}

This code calls the mpc_parse function with our parser Lispy, and the input string input. It copies the result of the parse into r and returns 1 on success and 0 on failure. We use the address of operator & on r when we pass it to the function. This operator will be explained in more detail in later chapters.

On success an internal structure is copied into r, in the field output. We can print out this structure using mpc_ast_print and delete it using mpc_ast_delete.

Otherwise there has been an error, which is copied into r in the field error. We can print it out using mpc_err_print and delete it using mpc_err_delete.

Compile these updates, and take this program for a spin. Try out different inputs and see how the system reacts. Correct behaviour should look like the following.

Lispy Version 0.0.0.0.2
Press Ctrl+c to Exit

lispy> + 5 (* 2 2)
>
  regex
  operator|char:1:1 '+'
  expr|number|regex:1:3 '5'
  expr|>
    char:1:5 '('
    operator|char:1:6 '*'
    expr|number|regex:1:8 '2'
    expr|number|regex:1:10 '2'
    char:1:11 ')'
  regex
lispy> hello
<stdin>:1:1: error: expected whitespace, '+', '-', '*' or '/' at 'h'
lispy> / 1dog
<stdin>:1:4: error: expected one of '0123456789', whitespace, '-', one or more of one of '0123456789', '(' or end of input at 'd'
lispy>

I'm getting an error <stdin>:1:1: error: Parser Undefined!.

This error is due to the syntax for your grammar supplied to mpca_lang being incorrect. See if you can work out what part of the grammar is incorrect. You can use the reference code for this chapter to help you find this, and verify how the grammar should look.

Reference


#include "mpc.h"

#ifdef _WIN32

static char buffer[2048];

char* readline(char* prompt) {
  fputs(prompt, stdout);
  fgets(buffer, 2048, stdin);
  char* cpy = malloc(strlen(buffer)+1);
  strcpy(cpy, buffer);
  cpy[strlen(cpy)-1] = '\0';
  return cpy;
}

void add_history(char* unused) {}

#else
#include <editline/readline.h>
#include <editline/history.h>
#endif

int main(int argc, char** argv) {
  
  /* Create Some Parsers */
  mpc_parser_t* Number   = mpc_new("number");
  mpc_parser_t* Operator = mpc_new("operator");
  mpc_parser_t* Expr     = mpc_new("expr");
  mpc_parser_t* Lispy    = mpc_new("lispy");
  
  /* Define them with the following Language */
  mpca_lang(MPCA_LANG_DEFAULT,
    "                                                     \
      number   : /-?[0-9]+/ ;                             \
      operator : '+' | '-' | '*' | '/' ;                  \
      expr     : <number> | '(' <operator> <expr>+ ')' ;  \
      lispy    : /^/ <operator> <expr>+ /$/ ;             \
    ",
    Number, Operator, Expr, Lispy);
  
  puts("Lispy Version 0.0.0.0.2");
  puts("Press Ctrl+c to Exit\n");
  
  while (1) {
  
    char* input = readline("lispy> ");
    add_history(input);
    
    /* Attempt to parse the user input */
    mpc_result_t r;
    if (mpc_parse("<stdin>", input, Lispy, &r)) {
      /* On success print and delete the AST */
      mpc_ast_print(r.output);
      mpc_ast_delete(r.output);
    } else {
      /* Otherwise print and delete the Error */
      mpc_err_print(r.error);
      mpc_err_delete(r.error);
    }
    
    free(input);
  }
  
  /* Undefine and delete our parsers */
  mpc_cleanup(4, Number, Operator, Expr, Lispy);
  
  return 0;
}

Bonus Marks


  • › Write a regular expression matching strings of all a or b such as aababa or bbaa.
  • › Write a regular expression matching strings of consecutive a and b such as ababab or aba.
  • › Write a regular expression matching pit, pot and respite but not peat, spit, or part.
  • › Change the grammar to add a new operator such as %.
  • › Change the grammar to recognise operators written in textual format add, sub, mul, div.
  • › Change the grammar to recognize decimal numbers such as 0.01, 5.21, or 10.2.
  • › Change the grammar to make the operators written conventionally, between two expressions.
  • › Use the grammar from the previous chapter to parse Doge. You must add start and end of input.

Navigation

‹ Languages

• Contents •

Evaluation ›