This is the first of a multi-post tutorial on using flex and Bison to write a programming language interpreter, in this case, BASIC. We’ll be using the 1964 BASIC manual from Dartmouth as the starting point language reference. All code is available at this GitHub repository: https://github.com/VisceralLogic/basic. The interpreter will be written in C++.
Why use flex/Bison instead of writing everything from scratch? If you’ve written a parser/grammar interpreter before (e.g., Numerical Expression Solver in Java), you know that even a simple language can get involved. Leveraging flex and Bison allows you to focus on the core language implementation, and not have to re-invent the wheel for parsing. Bison is a GNU utility for generating programs to parse a programmer-defined language grammar. A manual with tutorials is here: http://dinosaur.compilertools.net/bison/index.html. Flex is a utility for reading input files and converting them into lexical tokens, useful for consumption by Bison. The same site has the manual and tutorial for flex: http://dinosaur.compilertools.net/flex/index.html
Here is a good introduction to both flex and Bison, from which we will start our program: http://aquamentus.com/flex_bison.html
To start, we need to determine what the language features we need to implement are. The ’64 manual describes a non-interactive language, so that is where we will start. Programs will be entered one line at a time, with a line number at the beginning of each line, and then the program will be run. There is no interactive user input in this language.
Example input:
10 FOR X = 1 TO 100 20 PRINT X, SQR(X) 30 NEXT X 40 END
Here are the 15 commands we will need to be able to interpret to offer a complete implementation:
LET READ DATA PRINT GOTO IF-THEN FOR NEXT END STOP DEF GOSUB RETURN DIM REM
We will start by creating a program that reads just two lines: either a line number followed by a PRINT command, or the RUN statement to execute the stored program. We will start by creating the flex and Bison input files to make sure we have the basics working.
The flex input file will need to recognize a few different types of tokens: integers for a line number, strings delimited by double quotes, the literals PRINT and RUN, and new line characters. It will ignore other white spaces, and pass anything it doesn’t recognize on to the Bison parser. Here is the initial flex input file:
%option noyywrap %{ #include <cstdio> using namespace std; #define YY_DECL extern "C" int yylex() #include "basic.tab.h" %} %% [ \t]+ ; // skip white space [0-9]+ { yylval.iVal = atoi(yytext); return LINE; } \"[^\"\n]*\" { yylval.sVal = strdup(yytext); return STRING; } PRINT { return PRINT; } RUN { return RUN; } \n return ENDL; %%
The first section of the file is for definitions, and comes before the first %%. We denote C declarations with %{ and %}, and everything inside those braces is ignored by flex and passed on to the C program it generates. The second section, coming after the first %%, provides the flex rules. Flex will follow whichever rule matches the most text, or if multiple match the same amount, the first of those rules.
The first rule, [ \t]+ ;, matches one or more spaces or tabs, and silently discards them. The second rule, [0-9]+, is designed for reading line numbers, and matches one ore more digits. It converts the text it has read into an integer using the atoi function, and returns the LINE token (which will be defined in the Bison input file. The next rule, \”[^\”\n]*\”, matches strings; it looks for an initial double quote, and then reads all subsequent characters except a new line or double quote, until it reaches a closing double quote. This text is returned as-is, as a STRING token. The rules PRINT and RUN match that text exactly, and return the corresponding token. The final rule, \n, reads a new line character and returns the ENDL token.
The Bison input file is a bit more complicated. It starts similarly in the first section with C declarations, and then token definitions. The second section contains the grammar rules. The final section contains additional C code, which in our initial iteration includes the <em>main</em> function definition:
// A BASIC grammar interpreter %{ #include <cstdio> #include <iostream> using namespace std; extern "C" int yylex(); extern "C" int yyparse(); extern "C" FILE *yyin; void yyerror(const char *s); %} // token type definition %union { int iVal; char *sVal; } // constant tokens %token PRINT %token RUN %token ENDL // terminal symbols %token <iVal> LINE %token <sVal> STRING %% /* Grammar rules and actions follow */ input: /* empty */ | input line ; line: ENDL | stmt ENDL ; stmt: LINE program { cout << "> Programming line " << $1 << " ^^^" << endl; } | RUN { cout << "> running..." << endl; } ; program: PRINT STRING { cout << ">\tPRINT " << $2 << endl; } ; %% int main(int argc, char **argv){ do { yyparse(); } while( !feof(yyin) ); return 0; } void yyerror(const char *s){ cout << "Error: " << s << endl; }
Since we will be parsing tokens of multiple types, we use the %union token definition. To start with, the only types whose value we need are integers (int iVal) for line numbers, and strings (char *sVal). The constant tokens PRINT, RUN, and ENDL, do not need values, they represent themselves. The terminal symbols LINE and STRING use the token type values defined previously. These five tokens will be exported to the file basic.tab.h, which is imported by the flex input file to maintain consistency in token definitions.
Next is the grammar rules section, where the heart of the Bison file lies. The first rule defined is what Bison will attempt to match. It can be defined in terms of sub-rules. For our BASIC grammar, the primary rule is:
input: /* empty */ | input line ;
This means the non-terminal symbol input matches either an empty input, or the non-terminal symbol input followed by the non-terminal symbol line. This is a recursive rule, which allows multiple lines to be read. For example, a line non-terminal symbol matches this rule by being empty (which is a valid input) followed by a line, so this matches the second case. This single line symbol is therefore a valid input symbol, and can be followed by another line to again reduce to the second case.
Our second rule defines the line non-terminal symbol as either a blank line (represented by the terminal symbol ENDL as read by our flex analyzer), or a stmt non-terminal followed by the ENDL terminal symbol. This ensures that we will be reading in one line of input at a time:
line: ENDL | stmt ENDL ;
Things start to get a little more interesting with our third rule, for the stmt symbol. We define this as either a line number (the terminal symbol LINE) followed by a program symbol, or the terminal symbol RUN. In either case, we take an action to provide output as to what is being matched, so that we can verify what our program is doing:
stmt: LINE program { cout << "> Programming line " << $1 << " ^^^" << endl; } | RUN { cout << "> running..." << endl; } ;
Our final rule defines the single acceptable pattern to make a program symbol: the terminal symbol PRINT followed by a STRING. If this is matched, we will provide feedback to that effect:
program: PRINT STRING { cout << ">\tPRINT " << $2 << endl; } ;
The $N entries in the actions of the last two rules represent the Nth token being matched. As long as the token type has a value associated with it from the tokens declaration section, its value can be accessed.
The final section of our Bison input file provides a simple main function that calls the Bison-generated function yyparse() until the input file (which defaults to stdin) reaches the end, and an error function that is called when Bison is unable to successfully match a rule:
int main(int argc, char **argv){ do { yyparse(); } while( !feof(yyin) ); return 0; } void yyerror(const char *s){ cout << "Error: " << s << endl; }
The program is now ready to be built and run. Since the flex input file depends on basic.tab.h, we must first run Bison to generate this file. Call bison with the -d option to generate the header file:
> bison -d basic.y
This generates the two files basic.tab.c, which contains the grammar parsing code, and basic.tab.h, which defines the tokens for consumption by flex. Next run flex to generate the output file lex.yy.c:
> flex basic.l
Compile these with the g++ program:
> g++ basic.tab.c lex.yy.c -o basic
Running these commands over and over can get tedious, and it is easy to forget to run one, especially as we add more files. So it will make your life easier to use a make file, such as this one:
.PHONY: all clean all: basic.tab.c lex.yy.c g++ basic.tab.c lex.yy.c -o basic basic.tab.c: basic.y bison -d basic.y basic.tab.h: basic.y bison -d basic.y lex.yy.c: basic.tab.h basic.l flex basic.l clean: rm basic.tab.c basic.tab.h lex.yy.c basic
Save this as Makefile in the same directory, and you can run it with the command make. Call make clean to remove all the generated files.
Run your BASIC program by calling ./basic at the command line, type ctrl-D to generate an end-of-file signal to finish. You should be able to enter lines such as the following, with anything else printing a syntax error message:
10 PRINT "hello" 20 PRINT "world" 00000 PRINT "!" RUN
If you enter the above lines, you should see the following output:
10 PRINT "hello" > PRINT "hello" > Programming line 10 ^^^ 20 PRINT "world" > PRINT "world" > Programming line 20 ^^^ 00000 PRINT "!" > PRINT "!" > Programming line 0 ^^^ RUN > running...
As you can see from the output, the program non-terminal symbol is being matched and evaluated prior to the stmt non-terminal symbol, resulting in the > PRINT “hello” being printed before the > Programming line 10 ^^^. You can also see that the input 00000 is being converted to the integer 0 by flex, and passed to Bison in that way.
All the files used in this example are available for download here: https://github.com/VisceralLogic/basic/tree/part-1
In the Part II, we will begin expanding this program.
Pingback: BASIC Interpreter, Part II | Visceral Logic Programming