Rust is a system programming language. A typical task of system programming is processing formal languages. Formal languages are languages specified by well-defined logical rules and used everywhere in computer technology. They can be broadly classified into command, programming, and markup languages.
To process formal languages, the first step is to parse. Parsing means analyzing the grammatical structure of a piece of code to check whether it respects the rules of the grammar it is supposed to use, and then, if the grammar is respected, to generate a data structure that describes the structure of the parsed piece of code, in a way that such code can be further processed.
In this chapter, we will see how to process text written in a formal language, starting from the parsing step and proceeding with several possible outcomesāsimply checking the grammar, interpreting a program, and translating a program into the Rust language.
To show such features, an extremely simple programming language will be defined, and four tools (syntax checker, semantic checker, interpreter, and translator) will be built around it.
In this chapter, you will learn about the following topics:
- Defining a programming language using a formal grammar
- Classifying programming languages into three categories
- Learning two popular techniques for building parsersācompiler-compilers and parser combinators
- Using a parser combinator library for Rust named Nom
- Processing a source code to check its syntax following a context-free grammar, using the Nom library (calc_parser)
- Verifying the consistency of variable declarations and their usage in some source code, and at the same time preparing the required structure for optimal execution of the code (calc_analyzer)
- Executing the preprocessed code, in a process named interpretation (calc_interpreter)
- Translating the preprocessed code into another programming language, in a process named compilation (calc_compiler); as an example, translation to Rust code will be shown
After reading this chapter, you will be able to write the grammar for a simple formal language or understand the grammar for an existing formal language. You will also be able to write an interpreter for any programming language by following its grammar. Also, you will be able to write a translator for a formal language into another formal language, following their grammar.
Technical requirements
To read this chapter, knowledge of the preceding chapters is not required. Some knowledge of formal language theory and techniques is useful but not required, because the required knowledge will be explained in this chapter. The Nom library will be used to build such tools, and so it will be described in this chapter.
The complete source code for this chapter is in the Chapter08 folder of the GitHub repository, located at https://github.com/PacktPublishing/Creative-Projects-for-Rust-Programmers.
Project overview
In this chapter, we will build four projects of increasing complexity, listed as follows:
- The first project (calc_parser) will just be a syntax checker for the Calc language. Actually, it is just a parser, followed by a formatted debugging print of the parsing result.
- The second project (calc_analyzer) uses the parsing result of the first project to add the verification of the consistency of the variable declarations and of their usage, followed by a formatted debugging print of the analysis result.
- The third project (calc_interpreter) uses the analysis result to execute the preprocessed code, in an interactive interpreter.
- The fourth project (calc_compiler) uses the analysis result again to translate the preprocessed code into equivalent Rust code.
Introducing Calc
To make the following explanations, we will first define a toy programming language that we will name Calc (from the calculator). A toy programming language is a programming language used to demonstrate or prove something, not designed to develop real-world software. A simple program written in Calc is shown as follows:
@first
@second
> first
> second
@sum
sum := first + second
< sum
< first * second
The preceding program asks the user to type two numbers and then prints the sum and the product of those numbers on the console. Let's examine one statement at a time, as follows:
- The first two statements (@first and @second) declare two variables. Any variable in Calc represents a 64-bit floating-point number.
- The third and fourth statements (> first and > second) are input statements. Each of these prints a question mark and waits for the user to type a number and press Enter. Such a number, if valid, is stored in the specified variable. If no number or an invalid number is typed before pressing Enter, the value 0 is assigned to the variable.
- The fifth statement declares the sum variable.
- The sixth statement (sum := first + second) is a Pascal-style assignment. It computes the sum of the first and second variables and assigns the result to the sum variable.
- The seventh and eight statements perform output. The seventh statement (< sum) prints on the console the current value of the sum variable. The eighth statement (< first * second) computes the multiplication between the current values of the first and second variables, and then prints on the console the result of such multiplication.
The Calc language has two other operatorsā- (minus) and / (divide)ā to specify subtraction and division, respectively. In addition, the following code shows that the operations can be combined in expressions, and so these are valid assignment statements:
y := m * x + q
a := a + b - c / d
Operations are performed left to right, but multiplication and division have higher precedence than addition and subtraction.
In addition to variables, numeric literals are also allowed. So, you can write the following code:
a := 2.1 + 4 * 5
This statement assigns 22.1 to a, as multiplication is performed before addition. To force different precedence, parentheses are allowed, as illustrated in the following code snippet:
a := (2.1 + 4) * 5
The preceding code snippet assigns 30.5 to a.
In the preceding code snippet, there are no characters that separate a statement from the next one, in addition to the newline characters. Actually, the Calc language has no symbols used to separate statements, and also, it does not need them. So, the first program should be equivalent to this:
@first@second>first>second@sum sum:=first+second<sum<first*second
In the preceding code snippet, there is no ambiguity because the @ character marks the start of a declaration, the > character marks the start of an input operation, the < character marks the start of an output operation, and a variable in a location where the current statement does not allow a variable marks the start of an assignment.
To understand this syntax, some grammatical terms must be explained, as follows:
- The whole text is a program.
- Any program is a sequence of statements. In the first example program, there is exactly one statement for each line.
- In some statements, there can be an arithmetic formula that can be evaluated, such as a * 3 + 2. This formula is an expression.
- Any expression can contain sums or subtractions of simpler expressions. The simpler expressions that contain neither sums nor subtractions are named terms. Therefore, any expression can be a term (if it contains neither sums nor subtractions), or it can be the sum of an expression and a term, or it can be the subtraction of an expression and a term.
- Any term can contain multiplications or divisions of simpler expressions. The simpler expressions that contain neither multiplications nor divisions are named factors. Therefore, any term can be a factor (if it contains neither multiplications nor divisions), or it can be the multiplication of a term and a factor, or it can be the division of a term and a factor. There are three possible kinds of factors, l...