Table Read/Write and BNF

After introducing ExpDesc and modifying the existing syntax analysis in the previous two sections, this section implements table reading and writing.

The index of the table in Lua supports two methods, examples are as follows: t["k"] and t.k, where the latter is a special form of the former. All table read and write operations need to use the table index. Need to add the type of table index in ExpDesc.

The read and write operations of the table itself are not complicated, but it will make other statements suddenly become complicated:

The read operation of the table may have multiple consecutive levels, such as t.x.y, so when parsing the expression, the end cannot be judged immediately, but the next Token needs to be peeked to judge.
The write operation of the table, that is, the assignment statement. The current assignment statement only supports the assignment of "variables", that is, the lvalue only supports one Token::Name. To add support for table indexes, the handling of lvalues needs to be reimplemented. It is not enough to parse only one Token, but to parse an lvalue. So how is it considered a complete lvalue? For example, not all expressions can be used as lvalues, such as function calls or table constructions.
Previously, the assignment statement and function call statement were distinguished based on the second Token. If it is an equal sign =, it is an assignment statement. Now to support the write operation of the table, such as t.k = 123, then the second Token is a dot . instead of the equal sign =, but it is still an assignment statement. The previous judgment method is invalid. So is there any new way to distinguish between assignment statements and function call statements?

The first read operation problem is easy to solve. The next two questions related to write operations are very difficult. We cannot answer them accurately now, but can only guess the answers. This leads to a bigger problem, that is, the previous syntax analysis is based on guesswork! For example, the format of the definition statement of local variables, etc., are guessed based on the experience of using the Lua language, and cannot guarantee its accuracy and completeness. But it was relatively simple before, so you can make a guess. In addition, in order not to interrupt the rhythm of the entire project, I did not delve into this issue. Now to introduce the reading and writing of the table, the statement becomes complicated, and it is impossible to continue to mix it up by guessing. It is necessary to introduce a formal grammatical description.

BNF

The last chapter of the Lua manual is called: The Complete Syntax of Lua, the content is mainly a set of BNF descriptions. We don't need to know the meaning of the term "BNF", we just need to know that this is a formal grammar description method, where the Lua grammar can be described completely and accurately. The grammatical rules of BNF itself are also very simple, and most of them are clear at a glance. Here are only two:

{A} represents 0 or more A
[A] represents optional 1 A

The code segment of Lua is called chunk, so the definition of chunk is used as the entry, and several descriptions are listed:

chunk ::= block

block ::= {stat} [retstat]

stat ::=  ‘;’ | 
     varlist ‘=’ explist | 
     functioncall | 
     label | 
     break | 
     goto Name | 
     do block end | 
     while exp do block end | 
     repeat block until exp | 
     if exp then block {elseif exp then block} [else block] end | 
     for Name ‘=’ exp ‘,’ exp [‘,’ exp] do block end | 
     for namelist in explist do block end | 
     function funcname funcbody | 
     local function Name funcbody | 
     local attnamelist [‘=’ explist]

It can be obtained from these rules: a chunk contains a block. A block contains zero or more stats and an optional retstat. A stat has many types of statements. Among them, we have implemented the two statements functioncall and local, and then implemented the remaining types one by one to complete the entire grammar of Lua (although it is still far from the complete Lua language).

I don't quite understand what is the difference between chunk and block here? Why list two separately?

That is to say, we will implement the interpreter according to this set of specifications in the future, and we no longer need to rely on guesswork! Pick a few and compare them with our previous ones, such as local variable definition statements, and you can find that it should support multiple variables and multiple initializations expression, even without an initialization expression. This shows that our previous statement analysis is very imperfect. Later in this section, we will improve the sentences we already support based on BNF. Now find out the rules related to the table index:

var ::= Name | prefixexp '[' exp ']' | prefixexp '.' Name

exp ::= nil | false | true | Numeral | LiteralString | '...' | functiondef |
prefixexp | tableconstructor | exp binop exp | unop exp

prefixexp ::= var | functioncall | '(' exp ')'

functioncall ::= prefixexp args | prefixexp ':' Name args

At first glance it looks a bit complicated. Take var as an example for analysis. Here var deduces three cases, the first Name is a simple variable, and the latter two are table indexes, which are grammar sugar for general methods and string indexes. It involves prefixexp and exp. Among them, exp is very similar to the exp() function we currently implement, but we still lack some situations, which also need to be added later. In addition, Name is directly in the exp() function, and now it has to be moved to var.

Eliminate Left Recursion

There is a big problem here, the above 3 rules are recursively referenced. for example:

var refers to prefixexp which refers to var;
exp refers to prefixexp which refers to exp.

But these two examples are fundamentally different.

For the first example, after bringing in var and expanding it, it is

prefixexp ::= Name | prefixexp '[' exp ']' | prefixexp '.' Name | prefixexp args | prefixexp ':' Name args | '(' exp ')'

The problem is that the 2nd and 3rd items of the derivation rule start with prefixexp both. Then during syntax analysis, for example, if you read a Name, you can match item 1, or items 2 and 3, so it is impossible to judge which rule should be selected. This was a headache. I spent two days on this problem, and tried various solutions but couldn't solve it. Later, I searched the Internet and found the concept of "eliminating left recursion", and I vaguely recalled that this was a compulsory topic in the course of compiling principles. And there is a standard method for elimination: For rules that contain left recursion, they can be expressed as follows:

A := Aα | β

Then it can be rewritten as follows:

A := βA'
A’ := αA’ | ε

where ε is not matched. This eliminates left recursion. Take the above prefixexp as an example, first apply the above standard form, you can get:

α = '[' exp ']' | '.' Name | args | ':' Name args
β = Name | '(' exp ')'

Then bring in the above rewritten formula to get:

prefixexp := ( Name | '(' exp ')' ) A'
A' := ( '[' exp ']' | '.' Name | args | ':' Name args ) A' | ε

This way we get rules without left recursion.

And the second example at the beginning of this section, about exp, although there are recursive references, but it is not "left" recursion, so there is no such problem.

Read Table and `prefixexp`

The advantage of using BNF rules is that you don't need to think about Lua's grammar, just follow the rules to implement.

After obtaining the above BNF rules, the analysis of prefixexp can be completed:

    fn prefixexp(&mut self, ahead: Token) -> ExpDesc {
        let sp0 = self.sp;

        // beta
        let mut desc = match ahead {
            Token::Name(name) => self.simple_name(name),
            Token::ParL => { // `(` exp `)`
                let desc = self.exp();
                self.lex.expect(Token::ParR);
                desc
            }
            t => panic!("invalid prefixexp {t:?}"),
        };

        // A' = alpha A'
        loop {
            match self.lex.peek() {
                Token::SqurL => { // `[` exp `]`
                    self.lex.next();
                    let itable = self.discharge_if_need(sp0, desc);
                    desc = match self.exp() {
                        ExpDesc::String(s) => ExpDesc::IndexField(itable, self.add_const(s)),
                        ExpDesc::Integer(i) if u8::try_from(i).is_ok() => ExpDesc::IndexInt(itable, u8::try_from(i).unwrap()),
                        key => ExpDesc::Index(itable, self.discharge_top(key)),
                    };

                    self.lex.expect(Token::SqurR);
                }
                Token::Dot => { // .Name
                    self.lex.next();
                    let name = self.read_name();
                    let itable = self.discharge_if_need(sp0, desc);
                    desc = ExpDesc::IndexField(itable, self.add_const(name));
                }
                Token::Colon => todo!("args"), // :Name args
                Token::ParL | Token::CurlyL | Token::String(_) => { // args
                    self.discharge(sp0, desc);
                    desc = self.args();
                }
                _ => { // Epsilon
                    return desc;
                }
            }
        }
    }

The first paragraph of code corresponds to β mentioned above, namely Name | '(' exp ')'.

The loop in the second paragraph corresponds to the above A' := αA' | ε, if it matches the α part, it is '[' exp ']' | '.' Name | args | ':' Name args, then the loop continues after parsing; if there is no match, it corresponds to ε, and the loop exits. Here this loop supports many continuous operations, such as t.f(), which is a table index followed by a function call. Or more sequential operations like t.t.t.k and f()()(). If you follow the native method in the previous chapters and make a function as soon as you think of it, it will be difficult to support this kind of continuous operation, it is difficult to realize and it is difficult to think of it. But according to BNF, it can be realized correctly and completely.

Corresponding to the three types of bytecodes in the construction of the table, that is, the key is a variable on the stack, a string constant and a small integer. There are also three types of ExpDesc here, namely Index, IndexField and IndexInt . When discharging, add 3 corresponding bytecodes, GetTable, GetField and GetInt. This naturally solves the first problem at the beginning of this section, that is, the reading operation of the table is realized, and it is implemented correctly and completely!

Another feature of encoding according to the BNF rule is that you can only understand the processing logic inside each matching branch, but not the overall relationship between each branch. This is like solving a physics application problem. First, analyze the physical principles and list the equations, each of which has a corresponding physical meaning; but when solving the equations, the specific solution steps have been completely separated from the physical correspondence, which is a math tools.

The prefixexp() function is listed above, and the implementation of the exp() function is similar, which is omitted here.

Write Table and Assignment Statement

After implementing prefixexp and exp according to BNF, the problem about table write operation at the beginning of this section can be solved. The problem can be solved by reimplementing the assignment statement according to BNF. What we want to achieve this time is "complete assignment statement", and finally there is no need to emphasize "variable assignment statement".

Although the assignment statement looks similar to the local variable definition statement, it is actually completely different and much more complicated. The assignment statement in BNF is defined as follows:

varlist '=' explist
varlist ::= var {‘,’ var}
var ::= Name | prefixexp '[' exp ']' | prefixexp '.' Name

The left side of the assignment operator = is the var list. var expands to 3 kinds. The first Name is a variable, currently supports local variables and global variables, and will support upvalue after the introduction of closures. The latter two are table indexes. It can be seen from this that only these types of assignment are supported, while other types such as function calls do not support assignment. Look at the right side of =, which is a list of expressions, which can be parsed directly using the completed exp() function.

After reading the BNF grammatical rules of the assignment statement, there are three semantic rules.

First, compare the number of variables on the left of = and the number of expressions on the right side:

If equal, assign values one by one;
If the number of variables is less than the number of expressions, the variable list and the corresponding expression list are assigned one by one, and the extra expressions are ignored;
If the number of variables is greater than the number of expressions, the expression list and the corresponding variable list are assigned one by one, and the extra variables are assigned to nil.

Second, if the last expression on the right side of = has multiple values (such as function calls and variable parameters), it will be expanded as much as possible. However, we don't support these two types yet, so ignore this case for now.

Finally, all expressions to the right of = are evaluated before assignment. Instead of evaluating and assigning values simultaneously. For example, in the following Lua statement, the two expressions b and a on the right should be evaluated first to obtain 2 and 1, and then assigned to a and b on the left respectively. This exchanges the two variables. But if we assign a value while evaluating, we first evaluate b on the right, get 2, and assign it to a. Then evaluate a on the right to get the 2 that was just assigned, and then assign it to b. The end result is that both variables will be 2.

local a, b = 1, 2
a, b = b, a --swap 2 variables!!!

The diagram below depicts the execution process of Error:

            +-------+
    /--(1)--|   a   |<------\
    |       +-------+       |
    \------>|   b   |--(2)--/
            +-------+
            |       |

Since all values must be evaluated first, the obtained value must be stored in one place first, which is naturally the top of the stack as a temporary variable. The diagram below describes the correct execution process:

             +-------+
    /---(1)--|   a   |<-------\
    |        +-------+        |
    |  /-(2)-|   b   |<----\  |
    |  |     +-------+     |  |
    \------->|  tmp1 |-(3)-/  |
       |     +-------+        |
       \---->|  tmp2 |--(4)---/
             +-------+
             |       |

In the figure, (1) and (2) are to evaluate the expression and put it in the temporary position on the top of the stack; (3) and (4) are to assign the value of the temporary position on the top of the stack to the variable.

The functionality of this approach is correct, but the performance is relatively poor. Because each assignment requires 2 operations, first evaluating and then assigning, it requires 2 bytecodes. But in most cases, only one operation is required. For example, assigning a local variable to another local variable requires only one Move bytecode. In particular, the most common assignment statement in a program is the assignment of a single variable. The order of a single variable does not matter, and there is no need to evaluate a temporary variable first. Therefore, the above method of first evaluating to the top of the stack and then assigning a value is for the correctness of a few cases, while sacrificing the performance of most cases. This situation is relatively common in programming. The general solution is to add a quick path to most cases. For example, the following logic can be used in our current situation:

if single variable then
     var = exp // direct assignment, quick path
else // multiple variables
     tmp_vars = exp_list // evaluate all to temporary variables first
     var_list = tmp_vars // assign values uniformly

There is a more elegant solution to this specific problem, though. The key here is that in the case of multiple assignments, the assignment of the last variable does not depend on the assignment of other variables, and it can be assigned directly without first evaluating to a temporary variable. So the new solution is: special treatment (direct assignment) is made to the last variable, and other variables are still evaluated first and then assigned. In this way, for the assignment statement of a single variable (the single variable is naturally the last variable), it degenerates into a direct assignment. In this way, the correctness of multiple variables is guaranteed, and the performance of most cases (single variable) is also guaranteed. Pretty!

The following figure describes this scheme: for the previous variable a, first evaluate to the temporary variable on the top of the stack, and assign the last variable b directly, and then assign the temporary variable on the top of the stack to the corresponding variable in turn.

             +-------+
    /---(1)--|   a   |<------\
    |        +-------+       |
    |        |   b   |--(2)--/
    |        +-------+ <-------\
    \------->|  tmp1 |--(3)----/
             +-------+
             |       |

Since we execute the last expression first, thenthe previous expressions are also assigned in reverse order. In this way, all expressions are assigned in reverse order.

So far, the syntax and semantic rules of assignment statements have been introduced. The next step is to rewrite the assignment() function. The logic of the function body is as follows:

Call prefixexp() to read the lvalue list and save it as ExpDesc;
Call exp() to read the rvalue expression list, the last expression retains ExpDesc, and the remaining expressions are discharged to the top of the stack;
Align the number of lvalues and rvalues;
Assignment, first assign the last expression to the last lvalue, and then assign the temporary variable on the top of the stack to the corresponding lvalue in turn.

The specific code is omitted here. Only step 4, assignment, will be described in detail below.

Execute the Assignment

Assignment statements consist of lvalues and rvalues:

Each lvalue is read by the prefixexp() function, returning an ExpDesc type. However, it can be seen from BNF that the assignment statement only supports variables and table indexes. The variables include local variables and global variables, corresponding to the two ExpDesc types Local and Global respectively, and the table indexes include Index, IndexField and IndexInt. So there are up to a total of five types.
Each rvalue is read by the exp() function, which also returns an ExpDesc type, and supports arbitrary ExpDesc types.

To sum up, there are 5 types on the left and N types on the right (N is the number of all types of ExpDesc), and there are a total of 5*N combinations. A bit much, need to sort out.

First of all, for the case where the lvalue is a local variable, the assignment is equivalent to discharging the expression to the stack location of the local variable. Just call the discharge() function. This function already handles all N types of ExpDesc.

The remaining four lvalue types are a bit more complicated, but these four cases are similar. The following uses the global variable Global type as an example to introduce.

Several combinations of assignments were introduced in the previous Assignment section. For the case where the lvalue is a global variable, the rvalue supports three types of expressions: constant, local variable, and global variable. At that time, for the sake of simplicity, the three expressions SetGlobalConst, SetGlobal, and SetGlobalGlobal were directly generated. Now it can be foreseen that there will be more types of expressions in the future, such as the reading of tables added in this section (such as t.k), and subsequent additions such as UpValue and operations (such as a+b). If a new bytecode is added for each new type, it will become very complicated.

Moreover, expressions such as table indexing and operations require 2 parameters to represent, and the assignment bytecode of this series of global variables cannot be filled with 2 parameters to represent the source expression of the assignment (one bytecode supports up to 3 u8 Type parameter, this series of bytecodes needs 1 parameter to represent the destination address, and it seems that 2 parameters can be used to represent the source expression. But through the output of luac, you can see Lua's official global variable assignment bytecode SETTABUP has 3 parameters. In addition to the 2 parameters representing the source and destination addresses, there is an additional parameter. Although it is not clear what the function of the extra parameter is, let’s assume that we will use it later That parameter, so our series of bytecodes leaves only one parameter position for the source expression). So how to deal with such complex expressions? The answer is to first evaluate these complex expressions to the top of the stack as temporary variables, which are of Local type, and then use SetGlobal to complete the assignment.

Here are two extremes:

The previous practice was to define a bytecode for each source expression type;
The solution just discussed is to discharge all types on the stack first, and then only use one SetGlobal bytecode.

Between these two extremes, we refer to the choice of Lua's official implementation, which is to define a bytecode for the constant type (ExpDesc's String, Float, etc.), while other types are first discharged to the stack and converted to Local type. Although the constant type is actually not a specific type (including multiple types such as String, Float), but the processing method is the same, through the add_const() function to add to the constant table, and use the constant table Index to represent, so when dealing with assignment statements, it can be seen as a type. Thus, our rvalue expressions are simplified to two types: constants and Local variables on the stack! In the official implementation of Lua, the SETTABUP bytecode of global variable assignment uses 1 bit to indicate whether the source expression is a constant or a variable on the stack. The generation of our bytecode is inconvenient to precisely manipulate bits, so a new bytecode SetGlobalConst is added to represent constants.

Why does the official Lua implementation treat constants specially, but not optimize other types (such as global variables, UpValue, table indexes, etc.)? There are two reasons for my personal guess:

If a global variable or UpValue or table index is accessed so frequently that it is necessary to optimize, then you can simply create a local variable to optimize, such as local print = print. For constants, it is inappropriate to assign values to local variables in many cases. For example, changing an assignment statement g = 100 to local h = 100; g = a seems awkward and unnecessary.
Accessing global variables is based on the variable name table lookup, which is a relatively time-consuming operation, and the cost of adding a bytecode is not obvious in comparison. Access to other types is similar. The access constant is directly referenced through the index, and the cost of adding a bytecode is relatively high.

So far, the assignment of global variables has been introduced, and the assignment of table indexes (that is, the write operation of the table) is similar. For the three types Index, IndexField and IndexInt, we define SetTable, SetField, SetInt, SetTableConst, SetFieldConst, SetIntConst 6 bytecodes.

Finally, the code for assignment is as follows:

    // process assignment: var = value
    fn assign_var(&mut self, var: ExpDesc, value: ExpDesc) {
        if let ExpDesc::Local(i) = var {
            // self.sp will be set to i+1 in self.discharge(), which is
            // NOT expected, but it's ok because self.sp will not be used
            // before next statement.
            self.discharge(i, value);
        } else {
            match self.discharge_const(value) {
                ConstStack::Const(i) => self.assign_from_const(var, i),
                ConstStack::Stack(i) => self.assign_from_stack(var, i),
            }
        }
    }

    fn assign_from_stack(&mut self, var: ExpDesc, value: usize) {
        let code = match var {
            ExpDesc::Local(i) => ByteCode::Move(i as u8, value as u8),
            ExpDesc::Global(name) => ByteCode::SetGlobal(name as u8, value as u8),
            ExpDesc::Index(t, key) => ByteCode::SetTable(t as u8, key as u8, value as u8),
            ExpDesc::IndexField(t, key) => ByteCode::SetField(t as u8, key as u8, value as u8),
            ExpDesc::IndexInt(t, key) => ByteCode::SetInt(t as u8, key, value as u8),
            _ => panic!("assign from stack"),
        };
        self.byte_codes.push(code);
    }

    fn assign_from_const(&mut self, var: ExpDesc, value: usize) {
        let code = match var {
            ExpDesc::Global(name) => ByteCode::SetGlobalConst(name as u8, value as u8),
            ExpDesc::Index(t, key) => ByteCode::SetTableConst(t as u8, key as u8, value as u8),
            ExpDesc::IndexField(t, key) => ByteCode::SetFieldConst(t as u8, key as u8, value as u8),
            ExpDesc::IndexInt(t, key) => ByteCode::SetIntConst(t as u8, key, value as u8),
            _ => panic!("assign from const"),
        };
        self.byte_codes.push(code);
    }

So far, according to the BNF re-assignment statement, it naturally supports the read and write operations of the table.

Assignment and Function Call statement

Now look back at the last of the three questions raised at the beginning of this section, that is, how to distinguish between assignment statements and function call statements during syntax analysis.

Let’s start with the BNF representation of the assignment statement:

varlist '=' explist
varlist ::= var {‘,’ var}
var ::= Name | prefixexp '[' exp ']' | prefixexp '.' Name

The beginning of the statement is varlist, after expansion is the variable var, and then it is Name and prefixexp. Name corresponds to Token::Name, but prefixexp still needs to be expanded. Here is its definition:

prefixexp ::= var | functioncall | '(' exp ')'
functioncall ::= prefixexp args | prefixexp ':' Name args

Among them, the first var returns to the beginning of the assignment statement just now, and the circular reference is ignored. The last one starts with (, which is also very simple. After the functioncall in the middle is expanded, it also starts with prefixexp, which is also a circular reference, but this time it cannot be ignored, because functioncall itself is also a complete statement, that is, if a A statement starts with prefixexp, which may be an assignment statement or a function call statement. How to distinguish between these two statements? As explained in the previous section, the left value of an assignment statement can only be a variable or a table index. types, and function calls cannot be used as lvalues. This is the key to the distinction!

In summary, the final parsing logic is: if it starts with Name or (, parse it according to prefixexp, and judge the parsing result:

If it is a function call, it is considered a complete functioncall statement;
Otherwise, it is considered as an assignment statement, and the result of this parsing is only the first var of the assignment statement.

To do this, add a function call type Call in ExpDesc and let the function call statement args() return. In the load() function, this part of the code is as follows:

            match self.lex.next() {
                Token::SemiColon => (),
                t@Token::Name(_) | t@Token::ParL => {
                    // functioncall and var-assignment both begin with
                    // `prefixexp` which begins with `Name` or `(`.
                    let desc = self.prefixexp(t);
                    if desc == ExpDesc::Call {
                        // prefixexp() matches the whole functioncall
                        // statement, so nothing more to do
                    } else {
                        // prefixexp() matches only the first variable, so we
                        // continue the statement
                        self.assignment(desc);
                    }
                }

Summary

In this section, the parsing of the assignment statement is re-analyzed through BNF, and finally the read and write operations of the table are realized. In addition, the statement of local variable definition also needs to be rewritten according to BNF, which is relatively simple, and the introduction is omitted here.

So far, this chapter has completed the basic operations of table definition, construction, reading and writing; and introduce the very important ExpDesc concept and BNF rules.

Build a Lua Interpreter in Rust