Define and Call

Our interpreter only supported sequential execution at first, and later added control structures to support conditional jumps, and blocks also make the scope of variables. Functions, on the other hand, exist more independently in terms of resolution, execution, or scope. To do this, the current framework for parsing and virtual machine execution needs to be modified.

Transform ParseProto

The definition of a function can be nested, that is, the function can be defined again inside another function. If the entire code is regarded as the main function, then our current syntax analysis is equivalent to only supporting this one function. In order to support nested definitions of functions, the parsing needs to be modified. First transform the data structure.

Currently, the context structure of the parsing process is ParseProto, and this is also the structure returned to the virtual machine for execution. It is defined as follows:

pub struct ParseProto<R: Read> {
     pub constants: Vec<Value>,
     pub byte_codes: Vec<ByteCode>,

     sp: usize,
     locals: Vec<String>,
     break_blocks: Vec<Vec<usize>>,
     continue_blocks: Vec<Vec<(usize, usize)>>,
     gotos: Vec<GotoLabel>,
     labels: Vec<GotoLabel>,
     lex: Lex<R>,
}

The specific meaning of each field has been introduced in detail before, and will be ignored here. Here only the fields are distinguished according to the independence of the function:

the final lex field is parsed throughout the code;
All remaining fields are data inside the function.

In order to support nested definitions of functions, the global part (lex field) and the function part (other fields) need to be disassembled. The newly defined data structure PerFuncProto_ parsed by the function (because we will not adopt this solution in the end, _ is added to the name of the structure), including other fields left after removing lex from the original ParseProto:

struct PerFuncProto_ {
     pub constants: Vec<Value>,
     pub byte_codes: Vec<ByteCode>,
     sp: usize,
     ... // omit more fields
}

In order to support the nesting of functions, it is necessary to support multiple function analysis bodies at the same time. The most intuitive idea is to define a list of function bodies:

struct ParseProto<R: Read> {
     funcs: Vec<PerFuncProto_>, // The list of function analysis body PerFuncProto_ just defined
     lex: Lex<R>, // global data
}

Each time a new layer of functions is nested, a new member is pushed into the funcs field; it pops up after the function is parsed. The last member of funcs represents the current function. This definition is very intuitive, but there is a problem. It is very troublesome to access all the fields of the current function. For example, to access the constants field, you need self.funcs.last().unwrap().constants to read or self .funcs.last_mut().unwrap().constants writes. It's too inconvenient, and the execution efficiency should also be affected.

If it is C language, then this problem is easy to solve: add a pointer member of type PerFuncProto_ in ParseProto, such as current, which points to the last member of funcs. This pointer is updated every time the function body is pushed or popped. Then we can directly use this pointer to access the current function, such as self.current.constants. This approach is very convenient but Rust thinks it is not "safe", because the validity of this pointer cannot be guaranteed at the Rust syntax level. Although there are only two places to update this pointer, which is relatively safe, but since you use Rust, you must follow the rules of Rust.

For Rust, a feasible solution is to add an index (rather than a pointer), such as icurrent, pointing to the last member of funcs. This index is also updated every time the function body is pushed or popped. When accessing the current function information, we can use self.funcs[icurrent].constants. While the Rust language allows this, it's really just a variant of the pointer scheme above, and can still cause bugs due to incorrect updates of the index. For example, if the index exceeds the length of funcs, it will panic, and if it is smaller than expected, there will be code logic bugs that are more difficult to debug. In addition, during execution, Rust's list index will be compared with the length of the list, which will also slightly affect performance.

There is also a less intuitive solution that doesn't have the problems above: use recursion. When parsing nested functions, the most natural way is to recursively call the code of the parsing function, then each call will have an independent stack (Rust's call stack), so we can create a function parsing body every time you call it and use it Parse the current Lua function, and return the parsing body for the outer function to process after the call ends. In this solution, only the information of the current function can be accessed during the parsing process, and the information of the outer function cannot be accessed. Naturally, the problem of inconvenient access to the information of the current function just mentioned does not exist. For example, accessing constants still uses self.constants, even without modifying existing code. The only thing to solve is the global data Lex, which can be passed on as a parameter of the analysis function.

In this solution, there is no need to define a new data structure, just change the lex field in the original ParseProto from Lex type to &mut Lex. The syntax analysis function definition for parsing Lua functions is originally the method of ParseProto, which is defined as:

impl<'a, R: Read> ParseProto<'a, R> {
     fn chunk(&mut self) {
         ...
     }

Now change to a normal function, defined as:

fn chunk(lex: &mut Lex<impl Read>) -> ParseProto {
     ...
}

The parameter lex is global data, and each recursive call is directly passed to the next layer. The return value is the parsed information of the current Lua function created inside chunk().

In addition, the chunk() function internally calls the block() function to parse the code, and the latter returns the end Token of the block. Previously, the chunk() function was only used to process the entire code block, so the end Token could only be Token::Eos; but now it may also be used to parse other internal functions, and the expected end Token is Token ::End. Therefore, the chunk() function needs to add a new parameter, indicating the expected end Token. So the definition is changed to:

fn chunk(lex: &mut Lex<impl Read>, end_token: Token) -> ParseProto {
     ...
}

Add FuncProto

We just modified ParseProto and the type of lex. Now let's do a small optimization by the way. The first two pub modified fields in ParseProto are also returned to the virtual machine for execution; most of the latter fields are only used for syntax analysis, which are internal data and do not need to be returned to the virtual machine. These two parts can be disassembled so that only the part needed by the virtual machine is returned. To do this, add the FuncProto data structure:

// Return information to the virtual machine to execute
pub struct FuncProto {
     pub constants: Vec<Value>,
     pub byte_codes: Vec<ByteCode>,
}

#[derive(Debug)]
struct ParseProto<'a, R: Read> {
     // Return information to the virtual machine to execute
     fp: FuncProto,

     // syntax analysis internal data
     sp: usize,
     locals: Vec<String>,
     break_blocks: Vec<Vec<usize>>,
     continue_blocks: Vec<Vec<(usize, usize)>>,
     gotos: Vec<GotoLabel>,
     labels: Vec<GotoLabel>,
     lex: Lex<R>,

     // global data
     lex: &'a mut Lex<R>,
}

So the return value of the chunk() function is changed from ParseProto to FuncProto. Its full definition is as follows:

fn chunk(lex: &mut Lex<impl Read>, end_token: Token) -> FuncProto {
     // Generate a new ParseProto to parse the current new Lua function
     let mut proto = ParseProto::new(lex);

     // call block() parsing function
     assert_eq!(proto.block(), end_token);
     if let Some(goto) = proto. gotos. first() {
         panic!("goto {} no destination", &goto.name);
     }

     // only returns the FuncProto part
     proto.fp
}

In this way, when syntactically analyzing Lua built-in functions, just recursively call chunk(self.lex, Token::End). The specific syntax analysis is introduced below.

Syntax Analysis

The general process of parsing Lua functions is introduced above, now let's look at the specific syntax analysis. By now, we should be familiar with syntax analysis already, and it can be executed according to BNF. Lua's function definition has 3 places:

Global functions;
Local functions:
An anonymous function is a case of the expression exp statement.

The BNF rules are as follows:

stat :=
     `function` funcname funcbody | # 1. Global function
     `local` `function` Name funcbody | # 2. Local function
     # omit other cases

exp := functiondef | omit other cases
functiondef := `function` funcbody # 3. Anonymous function

funcbody ::= '(' [parlist] ')' block end # Function definition

It can be seen from the above rules that the difference between these three definitions is only at the beginning, and at the end they all belong to funcbody. Here only the simplest second case, the local function, is introduced.

     fn local_function(&mut self) {
         self.lex.next(); // skip keyword `function`
         let name = self.read_name(); // function name, or local variable name
         println!("== function: {name}");

         // currently does not support parameters, skip `()`
         self.lex.expect(Token::ParL);
         self.lex.expect(Token::ParR);

         // Call the chunk() parsing function
         let proto = chunk(self.lex, Token::End);

         // Put the parsed result FuncProto into the constant table
         let i = self.add_const(Value::LuaFunction(Rc::new(proto)));
         // load function through LoadConst bytecode
         self.fp.byte_codes.push(ByteCode::LoadConst(self.sp as u8, i as u16));

         // create local variable
         self. locals. push(name);
     }

The parsing process is simple. It should be noted that the processing method of the function prototype FuncProto returned by the chunk() function is to put it in the constant table as a constant. It can be compared that a string is a constant composed of a series of character sequences; and the function prototype FuncProto is a constant composed of a series of constant tables and bytecode sequences. It also exists in the constant table, and it is also loaded with LoadConst bytecode.

To this end, it is necessary to add a new Value type LuaFunction to represent the Rust function, and change the type that originally represented the Lua function from Function to RustFunction:

pub enum Value {
     LongStr(Rc<Vec<u8>>),
     LuaFunction(Rc<FuncProto>),
     RustFunction(fn (&mut ExeState) -> i32),

The data type associated with LuaFunction is Rc<FuncProto>, and it can also be seen from here that it is similar to a string constant.

The syntax analysis of "defining a function" is completed above, and the syntax analysis of "calling a function" is related to functions. But when "calling a function", the Lua function and the Rust function are treated equally, and the Lua programmer does not even know what the function is implemented when calling the function; since the Rust function print() has been called before Syntactic analysis, so there is no need to perform syntax analysis specifically for Lua function calls.

Virtual Machine Execution

Like syntax analysis, our previous virtual machine execution part only supports one layer of Lua functions. In order to support function calls, the easiest way is to recursively call the virtual machine to execute, that is, the execute() function. code show as below:

     ByteCode::Call(func, _) => {
         self. func_index = func as usize;
         match &self. stack[self. func_index] {
             Value::RustFunction(f) => { // previously supported Rust functions
                 f(self);
             }
             Value::LuaFunction(f) => { // new Lua function
                 let f = f. clone();
                 self.execute(&f); // recursively call the virtual machine!
             }
             f => panic!("invalid function: {f:?}"),
         }
     }

However, special handling of the stack is required. During parsing, each time a new function is parsed, the stack pointer (the sp field in the ParseProto structure) starts from 0. Because during syntax analysis, the absolute starting address of the stack when the virtual machine is executed is not known. Then, when the virtual machine is executing, when accessing the stack, the stack index in the bytecode used needs to add the offset of the stack start address of the current function. For example, for the following Lua code:

local a, b = 1, 2
local function foo()
     local x, y = 1, 2
end
foo()

When parsing the foo() function definition, the stack addresses of the local variables x and y are 0 and 1, respectively. When the last line of code is executed and the foo() function is called, the function foo is placed at the absolute index 2 of the stack. At this time, the absolute indexes of the local variables x and y are 3 and 4. Then when the virtual machine executes, it needs to convert the relative addresses 0 and 1 into 3 and 4.

  absolute     relative
  address      address
        +-----+ <---base of main function
      0 |  a  | 0
        +-----+
      1 |  b  | 1
        +-----+
      2 | foo | 2
        +-----+ <---base of foo()
      3 |  x  | 0
        +-----+
      4 |  y  | 1
        +-----+
        |     |

When executing the Rust function print() before, in order to allow the print() function to read the parameters, the func_index member is set in ExeState to point to the address of the function on the stack. Now call the Lua function, still the same. However, func_index is renamed to base here, and points to the next address of the function.

     ByteCode::Call(func, _) => {
         self.base += func as usize + 1; // Set the absolute address of the function on the stack
         match &self.stack[self.base-1] {
             Value::RustFunction(f) => {
                 f(self);
             }
             Value::LuaFunction(f) => {
                 let f = f. clone();
                 self. execute(&f);
             }
             f => panic!("invalid function: {f:?}"),
         }
         self.base -= func as usize + 1; // restore
     }

All previous write operations to the stack were called set_stack() method, now need to add self.base offset:

     fn set_stack(&mut self, dst: u8, v: Value) {
         set_vec(&mut self.stack, self.base + dst as usize, v); // plus self.base
     }

All previous read operations on the stack were directly self.stack[i], and now a new function get_stack() is also extracted, and the self.base offset is added when accessing the stack:

     fn get_stack(&self, dst: u8) -> &Value {
         &self.stack[self.base + dst as usize] // plus self.base
     }

So far, we have completed the most basic definition and calling of Lua functions. Thanks to the power of recursion, the code changes are not big. But it's just the beginning of the full feature. The next section adds support for parameters and return values.

Build a Lua Interpreter in Rust

Define and Call

Transform ParseProto

Add FuncProto

Syntax Analysis

Virtual Machine Execution