Define and Call
Our interpreter only supported sequential execution at first, and later added control structures to support conditional jumps, and blocks also make the scope of variables. Functions, on the other hand, exist more independently in terms of resolution, execution, or scope. To do this, the current framework for parsing and virtual machine execution needs to be modified.
Transform ParseProto
The definition of a function can be nested, that is, the function can be defined again inside another function. If the entire code is regarded as the main function, then our current syntax analysis is equivalent to only supporting this one function. In order to support nested definitions of functions, the parsing needs to be modified. First transform the data structure.
Currently, the context structure of the parsing process is ParseProto
, and this is also the structure returned to the virtual machine for execution. It is defined as follows:
pub struct ParseProto<R: Read> {
pub constants: Vec<Value>,
pub byte_codes: Vec<ByteCode>,
sp: usize,
locals: Vec<String>,
break_blocks: Vec<Vec<usize>>,
continue_blocks: Vec<Vec<(usize, usize)>>,
gotos: Vec<GotoLabel>,
labels: Vec<GotoLabel>,
lex: Lex<R>,
}
The specific meaning of each field has been introduced in detail before, and will be ignored here. Here only the fields are distinguished according to the independence of the function:
- the final
lex
field is parsed throughout the code; - All remaining fields are data inside the function.
In order to support nested definitions of functions, the global part (lex
field) and the function part (other fields) need to be disassembled. The newly defined data structure PerFuncProto_
parsed by the function (because we will not adopt this solution in the end, _
is added to the name of the structure), including other fields left after removing lex
from the original ParseProto
:
struct PerFuncProto_ {
pub constants: Vec<Value>,
pub byte_codes: Vec<ByteCode>,
sp: usize,
... // omit more fields
}
In order to support the nesting of functions, it is necessary to support multiple function analysis bodies at the same time. The most intuitive idea is to define a list of function bodies:
struct ParseProto<R: Read> {
funcs: Vec<PerFuncProto_>, // The list of function analysis body PerFuncProto_ just defined
lex: Lex<R>, // global data
}
Each time a new layer of functions is nested, a new member is pushed into the funcs
field; it pops up after the function is parsed. The last member of funcs
represents the current function. This definition is very intuitive, but there is a problem. It is very troublesome to access all the fields of the current function. For example, to access the constants
field, you need self.funcs.last().unwrap().constants
to read or self .funcs.last_mut().unwrap().constants
writes. It's too inconvenient, and the execution efficiency should also be affected.
If it is C language, then this problem is easy to solve: add a pointer member of type PerFuncProto_
in ParseProto
, such as current
, which points to the last member of funcs
. This pointer is updated every time the function body is pushed or popped. Then we can directly use this pointer to access the current function, such as self.current.constants
. This approach is very convenient but Rust thinks it is not "safe", because the validity of this pointer cannot be guaranteed at the Rust syntax level. Although there are only two places to update this pointer, which is relatively safe, but since you use Rust, you must follow the rules of Rust.
For Rust, a feasible solution is to add an index (rather than a pointer), such as icurrent
, pointing to the last member of funcs
. This index is also updated every time the function body is pushed or popped. When accessing the current function information, we can use self.funcs[icurrent].constants
. While the Rust language allows this, it's really just a variant of the pointer scheme above, and can still cause bugs due to incorrect updates of the index. For example, if the index exceeds the length of funcs
, it will panic, and if it is smaller than expected, there will be code logic bugs that are more difficult to debug. In addition, during execution, Rust's list index will be compared with the length of the list, which will also slightly affect performance.
There is also a less intuitive solution that doesn't have the problems above: use recursion. When parsing nested functions, the most natural way is to recursively call the code of the parsing function, then each call will have an independent stack (Rust's call stack), so we can create a function parsing body every time you call it and use it Parse the current Lua function, and return the parsing body for the outer function to process after the call ends. In this solution, only the information of the current function can be accessed during the parsing process, and the information of the outer function cannot be accessed. Naturally, the problem of inconvenient access to the information of the current function just mentioned does not exist. For example, accessing constants still uses self.constants
, even without modifying existing code. The only thing to solve is the global data Lex
, which can be passed on as a parameter of the analysis function.
In this solution, there is no need to define a new data structure, just change the lex
field in the original ParseProto
from Lex
type to &mut Lex
. The syntax analysis function definition for parsing Lua functions is originally the method of ParseProto
, which is defined as:
impl<'a, R: Read> ParseProto<'a, R> {
fn chunk(&mut self) {
...
}
Now change to a normal function, defined as:
fn chunk(lex: &mut Lex<impl Read>) -> ParseProto {
...
}
The parameter lex
is global data, and each recursive call is directly passed to the next layer. The return value is the parsed information of the current Lua function created inside chunk()
.
In addition, the chunk()
function internally calls the block()
function to parse the code, and the latter returns the end Token of the block. Previously, the chunk()
function was only used to process the entire code block, so the end Token could only be Token::Eos
; but now it may also be used to parse other internal functions, and the expected end Token is Token ::End
. Therefore, the chunk()
function needs to add a new parameter, indicating the expected end Token. So the definition is changed to:
fn chunk(lex: &mut Lex<impl Read>, end_token: Token) -> ParseProto {
...
}
Add FuncProto
We just modified ParseProto
and the type of lex
. Now let's do a small optimization by the way. The first two pub
modified fields in ParseProto
are also returned to the virtual machine for execution; most of the latter fields are only used for syntax analysis, which are internal data and do not need to be returned to the virtual machine. These two parts can be disassembled so that only the part needed by the virtual machine is returned. To do this, add the FuncProto
data structure:
// Return information to the virtual machine to execute
pub struct FuncProto {
pub constants: Vec<Value>,
pub byte_codes: Vec<ByteCode>,
}
#[derive(Debug)]
struct ParseProto<'a, R: Read> {
// Return information to the virtual machine to execute
fp: FuncProto,
// syntax analysis internal data
sp: usize,
locals: Vec<String>,
break_blocks: Vec<Vec<usize>>,
continue_blocks: Vec<Vec<(usize, usize)>>,
gotos: Vec<GotoLabel>,
labels: Vec<GotoLabel>,
lex: Lex<R>,
// global data
lex: &'a mut Lex<R>,
}
So the return value of the chunk()
function is changed from ParseProto
to FuncProto
. Its full definition is as follows:
fn chunk(lex: &mut Lex<impl Read>, end_token: Token) -> FuncProto {
// Generate a new ParseProto to parse the current new Lua function
let mut proto = ParseProto::new(lex);
// call block() parsing function
assert_eq!(proto.block(), end_token);
if let Some(goto) = proto. gotos. first() {
panic!("goto {} no destination", &goto.name);
}
// only returns the FuncProto part
proto.fp
}
In this way, when syntactically analyzing Lua built-in functions, just recursively call chunk(self.lex, Token::End)
. The specific syntax analysis is introduced below.
Syntax Analysis
The general process of parsing Lua functions is introduced above, now let's look at the specific syntax analysis. By now, we should be familiar with syntax analysis already, and it can be executed according to BNF. Lua's function definition has 3 places:
- Global functions;
- Local functions:
- An anonymous function is a case of the expression
exp
statement.
The BNF rules are as follows:
stat :=
`function` funcname funcbody | # 1. Global function
`local` `function` Name funcbody | # 2. Local function
# omit other cases
exp := functiondef | omit other cases
functiondef := `function` funcbody # 3. Anonymous function
funcbody ::= '(' [parlist] ')' block end # Function definition
It can be seen from the above rules that the difference between these three definitions is only at the beginning, and at the end they all belong to funcbody
. Here only the simplest second case, the local function, is introduced.
fn local_function(&mut self) {
self.lex.next(); // skip keyword `function`
let name = self.read_name(); // function name, or local variable name
println!("== function: {name}");
// currently does not support parameters, skip `()`
self.lex.expect(Token::ParL);
self.lex.expect(Token::ParR);
// Call the chunk() parsing function
let proto = chunk(self.lex, Token::End);
// Put the parsed result FuncProto into the constant table
let i = self.add_const(Value::LuaFunction(Rc::new(proto)));
// load function through LoadConst bytecode
self.fp.byte_codes.push(ByteCode::LoadConst(self.sp as u8, i as u16));
// create local variable
self. locals. push(name);
}
The parsing process is simple. It should be noted that the processing method of the function prototype FuncProto returned by the chunk()
function is to put it in the constant table as a constant. It can be compared that a string is a constant composed of a series of character sequences; and the function prototype FuncProto is a constant composed of a series of constant tables and bytecode sequences. It also exists in the constant table, and it is also loaded with LoadConst
bytecode.
To this end, it is necessary to add a new Value type LuaFunction
to represent the Rust function, and change the type that originally represented the Lua function from Function
to RustFunction
:
pub enum Value {
LongStr(Rc<Vec<u8>>),
LuaFunction(Rc<FuncProto>),
RustFunction(fn (&mut ExeState) -> i32),
The data type associated with LuaFunction
is Rc<FuncProto>
, and it can also be seen from here that it is similar to a string constant.
The syntax analysis of "defining a function" is completed above, and the syntax analysis of "calling a function" is related to functions. But when "calling a function", the Lua function and the Rust function are treated equally, and the Lua programmer does not even know what the function is implemented when calling the function; since the Rust function print()
has been called before Syntactic analysis, so there is no need to perform syntax analysis specifically for Lua function calls.
Virtual Machine Execution
Like syntax analysis, our previous virtual machine execution part only supports one layer of Lua functions. In order to support function calls, the easiest way is to recursively call the virtual machine to execute, that is, the execute()
function. code show as below:
ByteCode::Call(func, _) => {
self. func_index = func as usize;
match &self. stack[self. func_index] {
Value::RustFunction(f) => { // previously supported Rust functions
f(self);
}
Value::LuaFunction(f) => { // new Lua function
let f = f. clone();
self.execute(&f); // recursively call the virtual machine!
}
f => panic!("invalid function: {f:?}"),
}
}
However, special handling of the stack is required. During parsing, each time a new function is parsed, the stack pointer (the sp
field in the ParseProto
structure) starts from 0. Because during syntax analysis, the absolute starting address of the stack when the virtual machine is executed is not known. Then, when the virtual machine is executing, when accessing the stack, the stack index in the bytecode used needs to add the offset of the stack start address of the current function. For example, for the following Lua code:
local a, b = 1, 2
local function foo()
local x, y = 1, 2
end
foo()
When parsing the foo()
function definition, the stack addresses of the local variables x and y are 0 and 1, respectively. When the last line of code is executed and the foo()
function is called, the function foo
is placed at the absolute index 2 of the stack. At this time, the absolute indexes of the local variables x and y are 3 and 4. Then when the virtual machine executes, it needs to convert the relative addresses 0 and 1 into 3 and 4.
absolute relative
address address
+-----+ <---base of main function
0 | a | 0
+-----+
1 | b | 1
+-----+
2 | foo | 2
+-----+ <---base of foo()
3 | x | 0
+-----+
4 | y | 1
+-----+
| |
When executing the Rust function print()
before, in order to allow the print()
function to read the parameters, the func_index
member is set in ExeState
to point to the address of the function on the stack. Now call the Lua function, still the same. However, func_index
is renamed to base
here, and points to the next address of the function.
ByteCode::Call(func, _) => {
self.base += func as usize + 1; // Set the absolute address of the function on the stack
match &self.stack[self.base-1] {
Value::RustFunction(f) => {
f(self);
}
Value::LuaFunction(f) => {
let f = f. clone();
self. execute(&f);
}
f => panic!("invalid function: {f:?}"),
}
self.base -= func as usize + 1; // restore
}
All previous write operations to the stack were called set_stack()
method, now need to add self.base offset:
fn set_stack(&mut self, dst: u8, v: Value) {
set_vec(&mut self.stack, self.base + dst as usize, v); // plus self.base
}
All previous read operations on the stack were directly self.stack[i]
, and now a new function get_stack()
is also extracted, and the self.base offset is added when accessing the stack:
fn get_stack(&self, dst: u8) -> &Value {
&self.stack[self.base + dst as usize] // plus self.base
}
So far, we have completed the most basic definition and calling of Lua functions. Thanks to the power of recursion, the code changes are not big. But it's just the beginning of the full feature. The next section adds support for parameters and return values.