Let's Do It

The previous chapters introduced the basics of compilation principles, and defined the two most important concepts, ByteCode and Value. Next, we can start coding to implement our interpreter!

The code corresponding to this series of articles is all managed by Cargo that comes with Rust. Projects currently using the binary type will be changed to the library type in the future.

The minimalist interpreter to be implemented at present is very simple, with very little code. I wrote all the code in one file at the beginning. However, it is foreseeable that the code volume of this project will increase with the increase of features. So in order to avoid subsequent changes to the file, we directly create multiple files now:

Program entry: main.rs;
Three components: lexical analysis lex.rs, syntax analysis parse.rs, and virtual machine vm.rs;
Two concepts: byte code byte_code.rs, and value value.rs.

The latter two concepts and their codes have been introduced before. The other 4 files are described below. Let's start with the program entry.

Program Entry

For the sake of simplicity, our interpreter has only one way of working, which is to accept a parameter as a Lua source code file, and then parse and execute it. Here is the code:

use std::env;
use std::fs::File;

mod value;
mod bytecode;
mod lex;
mod parse;
mod vm;

fn main() {
    let args: Vec<String> = env::args().collect();
    if args.len() != 2 {
        println!("Usage: {} script", args[0]);
        return;
    }
    let file = File::open(&args[1]).unwrap();

    let proto = parse::load(file);
    vm::ExeState::new().execute(&proto);
}

The first 2 lines reference two standard libraries. env is used to obtain command line arguments. fs::File is used to open Lua source files.

The middle lines refer to other file modules through use keyword.

Then look at the main() function. The first few lines read the parameters and open the source file. For the sake of simplicity, we use unwrap() to terminate the program if fail to open file. We will improve the error handing later.

The last 2 lines are the core function:

First, the syntax analysis module parse (who also calls lexical analysis lex internally) parses the file and returns the parsing result proto;
Then create a virtual machine and execute proto.

This process is different from Lua's officially APIs (complete example) :

lua_State *L = lua_open(); // Create lua_State
luaL_loadfile(L, filename); // Parse and put the parsing result on the top of the stack
lua_pcall(L, 0, 0, 0); // top of execution stack

This is because the official implementation of Lua is a "library", and the API only exposes the lua_State data structure, which contains both parsing and executing parts. So you must first create lua_State, and then call parsing and execution based on it. The parsing result is also passed through the stack of Lua_state. However, we currently do not have a similar unified state data structure, so we can only call the parsing and execution functions separately.

Let's look at the analysis and execution process respectively.

Lexical Analysis

Although the main() function calls the syntax analysis parse module firstly, but the syntax analysis calls the lexical analysis lex module internally. So let's see the lexical analysis first.

The output of lexical analysis is Token stream. For the "hello, world!" program, you only need to use the two Tokens "identity print" and "string "hello, world!"". For simplicity, we only support these two kinds of tokens for the time being. In addition, we also define an Eos to indicate the end of the file:

#[derive(Debug)]
pub enum Token {
    Name(String),
    String(String),
    Eos,
}

Instead of returning a whole Token list after parsing the input file at one time, we provide a function similar to an iterator so that the syntax analysis module can be called on demand. To do this first define a lexical analyzer:

#[derive(Debug)]
pub struct Lex {
    input: File,
}

For now only one member is included, the input file.

It provides 2 APIs: new() creates a parser based on the input file; next() returns the next Token.

impl Lex {
     pub fn new(input: File) -> Self ;
     pub fn next(&mut self) -> Token;
}

The specific parsing process is pure and boring string handling, and the code is skipped.

According to the Rust convention, the return value of the next() function here should be defined as Option<Token>, where Some<Token> means that a new token has been read, and None means the end of the file. But since Token itself is an enum, it seems more convenient to directly add an Eos in it. And if it is changed to the Option<Token> type, then an additional layer of judgment will be required in the syntax analysis call, as shown in the following code. So I chose to add the Eos type.

loop {
     if let Some(token) = lex.next() { // extra check
         match token {
             ... // parse
         }
     } else {
         break
     }
}

Syntax Analysis

The parsing result proto in the main() function concats the parsing and execution phases. But in view of Rust's powerful type mechanism, proto does not show a specific type in the above code. Now let's define it. It has been introduced in the bytecode section that the analysis result needs to contain two parts: bytecode sequence and constant table. Then you can define the format of the parsing result as follows:

#[derive(Debug)]
pub struct ParseProto {
    pub constants: Vec::<Value>,
    pub byte_codes: Vec::<ByteCode>,
}

The constant table constants is a Vec containing the Value type, and the bytecode sequence byte_codes is a Vec containing the ByteCode type. They are both Vec structures with the same functionality but different containment types. In the ancient C language, to include the two types Value and ByteCode, either write a set of codes for each type, or use complex features such as macros or function pointers. Generics in the Rust language can abstract the same set of logic for different types. More features of generics will be used in subsequent code.

After defining ParseProto, let's look at the syntax analysis process. We currently only support the statement of print "hello, world!", which is the format of Name String. The Name is first read from the lexer, followed by the string constant. If it is not in this format, an error will be reported. The specific code is as follows:

pub fn load(input: File) -> ParseProto {
    let mut constants = Vec::new();
    let mut byte_codes = Vec::new();
    let mut lex = Lex::new(input);

    loop {
        match lex.next() {
            Token::Name(name) => { // `Name LiteralString` as function call
                constants.push(Value::String(name));
                byte_codes.push(ByteCode::GetGlobal(0, (constants.len()-1) as u8));

                if let Token::String(s) = lex.next() {
                    constants.push(Value::String(s));
                    byte_codes.push(ByteCode::LoadConst(1, (constants.len()-1) as u8));
                    byte_codes.push(ByteCode::Call(0, 1));
                } else {
                    panic!("expected string");
                }
            }
            Token::Eos => break,
            t => panic!("unexpected token: {t:?}"),
        }
    }

    dbg!(&constants);
    dbg!(&byte_codes);
    ParseProto { constants, byte_codes }
}

The input is the source file File, and the output is the ParseProto just defined.

The main body of the function is a loop, and the Token is cyclically read through the next() function provided by the lexical analyzer lex created at the beginning of the function. We currently only support one type of statement, Name LiteralString, and the semantics are function calls. So the analysis logic is also very simple:

When Name is encountered, it is considered to be the beginning of a statement:
- Use Name as a global variable and store it in the constant table;
- Generate GetGlobal bytecode, load the global variable on the stack according to the name. The first parameter is the index of the target stack. Since we currently only support the function call statement, the stack is only used for function calls, so the function must be at position 0; the second parameter is the index of the global variable name in the global variable;
- Read the next Token, and it is expected to be a string constant, otherwise panic;
- Add string constants to the constant table;
- Generate LoadConst bytecode to load constants onto the stack. The first parameter is the target stack index, which is behind the function and is 1; the second parameter is the index of the constant in the constant table;
- Once the function and parameters are ready, Call bytecode can be generated to call the function. At present, the two parameters are the function position and the number of parameters, which are fixed at 0 and 1 respectively.
When Eos is encountered, exit the loop.
When encountering other Tokens (currently only of Token::String type), panic.

After the function, the constant table and bytecode sequence are output through dbg! for debugging. It can be compared with the output of luac.

Finally returns ParseProto.

Virtual Machine Execution

After parsing and generating ParseProto, it is the turn of the virtual machine to execute. According to the previous analysis, the virtual machine currently requires two components: the stack and the global variable table. So define the virtual machine state as follows:

pub struct ExeState {
    globals: HashMap<String, Value>,
    stack: Vec::<Value>,
}

When creating a virtual machine, you need to add the print function in the global variable table in advance:

impl ExeState {
    pub fn new() -> Self {
        let mut globals = HashMap::new();
        globals.insert(String::from("print"), Value::Function(lib_print));

        ExeState {
            globals,
            stack: Vec::new(),
        }
    }

The print function is defined as follows:

// "print" function in Lua's std-lib.
// It supports only 1 argument and assumes the argument is at index:1 on stack.
fn lib_print(state: &mut ExeState) -> i32 {
    println!("{:?}", state.stack[1]);
    0
}

Currently the print function only supports one parameter, and it is assumed that this parameter is at position 1 of the stack. The function prints this parameter. Because this function does not need to return data to the caller, it returns 0.

After the initialization is completed, the following is the core virtual machine execution function, that is, the big bytecode dispatching loop: read the bytecode sequence in turn and execute the corresponding predefined subrotines. The specific code is as follows:

    pub fn execute(&mut self, proto: &ParseProto) {
        for code in proto.byte_codes.iter() {
            match *code {
                ByteCode::GetGlobal(dst, name) => {
                    let name = &proto.constants[name as usize];
                    if let Value::String(key) = name {
                        let v = self.globals.get(key).unwrap_or(&Value::Nil).clone();
                        self.set_stack(dst, v);
                    } else {
                        panic!("invalid global key: {name:?}");
                    }
                }
                ByteCode::LoadConst(dst, c) => {
                    let v = proto.constants[c as usize].clone();
                    self.set_stack(dst, v);
                }
                ByteCode::Call(func, _) => {
                    let func = &self.stack[func as usize];
                    if let Value::Function(f) = func {
                        f(self);
                    } else {
                        panic!("invalid function: {func:?}");
                    }
                }
            }
        }
    }

Currently only 3 bytecodes are supported. All subrotines are clear, needless to explain.

Test

So far, we have implemented a Lua interpreter with a complete process! Look at the running effect:

$ cargo r -q --test_lua/hello.lua
[src/parse.rs:39] &constants = [
     print,
     hello, world!,
]
[src/parse.rs:40] &byte_codes = [
     GetGlobal(
         0,
         0,
     ),
     LoadConst(
         1,
         1,
     ),
     Call(
         0,
     ),
]
hello world!

The output is divided into 3 parts. Part 1 is the constant table, containing 2 string constants. The second part is the bytecode sequence, which can be compared with the output of luac in the Bytecode section. The last line is the result we expected: "hello, world!".

There is an additional function. The parsing part does not support only one line statement, but a loop. So we can support multiple print statements, such as:

print "hello, world!"
print "hello, again..."

There is a small problem which is print appears twice in the constant table. It can be optimized here that every time adding a value to the constant table, check whether it already exists first. We will finish this in the next chapter.

Summary

The purpose of this chapter is to implement a minimal Lua interpreter but with complete process, in order to get familiar with the interpreter architecture. To this end, we first introduced the basics of compiling principles, then introduced the two core concepts of Lua's bytecode and value, and finally accomplished it!

We have been emphasizing the "complete process" because we only need to add features onto this framework in the following chapters. Let's move on!

Build a Lua Interpreter in Rust