Let's Do It
The previous chapters introduced the basics of compilation principles, and defined the two most important concepts, ByteCode and Value. Next, we can start coding to implement our interpreter!
The code corresponding to this series of articles is all managed by Cargo that comes with Rust. Projects currently using the binary type will be changed to the library type in the future.
The minimalist interpreter to be implemented at present is very simple, with very little code. I wrote all the code in one file at the beginning. However, it is foreseeable that the code volume of this project will increase with the increase of features. So in order to avoid subsequent changes to the file, we directly create multiple files now:
- Program entry:
main.rs
; - Three components: lexical analysis
lex.rs
, syntax analysisparse.rs
, and virtual machinevm.rs
; - Two concepts: byte code
byte_code.rs
, and valuevalue.rs
.
The latter two concepts and their codes have been introduced before. The other 4 files are described below. Let's start with the program entry.
Program Entry
For the sake of simplicity, our interpreter has only one way of working, which is to accept a parameter as a Lua source code file, and then parse and execute it. Here is the code:
use std::env;
use std::fs::File;
mod value;
mod bytecode;
mod lex;
mod parse;
mod vm;
fn main() {
let args: Vec<String> = env::args().collect();
if args.len() != 2 {
println!("Usage: {} script", args[0]);
return;
}
let file = File::open(&args[1]).unwrap();
let proto = parse::load(file);
vm::ExeState::new().execute(&proto);
}
The first 2 lines reference two standard libraries. env
is used to obtain command line arguments. fs::File
is used to open Lua source files.
The middle lines refer to other file modules through use
keyword.
Then look at the main()
function. The first few lines read the parameters and open the source file. For the sake of simplicity, we use unwrap()
to terminate the program if fail to open file. We will improve the error handing later.
The last 2 lines are the core function:
- First, the syntax analysis module
parse
(who also calls lexical analysislex
internally) parses the file and returns the parsing resultproto
; - Then create a virtual machine and execute
proto
.
This process is different from Lua's officially APIs (complete example) :
lua_State *L = lua_open(); // Create lua_State
luaL_loadfile(L, filename); // Parse and put the parsing result on the top of the stack
lua_pcall(L, 0, 0, 0); // top of execution stack
This is because the official implementation of Lua is a "library", and the API only exposes the lua_State
data structure, which contains both parsing and executing parts. So you must first create lua_State
, and then call parsing and execution based on it. The parsing result is also passed through the stack of Lua_state
. However, we currently do not have a similar unified state data structure, so we can only call the parsing and execution functions separately.
Let's look at the analysis and execution process respectively.
Lexical Analysis
Although the main()
function calls the syntax analysis parse
module firstly, but the syntax analysis calls the lexical analysis lex
module internally. So let's see the lexical analysis first.
The output of lexical analysis is Token stream. For the "hello, world!" program, you only need to use the two Tokens "identity print
" and "string "hello, world!"
". For simplicity, we only support these two kinds of tokens for the time being. In addition, we also define an Eos
to indicate the end of the file:
#[derive(Debug)]
pub enum Token {
Name(String),
String(String),
Eos,
}
Instead of returning a whole Token list after parsing the input file at one time, we provide a function similar to an iterator so that the syntax analysis module can be called on demand. To do this first define a lexical analyzer:
#[derive(Debug)]
pub struct Lex {
input: File,
}
For now only one member is included, the input file.
It provides 2 APIs: new()
creates a parser based on the input file; next()
returns the next Token.
impl Lex {
pub fn new(input: File) -> Self ;
pub fn next(&mut self) -> Token;
}
The specific parsing process is pure and boring string handling, and the code is skipped.
According to the Rust convention, the return value of the next()
function here should be defined as Option<Token>
, where Some<Token>
means that a new token has been read, and None
means the end of the file. But since Token
itself is an enum
, it seems more convenient to directly add an Eos
in it. And if it is changed to the Option<Token>
type, then an additional layer of judgment will be required in the syntax analysis call, as shown in the following code. So I chose to add the Eos
type.
loop {
if let Some(token) = lex.next() { // extra check
match token {
... // parse
}
} else {
break
}
}
Syntax Analysis
The parsing result proto
in the main()
function concats the parsing and execution phases. But in view of Rust's powerful type mechanism, proto
does not show a specific type in the above code. Now let's define it. It has been introduced in the bytecode section that the analysis result needs to contain two parts: bytecode sequence and constant table. Then you can define the format of the parsing result as follows:
#[derive(Debug)]
pub struct ParseProto {
pub constants: Vec::<Value>,
pub byte_codes: Vec::<ByteCode>,
}
The constant table constants
is a Vec
containing the Value
type, and the bytecode sequence byte_codes
is a Vec
containing the ByteCode
type. They are both Vec
structures with the same functionality but different containment types. In the ancient C language, to include the two types Value
and ByteCode
, either write a set of codes for each type, or use complex features such as macros or function pointers. Generics in the Rust language can abstract the same set of logic for different types. More features of generics will be used in subsequent code.
After defining ParseProto
, let's look at the syntax analysis process. We currently only support the statement of print "hello, world!"
, which is the format of Name String
. The Name is first read from the lexer, followed by the string constant. If it is not in this format, an error will be reported. The specific code is as follows:
pub fn load(input: File) -> ParseProto {
let mut constants = Vec::new();
let mut byte_codes = Vec::new();
let mut lex = Lex::new(input);
loop {
match lex.next() {
Token::Name(name) => { // `Name LiteralString` as function call
constants.push(Value::String(name));
byte_codes.push(ByteCode::GetGlobal(0, (constants.len()-1) as u8));
if let Token::String(s) = lex.next() {
constants.push(Value::String(s));
byte_codes.push(ByteCode::LoadConst(1, (constants.len()-1) as u8));
byte_codes.push(ByteCode::Call(0, 1));
} else {
panic!("expected string");
}
}
Token::Eos => break,
t => panic!("unexpected token: {t:?}"),
}
}
dbg!(&constants);
dbg!(&byte_codes);
ParseProto { constants, byte_codes }
}
The input is the source file File
, and the output is the ParseProto
just defined.
The main body of the function is a loop, and the Token is cyclically read through the next()
function provided by the lexical analyzer lex
created at the beginning of the function. We currently only support one type of statement, Name LiteralString
, and the semantics are function calls. So the analysis logic is also very simple:
- When
Name
is encountered, it is considered to be the beginning of a statement:- Use
Name
as a global variable and store it in the constant table; - Generate
GetGlobal
bytecode, load the global variable on the stack according to the name. The first parameter is the index of the target stack. Since we currently only support the function call statement, the stack is only used for function calls, so the function must be at position 0; the second parameter is the index of the global variable name in the global variable; - Read the next Token, and it is expected to be a string constant, otherwise panic;
- Add string constants to the constant table;
- Generate
LoadConst
bytecode to load constants onto the stack. The first parameter is the target stack index, which is behind the function and is 1; the second parameter is the index of the constant in the constant table; - Once the function and parameters are ready,
Call
bytecode can be generated to call the function. At present, the two parameters are the function position and the number of parameters, which are fixed at 0 and 1 respectively.
- Use
- When
Eos
is encountered, exit the loop. - When encountering other Tokens (currently only of
Token::String
type), panic.
After the function, the constant table and bytecode sequence are output through dbg!
for debugging. It can be compared with the output of luac
.
Finally returns ParseProto
.
Virtual Machine Execution
After parsing and generating ParseProto
, it is the turn of the virtual machine to execute. According to the previous analysis, the virtual machine currently requires two components: the stack and the global variable table. So define the virtual machine state as follows:
pub struct ExeState {
globals: HashMap<String, Value>,
stack: Vec::<Value>,
}
When creating a virtual machine, you need to add the print
function in the global variable table in advance:
impl ExeState {
pub fn new() -> Self {
let mut globals = HashMap::new();
globals.insert(String::from("print"), Value::Function(lib_print));
ExeState {
globals,
stack: Vec::new(),
}
}
The print
function is defined as follows:
// "print" function in Lua's std-lib.
// It supports only 1 argument and assumes the argument is at index:1 on stack.
fn lib_print(state: &mut ExeState) -> i32 {
println!("{:?}", state.stack[1]);
0
}
Currently the print
function only supports one parameter, and it is assumed that this parameter is at position 1 of the stack. The function prints this parameter. Because this function does not need to return data to the caller, it returns 0.
After the initialization is completed, the following is the core virtual machine execution function, that is, the big bytecode dispatching loop: read the bytecode sequence in turn and execute the corresponding predefined subrotines. The specific code is as follows:
pub fn execute(&mut self, proto: &ParseProto) {
for code in proto.byte_codes.iter() {
match *code {
ByteCode::GetGlobal(dst, name) => {
let name = &proto.constants[name as usize];
if let Value::String(key) = name {
let v = self.globals.get(key).unwrap_or(&Value::Nil).clone();
self.set_stack(dst, v);
} else {
panic!("invalid global key: {name:?}");
}
}
ByteCode::LoadConst(dst, c) => {
let v = proto.constants[c as usize].clone();
self.set_stack(dst, v);
}
ByteCode::Call(func, _) => {
let func = &self.stack[func as usize];
if let Value::Function(f) = func {
f(self);
} else {
panic!("invalid function: {func:?}");
}
}
}
}
}
Currently only 3 bytecodes are supported. All subrotines are clear, needless to explain.
Test
So far, we have implemented a Lua interpreter with a complete process! Look at the running effect:
$ cargo r -q --test_lua/hello.lua
[src/parse.rs:39] &constants = [
print,
hello, world!,
]
[src/parse.rs:40] &byte_codes = [
GetGlobal(
0,
0,
),
LoadConst(
1,
1,
),
Call(
0,
),
]
hello world!
The output is divided into 3 parts. Part 1 is the constant table, containing 2 string constants. The second part is the bytecode sequence, which can be compared with the output of luac
in the Bytecode section. The last line is the result we expected: "hello, world!".
There is an additional function. The parsing part does not support only one line statement, but a loop. So we can support multiple print
statements, such as:
print "hello, world!"
print "hello, again..."
There is a small problem which is print
appears twice in the constant table. It can be optimized here that every time adding a value to the constant table, check whether it already exists first. We will finish this in the next chapter.
Summary
The purpose of this chapter is to implement a minimal Lua interpreter but with complete process, in order to get familiar with the interpreter architecture. To this end, we first introduced the basics of compiling principles, then introduced the two core concepts of Lua's bytecode and value, and finally accomplished it!
We have been emphasizing the "complete process" because we only need to add features onto this framework in the following chapters. Let's move on!