Bytecode
As a beginner, it is natural to feel lost and unsure of how to start implementing an interpreter. No way to start.
Fortunately, the previous section introduces the bytecode at the end, and divides the entire interpreter process into two stages: parsing and execution. Then we can start with the bytecode:
- Determine the bytecode first,
- then let the parsing process (lexical analysis and parsing) try to generate this set of bytecodes,
- Then let the execution process (virtual machine) try to execute this set of bytecodes.
generate execute
parse -------> bytecode <------- virtual machine
But what does bytecode look like? How to define? What type? We can refer to the official implementation of Lua.
Output of luac
For the convenience of description, the object code is listed again here:
print "hello, world!"
The official implementation of Lua comes with a very useful tool, luac
, namely Lua Compiler, which translates the source code into bytecode and outputs it. It is our right-hand man in this project. Take a look at its output to the "hello, world!" program:
$ luac -l hello_world.lua
main <hello_world.lua:0,0> (5 instructions at 0x600000d78080)
0+ params, 2 slots, 1 upvalue, 0 locals, 2 constants, 0 functions
1 [1] VARARGPREP 0
2 [1] GETTABUP 0 0 0 ; _ENV "print"
3 [1] LOADK 1 1 ; "hello, world!"
4 [1] CALL 0 2 1 ; 1 in 0 out
5 [1] RETURN 0 1 1 ; 0 out
The first 2 lines of the output are incomprehensible, so ignore them now. The following should be the bytecode, and there are comments, which is great. But still do not understand. Check out Lua's official manual, but I can't find any explanation about bytecode. It turns out that the Lua language standard only defines the characteristics of the language, while the bytecode belongs to the "concrete implementation" part, just like the variable naming in the interpreter code, which does not belong to the definition scope of the Lua standard. In fact, the Luajit project, that is fully compatible with Lua 5.1, uses a completely different bytecode. We can even implement an interpreter without bytecode. Since the manual does not explain it, we can only check the the comment in source code. Here we only introduce the 5 bytecodes that appear above:
- VARARGPREP, temporarily unused, ignored.
- GETTABUP, this is a bit complicated, it can be temporarily understood as: loading global variables onto the stack. The three parameters are the stack index (0) as the target address, (ignore the second one,) and the index (0) of the global variable name in the constant table. The global variable name listed in the comments later is "print".
- LOADK, load constants onto the stack. The two parameters are the stack index (1) as the destination address, and the constant index (1) as the loading source. The value of the constant listed in the comment below is "hello, world!".
- CALL, function call. The three parameters are the stack index (0) of the function, the number of parameters, and the number of return values. The following comment indicates that there is 1 parameter and 0 return value.
- RETURN, temporarily unused, ignored.
Take a look at it together:
- First load the global variable named
print
into the stack (0); - Then load the string constant
"hello, world!"
into the stack (1); - Then execute the function at the stack (0) position, and take the stack (1) position as a parameter.
The stack diagram during execution is as follows:
+-----------------+
0 | print | <- function
+-----------------+
1 | "hello, world!" |
+-----------------+
| |
We currently only need to implement the above three bytecodes 2, 3, and 4.
Bytecode Definition
Now define the bytecode format.
First refer to the format definition of Lua's official implementation. Source code has comments on the bytecode format:
We assume that instructions are unsigned 32-bit integers.
All instructions have an opcode in the first 7 bits.
Instructions can have the following formats:
3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
iABC C(8) | B(8) |k| A(8) | Op(7) |
iABx Bx(17) | A(8) | Op(7) |
iAsBx sBx (signed)(17) | A(8) | Op(7) |
iAx Ax(25) | Op(7) |
isJ sJ(25) | Op(7) |
A signed argument is represented in excess K: the represented value is
the written unsigned value minus K, where K is half the maximum for the
corresponding unsigned argument.
The bytecode is represented by a 32bit unsigned integer. The first 7 bits represent the command, and the following 25 bits represent the parameters. There are 5 formats of bytecode, and the parameters of each format are different. If you like this sense of precise bit control, you may immediately think of various bit operations, and you may already be excited. But don’t worry, let's look at Luajit’s bytecode format first:
A single bytecode instruction is 32 bit wide and has an 8 bit opcode field and
several operand fields of 8 or 16 bit. Instructions come in one of two formats:
+---+---+---+---+
| B | C | A | OP|
| D | A | OP|
+---+---+---+---+
It is also a 32bit unsigned integer, but the division of fields is only accurate to bytes, and there are only 2 formats, which is much simpler than the official Lua implementation. In C language, by defining matching struct and union, bytecode can be constructed and parsed more conveniently, thus avoiding bit operations.
Since the Lua language does not specify the bytecode format, we can also design our own bytecode format. For different types of commands like this, where each command has unique associated parameters, it is very suitable to use Rust's enum: use tags as commands, and use associated values as parameters. Let's define the bytecodes like this:
#[derive(Debug)]
pub enum ByteCode {
GetGlobal(u8, u8),
LoadConst(u8, u8),
Call(u8, u8),
}
Luajit's bytecode definition can avoid bit operations, and using Rust's enum can go a step further, where you don't even need to care about the memory layout of each bytecode. You can use the enum creation syntax to construct bytecode, such as ByteCode::GetGlobal(1,2)
; use pattern matching match
to parse bytecode. The parsing and virtual-matchine modules in Section 1.4 construct and parse bytecodes respectively.
But also pay attention to ensure that the enum does not exceed 32bit, so we still need to understand the layout of the enum. The size of the enum tag in Rust is in bytes and is allocated on demand. So as long as there are less than 2^8=256 kinds of bytecodes, the tag only needs 1 byte. Only 7 bits are used to indicate the command type in Lua's official bytecode, so 256 is enough. Then there is still 3 bytes of space to store parameters. In the two bytecode types of Luajit, the parameters only occupy 3 bytes, which is enough. This article introduces the method of static checking, but due to the need for third-party libraries or macros, we don't use it here for the time being.
Rust's enum is really nice!
Two Tables
As you can see from analysis above, we also need two tables except the bytecodes.
First, we need a constant table to store all the constants during the parsing process. The generated bytecodes refer to the corresponding constant through the index parameter. And during the execution process, the virtual machine reads the constants from this table through the bytecodes' parameter. In this example, there are two constants, one is the name of the global variable print
, and the other is the string constant "hello, world!". This is the meaning of 2 constants
in the second line of the above luac
output.
Then we need a global variable table to save global variables according to variable names. During execution the virtual matchine first queries the global variable name in the constant table through the parameters in the bytecodes, and then queries the global variable table according to the name. The global variable table is only used (add, read, modify) during execution, and has nothing to do with the parsing process.
The specific definition of these two tables needs to rely on the concept of Lua's "value", which will be introduced in the next section.