More Types

This section adds simple types, including boolean, integer, and float. Other types such as Table and UserData will be implemented in subsequent chapters.

We first improve the lexical analysis to support the tokens corresponding to these types, and then generate the corresponding bytecodes through the syntax analysis, and add support for these bytecodes in the virtual machine. Finally we modify the function call to support printing these types.

Improve Lexical Analysis

The lexical analysis in the previous chapter only supports 2 tokens. So now no matter what features are added, the lexical analysis must be changed first to add the corresponding Tokens. In order to avoid adding tokens piecemeal in each chapter in the future, it is now added here in one go.

The Lua official website lists the complete lexical conventions. It includes:

  • Name, which has been implemented before, is used for variables, etc.

  • Constants, including string, integer, and floating-point constants.

  • Keywords:

      and break do else else if end
      false for function goto if in
      local nil not or repeat return
      then true until while
    
  • Symbols:

      + - * / % ^ #
      & ~ | << >> //
      == ~= <= >= < > =
      ( ) { } [ ] ::
      ; : , . .. ...
    

The corresponding Token is defined as:

#[derive(Debug, PartialEq)]
pub enum Token {
    // keywords
    And,    Break,  Do,     Else,   Elseif, End,
    False,  For,    Function, Goto, If,     In,
    Local,  Nil,    Not,    Or,     Repeat, Return,
    Then,   True,   Until,  While,

 // +       -       *       /       %       ^       #
    Add,    Sub,    Mul,    Div,    Mod,    Pow,    Len,
 // &       ~       |       <<      >>      //
    BitAnd, BitXor, BitOr,  ShiftL, ShiftR, Idiv,
 // ==       ~=     <=      >=      <       >        =
    Equal,  NotEq,  LesEq,  GreEq,  Less,   Greater, Assign,
 // (       )       {       }       [       ]       ::
    ParL,   ParR,   CurlyL, CurlyR, SqurL,  SqurR,  DoubColon,
 // ;               :       ,       .       ..      ...
    SemiColon,      Colon,  Comma,  Dot,    Concat, Dots,

    // constant values
    Integer(i64),
    Float(f64),
    String(String),

    // name of variables or table keys
    Name(String),

    // end
    Eos,
}

The specific implementation is nothing more than tedious string parsing, which is skipped here. For the sake of simplicity, this implementation only supports most simple types, but does not support complex types such as long strings, long comments, string escapes, hexadecimal numbers, and floating point numbers only support scientific notation Law. These do not affect the main features to be added later.

Type of Values

After lexical analysis supports more types, we add these types to Value:

#[derive(Clone)]
pub enum Value {
    Nil,
    Boolean(bool),
    Integer(i64),
    Float(f64),
    String(String),
    Function(fn (&mut ExeState) -> i32),
}

impl fmt::Debug for Value {
    fn fmt(&self, f: &mut fmt::Formatter) -> Result<(), fmt::Error> {
        match self {
            Value::Nil => write!(f, "nil"),
            Value::Boolean(b) => write!(f, "{b}"),
            Value::Integer(i) => write!(f, "{i}"),
            Value::Float(n) => write!(f, "{n:?}"),
            Value::String(s) => write!(f, "{s}"),
            Value::Function(_) => write!(f, "function"),
        }
    }
}

One of the special places is that the debug mode is used for the output of floating-point numbers: {:?}. Because Rust's common output format {} for floating-point numbers is integer + decimal format, and a more reasonable way should be to choose a more suitable one between "integer decimal" and "scientific notation", corresponding to %g in C language's printf. For example, it is unreasonable to output "0.000000" for the number 1e-10. This seems to be a historical issue of Rust. For compatibility and other reasons, only the debug mode {:?} can be used to correspond to %g. I don't get into it here.

In addition, in order to facilitate the distinction between "integer" and "floating point number without decimal part", in Lua's official implementation, .0 will be added after the latter. For example, 2 will be output as 2.0 for the floating point number. The code is as follows. This is so sweet. And this is also the default behavior of Rust's {:?} mode, so we don't need special handling for this.

     if (buff[strspn(buff, "-0123456789")] == '\0') { /* looks like an int? */
       buff[len++] = lua_getlocaledecpoint();
       buff[len++] = '0'; /* adds '.0' to result */
     }

Before Lua 5.3, Lua has only one numeric type, the default is floating point. I understand this because Lua was originally intended for configuration files, for users rather than programmers. For ordinary users, the concepts of integer and floating point numbers are not distinguished, and there is no difference between configuring 10 seconds and 10.0 seconds; in addition, for some calculations, such as 7/2, the result is obviously 3.5 and Not 3. However, with the expansion of Lua's use, for example, as a glue language between many large programs, the demand for integers has become increasingly strong, so integers and floating-point numbers are distinguished at the language level.

Syntax Analysis

Now we add support for these types in the parser. Since currently only the function call statement is supported, that is, the format of function parameter; and the "function" only supports global variables, so this time only the "parameter" part needs to support these new types. For function calls in Lua voice, if the parameter is a string constant or a table structure, then the parentheses () can be omitted, as in the "hello, world!" example in the previous chapter. But for other cases, such as several new types added this time, brackets () must be required. So the modification of the parameter part is as follows:

Token::Name(name) => {
     // function, global variable only
     let ic = add_const(&mut constants, Value::String(name));
     byte_codes.push(ByteCode::GetGlobal(0, ic as u8));

     // argument, (var) or "string"
     match lex. next() {
         Token::ParL => { // '('
             let code = match lex. next() {
                 Token::Nil => ByteCode::LoadNil(1),
                 Token::True => ByteCode::LoadBool(1, true),
                 Token::False => ByteCode::LoadBool(1, false),
                 Token::Integer(i) =>
                     if let Ok(ii) = i16::try_from(i) {
                         ByteCode::LoadInt(1, ii)
                     } else {
                         load_const(&mut constants, 1, Value::Integer(i))
                     }
                 Token::Float(f) => load_const(&mut constants, 1, Value::Float(f)),
                 Token::String(s) => load_const(&mut constants, 1, Value::String(s)),
                 _ => panic!("invalid argument"),
             };
             byte_codes. push(code);

             if lex.next() != Token::ParR { // ')'
                 panic!("expected `)`");
             }
         }
         Token::String(s) => {
             let code = load_const(&mut constants, 1, Value::String(s));
             byte_codes. push(code);
         }
         _ => panic!("expected string"),
     }
}

This code first parses the function. Like the code in the previous chapter, it still only supports global variables. Then parse the parameters. In addition to the support for string constants, a more general way of parentheses () is added. Which handles various type constants:

  • Floating-point constants, similar to string constants, call the load_const() function, put it in the constant table at compile time, and then load it through LoadConst bytecode during execution.

  • Nil and Boolean types, there is no need to put Nil, true and false in the constant table. It is more convenient to encode directly into bytecode, and it is faster at execution time (because there is one less memory read). So LoadNil and LoadBool bytecodes are added.

  • Integer constants combine the above two approaches. Because a bytecode has 4 bytes, the opcode occupies 1 byte, the destination address occupies 1 byte, and there are 2 bytes left, which can store the integer of i16. Therefore, for numbers in the range of i16 (this is also a high probability event), it can be directly encoded into the bytecode, and the LoadInt bytecode is added for this purpose; if it exceeds the range of i16, it is stored in the constant table . This is also the official implementation of Lua for reference. From this we can see Lua's pursuit of performance, it adds a bytecode and process codes in order to reduce a memory access only. We will see many such cases in the future.

Since only the function call statment is currently supported, the function is fixed at the 0 position of the stack during execution, and the parameter is fixed at the 1 position. The target addresses of the above bytecodes are also fixedly filled with 1.

The main code has been introduced. The definition of the function load_const() used to generate LoadConst bytecode is listed below:

fn add_const(constants: &mut Vec<Value>, c: Value) -> usize {
     constants. push(c)
}

fn load_const(constants: &mut Vec<Value>, dst: usize, c: Value) -> ByteCode {
     ByteCode::LoadConst(dst as u8, add_const(constants, c) as u8)
}

Test

So far, the parsing process has completed the support for new types. The rest of the virtual machine execution part just supports the newly added bytecode LoadInt, LoadBool and LoadNil. Skip it here.

Then you can test the following code:

print(nil)
print(false)
print(123)
print(123456)
print(123456.0)

The output is as follows:

[src/parse.rs:64] &constants = [
     print,
     print,
     print,
     print,
     123456,
     print,
     123456.0,
]
byte_codes:
   GetGlobal(0, 0)
   LoadNil(1)
   Call(0, 1)
   GetGlobal(0, 0)
   LoadBool(1, false)
   Call(0, 1)
   GetGlobal(0, 0)
   LoadInt(1, 123)
   Call(0, 1)
   GetGlobal(0, 0)
   LoadConst(1, 1)
   Call(0, 1)
   GetGlobal(0, 0)
   LoadConst(1, 2)
   Call(0, 1)
nil
false
123
123456
123456.0

There is a small problem left over from the last chapter, that is, print appears many times in the constant table. This needs to be modified to check whether it already exists every time a constant is added.

Add Constants

Modify the add_const() function above as follows:

fn add_const(constants: &mut Vec<Value>, c: Value) -> usize {
     constants.iter().position(|v| v == &c)
         .unwrap_or_else(|| {
             constants. push(c);
             constants.len() - 1
         })
}

constants.iter().position() positions the index. Its parameter is a closure, which needs to compare two Value, for which Value needs to be implemented PartialEq trait:

impl PartialEq for Value {
    fn eq(&self, other: &Self) -> bool {
        // TODO compare Integer vs Float
        match (self, other) {
            (Value::Nil, Value::Nil) => true,
            (Value::Boolean(b1), Value::Boolean(b2)) => *b1 == *b2,
            (Value::Integer(i1), Value::Integer(i2)) => *i1 == *i2,
            (Value::Float(f1), Value::Float(f2)) => *f1 == *f2,
            (Value::String(s1), Value::String(s2)) => *s1 == *s2,
            (Value::Function(f1), Value::Function(f2)) => std::ptr::eq(f1, f2),
            (_, _) => false,
        }
    }
}

Here we think that two integers and floating point numbers that are numerically equal are different, such as Integer(123456) and Float(123456.0), because these are indeed two values, and the two cannot be combined when dealing with constant tables value, otherwise in the test code in the previous section, the last line will also load the integer 123456.

But during Lua execution, these two values are equal, that is, the result of 123 == 123.0 is true. We will deal with this issue in a later chapter.

Going back to the position() function, its return value is Option<usize>, Some(i) means found, and returns the index directly; while None means not found, you need to add a constant first, and then return the index. According to the programming habit of C language, it is the following if-else judgment, but here we try to use a more functional way. Personally, I feel that this method is not clearer, but since you are learning Rust, try to use the Rust method first.

     if let Some(i) = constants.iter().position(|v| v == &c) {
         i
     } else {
         constants. push(c);
         constants.len() - 1
     }

After completing the transformation of the add_const() function, duplicate values can be avoided in the constant table. The relevant output is intercepted as:

[src/parse.rs:64] &constants = [
     print,
     123456,
     123456.0,
]

Although the above will check for duplicates when adding constants, the check is done by traversing the array. The time complexity of adding all constants is O(N^2). If a Lua code segment contains a lot of constants, such as 1 million, the parsing will be too slow. For this we need a hash table to provide fast lookups. TODO.