Unicode and UTF-8

The previous sections of this chapter refine the string-related content, clarifying some issues, but also introducing some confusion. For example, the definitions of the three string types in Value, some are of [u8] type, some are of String type:

pub enum Value {
     ShortStr(u8, [u8; SHORT_STR_MAX]), // [u8] type
     MidStr(Rc<(u8, [u8; MID_STR_MAX])>), // [u8] type
     LongStr(Rc<String>), // String type

Another example is the mixed use of "byte" and "character" in the previous section. The same is true for the lexical analysis code, which reads bytes of type u8 from the input character stream, but converts them to characters of type char via as.

     fn read_char(&mut self) -> char {
         match self. input. next() {
             Some(Ok(ch)) => ch as char, // u8 -> char

The reason these confusions haven't caused problems so far is because our test programs only involve ASCII characters. Problems arise if other characters are involved. For example, for the following Lua code:

print "你好"

The execution result is wrong:

$ cargo r -q --  test_lua/nihao.lua
constants: [print, ä½ å¥½]
byte_codes:
  GetGlobal(0, 0)
  LoadConst(1, 1)
  Call(0, 1)
ä½ å¥½

The output is not the expected 你好, but ä½ å¥½. Let's explain the reason for this "garbled code" and fix this problem.

Unicode and UTF-8 Concepts

These two are very general concepts, and only the most basic introduction is given here.

Unicode uniformly encodes most of the characters in the world. Among them, in order to be compatible with the ASCII code, the encoding of the ASCII character set is consistent. For example, the ASCII and Unicode encodings of the English letter p are both 0x70, and U+0070 is written in Unicode. The Unicode encoding of Chinese 你 is U+4F60.

Unicode just numbers the character, while how the computer stores it is another matter. The easiest way is to store directly according to the Unicode encoding. Since Unicode currently supports more than 140,000 characters (still increasing), it needs at least 3 bytes to represent, so the English letter p is 00 00 70, and the Chinese 你 is 00 4F 60 . The problem with this method is that 3 bytes are required for the ASCII part, which (for English) is wasteful. So there are other encoding methods, UTF-8 is one of them. UTF-8 is a variable-length encoding. For example, each ASCII character only occupies 1 byte. Here the encoding of the English letter p is still 0x70, and it is written as \x70 according to UTF-8; while each Chinese character occupies 3 bytes, for example, the UTF-8 encoding of Chinese 你 is \xE4\xBD\xA0. The more detailed encoding rules of UTF-8 are omitted here. Here are a few examples:

char | Unicode | UTF-8
-----+---------+---------------
p    |  U+0070 | \x70
r    |  U+0072 | \x72
你   |  U+4F60 | \xE4\xBD\xA0
好   |  U+597D | \xE5\xA5\xBD

Garbled Code

After introducing the coding concepts, let’s analyze the reasons for the garbled characters in the Lua test code at the beginning of this section. Use hexdump to view source code files:

$ hexdump -C test_lua/nihao.lua
00000000  70 72 69 6e 74 20 22 e4  bd a0 e5 a5 bd 22 0a     |print "......".|
#         p  r  i  n  t     "  |--你---| |--好---| "

The last line is the comment I added, indicating each Unicode text. You can see that the encoding of p and 你 is consistent with the UTF-8 encoding introduced above. Indicates that this file is UTF-8 encoded. How the file is encoded depends on the text editor and operating system used.

Our current lexical analysis reads "bytes" one by one, so for the Chinese 你, it is considered by the lexical analysis to be 3 independent bytes, namely e4, bd and a0. Then use as to convert to char. Rust's char is Unicode encoded, so we get 3 Unicode characters. By querying Unicode, we can get these 3 characters are ä, ½ and (the last one is a blank character), which is the first half of the "garbled characters" we encountered at the beginning. The following 好 corresponds to the second half of the garbled characters. The 6 characters represented by these 6 bytes are sequentially pushed into Token::String (Rust String type), and finally printed out by println!. Rust's String type is UTF-8 encoded, but this does not affect the output.

Summarize the process of garbled characters:

The source file is UTF-8 encoded;
Read byte by byte, at this time UTF-8 encoding has been fragmented;
Each byte is interpreted as Unicode, resulting in garbled characters;
Storage and printing.

You can also verify it again through Rust coding:

fn main() {
     let s = String::from("print hello"); // Rust's String is UTF-8 encoded, so it can simulate Lua source files
     println!("string: {}", &s); // normal output
     println!("bytes in UTF-8: {:x?}", s.as_bytes()); // View UTF-8 encoding

     print!("Unicode: ");
     for ch in s.chars() { // Read "characters" one by one, check the Unicode encoding
         print!("{:x} ", ch as u32);
     }
     println!("");

     let mut x = String::new();
     for b in s.as_bytes().iter() { // read "bytes" one by one
         x.push(*b as char); // as char, bytes are interpreted as Unicode, resulting in garbled characters
     }
     println!("wrong: {}", x);
}

Click on the upper right corner to run and see the result.

The core of the garbled problem lies in the conversion of "byte" to "char". So there are 2 workarounds:

When reading the source code, modify it to read "char" one by one. This solution has bigger problems:
- The input type of Lex we introduced in the previous section is Read trait, which only supports reading by "byte". If you want to read according to the "character char", you need to convert it to the String type first, and you need the BufRead trait, which has stricter requirements for input, such as Cursor<T> encapsulated outside the string. not support.
- If the source code input is UTF-8 encoding, and finally Rust’s storage is also UTF-8 encoding, if it is read according to the Unicode encoding “character char”, then it needs two meaningless steps from UTF-8 to Unicode and then to UTF-8 conversion.
- There is another most important reason, which will be discussed soon, Lua strings can contain arbitrary data, not necessarily legal UTF-8 content, and may not be correctly converted to "character char" .
When reading the source code, still read byte by byte; when saving, it is no longer converted to "character char", but directly saved according to "byte". This makes it impossible to continue to use Rust's String type to save, the specific solution is shown below.

It is obvious (it just seems obvious now. I was confused at the beginning, and tried for a long time) should choose the second option.

String Definition

Now let's see the difference between the contents of strings in Lua and Rust languages.

Lua introduction to strings: We can specify any byte in a short literal string. In other words, Lua strings can represent arbitrary data. Rather than calling it a string, it is better to say that it is a series of continuous data, and does not care about the content of the data.

And the introduction of the Rust string String type: A UTF-8–encoded, growable string. Easy to understand. Two features: UTF-8 encoding, and growable. Lua's strings are immutable, Rust's are growable, but this distinction is beyond the scope of this discussion. Now the focus is on the former feature, which is UTF-8 encoding, which means that Rust strings cannot store arbitrary data. This can be better observed through the definition of Rust's string:

pub struct String {
     vec: Vec<u8>,
}

You can see that String is the encapsulation of Vec<u8> type. It is through this encapsulation that the data in vec is guaranteed to be legal UTF-8 encoding, and no arbitrary data will be mixed in. If arbitrary data is allowed, just define the alias type String = Vec<u8>; directly.

To sum up, Rust's string String is only a subset of Lua string; the Rust type corresponding to the Lua string type is not String, but Vec<u8> that can store arbitrary data.

Modify the Code

Now that we have figured out the cause of the garbled characters and analyzed the difference between Rust and Lua strings, we can start modifying the interpreter code. The places that need to be modified are as follows:

The type associated with Token::String in lexical analysis is changed from String to Vec<u8> to support arbitrary data, not limited to legal UTF-8 encoded data.
Correspondingly, the type associated with Value::LongStr is also changed from String to Vec<u8>. This is consistent with the other two string types ShortStr and MidStr.
In lexical analysis, the original reading functions peek_char() and read_char() are changed to peek_byte() and next_byte() respectively, and the return type is changed from "char" to "byte". It turns out that although the name is char, it actually reads "bytes" one by one, so there is no need to modify the function content this time.
In the code, the original matching character constant such as 'a' should be changed to a byte constant such as b'a'.
If the original read_char() reads to the end, it will return \0, because \0 is considered to be a special character at that time. Now Lua's string can contain any value, including \0, so \0 cannot be used to indicate the end of reading. At this point, Rust's Option is needed, and the return value type is defined as Option<u8>.

But this makes it inconvenient to call this function, requiring pattern matching (if let Some(b) =) every time to get out the bytes. Fortunately, there are not many places where this function is called. But another function peek_byte() is called in many places. It stands to reason that the return value of this function should also be changed to Option<u8>, but in fact the bytes returned by this function are used to "look at it", as long as it does not match several possible paths, it can be regarded as No effect. So when this function reads to the end, it can still return \0, because \0 will not match any possible path. If you really read to the end, then just leave it to the next next_byte() to process.

It is the inconvenience brought by Option (it must be matched to get the value) that withdraws its value. In my C language programming experience, the handling of this special case of function return is generallyIt is represented by a special value, such as NULL for the pointer type, and 0 or -1 for the int type. This brings two problems: one is that the caller may not handle this special value, which will directly lead to bugs; the other is that these special values may later become ordinary values (for example, our \0 this time is a typical example), then all places that call this function must be modified. Rust's Option perfectly solves these two problems.
In lexical analysis, strings support escape. This part is all boring character processing, and the introduction is omitted here.
Add impl From<Vec<u8>> for Value to convert the string constant in Token::String(Vec<u8>) to Value type. This also involves a lot of details of Vec and strings, which is very cumbersome and has little to do with the main line. The following two sections will be devoted to it.

Conversion from &str, String, &[u8], Vec to Value

The conversion of String and &str to Value has been implemented before. Now add Vec<u8> and &[u8] to Value conversion. The relationship between these 4 types is as follows:

            slice
   &[u8] <---------> Vec<u8>
                       ^
                       |encapsulate
            slice      |
   &str <---------> String

String is an encapsulation of Vec<u8>. The encapsulated Vec<u8> can be returned by into_bytes().
&str is a slice of String (can be considered a reference?).
&[u8] is a slice of Vec<u8>.

So String and &str can depend on Vec<u8> and &[u8] respectively. And it seems that Vec<u8> and &[u8] can also depend on each other, that is, only one of them can be directly converted to Value. However, this will lose performance. analyse as below:

Source type: Vec<u8> is owned, while &[u8] is not.
Destination type: Value::ShortStr/MidStr only needs to copy the string content (into Value and Rc respectively), without taking ownership of the source data. And Value::LongStr needs to take ownership of Vec.

2 source types, 2 destination types, 4 conversion combinations are available:

          | Value::ShortStr/MidStr  | Value::LongStr
----------+-------------------------+-----------------------
  &[u8]   | 1. Copy string content  | 2. Create a Vec and allocate memory
  Vec<u8> | 3. Copy string content  | 4. Transfer ownership

If we directly implement Vec<u8>, and for &[8], first create Vec<u8> through .to_vec() and then indirectly convert it to Value. So for the first case above, only the content of the string needs to be copied, and the Vec created by .to_vec() is wasted.

If we directly implement &[8], and for Vec<u8>, it is first converted to &[u8] by reference and then indirectly converted to Value. Then for the fourth case above, it is necessary to convert the reference to &u[8] first, and then create a Vec through .to_vec() to obtain ownership. One more unnecessary creation.

So for the sake of efficiency, it is better to directly implement the conversion of Vec<u8> and &[u8] to Value. However, maybe the compiler will optimize these, and the above considerations are nothing to worry about. However, this can help us understand the two types Vec<u8> and &[u8] more deeply, and the concept of ownership in Rust. The final conversion code is as follows:

// convert &[u8], Vec<u8>, &str and String into Value
impl From<&[u8]> for Value {
    fn from(v: &[u8]) -> Self {
        vec_to_short_mid_str(v).unwrap_or(Value::LongStr(Rc::new(v.to_vec())))
    }
}
impl From<&str> for Value {
    fn from(s: &str) -> Self {
        s.as_bytes().into() // &[u8]
    }
}

impl From<Vec<u8>> for Value {
    fn from(v: Vec<u8>) -> Self {
        vec_to_short_mid_str(&v).unwrap_or(Value::LongStr(Rc::new(v)))
    }
}
impl From<String> for Value {
    fn from(s: String) -> Self {
        s.into_bytes().into() // Vec<u8>
    }
}

fn vec_to_short_mid_str(v: &[u8]) -> Option<Value> {
    let len = v.len();
    if len <= SHORT_STR_MAX {
        let mut buf = [0; SHORT_STR_MAX];
        buf[..len].copy_from_slice(&v);
        Some(Value::ShortStr(len as u8, buf))

    } else if len <= MID_STR_MAX {
        let mut buf = [0; MID_STR_MAX];
        buf[..len].copy_from_slice(&v);
        Some(Value::MidStr(Rc::new((len as u8, buf))))

    } else {
        None
    }
}

Reverse Conversion

The conversion from Value to String and &str has been implemented before. Now to add the conversion to Vec<u8>. First list the code:

impl<'a> From<&'a Value> for &'a [u8] {
    fn from(v: &'a Value) -> Self {
        match v {
            Value::ShortStr(len, buf) => &buf[..*len as usize],
            Value::MidStr(s) => &s.1[..s.0 as usize],
            Value::LongStr(s) => s,
            _ => panic!("invalid string Value"),
        }
    }
}

impl<'a> From<&'a Value> for &'a str {
    fn from(v: &'a Value) -> Self {
        std::str::from_utf8(v.into()).unwrap()
    }
}

impl From<&Value> for String {
    fn from(v: &Value) -> Self {
        String::from_utf8_lossy(v.into()).to_string()
    }
}

Since the three strings of Value are all consecutive u8 sequences, it is easy to convert to &[u8].
The conversion to &str needs to be processed by std::str::from_utf8() to handle the &[u8] type just obtained. This function does not involve new memory allocation, but only verifies the validity of the UTF-8 encoding. If it is illegal, it will fail, and here we panic directly through unwrap().
Conversion to String, through String::from_utf8_lossy() to process the &[u8] type just obtained. This function also verifies the legality of UTF-8 encoding, but if the verification fails, a special character u+FFFD will be used to replace the illegal data. But the original data cannot be modified directly, so a new string will be created. If the verification is successful, there is no need to create new data, just return the index of the original data. The return type Cow of this function is also worth learning.

The different processing methods of the above two functions are because &str has no ownership, so new data cannot be created, but an error can only be reported. It can be seen that ownership is very critical in the Rust language.

The conversion from Value to String, the current requirement is only used when the global variable table needs to be set. You can see that this conversion always calls .to_string() to create a new string. This makes the optimization of strings in our chapter (mainly Section 1) meaningless. Later, after introducing the Lua table structure, the index type of the global variable table will be changed from String to Value, and then the operation of the global variable table will not need this conversion. However, this conversion is still used in other places.

Test

So far, the function of Lua string is more complete. The test code at the beginning of this section can also be output normally. More methods can be handled by escape, and verified with the following test code:

print "tab:\thi" -- tab
print "\xE4\xBD\xA0\xE5\xA5\xBD" -- 你好
print "\xE4\xBD" -- invalid UTF-8
print "\72\101\108\108\111" -- Hello
print "null: \0." -- '\0'

Summarize

This chapter has learned the Rust string type, which involves ownership, memory allocation, Unicode and UTF-8 encoding, etc., and deeply understands what is said in "Rust Programming Language": Rust strings are complex because the string itself is complex of. Through these learnings, Lua's string type is optimized, and generics and From traits are also involved. Although it did not add new features to our Lua interpreter, it also gained a lot.

Build a Lua Interpreter in Rust