Unicode and UTF-8
The previous sections of this chapter refine the string-related content, clarifying some issues, but also introducing some confusion. For example, the definitions of the three string types in Value
, some are of [u8]
type, some are of String
type:
pub enum Value {
ShortStr(u8, [u8; SHORT_STR_MAX]), // [u8] type
MidStr(Rc<(u8, [u8; MID_STR_MAX])>), // [u8] type
LongStr(Rc<String>), // String type
Another example is the mixed use of "byte" and "character" in the previous section. The same is true for the lexical analysis code, which reads bytes of type u8
from the input character stream, but converts them to characters of type char
via as
.
fn read_char(&mut self) -> char {
match self. input. next() {
Some(Ok(ch)) => ch as char, // u8 -> char
The reason these confusions haven't caused problems so far is because our test programs only involve ASCII characters. Problems arise if other characters are involved. For example, for the following Lua code:
print "你好"
The execution result is wrong:
$ cargo r -q -- test_lua/nihao.lua
constants: [print, ä½ å¥½]
byte_codes:
GetGlobal(0, 0)
LoadConst(1, 1)
Call(0, 1)
ä½ å¥½
The output is not the expected 你好
, but ä½ å¥½
. Let's explain the reason for this "garbled code" and fix this problem.
Unicode and UTF-8 Concepts
These two are very general concepts, and only the most basic introduction is given here.
Unicode uniformly encodes most of the characters in the world. Among them, in order to be compatible with the ASCII code, the encoding of the ASCII character set is consistent. For example, the ASCII and Unicode encodings of the English letter p
are both 0x70, and U+0070
is written in Unicode. The Unicode encoding of Chinese 你
is U+4F60
.
Unicode just numbers the character, while how the computer stores it is another matter. The easiest way is to store directly according to the Unicode encoding. Since Unicode currently supports more than 140,000 characters (still increasing), it needs at least 3 bytes to represent, so the English letter p
is 00 00 70
, and the Chinese 你
is 00 4F 60
. The problem with this method is that 3 bytes are required for the ASCII part, which (for English) is wasteful. So there are other encoding methods, UTF-8 is one of them. UTF-8 is a variable-length encoding. For example, each ASCII character only occupies 1 byte. Here the encoding of the English letter p
is still 0x70, and it is written as \x70
according to UTF-8; while each Chinese character occupies 3 bytes, for example, the UTF-8 encoding of Chinese 你
is \xE4\xBD\xA0
. The more detailed encoding rules of UTF-8 are omitted here. Here are a few examples:
char | Unicode | UTF-8
-----+---------+---------------
p | U+0070 | \x70
r | U+0072 | \x72
你 | U+4F60 | \xE4\xBD\xA0
好 | U+597D | \xE5\xA5\xBD
Garbled Code
After introducing the coding concepts, let’s analyze the reasons for the garbled characters in the Lua test code at the beginning of this section. Use hexdump to view source code files:
$ hexdump -C test_lua/nihao.lua
00000000 70 72 69 6e 74 20 22 e4 bd a0 e5 a5 bd 22 0a |print "......".|
# p r i n t " |--你---| |--好---| "
The last line is the comment I added, indicating each Unicode text. You can see that the encoding of p
and 你
is consistent with the UTF-8 encoding introduced above. Indicates that this file is UTF-8 encoded. How the file is encoded depends on the text editor and operating system used.
Our current lexical analysis reads "bytes" one by one, so for the Chinese 你
, it is considered by the lexical analysis to be 3 independent bytes, namely e4
, bd
and a0
. Then use as
to convert to char
. Rust's char
is Unicode encoded, so we get 3 Unicode characters. By querying Unicode, we can get these 3 characters are ä
, ½
and
(the last one is a blank character), which is the first half of the "garbled characters" we encountered at the beginning. The following 好
corresponds to the second half of the garbled characters. The 6 characters represented by these 6 bytes are sequentially pushed into Token::String
(Rust String
type), and finally printed out by println!
. Rust's String
type is UTF-8 encoded, but this does not affect the output.
Summarize the process of garbled characters:
- The source file is UTF-8 encoded;
- Read byte by byte, at this time UTF-8 encoding has been fragmented;
- Each byte is interpreted as Unicode, resulting in garbled characters;
- Storage and printing.
You can also verify it again through Rust coding:
fn main() { let s = String::from("print hello"); // Rust's String is UTF-8 encoded, so it can simulate Lua source files println!("string: {}", &s); // normal output println!("bytes in UTF-8: {:x?}", s.as_bytes()); // View UTF-8 encoding print!("Unicode: "); for ch in s.chars() { // Read "characters" one by one, check the Unicode encoding print!("{:x} ", ch as u32); } println!(""); let mut x = String::new(); for b in s.as_bytes().iter() { // read "bytes" one by one x.push(*b as char); // as char, bytes are interpreted as Unicode, resulting in garbled characters } println!("wrong: {}", x); }
Click on the upper right corner to run and see the result.
The core of the garbled problem lies in the conversion of "byte" to "char". So there are 2 workarounds:
-
When reading the source code, modify it to read "char" one by one. This solution has bigger problems:
- The input type of Lex we introduced in the previous section is
Read
trait, which only supports reading by "byte". If you want to read according to the "character char", you need to convert it to theString
type first, and you need theBufRead
trait, which has stricter requirements for input, such asCursor<T>
encapsulated outside the string. not support. - If the source code input is UTF-8 encoding, and finally Rust’s storage is also UTF-8 encoding, if it is read according to the Unicode encoding “character char”, then it needs two meaningless steps from UTF-8 to Unicode and then to UTF-8 conversion.
- There is another most important reason, which will be discussed soon, Lua strings can contain arbitrary data, not necessarily legal UTF-8 content, and may not be correctly converted to "character char" .
- The input type of Lex we introduced in the previous section is
-
When reading the source code, still read byte by byte; when saving, it is no longer converted to "character char", but directly saved according to "byte". This makes it impossible to continue to use Rust's
String
type to save, the specific solution is shown below.
It is obvious (it just seems obvious now. I was confused at the beginning, and tried for a long time) should choose the second option.
String Definition
Now let's see the difference between the contents of strings in Lua and Rust languages.
Lua introduction to strings: We can specify any byte in a short literal string. In other words, Lua strings can represent arbitrary data. Rather than calling it a string, it is better to say that it is a series of continuous data, and does not care about the content of the data.
And the introduction of the Rust string String
type: A UTF-8–encoded, growable string. Easy to understand. Two features: UTF-8 encoding, and growable. Lua's strings are immutable, Rust's are growable, but this distinction is beyond the scope of this discussion. Now the focus is on the former feature, which is UTF-8 encoding, which means that Rust strings cannot store arbitrary data. This can be better observed through the definition of Rust's string:
pub struct String {
vec: Vec<u8>,
}
You can see that String
is the encapsulation of Vec<u8>
type. It is through this encapsulation that the data in vec
is guaranteed to be legal UTF-8 encoding, and no arbitrary data will be mixed in. If arbitrary data is allowed, just define the alias type String = Vec<u8>;
directly.
To sum up, Rust's string String
is only a subset of Lua string; the Rust type corresponding to the Lua string type is not String
, but Vec<u8>
that can store arbitrary data.
Modify the Code
Now that we have figured out the cause of the garbled characters and analyzed the difference between Rust and Lua strings, we can start modifying the interpreter code. The places that need to be modified are as follows:
-
The type associated with
Token::String
in lexical analysis is changed fromString
toVec<u8>
to support arbitrary data, not limited to legal UTF-8 encoded data. -
Correspondingly, the type associated with
Value::LongStr
is also changed fromString
toVec<u8>
. This is consistent with the other two string types ShortStr and MidStr. -
In lexical analysis, the original reading functions
peek_char()
andread_char()
are changed topeek_byte()
andnext_byte()
respectively, and the return type is changed from "char" to "byte". It turns out that although the name ischar
, it actually reads "bytes" one by one, so there is no need to modify the function content this time. -
In the code, the original matching character constant such as
'a'
should be changed to a byte constant such asb'a'
. -
If the original
read_char()
reads to the end, it will return\0
, because\0
is considered to be a special character at that time. Now Lua's string can contain any value, including\0
, so\0
cannot be used to indicate the end of reading. At this point, Rust'sOption
is needed, and the return value type is defined asOption<u8>
.But this makes it inconvenient to call this function, requiring pattern matching (
if let Some(b) =
) every time to get out the bytes. Fortunately, there are not many places where this function is called. But another functionpeek_byte()
is called in many places. It stands to reason that the return value of this function should also be changed toOption<u8>
, but in fact the bytes returned by this function are used to "look at it", as long as it does not match several possible paths, it can be regarded as No effect. So when this function reads to the end, it can still return\0
, because\0
will not match any possible path. If you really read to the end, then just leave it to the nextnext_byte()
to process.It is the inconvenience brought by
Option
(it must be matched to get the value) that withdraws its value. In my C language programming experience, the handling of this special case of function return is generallyIt is represented by a special value, such asNULL
for the pointer type, and0
or-1
for the int type. This brings two problems: one is that the caller may not handle this special value, which will directly lead to bugs; the other is that these special values may later become ordinary values (for example, our\0
this time is a typical example), then all places that call this function must be modified. Rust'sOption
perfectly solves these two problems. -
In lexical analysis, strings support escape. This part is all boring character processing, and the introduction is omitted here.
-
Add
impl From<Vec<u8>> for Value
to convert the string constant inToken::String(Vec<u8>)
toValue
type. This also involves a lot of details of Vec and strings, which is very cumbersome and has little to do with the main line. The following two sections will be devoted to it.
Conversion from &str, String, &[u8], Vec to Value
The conversion of String
and &str
to Value
has been implemented before. Now add Vec<u8>
and &[u8]
to Value
conversion. The relationship between these 4 types is as follows:
slice
&[u8] <---------> Vec<u8>
^
|encapsulate
slice |
&str <---------> String
String
is an encapsulation ofVec<u8>
. The encapsulatedVec<u8>
can be returned byinto_bytes()
.&str
is a slice ofString
(can be considered a reference?).&[u8]
is a slice ofVec<u8>
.
So String
and &str
can depend on Vec<u8>
and &[u8]
respectively. And it seems that Vec<u8>
and &[u8]
can also depend on each other, that is, only one of them can be directly converted to Value
. However, this will lose performance. analyse as below:
- Source type:
Vec<u8>
is owned, while&[u8]
is not. - Destination type:
Value::ShortStr/MidStr
only needs to copy the string content (into Value and Rc respectively), without taking ownership of the source data. AndValue::LongStr
needs to take ownership ofVec
.
2 source types, 2 destination types, 4 conversion combinations are available:
| Value::ShortStr/MidStr | Value::LongStr
----------+-------------------------+-----------------------
&[u8] | 1. Copy string content | 2. Create a Vec and allocate memory
Vec<u8> | 3. Copy string content | 4. Transfer ownership
If we directly implement Vec<u8>
, and for &[8]
, first create Vec<u8>
through .to_vec()
and then indirectly convert it to Value
. So for the first case above, only the content of the string needs to be copied, and the Vec created by .to_vec()
is wasted.
If we directly implement &[8]
, and for Vec<u8>
, it is first converted to &[u8]
by reference and then indirectly converted to Value
. Then for the fourth case above, it is necessary to convert the reference to &u[8]
first, and then create a Vec through .to_vec()
to obtain ownership. One more unnecessary creation.
So for the sake of efficiency, it is better to directly implement the conversion of Vec<u8>
and &[u8]
to Value
. However, maybe the compiler will optimize these, and the above considerations are nothing to worry about. However, this can help us understand the two types Vec<u8>
and &[u8]
more deeply, and the concept of ownership in Rust. The final conversion code is as follows:
// convert &[u8], Vec<u8>, &str and String into Value
impl From<&[u8]> for Value {
fn from(v: &[u8]) -> Self {
vec_to_short_mid_str(v).unwrap_or(Value::LongStr(Rc::new(v.to_vec())))
}
}
impl From<&str> for Value {
fn from(s: &str) -> Self {
s.as_bytes().into() // &[u8]
}
}
impl From<Vec<u8>> for Value {
fn from(v: Vec<u8>) -> Self {
vec_to_short_mid_str(&v).unwrap_or(Value::LongStr(Rc::new(v)))
}
}
impl From<String> for Value {
fn from(s: String) -> Self {
s.into_bytes().into() // Vec<u8>
}
}
fn vec_to_short_mid_str(v: &[u8]) -> Option<Value> {
let len = v.len();
if len <= SHORT_STR_MAX {
let mut buf = [0; SHORT_STR_MAX];
buf[..len].copy_from_slice(&v);
Some(Value::ShortStr(len as u8, buf))
} else if len <= MID_STR_MAX {
let mut buf = [0; MID_STR_MAX];
buf[..len].copy_from_slice(&v);
Some(Value::MidStr(Rc::new((len as u8, buf))))
} else {
None
}
}
Reverse Conversion
The conversion from Value
to String
and &str
has been implemented before. Now to add the conversion to Vec<u8>
. First list the code:
impl<'a> From<&'a Value> for &'a [u8] {
fn from(v: &'a Value) -> Self {
match v {
Value::ShortStr(len, buf) => &buf[..*len as usize],
Value::MidStr(s) => &s.1[..s.0 as usize],
Value::LongStr(s) => s,
_ => panic!("invalid string Value"),
}
}
}
impl<'a> From<&'a Value> for &'a str {
fn from(v: &'a Value) -> Self {
std::str::from_utf8(v.into()).unwrap()
}
}
impl From<&Value> for String {
fn from(v: &Value) -> Self {
String::from_utf8_lossy(v.into()).to_string()
}
}
-
Since the three strings of
Value
are all consecutiveu8
sequences, it is easy to convert to&[u8]
. -
The conversion to
&str
needs to be processed bystd::str::from_utf8()
to handle the&[u8]
type just obtained. This function does not involve new memory allocation, but only verifies the validity of the UTF-8 encoding. If it is illegal, it will fail, and here we panic directly throughunwrap()
. -
Conversion to
String
, throughString::from_utf8_lossy()
to process the&[u8]
type just obtained. This function also verifies the legality of UTF-8 encoding, but if the verification fails, a special characteru+FFFD
will be used to replace the illegal data. But the original data cannot be modified directly, so a new string will be created. If the verification is successful, there is no need to create new data, just return the index of the original data. The return typeCow
of this function is also worth learning.
The different processing methods of the above two functions are because &str
has no ownership, so new data cannot be created, but an error can only be reported. It can be seen that ownership is very critical in the Rust language.
The conversion from Value
to String
, the current requirement is only used when the global variable table needs to be set. You can see that this conversion always calls .to_string()
to create a new string. This makes the optimization of strings in our chapter (mainly Section 1) meaningless. Later, after introducing the Lua table structure, the index type of the global variable table will be changed from String
to Value
, and then the operation of the global variable table will not need this conversion. However, this conversion is still used in other places.
Test
So far, the function of Lua string is more complete. The test code at the beginning of this section can also be output normally. More methods can be handled by escape, and verified with the following test code:
print "tab:\thi" -- tab
print "\xE4\xBD\xA0\xE5\xA5\xBD" -- 你好
print "\xE4\xBD" -- invalid UTF-8
print "\72\101\108\108\111" -- Hello
print "null: \0." -- '\0'
Summarize
This chapter has learned the Rust string type, which involves ownership, memory allocation, Unicode and UTF-8 encoding, etc., and deeply understands what is said in "Rust Programming Language": Rust strings are complex because the string itself is complex of. Through these learnings, Lua's string type is optimized, and generics and From
traits are also involved. Although it did not add new features to our Lua interpreter, it also gained a lot.