String Definition

This section does not add new features for now, but stops to discuss and optimize the string type.

Ownership section in the book "Rust Programming Language" introduces the heap using strings as an example of stack and heap, and the relationship with ownership; String section says the strings are complex. We will now use strings to explore Rust's allocation of heaps and stacks, and initially experience the complexity of strings.

Heap and Stack

"Rust Programming Language" uses strings as an example to introduce the concept of heap and stack, and the relationship between stack and ownership. Here is a brief recap. Rust's String consists of two parts:

  1. Metadata, generally located on the stack, includes 3 fields: a pointer to the memory block, the length of the string, and the capacity of the memory block. The following are represented by buffer, len and cap respectively.
  2. The private memory block used to store the string content is applied on the heap. Owned by the string, so is freed when the string ends. Because of this piece of memory on the heap, String is not Copy, which in turn leads to Value not being Copy. In order to copy Value, it can only be defined as Clone.

For example, for a string whose content is "hello, world!", the memory layout is as follows. On the left is the metadata on the stack, where buffer points to the memory block on the heap, len is the length of the string, which is 13, and cap is the capacity of the memory block, which is likely to be aligned to 16. On the right is the block of memory on the heap that stores the contents of the string.

    stack             heap
    +--------+
    | buffer +------->+----------------+
    |--------|        |hello, world!   |
    | len=13 |        +----------------+
    |--------|
    | cap=16 |
    +--------+

What needs to be explained here is that the metadata "generally" located on the stack above is for simple types. But for complex types, such as Vec<String>, the String metadata part is also stored on the heap as the content of the array (similar to the memory block part of String). Below is an array Vec with 2 string members. The metadata of the array itself is on the stack, but the metadata of the string is on the heap.

    stack             heap
    +--------+
    | buffer +------->+-------------+-------------+----
    |--------|        | buf|len|cap | buf|len|cap | ...
    | len=2  |        +--+----------+--+----------+----
    |--------|           |             V
    | cap=4  |           V             +----------------+
    +--------+        +--------+       |hello, world!   |
                      |print   |       +----------------+
                      +--------+

In this case, although the metadata array part is on the heap, it still has the characteristics of the stack, including last-in-first-out, fast access through indexes, fixed known size, and no need for management (allocation and free). In fact, the stack of the virtual machine of our Lua interpreter is a similar Vec<Value> type. Similarly, although its data is on the heap, it has the characteristics of a stack. The term "stack" has two meanings here: the stack at the Rust level, and the stack at the Lua virtual machine. The latter is on the heap at the Rust level. The "stack" mentioned below in this article is the latter meaning, that is, the stack of the Lua virtual machine. But it doesn't matter if you understand it as a Rust stack.

Use String

Currently the string type of Value uses String in the Rust standard library directly:

#[derive(Clone)]
struct Value {
     String(String),

The biggest problem with this definition is that if you want to copy the Value of a string, you must deeply copy the string content, that is, Clone. The following diagram represents the memory layout for copying a string:

    stack              heap
    |        |
    +--------+
    |t|      |
    |-+------|
    | buffer +------->+----------------+
    |--------|        |hello, world!   |
    | len=13 |        +----------------+
    |--------|
    | cap=16 |
    +--------+
    :        :
    :        :
    +--------+
    |t|      |
    |-+------|
    | buffer +------->+----------------+
    |--------|        |hello, world!   |
    | len=13 |        +----------------+
    |--------|
    | cap=16 |
    +--------+
    |        |

The left side of the figure is the stack of the Lua virtual machine, and each line represents a word. Since we are developing based on a 64-bit system, a word is 8 bytes.

The t in line 1 represents the tag of enum Value. Since our Value type is less than 256 types, 1 byte can be represented, so t occupies 1 byte. The next 3 lines buffer, len and cap form a Rust standard library String. Each field occupies one word. buffer is 8-byte aligned, so there are 7 bytes empty between t and this part is unusable. These 4 lines (the rectangle surrounded by four + in the figure) constitute a value of string type in total.

There is no default layout for enums in Rust (although it can be specified). We only list one layout possibility here. This does not affect the discussion in this section.

To deeply copy the string Value, you need to copy the metadata on the stack and the memory block on the heap, which is a great waste of performance and memory. The most straightforward way to solve this problem in Rust is to use Rc.

Use Rc<String>

In order to quickly copy the string String, it is necessary to allow multiple owners of the string at the same time. Rust's Rc provides this feature. Encapsulate Rc outside String, and only need to update the Rc count when copying. It is defined as follows:

#[derive(Clone)]
struct Value {
     String(Rc<String>),

The memory layout is as follows:

    stack             heap
    |        |
    +--------+
    |t|      |
    |-+------|
    |   Rc   +----+-->+--------+--------+--------+--------+--------+
    +--------+    |   |count=2 | weak=0 | buffer | len=13 | cap=16 |
    :        :    |   +--------+--------+-+------+--------+--------+
    :        :    |                       |
    +--------+    |                       V
    |t|      |    |                       +----------------+
    |-+------|    |                       |hello, world!   |
    |   Rc   +----/                       +----------------+
    +--------+ 
    |        |

The count and weak on the right side of the figure are the packages of Rc. Since there are currently 2 Values pointing to this string, count is 2.

Using Rc directly causes the interpreter to use reference counting to implement garbage collection. The subsection below is devoted to this high-impact decision.

Although this solution solves the problem of copying, it also brings a new problem, that is, accessing the content of the string requires 2 pointer jumps. This wastes memory and affects execution performance. Some optimization schemes are introduced below.

Use Rc<str>

Strings in Lua have a feature that they are read-only! If you want to process the string, such as truncation, connection, replacement, etc., a new string will be generated. Rust's String is designed for mutable strings, so it is a bit wasteful to represent read-only strings. For example, the cap field in metadata can be removed, and there is no need to reserve memory for possible modifications. For example, in the above example, the length of "hello, world!" is only 13, but a memory block of 16 is allocated for. In Rust, it is more suitable to represent a read-only string is &str, which is the slice of String. But &str is a reference, and does not have ownership of the string, but needs to be attached to a string. However, it has a reference that is not a string (the reference of a string is &String). Intuitively, it should be a reference to str. What is str? It's as if it never appeared alone.

For example. For code like this:

let s = String::from("hello, world!"); // String
let r = s[7..12]; // &str

Where r is &str type, and the memory layout is as follows:

    stack             heap
s:  +--------+
    | buffer +------->+----------------+
    |--------|        |hello, world!   |
    | len=13 |        +-------^--------+
    |--------|                |
    | cap=16 |                |
    +--------+                |
                              |
r:  +--------+                |
    | buffer +----------------/
    |--------|
    | len=5  |
    +--------+

Then dereferencing &str, what you get is the "world" memory. However, a general reference is an address, but length information is also added here, indicating that str includes length information in addition to memory. The length information is not on the original data like String, but follows the reference together. In fact, str does not exist independently, it must follow a reference (such as &str) or a pointer (such as Box(str)). This is dynamic size type.

And Rc is also a pointer, so Rc<str> can be defined. It is defined as follows:

#[derive(Clone)]
struct Value {
     String(Rc<str>),

The memory layout is as follows:

    stack             heap
    |        |
    +--------+
    |t|      |
    |-+------|
    |   Rc   +----+-->+--------+--------+-------------+
    |--------|    |   |count=2 | weak=0 |hello, world!|
    | len=13 |    |   +--------+--------+-------------+
    +--------+    |
    :        :    |
    :        :    |
    +--------+    |
    |t|      |    |
    |-+------|    |
    |   Rc   +----/
    +--------+
    | len=13 |
    +--------+
    |        |

Among them, "hello, world!" is the original data, which is encapsulated by Rc. The length information len=13 is stored on the stack along with Rc.

This scheme looks very good! Compared with the Rc<String> scheme above, this scheme removes the useless cap field, does not need to reserve memory, and also saves a layer of pointer jumps. But this solution also has 2 problems:

First, the content needs to be copied when creating the string value. The previous solution only needs to copy the metadata part of the string, which is only 3 words long. And this scheme should copy the string content to the newly created Rc package. Imagine creating a 1M long string, this copying affects performance a lot.

Secondly, it occupies 2 words of space on the stack. Although the problem was more serious in the earliest scheme of directly using String to occupy 3 characters, it can be understood that our current standard has been improved. At present, other types in Value only occupy a maximum of 1 word (plus tag, a total of 2 words). What can be spoiled is that the types of tables and UserData to be added in the future also only occupy 1 word, so it is a waste to change the size of Value from 2 to 3 just because of the string type. Not only does it take up more memory, but it's also unfriendly to the CPU cache.

The key to these problems is that len follows Rc, not the data. It would be perfect if we could put len on the heap, say between weak and "hello, world!" in the picture. This is trivial for C, but Rust doesn't support it. The reason is that str is a dynamically sized type. So if you choose a fixed size type, can it be realized? Such as arrays.

Use Rc<(u8, [u8; 47])>

Arrays in Rust have intrinsic size information. For example, the sizes of [u8; 10] and [u8; 20] are 10 and 20 respectively. This size is known at compile time, and there is no need to store it following the pointer. Two arrays with different lengths are different types, such as [u8; 10] and [u8; 20] are different types. Therefore, the array is a fixed-size type, which can solve the problem in the previous section, that is, only one word is needed on the stack.

Since it is a fixed length, it can only store strings smaller than this length, so this solution is incomplete and can only be a supplementary solution for performance optimization. However, most of the strings encountered in Lua are very short, at least in my experience, so this optimization is still very meaningful. To do this, we need to define 2 string types, one is a fixed-length array, which is used to optimize short strings, and the other is the previous Rc<String> scheme, which is used to store long strings. The first byte of the fixed-length array is used to represent the actual length of the string, so the array can be split into two parts. Let's first assume that an array with a total length of 48 is used (1 byte represents the length, and 47 bytes store the string content), then the definition is as follows:

struct Value {
     FixStr(Rc<(u8, [u8; 47])>), // len<=47
     String(Rc<String>), // len>47

The memory layout for short strings is as follows:

    stack              heap
    |        |
    +--------+
    |t|      |
    |-+------|
    |   Rc   +----+-->+--------+--------+----------------------------+
    +--------+    |   |count=2 | weak=0 |len|hello, world!           |
    :        :    |   +--------+--------+----------------------------+
    :        :    |
    +--------+    |
    |t|      |    |
    |-+------|    |
    |   Rc   +----/
    +--------+
    |        |

The first byte len at the beginning of the array part on the right in the figure indicates the actual length of the following string. The next 47 bytes can be used to store string content.

This solution is the same as Rc<str> mentioned above, it needs to copy the string content, so it is not suitable for long strings. This is not a big problem. Originally, this solution was designed to optimize short strings. Then even if it is a short string, the selection of the array length is also critical. If it is very long, the space waste is serious for short strings; if it is very short, the coverage ratio is not high. However, this solution can continue to be optimized, using arrays with multi-level lengths, such as 16, 32, 48, 64, etc. However, this also creates some complications.

In addition, the selection of the array length also depends on the memory management library used by Rust. For example, if we choose the length to be 48, plus the two counting fields encapsulated by Rc of 16 bytes, then the length of the memory block on the right heap in the above figure is 64 bytes, which is a very "regular" length. For example, the memory management library jemalloc manages small memory blocks into lengths of 16, 32, 48, 64, and 128, so the above-mentioned memory application with a total length of 64 is not wasted. If we choose the array length to be 40, the total length of the memory block is 56, and it will still be matched to the category of 64, and 64-56=8 bytes will be wasted. Of course, it is very bad behavior to rely on the specific implementation of other libraries to make decisions, but fortunately, the impact is not great.

Here we choose an array length of 48, that is, only strings with lengths from 0 to 47 can be represented.

Then compare it with the Rc<String> scheme to see how the optimization works. First of all, the biggest advantage of this solution is that only one memory allocation is required, and only one pointer jump is required during execution.

Second, compare the allocated memory size. In the Rc<String> scheme, you need to apply for 2 blocks of memory: one is the Rc count and string metadata, fixed 2+3=5 words, 40 bytes, according to the memory strategy of jemalloc, it will occupy 48 bytes of memory ; The second is the string content. The memory size is related to the length of the string, and also depends on the memory management strategy of Rust String and the implementation of the underlying library. For example, a string with a length of 1 may occupy 16 bytes of memory; A character string with a length of 47 may occupy 48 bytes or 64 bytes of memory. The two blocks of memory together occupy 64 to 112 bytes, which is greater than or equal to this fixed-length array solution.

Let's look at the next solution along the line of "optimizing short strings".

Use Inline Arrays

Compared with Rc<String>, the previous solution reduces one layer of pointer jumps. The following solution goes a step further, directly removing the storage on the heap, and storing the string completely on the stack.

We want the size of the Value type to be 2 words, or 16 bytes. One of them is used for tag, and one is used for string length, so there are 14 bytes remaining, which can be used to store strings with a length less than 14. This scheme is also a supplementary scheme, and it must also be used in conjunction with a long string definition. details as follows:

// sizeof(Value) - 1(tag) - 1(len)
const INLSTR_MAX: usize = 14;

struct Value {
     InlineStr(u8, [u8; INLSTR_MAX]), // len<=14
     String(Rc<String>), // len>14

The short string InlineStr is associated with two parameters: the string length of the u8 type, and the u8 array with a length of 14, which also makes full use of the 7 bytes behind t that have been wasted before. hollow. The long string String still uses the Rc<String> scheme.

The memory layout for short strings is as follows:

     stack
    |        |
    +vv------+
    |tlhello,|
    |--------|
    | world! |
    +--------+
    :        :
    :        :
    +vv------+
    |tlhello,|
    |--------|
    | world! |
    +--------+
    |        |

The t and l pointed to by the arrow v on the grid represent the tag and length of 1 byte, respectively. The actual string content spans 2 words. If you draw the stack horizontally, it looks clearer:

stack:
    --+-+-+--------------+......+-+-+--------------+--
      |t|l|hello, world! |      |t|l|hello, world! |
    --+------------------+......+------------------+--

This solution has the best performance, but the worst ability, and can only handle strings with a length no greater than 14. There are three usage scenarios of string type in Lua: global variable name, table index, and string value. Most of the first two are not larger than 14 bytes, so it should be able to cover most cases.

It can be further optimized by adding another type to store strings with a length of 15. Since the length is known, one byte originally used to store the length can also be used to store the content of the string. However, the optimization brought by this solution is not obvious and less than the complexity brought, so it is not used. It is defined as follows.

struct Value {
     InlineStr(u8, [u8; INLSTR_MAX]), // len<=14
     Len15Str([u8; 15]), //len=15
     String(Rc<String>), // len>15

Summary and Choice

In this section, we used and analyzed String, Rc<String>, Rc<str>, Rc<(u8, [u8; 47])> and inline (u8, [u8; 14]) and several other schemes. Each has advantages and disadvantages. A reasonable approach is to treat long and short strings differently, use short strings to optimize, and use long strings to cover the bottom line. 3 options are available:

  • In order to guarantee the length of the Value type as 2 words, only Rc<String> can be used for long strings.
  • For short strings, the final inlining solution does not use heap memory at all, and the optimization effect is the best.
  • The penultimate fixed-length array scheme is a compromise between the above two schemes, which is slightly tasteless. However, there is only one disadvantage, which is to introduce greater complexity, and strings need to deal with 3 types. In the next section, generics are used to shield these three types, which solves this shortcoming.

The final solution is as follows:

const SHORT_STR_MAX: usize = 14; // sizeof(Value) - 1(tag) - 1(len)
const MID_STR_MAX: usize = 48 - 1;

struct Value {
     ShortStr(u8, [u8; SHORT_STR_MAX]),
     MidStr(Rc<(u8, [u8; MID_STR_MAX])>),
     LongStr(Rc<Vec<u8>>),

The original InlineStr and FixStr both represent specific implementation solutions, and the characteristics of external performance are long and short, so they are renamed ShortStr, MidStr and LongStr, which are more intuitive.

In this way, most cases (short strings) can be processed quickly, and for a small number of cases (long strings), although slow, they can also be processed correctly, and do not affect the overall situation (for example, Rc<str> takes up 2 word, directly makes Value larger, even if it affects the overall situation), and ultimately improves the overall processing efficiency. This is a very common and effective optimization idea. Our scheme achieves optimization by distinguishing between two sets of definitions, which is a typical example. It would be even more beautiful if this goal can be achieved with only one set of definitions and one set of algorithms without distinguishing definitions. We will encounter such an example later in the syntax analysis of assignment statement.

After distinguishing between long and short strings, it also brings two new problems:

  1. When generating the string type Value, we need to choose ShortStr, MidStr or LongStr according to the length of the string. This choice should be implemented automatically, not by the caller, otherwise it will be troublesome and may make mistakes. For example, the self.add_const(Value::String(var)) statement that appears many times in the syntax analysis code needs to be improved.

  2. Strings are composed of "characters", but ShortStr and MidStr are both composed of u8, what is the difference? How does u8 express Unicode correctly? How to deal with illegal characters?

The next few sections discuss these two issues.