Input Type

In the previous section we defined a function with generics. In fact, we "use" more generic types than "define". This chapter discusses another "use" example, which is the input type of the entire interpreter, that is, the lexical analysis module reads the source code.

Currently only reading source code from files is supported, and Rust's file type std::fs::File does not even include standard input. The lexical analysis data structure Lex is defined as follows:

pub struct Lex {
     input: File,
     // omit other members

The method read_char() for reading characters is defined as follows:

impl Lex {
     fn read_char(&mut self) -> char {
         let mut buf: [u8; 1] = [0];
         self.input.read(&mut buf).unwrap();
         buf[0] as char
     }

Here we only focus on the self.input.read() call.

Use Read

The official implementation of Lua supports two types of input files (including standard input) and strings as source code. According to the idea of Rust generics, the input we want to support may not be limited to some specific types, but a type that supports certain features (ie traits). In other words, as long as it is a character stream, you can read characters one by one. This feature is so common that the std::io::Read trait is provided in the Rust standard library. So modify the definition of Lex as follows:

pub struct Lex<R> {
     input: R,

There are two changes here:

  • Changed the original Lex to Lex<R>, indicating that Lex is based on the generic type R,
  • Change the original field input type File to R.

Correspondingly, the implementation part should also be changed:

impl<R: Read> Lex<R> {

Added <R: Read>, indicating that the constraint of <R> is Read, that is, the type R must support the Read trait. This is because the input.read() function is used in the read_char() method.

The read_char() method itself does not need to be modified, and the input.read() function can still be used normally, but its meaning has changed slightly:

  • When the input used the File type before, the read() function called was a method of the File type that implemented the Read trait;
  • The read() function is now called on all types that implement the Read trait.

The statement here is rather convoluted, so you can ignore it if you don’t understand it.

In addition, generic definitions must be added to other places where Lex is used. For example, the definition of ParseProto is modified as follows:

pub struct ParseProto<R> {
     lex: Lex<R>,

The parameter of its load() method is also changed from File to R:

     pub fn load(input: R) -> Self {

load() supports R just to create Lex<R>, and ParseProto does not use R directly. But <R> still needs to be added to the definition of ParseProto, which is a bit long-winded. What's more verbose is that if there are other types that need to include ParseProto, then <R> should also be added. This is called generic type propagate. This problem can be circumvented by defining dyn, which will also bring some additional performance overhead. However, here ParseProto is an internal type and will not be exposed to other types, so <R> in Lex is equivalent to only spreading one layer, which is acceptable, and dyn will not be adopted.

Now that Read is supported, types other than files can be used. Next look at using stdin like and string types.

Use Standard Input

The standard input std::io::Stdin type implements the Read trait, so it can be used directly. Modify the main() function to use standard input:

fn main() {
     let input = std::io::stdin(); // standard input
     let proto = parse::ParseProto::load(input);
     vm::ExeState::new().execute(&proto);
}

Test source code from standard input:

echo 'print "i am from stdin!"' | cargo r

Use String

The string type does not directly support the Read trait, because the string type itself does not have the function of recording the read position. Read can be realized by encapsulating std::io::Cursor type, which is used to encapsulates AsRef<[u8]> to support recording position. Its definition is clear:

pub struct Cursor<T> {
     inner: T,
     pos: u64,
}

This type naturally implements the Read trait. Modify the main() function to use strings as source code input:

fn main() {
     let input = std::io::Cursor::new("print \"i am from string!\""); // string+Cursor
     let proto = parse::ParseProto::load(input);
     vm::ExeState::new().execute(&proto);
}

Use BufReader

Reading and writing files directly is a performance-intensive operation. The above implementation only reads one byte at a time, which is very inefficient for file types. This frequent and small amount of file reading operation requires a layer of cache outside. The std::io::BufReader type in the Rust standard library provides this functionality. This type naturally also implements the Read trait, and also implements the BufRead trait using the cache, providing more methods.

I originally defined Lex's input field as BufReader<R> type, instead of R type above. But later it was found to be wrong, because when BufReader reads data, it first reads from the source to the internal cache, and then returns. Although it is very practical for file types, while the internal cache is unnecessary for string types, and there is one more unnecessary memory copy. And also found that the standard input std::io::Stdin also has its own cache already, so no need to add another layer. Therefore, BufReader is not used inside Lex, but let the caller add it according to the needs (for example, for File type).

Let’s modify the main() function to encapsulate BufReader outside the original File type:

fn main() {
     // omit parameter handling
     let file = File::open(&args[1]).unwrap();

     let input = BufReader::new(file); // encapsulate BufReader
     let proto = parse::ParseProto::load(input);
     vm::ExeState::new().execute(&proto);
}

Give Up Seek

At the beginning of this section, we only require that the input type supports character-by-character reading. In fact, it is not true, we also require that the read position can be modified, that is, the Seek trait. This is what the original putback_char() method requires, using the input.seek() method:

     fn putback_char(&mut self) {
         self.input.seek(SeekFrom::Current(-1)).unwrap();
     }

The application scenario of this function is that in lexical analysis, sometimes it is necessary to judge the type of the current character according to the next character. For example, after reading the character -, if the next character is still -, it is a comment; otherwise it is Subtraction, at this time the next character will be put back into the input source as the next Token. Previously introduced that the same is true for reading Token in syntax analysis, and the current statement type must be judged according to the next Token. At that time, the peek() function was added to Lex, which could "peek" at the next Token without consuming it. The peek() here and the putback_char() above are two ways to deal with this situation. The pseudo codes are as follows:

// Method 1: peek()
if input.peek() == xxx then
     input.next() // Consume the peek just now
     handle(xxx)
end

// Method 2: put_back()
if input.next() == xxx then
     handle(xxx)
else
     input.put_back() // plug it back and read it next time
end

When using the File type before, because the seek() function is supported, it is easy to support the put_back function later, so the second method is adopted. But now the input has been changed to Read type, if input.seek() is still used, then the input is also required to have std::io::Seek trait constraints. Among the three types we have tested above, the cached file BufReader<File> and the string Cursor<String> both support Seek, but the standard input std::io::Stdin does not support it, and there may be other input types that support Read but not Seek (such as std::net::TcpStream). If we add Seek constraints here, the road will be narrowed.

Since Seek cannot be used, there is no need to use the second method. You can also consider the first method, which is at least consistent with Token's peek() function.

The more straightforward approach is to add an ahead_char: char field in Lex to save the character peeked to, similar to the peek() function and the corresponding ahead: Token field. It's simpler to do this, but there's a more general way of doing it in the Rust standard library, using Peekable. Before introducing Peekable, let's look at the Bytes type it depends on.

Use Bytes

The implementation of the read_char() function listed at the beginning of this section is a bit complicated relative to its purpose (reading a character). I later discovered a more abstract method, the bytes() method of the Read triat, which returns an iterator Bytes, and each call to next() returns a byte. Modify the Lex definition as follows:

pub struct Lex<R> {
     input: Bytes::<R>,

Modify the constructor and read_char() function accordingly.

impl<R: Read> Lex<R> {
     pub fn new(input: R) -> Self {Lex {
             input: input.bytes(), // generate iterator Bytes
             ahead: Token::Eos,
         }
     }
     fn read_char(&mut self) -> char {
         match self.input.next() { // just call next(), simpler
             Some(Ok(ch)) => ch as char,
             Some(_) => panic!("lex read error"),
             None => '\0',
         }
     }

The code for read_char() does not seem to be reduced here. But its main body is just input.next() call, and the rest is the processing of the return value. After the error handling is added later, these judgment processing will be more useful.

Use Peekable

The peekable() method in the Bytes document, which returns the Peekable type, is exactly what we need. It based on the iterator, and we can "peek" a piece of data forward. Its definition is clear:

pub struct Peekable<I: Iterator> {
     iter: I,
     /// Remember a peeked value, even if it was None.
     peeked: Option<Option<I::Item>>,
}

To this end, modify the definition of Lex as follows:

pub struct Lex<R> {
     input: Peekable::<Bytes::<R>>,

Modify the constructor accordingly, and add the peek_char() function:

impl<R: Read> Lex<R> {
     pub fn new(input: R) -> Self {
         Lex {
             input: input.bytes().peekable(), // generate iterator Bytes
             ahead: Token::Eos,
         }
     }
     fn peek_char(&mut self) -> char {
         match self. input. peek() {
             Some(Ok(ch)) => *ch as char,
             Some(_) => panic!("lex peek error"),
             None => '\0',
         }
     }

Here input.peek() is basically the same as input.next() above, the difference is that the return type is a reference. This is the same as the reason why the Lex::peek() function returns &Token, because the owner of the returned value is still input, and it does not move out, but just "peek". But here we are of char type, which is Copy, so directly dereference *ch, and finally return char type.

Summary

So far, we have completed the optimization of the input type. From the beginning, only the File type is supported, and finally the Read trait is supported. There is not much content to sort out, but in the process of realization and exploration at the beginning, it took a lot of effort to bump into things. In this process, I also thoroughly figured out some basic types in the standard library, such as Read, BufRead, BufReader, also discovered and learned the Cursor and Peekable types, and also learned more about the official website documents way of organization. Learning the Rust language by doing is the ultimate goal of this project.