Input Type
In the previous section we defined a function with generics. In fact, we "use" more generic types than "define". This chapter discusses another "use" example, which is the input type of the entire interpreter, that is, the lexical analysis module reads the source code.
Currently only reading source code from files is supported, and Rust's file type std::fs::File
does not even include standard input. The lexical analysis data structure Lex is defined as follows:
pub struct Lex {
input: File,
// omit other members
The method read_char()
for reading characters is defined as follows:
impl Lex {
fn read_char(&mut self) -> char {
let mut buf: [u8; 1] = [0];
self.input.read(&mut buf).unwrap();
buf[0] as char
}
Here we only focus on the self.input.read()
call.
Use Read
The official implementation of Lua supports two types of input files (including standard input) and strings as source code. According to the idea of Rust generics, the input we want to support may not be limited to some specific types, but a type that supports certain features (ie traits). In other words, as long as it is a character stream, you can read characters one by one. This feature is so common that the std::io::Read
trait is provided in the Rust standard library. So modify the definition of Lex as follows:
pub struct Lex<R> {
input: R,
There are two changes here:
- Changed the original
Lex
toLex<R>
, indicating that Lex is based on the generic typeR
, - Change the original field input type
File
toR
.
Correspondingly, the implementation part should also be changed:
impl<R: Read> Lex<R> {
Added <R: Read>
, indicating that the constraint of <R>
is Read
, that is, the type R must support the Read
trait. This is because the input.read()
function is used in the read_char()
method.
The read_char()
method itself does not need to be modified, and the input.read()
function can still be used normally, but its meaning has changed slightly:
- When the input used the
File
type before, theread()
function called was a method of theFile
type that implemented theRead
trait; - The
read()
function is now called on all types that implement theRead
trait.
The statement here is rather convoluted, so you can ignore it if you don’t understand it.
In addition, generic definitions must be added to other places where Lex is used. For example, the definition of ParseProto is modified as follows:
pub struct ParseProto<R> {
lex: Lex<R>,
The parameter of its load()
method is also changed from File
to R
:
pub fn load(input: R) -> Self {
load()
supports R
just to create Lex<R>
, and ParseProto
does not use R
directly. But <R>
still needs to be added to the definition of ParseProto
, which is a bit long-winded. What's more verbose is that if there are other types that need to include ParseProto
, then <R>
should also be added. This is called generic type propagate. This problem can be circumvented by defining dyn
, which will also bring some additional performance overhead. However, here ParseProto
is an internal type and will not be exposed to other types, so <R>
in Lex
is equivalent to only spreading one layer, which is acceptable, and dyn
will not be adopted.
Now that Read
is supported, types other than files can be used. Next look at using stdin like and string types.
Use Standard Input
The standard input std::io::Stdin
type implements the Read
trait, so it can be used directly. Modify the main()
function to use standard input:
fn main() {
let input = std::io::stdin(); // standard input
let proto = parse::ParseProto::load(input);
vm::ExeState::new().execute(&proto);
}
Test source code from standard input:
echo 'print "i am from stdin!"' | cargo r
Use String
The string type does not directly support the Read
trait, because the string type itself does not have the function of recording the read position. Read
can be realized by encapsulating std::io::Cursor
type, which is used to encapsulates AsRef<[u8]>
to support recording position. Its definition is clear:
pub struct Cursor<T> {
inner: T,
pos: u64,
}
This type naturally implements the Read
trait. Modify the main()
function to use strings as source code input:
fn main() {
let input = std::io::Cursor::new("print \"i am from string!\""); // string+Cursor
let proto = parse::ParseProto::load(input);
vm::ExeState::new().execute(&proto);
}
Use BufReader
Reading and writing files directly is a performance-intensive operation. The above implementation only reads one byte at a time, which is very inefficient for file types. This frequent and small amount of file reading operation requires a layer of cache outside. The std::io::BufReader
type in the Rust standard library provides this functionality. This type naturally also implements the Read
trait, and also implements the BufRead
trait using the cache, providing more methods.
I originally defined Lex's input field as BufReader<R>
type, instead of R
type above. But later it was found to be wrong, because when BufReader
reads data, it first reads from the source to the internal cache, and then returns. Although it is very practical for file types, while the internal cache is unnecessary for string types, and there is one more unnecessary memory copy. And also found that the standard input std::io::Stdin
also has its own cache already, so no need to add another layer. Therefore, BufReader
is not used inside Lex, but let the caller add it according to the needs (for example, for File
type).
Let’s modify the main()
function to encapsulate BufReader
outside the original File
type:
fn main() {
// omit parameter handling
let file = File::open(&args[1]).unwrap();
let input = BufReader::new(file); // encapsulate BufReader
let proto = parse::ParseProto::load(input);
vm::ExeState::new().execute(&proto);
}
Give Up Seek
At the beginning of this section, we only require that the input type supports character-by-character reading. In fact, it is not true, we also require that the read position can be modified, that is, the Seek
trait. This is what the original putback_char()
method requires, using the input.seek()
method:
fn putback_char(&mut self) {
self.input.seek(SeekFrom::Current(-1)).unwrap();
}
The application scenario of this function is that in lexical analysis, sometimes it is necessary to judge the type of the current character according to the next character. For example, after reading the character -
, if the next character is still -
, it is a comment; otherwise it is Subtraction, at this time the next character will be put back into the input source as the next Token. Previously introduced that the same is true for reading Token in syntax analysis, and the current statement type must be judged according to the next Token. At that time, the peek()
function was added to Lex, which could "peek" at the next Token without consuming it. The peek()
here and the putback_char()
above are two ways to deal with this situation. The pseudo codes are as follows:
// Method 1: peek()
if input.peek() == xxx then
input.next() // Consume the peek just now
handle(xxx)
end
// Method 2: put_back()
if input.next() == xxx then
handle(xxx)
else
input.put_back() // plug it back and read it next time
end
When using the File
type before, because the seek()
function is supported, it is easy to support the put_back
function later, so the second method is adopted. But now the input has been changed to Read
type, if input.seek()
is still used, then the input is also required to have std::io::Seek
trait constraints. Among the three types we have tested above, the cached file BufReader<File>
and the string Cursor<String>
both support Seek
, but the standard input std::io::Stdin
does not support it, and there may be other input types that support Read
but not Seek
(such as std::net::TcpStream
). If we add Seek
constraints here, the road will be narrowed.
Since Seek
cannot be used, there is no need to use the second method. You can also consider the first method, which is at least consistent with Token's peek()
function.
The more straightforward approach is to add an ahead_char: char
field in Lex to save the character peeked to, similar to the peek()
function and the corresponding ahead: Token
field. It's simpler to do this, but there's a more general way of doing it in the Rust standard library, using Peekable
. Before introducing Peekable, let's look at the Bytes
type it depends on.
Use Bytes
The implementation of the read_char()
function listed at the beginning of this section is a bit complicated relative to its purpose (reading a character). I later discovered a more abstract method, the bytes()
method of the Read
triat, which returns an iterator Bytes
, and each call to next()
returns a byte. Modify the Lex definition as follows:
pub struct Lex<R> {
input: Bytes::<R>,
Modify the constructor and read_char()
function accordingly.
impl<R: Read> Lex<R> {
pub fn new(input: R) -> Self {Lex {
input: input.bytes(), // generate iterator Bytes
ahead: Token::Eos,
}
}
fn read_char(&mut self) -> char {
match self.input.next() { // just call next(), simpler
Some(Ok(ch)) => ch as char,
Some(_) => panic!("lex read error"),
None => '\0',
}
}
The code for read_char()
does not seem to be reduced here. But its main body is just input.next()
call, and the rest is the processing of the return value. After the error handling is added later, these judgment processing will be more useful.
Use Peekable
The peekable()
method in the Bytes
document, which returns the Peekable
type, is exactly what we need. It based on the iterator, and we can "peek" a piece of data forward. Its definition is clear:
pub struct Peekable<I: Iterator> {
iter: I,
/// Remember a peeked value, even if it was None.
peeked: Option<Option<I::Item>>,
}
To this end, modify the definition of Lex as follows:
pub struct Lex<R> {
input: Peekable::<Bytes::<R>>,
Modify the constructor accordingly, and add the peek_char()
function:
impl<R: Read> Lex<R> {
pub fn new(input: R) -> Self {
Lex {
input: input.bytes().peekable(), // generate iterator Bytes
ahead: Token::Eos,
}
}
fn peek_char(&mut self) -> char {
match self. input. peek() {
Some(Ok(ch)) => *ch as char,
Some(_) => panic!("lex peek error"),
None => '\0',
}
}
Here input.peek()
is basically the same as input.next()
above, the difference is that the return type is a reference. This is the same as the reason why the Lex::peek()
function returns &Token
, because the owner of the returned value is still input, and it does not move out, but just "peek". But here we are of char
type, which is Copy, so directly dereference *ch
, and finally return char type.
Summary
So far, we have completed the optimization of the input type. From the beginning, only the File
type is supported, and finally the Read
trait is supported. There is not much content to sort out, but in the process of realization and exploration at the beginning, it took a lot of effort to bump into things. In this process, I also thoroughly figured out some basic types in the standard library, such as Read
, BufRead
, BufReader
, also discovered and learned the Cursor
and Peekable
types, and also learned more about the official website documents way of organization. Learning the Rust language by doing is the ultimate goal of this project.