Compilation principle

The principle of compilation is a very profound and mature subject. It is not necessary or capable to give a complete or accurate introduction here. It is just a simple concept introduction according to the subsequent implementation process. If you are familiar with the concept of an interpreter, you can skip this section.

Compiled and Interpreted language

Regardless of the programming language, before the source code is handed over to the computer for execution, a translation process is necessary to translate the source code into a computer-executable language. According to the timing of this translation, programming languages can be roughly divided into two types:

  • Compiled type, that is, the compiler first compiles the source code into a computer language and generates an executable file. This file is subsequently executed directly by the computer. For example, under Linux, use the compiler gcc to compile the C language source code into an executable file.

  • Interpreted type requires an interpreter, which loads and parses the source program in real time, and then maps the parsed results to pre-compiled subroutine and executes them. This interpreter is generally implemented by the above compiled language.

                                                                    interprete
+-------------+  compile  +-----------------+      +-------------+  & execute   +-----------------+
| source file | --------> | executable file |      | source file | -----------> | Lua interpreter |
|  bar.c      |           |  bar.exe        |      |  bar.lua    |              |  lua.exe        |
+-------------+           +-----------------+      +-------------+              +-----------------+
                               ^                                                     ^
                               | execute                                             | execute
                               | machine code                                        | machine code
                           +----------------+                                    +----------------+
                           |    computer    |                                    |    computer    |
                           +----------------+                                    +----------------+

                 Compiled                                           Interpreted

The figure above roughly shows the two types of translation and execution processes. Lua is an interpreted language, and our goal is to implement a Lua interpreter.

Parse and Execute

The general compilation principle process is as follows:

            Lexical Analysis        Syntax Analysis       Semantic Analysis
Character Stream --------> Token Stream --------> Syntax Tree --------> Intermediate Code ...
  • The character stream corresponds to the source code, that is, the source code is treated as a character stream.

  • Lexical analysis, which splits the character stream into tokens supported by the language. For example, the above Lua code is split into two Tokens: "identification print" and "string "hello, world!"". Lua ignores whitespace characters.

  • Syntax analysis, which parses the Token stream into a syntax tree according to grammer rules. For example, the two tokens just now are recognized as a function call statement, in which "identity print" is the function name, and "string "hello, world!"" is the parameter.

  • Semantic analysis, which generates the corresponding intermediate code from the statement of this function call, these codes indicate where to find the function body, where to load the parameters and so on.

After intermediate code is generated, compiled and interpreted languages diverge. The compiled language moves on, eventually generating machine code that can be executed directly, and packaged as an executable file. For the interpreted language, this is the end, the generated intermediate code (generally called bytecode) is the result of compilation; and the execution of the bytecode is the task of the virtual machine.

The virtual machine converts the bytecode into a corresponding series of precompiled subroutine, and then executes them. For example, to execute the bytecode generated above, the virtual machine first finds the corresponding function, namely print, which is a function in the Lua standard library; then loads the parameters, namely "hello, world"; finally calls the print function which outputs "hello, world!".

The above just describes a general process. Specific to each language or each interpreter process may be different. For example, some interpreters may not generate bytecode, but let the virtual machine directly execute the syntax tree. The official implementation of Lua omits the syntax tree, and the bytecode is directly generated by syntax analysis. Each of these options has advantages and disadvantages, but they are beyond the scope of our topic and will not be discussed here. Our interpreter is a full reference to the official Lua implementation in the main process, so the final process is as follows:

            Lexical Analysis         Syntax Analysis
Character Stream --------> Token Stream --------> Bytecode
                                                     ^
                                                     |
                                                 virtual machine

From this we can clarify the main functional components of our interpreter: lexical analysis, syntax analysis and virtual machine. The combination of lexical analysis and syntax analysis can be called the "parsing" process, and the virtual machine is the "execution" process, then the bytecode is the link connecting the two processes. The two processes of parsing and execution are relatively independent. Next, we use the bytecode as a breakthrough to start implementing our interpreter.