Section 1: Introduction to WebAssembly
This section gives a brief introduction to LLVM and Emscripten. You will learn what they are and why you need to understand them before learning about WebAssembly. This section ends with an introduction to the WebAssembly module and WebAssembly text format.
You will understand what WebAssembly is and how it works after this section.
This section comprises the following chapters:
- Chapter 1, Understanding LLVM
- Chapter 2, Understanding Emscripten
- Chapter 3, Exploring WebAssembly Modules
Chapter 1: Understanding LLVM
JavaScript is one of the most popular programming languages. However, JavaScript has two main disadvantages:
- Unpredictable performance
JavaScript executes inside the environment and runtime provided by JavaScript engines. There are various JavaScript engines (V8, WebKit, and Gecko). All of them were built differently and run the same JavaScript code in a different way. Added to that, JavaScript is dynamically typed. This means JavaScript engines should guess the type while executing the JavaScript code. These factors lead to unpredictable performance in JavaScript execution. The optimizations for one type of JavaScript engine may cause undesirable side effects on other types of JavaScript engines. This leads to unpredictable performance.
The JavaScript engine waits until it downloads the entire JavaScript file before parsing and executing. The larger the JavaScript file, the longer the wait will be. This will degrade your application's performance. Bundlers such as webpack help to minimize the bundle size. But when your application grows, the bundle size grows exponentially.
Is there a tool that provides native performance and comes in a much smaller size? Yes, WebAssembly.
WebAssembly is the future of web and node development. WebAssembly is statically typed and precompiled, and thus it provides better performance than JavaScript. Precompilation of the binary provides an option to generate tiny binary bundles. WebAssembly allows languages such as Rust, C, and C++ to be compiled into binaries that run inside the JavaScript engine along with JavaScript. All WebAssembly compilers use LLVM underneath to convert the native code into WebAssembly binary code. Thus, it is important to understand what LLVM is and how it works.
In this chapter, we will learn what the various components of a compiler are and how they work. Then, we will explore what LLVM is and how it helps the compiled languages. Finally, we will see how the LLVM compiler compiles native code. We will cover the following topics in this chapter:
- Understanding compilers
- Exploring LLVM
- LLVM in action
Technical requirements
We will make use of Clang, which is a compiler that compiles C/C++ code into native code.
For Linux and Mac users, Clang should be available out of the box.
For Windows users, Clang can be installed from the following link: https://llvm.org/docs/GettingStarted.html?highlight=installing%20clang%20windows#getting-the-source-code-and-building-llvm to install Clang.
You can find the code files present in this chapter on GitHub at https://github.com/PacktPublishing/Practical-WebAssembly
Understanding compilers
Programming languages are broadly classified into compiled and interpreted languages.
In the compiled world, the code is first compiled into target machine code. This process of converting the code into binary is called compilation. The software program that converts the code into target machine code is called a compiler. During the compilation, the compiler runs a series of checks, passes, and validation on the code written and generates an efficient and optimized binary. A few examples of compiled languages are C, C++, and Rust.
In the interpreted world, the code is read and executed in a single pass. Since the compilation happens at runtime, the generated machine code is not as optimized as its compiled counterpart. Interpreted languages are significantly slower than compiled ones, but they provide dynamic typing and a smaller program size.
In this book, we will focus only on compiled languages.
Compiled languages
A compiler is a translator that translates source code into machine code (or in a more abstract way, converts the code from one programming language to another). A compiler is complicated because it should understand the language in which the source code is written (its syntax, semantics, and context); it should also understand the target machine code (its syntax, semantics, and context) and should create a representation that maps the source code into the target machine code.
A compiler has the following components:
- Frontend – The frontend is responsible for handling the source language.
- Optimizer – The optimizer is responsible for optimizing the code.
- Backend – The backend is responsible for handling the target language.
Figure 1.1 – Components of a compiler
Frontend
The frontend focuses on handling the source language. The frontend parses the code upon receiving it. The code is then checked for any grammar or syntax issues. After that, the code is converted (mapped) into an intermediate representation (IR). Consider IR as a format that represents the code that the compiler processes. The IR is the compiler's version of your code.
Optimizer
The second component in the compiler is the optimizer. This is optional, but as the name indicates, the optimizer analyzes the IR and transforms it into a much more efficient one. Few compilers have multiple IRs. The compiler efficiently optimizes the code on every pass over the IR. The optimizer is an IR-to-IR transformer. The optimizer analyzes, runs passes, and rewrites the IR. The optimizations here include removing redundant computations, eliminating dead code (code that cannot be reached), and various other optimizing options, which will be explored in future chapters. It is important to note that the optimizers need not be language-specific. Since they act on the IR, they can be built as a generic component and reused with multiple languages.
Backend
The backend focuses on producing the target language. The backend receives the generated (optimized) IR and converts it into another language (such as machine code). It is also possible to chain multiple backends that convert the code into some other languages. The backend is responsible for generating the target machine code from the IR. This machine code is the actual code that runs on the bare metal. In order to produce efficient machine code, the backend should understand the architecture...