Section 1: Introduction – Recent Developments in the Field, Installations, and Hello World Applications
In this section, you will learn about all aspects of Transformers at an introductory level. You will write your first hello-world program with Transformers by loading community-provided pre-trained language models and running the related code with or without a GPU. Installing and utilizing the tensorflow, pytorch, conda, transformers, and sentenceTransformers libraries will also be explained in detail in this section.
This section comprises the following chapters:
- Chapter 1, From Bag-of-Words to the Transformers
- Chapter 2, A Hands-On Introduction to the Subject
Chapter 1: From Bag-of-Words to the Transformer
In this chapter, we will discuss what has changed in Natural Language Processing (NLP) over two decades. We experienced different paradigms and finally entered the era of Transformer architectures. All the paradigms help us to gain a better representation of words and documents for problem-solving. Distributional semantics describes the meaning of a word or a document with vectorial representation, looking at distributional evidence in a collection of articles. Vectors are used to solve many problems in both supervised and unsupervised pipelines. For language-generation problems, n-gram language models have been leveraged as a traditional approach for years. However, these traditional approaches have many weaknesses that we will discuss throughout the chapter.
We will further discuss classical Deep Learning (DL) architectures such as Recurrent Neural Networks (RNNs), Feed-Forward Neural Networks (FFNNs), and Convolutional Neural Networks (CNNs). These have improved the performance of the problems in the field and have overcome the limitation of traditional approaches. However, these models have had their own problems too. Recently, Transformer models have gained immense interest because of their effectiveness in all NLP tasks, from text classification to text generation. However, the main success has been that Transformers effectively improve the performance of multilingual and multi-task NLP problems, as well as monolingual and single tasks. These contributions have made Transfer Learning (TL) more possible in NLP, which aims to make models reusable for different tasks or different languages.
Starting with the attention mechanism, we will briefly discuss the Transformer architecture and the differences between previous NLP models. In parallel with theoretical discussions, we will show practical examples with the popular NLP framework. For the sake of simplicity, we will choose introductory code examples that are as short as possible.
In this chapter, we will cover the following topics:
- Evolution of NLP toward Transformers
- Understanding distributional semantics
- Leveraging DL
- Overview of the Transformer architecture
- Using TL with Transformers
Technical requirements
We will be using Jupyter Notebook to run our coding exercises that require python >=3.6.0, along with the following packages that need to be installed with the pip install command:
- sklearn
- nltk==3.5.0
- gensim==3.8.3
- fasttext
- keras>=2.3.0
- Transformers >=4.00
All notebooks with coding exercises are available at the following GitHub link: https://github.com/PacktPublishing/Advanced-Natural-Language-Processing-with-Transformers/tree/main/CH01.
Check out the following link to see Code in Action Video: https://bit.ly/2UFPuVd
Evolution of NLP toward Transformers
We have seen profound changes in NLP over the last 20 years. During this period, we experienced different paradigms and finally entered a new era dominated mostly by magical Transformer architecture. This architecture did not come out of nowhere. Starting with the help of various neural-based NLP approaches, it gradually evolved to an attention-based encoder-decoder type architecture and still keeps evolving. The architecture and its variants have been successful thanks to the following developments in the last decade:
- Contextual word embeddings
- Better subword tokenization algorithms for handling unseen words or rare words
- Injecting additional memory tokens into sentences, such as Paragraph ID in Doc2vec or a Classification (CLS) token in Bidirectional Encoder Representations from Transformers (BERT)
- Attention mechanisms, which overcome the problem of forcing input sentences to encode all information into one context vector
- Multi-head self-attention
- Positional encoding to case word order
- Parallelizable architectures that make for faster training and fine-tuning
- Model compression (distillation, quantization, and so on)
- TL (cross-lingual, multitask learning)
For many years, we used traditional NLP approaches such as n-gram language models, TF-IDF-based information retrieval models, and one-hot encoded document-term matrices. All these approaches have contributed a lot to the solution of many NLP problems such as sequence classification, language generation, language understanding, and so forth. On the other hand, these traditional NLP methods have their own weaknesses—for instance, falling short in solving the problems of sparsity, unseen words representation, tracking long-term dependencies, and others. In order to cope with these weaknesses, we developed DL-based approaches such as the following:
- RNNs
- CNNs
- FFNNs
- Several variants of RNNs, CNNs, and FFNNs
In 2013, as a two-layer FFNN word-encoder model, Word2vec, sorted out the dimensionality problem by producing short and dense representations of the words, called word embeddings. This early model managed to produce fast and efficient static word embeddings. It transformed unsupervised textual data into supervised data (self-supervised learning) by either predicting the target word using context or predicting neighbor words based on a sliding window. GloVe, another widely used and popular model, argued that count-based models can be better than neural models. It leverages both global and local statistics of a corpus to learn embeddings based on word-word co-occurrence statistics. It performed well on some syntactic and semantic tasks, as shown in the following screensho...