1
Introduction
1.1Background
Online transaction processing (OLTP) systems have become the core component of IT infrastructure in modern enterprise for several decades, and they have been widely deployed in many industries such as finance, telecommunication, e-commerce, etc. The IBM System-R project, beginning in 1974a, still has an influence on the system architecture and implementation of most traditional OLTP systems. Compared with computing environment in the past four decades, there has been huge progress in CPU, storage and network techniques. The cost of RAM continues to decrease, and the machine with large memory can hold the whole working set for OLTP applications. As it is difficult to improve the performance of a single core in CPU, multi-core processors are invented to increase computing power. However, if there are no adjustments to the system design of OLTP systems, we cannot make full use of the power from new hardware technology.
The new application scenario has created a big challenge to the traditional OLTP systems. In recent years, we have witnessed the proliferation of smartphones in people’s daily lives. With the prevalence of mobile applications and 4G networks, people can access what they want at anytime and from anywhere. With smartphones, a user can reach the service at any time and trigger the back-end system to start a transaction processing for buying tickets, online shopping, stock trading, etc. In these applications, OLTP systems must be equipped with the ability of continuous service because Internet users may access the system at any time. The shutdown of system services would lose many users, which means the revenue is also reduced. In a nutshell, most enterprise applications or services should consider how to provide flexible and stable transaction processing for uncertain number of users. In the setting of a cluster with many commodity servers, an OLTP system must have horizontal scalability, and also guarantee reliability and availability.
Traditional OLTP systems often host on the single machine or multiple machines with large share storage such as Storage Area Networks (SANs) and Network Attached Storage (NAS). The system designers consider these hardwares as sufficiently reliable and fault events rarely happen, and OLTP systems mainly depend on expensive hardware to provide the high availability. To enhance the throughput of transaction processing, the common choice is to replace the original hardware with more advanced and high-end hardware. In addition to hardware upgrade, deploying a new version of OLTP software or fixing bugs also requires the system to shut down and stop the service for several hours or days. Generally speaking, these procedures require the database administrators (DBA) or system vendors to make a more detailed plan for each step, and although much effort has been taken, it still proves to be full of risk. However, we will consider it is unimaginable if the online shopping websites can not serve the user request. In a word, today’s Internet users have been accustomed to enjoying shopping at any time.
The emergence of plenty of NoSQL systems aims to meet these requirements. Therefore, these systems have been widely used in Internet applications such as social media and web content caching [13]. Although NoSQL systems have better scalability and performance, many virtues in traditional OLTP systems have been removed. Especially, NoSQL systems offer weak support for transaction processing and ACID properties are not fully supported. This may lead to that the application programmer having to do much work on the problem of data inconsistency resulting from concurrent operations over data store. Furthermore, for the mission-critical applications, they need not only the ability of scale-out similar to NoSQL systems but also strong support of full ACID. Therefore, the next generation of OLTP systems has been proposed and implemented, referred to as NewSQL systems. To achieve new features such as scale-out ability, active upgrade and fault tolerance over commodity machines, the NewSQL systems should redesign the whole architecture. Concurrency control and recovery as the most important components in OLTP systems have been revisited and redesigned.
1.2New OLTP Architectures
New OLTP systems have optimized the design of most key components such as storage, logging and concurrency control. Their architectures provide unique features that are different from traditional DBMSs. In this section, we classify these systems according to how the database is partitioned and concurrent transactions are executed.
The first kind of OLTP systems partition the whole database and distribute the data in multiple nodes. The design goal of these systems is to scale out the capability of transaction processing by adding more computing nodes. The design philosophy of H-Store was proposed in VLDB 2007 by Michael Stonebraker et al. [89]. In the original prototype of H-Store, each partition can held in memory, which indicates no disk latency during transaction execution. Furthermore, H-Store requires that all codes of each transaction are submitted at once. This feature means it does’t need to wait user input and there is no network latency between the client and OLTP server. Since there is no latency from disk, network and user input, there is no need to parallelize the transaction processing over a single partition. In H-Store, each partition is accessed by a single thread, and if there are no cross-partition transactions, the throughput can be linearly scaled out with the number of cores or nodes. For the transactions over local partition, there are no concurrent control schemes because the local transactions are executed one by one. For the cross-partition transactions, H-Store adopted a two-phase commit protocol [29] to coordinate the execution procedure among different partition nodes.
Although partition and replication are the main approaches to horizontally scale the throughput of transaction processing, the high cost of distributed transaction using 2PC is prohibitive in the application where the database cannot be well partitioned. CalvinDB considers the dynamic locking scheme is the root cause of high-cost concurrency control. Thus, it reduces the cost of executing distributed transaction using deterministic concurrent control strategy. The architecture of CalvinDB comprises three layers. Before executing transactions, CalvinDB collects the inputs of incoming transaction requests at the sequencing layer. This layer determines the serialization order of transactions, and all replicas follow this order. The second layer, called scheduling layer, orchestrates the transaction execution using deterministic locking method. Storage layer is the third layer for physical data layout and provides the CRUD interface.
Different from most cluster-based OLTP systems where each machine is the owner of some partitions, Hyder doesn’t partition the database and locally stores the whole database snapshot at each machine. There is a shared log component to collect update information from different machines, and the local snapshot is freshened using a log melding component. This log meld procedure is conducted on each machine to ensure the update operations are applied in the same order. As each transaction can access all data from the local snapshot, Hyder circumvents the problem of distributed transaction.
Generally speaking, the transaction processing model and data access method are tightly bounded in the implementation of traditional OLTP systems. The architecture of Deuteronomy separates the transaction processing function from data access and storage component. Its main motivation is to scale the performance by adding more storage nodes or transaction processing nodes. However, this architecture leads to some challenges for concurrency control because concurrent transactions don’t know which records really exist in the database before they get the returned results from the storage layer.
Compared to traditional centralized DBMSs, key-value systems (say Cassandra, HBase, Redis, etc.) have advantages such as low latency, scalability and high availability, and many of them have been widely adopted in many industry projects. These systems achieve high performance by sacrificing the ACID properties. However, due to the lack of support for transaction processing, the programmer must take much effort to consider the concurrent conflicts. Therefore, this motivates the need to equip the transaction processing function with key-value systems. Both Percolator and Omid belong to this kind of OLTP systems and they can support cross-row and cross-table transactions. Percolator is built on Google’s Bigtable, which only supports single row transaction, and it stores locking information with the data and uses two-phase commit protocol to coordinate the distributed transactions. Omid has a transaction manager component that is independent of the bottom storage layer HBase, which is regarded as the open source implementation of Bigtable.
OceanBase (OB), developed by Alibaba, partitions the database according to whether the data is being updated. OB uses a single update server (UPS) to store the active data which are being updated by transactions, and the static data are distributed and replicated on many chunk servers (CSs). This design is based on the observation that the size of updated data during a fixed time (e.g., one day) is limited in most OLTP settings. When the used memory of UPS is beyond the defined threshold, OB merges the active data from the UPS to CSs. This architecture has the advantage of avoiding distributed transactions because all update transactions are running on a single UPS. OB provides the high availability by log replication between primary UPS and backup UPS. To reduce the workload of primary UPS, read-only transactions can be routed to backup UPS.
1.3Concurrency Control
Transaction is one of the most important concepts in OLTP systems. In real applications, it often defines certain business behaviors such as transfering money from one account to another account, buying a ticket, making a new order, etc. One transaction often consists of a sequence of operations over data objects, and OLTP systems can interleave the execution of operations from different transactions. However, without appropriate concurrency control, the interleaving of executing operations from different transactions will result in all kinds of problems, including dirty read, lost update, non-repeatable read, etc. To resolve the problem from concurrently executing transactions, concurrency control scheme has been the hot topic in the past decades and becomes the key component in OLTP systems. Generally speaking, it aims to guarantee the data consistency on the one hand, and to coordinate as many transactions as possible that can be concurrently executed on the other hand. To guarantee the correctness of the database, each transaction should obey the ACID rule, i.e., Atomicity, Consistency, Isolation and Durability. In order to achieve this goal, three kinds of concurrency control schemes have been proposed in the database research community. They are two-phase locking (2PL), optimistic concurrency control and timestamp ordering, respectively.
Locking is the most widely implemented and deployed concurrency control mechanism in so many commercial or open-source DBMSs including DB2, ORACLE, MySQL, PostgreSQL, etc. The well-known 2PL protocol schedules the potential data access conflicts in the pessimistic way. All lock requests happen at the first phase, where the transaction would firstly lock the record or a set of records before it reads or updates the data. The transaction enters the second phase if one lock is released and then lock requests are disallowed. Furthermore, to ensure the serializable schedule, strict 2PL protocol requires the locks be held by a transaction to be released after this transaction is committed.
Under optimistic concurrency control [53], the lifetime of each transaction is divided into three phases, i.e., read, validation and write phase. In the first phase, each transaction only tentatively writes data in the local version and essentially doesn’t feel the existence of other transactions. Before the transaction commits, validation phase is used to detect whether there are conflicts with other committed and lifetime-overlapped transactions. If the validation fails, the transaction being validated will be aborted. Although all kinds of methods about pessimistic or optimistic schemes have been proposed, there is no method that always performs better than all other methods. In the setting with high conflicts, locking-based pessimistic method has better performance over optimistic method. However, if the applications only have lower conflicts, optimistic method performs better.
As the hardware techniques have made great progress and the scenario of OLTP application continues to evolve and differentiates from that in last several 10 years, both pessimistic and optimistic methods have been revisited and received much attention from the research community. In the following, we summarize some related work on how to optimize the implementation in the setting of multiple cores and how to adapt existing concurrency control schemes to modern OLTP application.
The traditional implementation of 2PL protocol has a locking manager component which adopts the hash map as the central data structure shared by multiple threads. In the hash map, a key represents a locked item, and its value is often a linked list which stores the information of transaction holding this lock and those transactions waiting for the lock release. It is reported that the cost of locking overhead takes nearly up to 16–25%. To reduce the locking overhead and improve the scalability of locking algorithm under multi-core machine, existing works give up central data structure and co-allocate locking information with raw data items. Especially for the hot data item, frequent locking acquisition and release leads to expensive memory allocation and deallocation. Lock inheritance resolves this problem by transferring the allocated lock information to another transaction. In Section 2.2, we introduce more related works on locking.
As read and write operations don’t block each other in multiple version concurrency control (MVCC), it has received much more attention in recent years. Snapshot isolation based on MVCC has been implemented in most DBMS products including SQL Server, Oracle, PostgreSQL, etc. Although snapshot isolation level has removed the most abnormal phenomena, it is still possible to generate non-serializable transaction execution because of write skew problem. The write skew problem happens when two transactions read the same set of records but they modify different records in this set. While it seems that each of two transactions obeys the consistency constraints, the final state of the database is inconsistent. Serializable snapshot isolation was proposed to resolve the write skew problem. The key idea is to track the read and write set of each running transaction, and then to build run-time dependency graph. If there exists a cycle in the dependency graph, the write skew problem will occur. However, it would take much cost to detect write skew problem. This is because maintaining all access records of each running transaction is a resource-intensive task. The authors of Ref. [79] introduced an approximate solution of keeping rw-dependency in-conflicts and out-conflicts for each transaction. If the transaction is dependent on by other transactions and it also depends on another transactions, it means this is the joint transaction. Thus, it is possible there is a dependency cycle. As the existence of the joint transaction does not mean it absolutely has a dependency cycle, this results of detection may be false positive.
1.4Crash Recovery
OLTP systems must provide the ability to avoid data loss or inconsistency in case of unpredictable failures including transaction abort, disk corrupt, OS crash, etc. The recovery mechanism of OLTP aims to guarantee the atomicity and durability of the ACID properties [33]. By this mechanism, each transaction has its corresponding committed log which contains all details of update operations. Write-Ahead Logging (WAL) is the widely used logging protocol whose key idea is that the transaction log must be flushed to persistent storage before committing the transaction [11]. When OLTP systems crash for some reasons, the latest commited data in the volatile memory will be lost since these data have perhaps not been flushed to non-volatile media. By replaying the committed log, the database can be recovered to the moment when the system crashes, and then all committed data are available.
Traditional database systems often rely on high-end expensive hardware to provide the high availability, and these systems regard hardware faults as abnormal events. To achieve the ability of fast recovery from system crash, conventional database systems adopt replication techniques to copy transaction log from primary DBMS instance to backup instance. When the primary node is shut down because of failures, the backup node switches to serve as the primary and continues to handle the requests from clients. Although the classic primary-backup replication is easy to understand and to implement, it can’t acquire availability and consistency at the same time. If we need instaneous consistency between the primary and backup, the system will lose availability when certain failures occur. To keep the consistency, each transaction cannot be committed until the backup has received the lo...