Aerospike distributed ACID transaction design
Discover the design of distributed ACID transactions in Aerospike 8. Ideal for developers seeking rigorous transactional systems.
This blog outlines the design of distributed ACID transactions (transactions in short henceforth) in Aerospike 8. It assumes no prior knowledge of Aerospike and is not intended to guide application development using transactions. The target audience includes developers and architects familiar with database systems who wish to understand Aerospike's approach to implementing transactions to assess the design's rigor and form a high-level opinion on the expected performance of transactions. Consequently, we omit low-level details about client APIs, which are covered in Aerospike's official documentation. We begin with a brief overview of Aerospike for unfamiliar readers, which those already knowledgeable can skip.
Quick Aerospike background
Aerospike applications link to an Aerospike client library (referred to as "client") that interacts with the Aerospike server cluster. Each record in Aerospike is uniquely identified by a key, owned by one node in the cluster, and may be replicated on other nodes. The number of replicas is configurable. For each record, the Aerospike client knows the master node and any nodes containing replicas.
A consistency mode called strong consistency ensures commands on a single record are linearizable. Transactions require the Aerospike cluster to operate in a strongly consistent mode. The client communicates only with the node containing the primary copy for any commands on the record, while the server cluster ensures replica consistency without client involvement.
Aerospike uses a primary index to quickly map a key to its disk location and supports secondary indexes for efficient query processing. Additionally, Aerospike supports namespace (table) scans, but neither scans nor queries are part of transactions.
Distributed ACID transaction design goals
The primary design goals for the transaction subsystem were:
Ensure near-zero impact on read/write command performance when no transaction is running, meaning that the code path remains unaffected except for data contention.
Support transactions that consist of an arbitrary sequence of reads and writes of records within the same namespace, submitted by the Aerospike client, bracketed by a begin and end transaction request. The records being read or written do not need to be known beforehand, enabling applications like graph traversals or linked data structures to function seamlessly.
Guarantee strict serializability for applications using transactions.
Exclude scans and queries from transactions.
High-level client API for distributed ACID transactions
The client library exposes several commands to the application, with detailed API specifications available in our documentation. The key “conceptual” commands are :
Begin transaction (timeout): Starts a transaction with a timeout parameter and returns a transaction ID, which must be included in all subsequent requests within this transaction.
Read (transaction_id, X): Reads the record with key X as part of the transaction corresponding to transaction_id.
Write (transaction_id, X): Writes the record with key X as part of the transaction corresponding to transaction_id.
Abort transaction (transaction_id): Requests the abortion of the transaction.
Commit transaction (transaction_id): Requests the commitment of the transaction.
Note: the actual mechanisms to achieve some of the conceptual commands below will depend on the client language being used.
Client library as the coordinator for distributed transactions
The Aerospike client is the primary coordinator of a transaction. The state of the transaction is tracked in a replicated monitor record. This allows the server to take over as coordinator if the client fails. To illustrate the above conceptual commands, we consider a running example transaction with four steps: a Begin Transaction, a read, a write, and a commit.
Figures 1-4 depict the workflow between the application, the client library, and the server nodes, corresponding to these four steps applied one after the other. Figure 1 depicts what happens when the application issues a Begin Transaction command. The client assigns a transaction_id of 23 to this transaction.
Aerospike uses a combination of optimistic concurrency control for read commands and strict two-phase locking for write commands. While write locks are held until the transaction ends, reads are verified for correctness when the transaction ends. This method ensures that there are no cycles in the read-write precedence graph.
data:image/s3,"s3://crabby-images/fb696/fb696401af1ae71bfb2a034651d6d09a48b94c15" alt="distributed-acid-transactions-figure-1-blog"
Dual records in Aerospike 8: Atomicity and lock primitive
The server must ensure two things for a record written as part of a transaction:
1. Lock the record, and
2. Guarantee that it can either roll back this record or complete the transaction by rolling the record forward as directed by the coordinator.
Aerospike 8 achieves both using a simple and effective technique. When a transaction updates a record, the primary index entry for that record holds two versions (called dual records henceforth):
One for the prior committed version (v1), and
Another for the currently updated version (v2)
The presence of these two versions acts as a lock on the record. If the transaction successfully commits, the primary index entry will have only one version corresponding to the currently updated version (v2). If the transaction fails, the primary index will have only one version corresponding to the prior committed version (v1).
This dual record concept is integral to the Aerospike design and code, ensuring that replication is aware of dual records. When a record's primary copy has a dual record, it is replicated to the replica copies. Conversely, when the primary copy goes back to a single version of the record, replication using strong consistency ensures the replicas also go back to a single version of the record.
Transaction reads
At the client, when a transaction reads an item, a read request is sent to the appropriate node. There are three possible cases:
If there is only one entry at the server, it is returned along with its version number.
If the server has a dual record and the same transaction is updating the item, the current version and version number are returned.
The server node fails the request if another transaction is currently updating the dual record.
Upon a failed read, the client notifies the application, which can choose to retry the read or abort the transaction. If the read succeeds, the client records the version number before passing the record to the application. The client tracks all items read by the transaction and their version numbers for validation at the end of the transaction in a Read Set. If the item being read is also written, it is not tracked in the Read Set as this item is locked, and there is no need for validation. Figure 2 illustrates a transaction read with a read (23,X) command as the second command in our running example.
data:image/s3,"s3://crabby-images/85cb4/85cb4cdfa9b649fa95c130ecb1d19269b357fbbd" alt="distributed-acid-transactions-figure-2"
Transaction writes
When the client writes an item, the client first logs the write in the monitor record (the server automatically creates the monitor record if it does not exist), and then a write request is sent to the appropriate node. Again, there are three cases possible:
If there is only one primary index entry at the server node, a second entry is created, locking the record (if this item has been read by this transaction before, the write succeeds only if the version read matches the current version).
If there is a dual record and the same transaction is updating the item, the record is updated and the version is also updated.
The server node fails the request if another transaction is currently updating the record. The client informs the application, which can choose to retry the write or abort the transaction.
The client tracks all items written by the transaction. Figure 3 illustrates transaction writes this with a write (23,Y), i.e., transaction_id 23 issues a write for key Y following read(23,X). Although the figure depicts what would happen in all three cases above, we assume we are in Case 1 for the purpose of this running example, and version v2 for item Y was written by this transaction.
data:image/s3,"s3://crabby-images/5aea0/5aea0aaba9ac79575bd19a759c20ebd91dd4c667" alt="distributed-acid-transactions-figure-3"
Transaction abort
When the application requests an abort, it processes the list of writes for the transaction and instructs the server nodes to undo the writes, restoring the primary index to its pre-transaction state. The client ensures all writes are undone, retrying until successful. Reads do not require any action.
Transaction commit
The client first verifies that all transaction reads are still valid by checking current version numbers with the server nodes. If any read is invalid, the commit is aborted, and the application is informed. If all reads are valid, the client updates a commit bit on the monitor record to indicate the commit status of the transaction. If the client fails during a transaction, then the commit bit informs us of the state of the transaction. The client lets the application know of the status of the commit. If the commit is successful, the client then instructs the server nodes to finalize the writes, making the newer version the only record version.
If the client successfully rolls forward in the case of a commit or rolls back in the case of an abort, then the client will delete the monitor record. The server periodically scans all the monitor records and either rolls back or rolls forward the transaction based on the commit bit's state. Figure 4 illustrates what happens when transaction 23 commits the transaction.
data:image/s3,"s3://crabby-images/e1600/e1600367244c8d862727ed144e2e6118ba54b8de" alt="distributed-acid-transactions-figure-4"
Read/write command behavior in Aerospike 8
This section discusses the behavior of read/write commands on single records when they are not part of a transaction. The client API for read/write commands remains unchanged, as does most of the client-side implementation. The server undergoes a minor change to ensure these single record commands are serializable with transactions.
Read of a record
All reads succeed. If the primary index has only one entry, it is returned, consistent with the read behavior before transactions were introduced. If the primary index has a dual record (indicating another transaction is updating this record), the previously committed version is returned. This ensures the read always succeeds and precedes the ongoing transaction in the serializable schedule. The transition from a dual record to a single record occurs after a transaction successfully commits. This leads to a small window where a read that starts after a transaction commits may still encounter a dual record but return the previously committed version. This makes the reads not strictly serializable with the transactions.
Write of a record
If there is only one primary index entry, writes create a newer record version. If the primary index has two entries, the write fails immediately, and the client is informed of the failure.
Behavior of scans and queries
Scans and Queries are not part of transactions. If there is only one entry in the primary index, the scan/query will just read that record. If there are two entries in the primary index, the scan/query will return the previously committed version.
Explore transactions with Aerospike 8
Aerospike 8 is a huge step forward for customers who want atomicity and isolation for their transactions that span multiple records. Strict serializability is the strongest guarantee any system can provide. Distributed ACID transactions can achieve it with an elegant design that leverages our strong consistency algorithm and intelligently blends optimistic concurrency control for reads and a lock-based mechanism for writes.