eXtremeDB Cluster is a distributed database implementation. Instances of applications can run on different nodes in a network, and all database modifications will be automatically replicated between all instances. For example, the following diagram illustrates a 5-node cluster.
![]()
All nodes in a cluster hold the same copy of the database. All nodes in a cluster are equal, i.e. eXtremeDB Cluster has no dedicated master or replica nodes (as with eXtremeDB High Availability). Each node can perform
READ_WRITE
andREAD_ONLY
operations.READ_ONLY
transactions are local, i.e. they don't require network operations, and run as fast as with a standalone database.Theory of operation
To update a database, a two-phase commit protocol is used. All operations are local, no network communications are involved, until the transaction is committed. The update scenario is as follows:
1. The thread running on the cluster node (called the initiator) performs a call to the C/C++ function
mco_trans_commit()
, or the Java/C#ClusterConnection.CommitTransaction()
method, for aREAD_WRITE
transaction.2. Internally the eXtremeDB Cluster implementation of the transaction commit invokes
commit_phase1()
for the initiator database copy. Ifcommit_phase1()
fails, the transaction is aborted and rolled back locally. (No network communication happens.)3. If the
commit_phase1()
succeeds, the transaction is serialized and sent to all other nodes (called participants) in the Cluster.4. Each participant opens a transaction, applies the serialized data from the initiator and performs
commit_phase1()
.5. The results of these operations (
SUCCESS
orFAIL
) are sent back to the initiator.6. The initiator gathers replies from all participants and decides how to complete the global transaction:
- If all participants report
SUCCESS
, the initiator completes the local transaction withcommit_phase2()
and sendsCONFIRM
messages to participants.- If at least one participants replied with
FAIL
, the initiator rolls back the transaction and sendsREJECT
messages to the participants.- Depending on the type of message (
CONFIRM
orREJECT
), participants performcommit_phase2()
orrollback()
to complete the transaction.Accordingly, database changes are either applied to all nodes, or rolled back. Since pessimistic locking is extremely expensive and subject to unpredictable failure-points when implemented in a network environment, eXtremeDB Cluster employs only the MVCC transaction manager.
The eXtremeDB Cluster implementation automatically controls the availability of nodes using timeouts and keep-alive messages. If a number of nodes become unavailable (i.e. there is no quorum), the processing of
READ_WRITE
transactions may be suspended.The schema for a C/C++ application using a database that participate in the cluster must include the auto_oid declaration.
Terminology
The term Node Address refers to a character string containing the transport-layer-dependent address of a node. For example, a node address for the
TCP/IP
transport could be<hostname>:<port>
or<IP-address>:<port>
. (See examples below.)The term Node Quorum Rank refers to an integer value that is used to determine whether enough interconnected nodes are online to work. If the sum of the
qrank
values for nodeA
and all nodes connected toA
is more than half of the sum of theqrank
values for all cluster nodes (online and offline), then nodeA
(and other nodes online) will continue to work.It is helpful to consider two node quorum rank example scenarios:
1. All nodes have the same
qrank
value: in this case the cluster will work if more than half of the nodes are active and connected to each other. If the cluster is split exactly in half, the cluster will not work.2. One node (call it
main
) has aqrank
value of 1, all other nodes have aqrank
value of 0: in this case the cluster of nodes connected tomain
will always work. If themain
node dies, the cluster ceases to work.Note that in any scenario, if the cluster nodes split into several disconnected parts,
WRITE
operations will be allowed only for one of these parts
Overview
eXtremeDB Cluster is built on the eXtremeDB High Availability technology. However, while eXtremeDB High Availability provides the synchronization / replication support in a master-replica scenario, eXtremeDB Cluster provides a simple interface for managing a peer-to-peer network of “equal” nodes.
As a new node is connected to the cluster, it is synchronized in
hot
mode, i.e. the newly connected node receives synchronization data and transactions from other nodes. Other currently connected cluster nodes can still performREAD_WRITE
transactions during the startup synchronization of the new node. (The new node selects the active node with the largest number of completed transactions at the moment of connection to perform the synchronization. If all the cluster nodes are equallyloaded
with completed transactions, then the synchronizer node is selected randomly from the active cluster nodes.)