Cluster Binary Schema Evolution
The Cluster runtime supports the Binary Schema Evolution (BSE) feature of eXtremeDB, thus nodes in the single cluster are able to have different database schemas. To enable binary evolution mode the application must set the
BINARY_EVOLUTION
flag in themode_mask
field on all nodes.Using the C/C++ API, this is done as follows:
mco_cluster_params_t cl_params; ... mco_cluster_params_init(&cl_params); ... cl_params.mode_mask |= MCO_CLUSTER_MODE_BINARY_EVOLUTION; /* use BSE */In the Java API:
Database.Parameters params = new Database.Parameters(); ... params.clusterParams = new Database.ClusterParams(...); ... params.clusterParams.mode |= Database.CLUSTER_MODE_BINARY_EVOLUTION;In the C# API:
Database.Parameters parameters = new Database.Parameters(); ... parameters.ClusterParameters = new Database.ClusterParams(...); ... parameters.ClusterParameters.Mode |= Database.CLUSTER_MODE_BINARY_EVOLUTION;Note that if nodes
A
andB
set theCLUSTER_MODE_BINARY_EVOLUTION
flag but have exactly the same schema, the objects between these nodes are sent without any conversion. In this case the only overhead for using the binary evolution flag is the additional space allocated in the database header (at database creation time). This space is used to store the dictionaries from other nodes.Saving and Loading a snapshot of a cluster database
A snapshot of a cluster database can be saved by calling function
mco_db_save()
. The ability to load a cluster database from a snapshot (image file) is enabled through the following two fields in themco_cluster_params_t
structure passed tomco_cluster_db_open()
:void * stream_handle; mco_stream_read input_stream_reader;These fields correspond to two the first two parameters of the
mco_db_load()
API, wherestream_handle
is a handle to theinput stream
andinput_stream_reader
is the handler function called by the runtime to read the input.If the values for these two parameters are
NULL
, thenmco_cluster_db_open()
callsmco_db_open_dev()
, otherwisemco_db_load()
is called. By default (set inmco_cluster_params_init()
) these values are set toNULL
.For example, with code like the following, a cluster-based in-memory database can be opened (created) or loaded from a saved database snapshot (image file):
mco_size_sig_t file_reader(void* stream_handle /* FILE* */, void* to, mco_size_t max_nbytes) { return (mco_size_t) fread(to, 1, max_nbytes, (FILE*)stream_handle); } mco_size_sig_t file_writer(void* stream_handle /* FILE* */, const void* from, mco_size_t max_nbytes) { return (mco_size_t) fwrite(from, 1, max_nbytes, (FILE*)stream_handle); } main() { mco_db_h db; mco_cluster_params_t cl_params; mco_cluster_params_init(&cl_params); <fill cl_params> FILE *rfile = fopen("<image_file>", "r"); if (rfile) /* if image exists, load database */ { cl_params.stream_handle = rfile; cl_params.input_stream_reader = &file_reader; } mco_cluster_db_open("<db_name>",...., &cl_params); <start mco_cluster_listen() thread> mco_db_connect("<db_name>", &db); <... Do some work ...> mco_cluster_stop(db); /* save database */ FILE *wfile = fopen("<image_file>", "w"); mco_db_save(wfile, &file_writer, db); <close database> }Cluster Transaction Window
The transaction window in the eXtremeDB Cluster allows an application to merge transaction data from several commit threads into a single
send()
operation reducing the number of system calls. For small transactions this can significantly improve the overall performance in the cluster database. The C APIsmco_cluster_get_window_params()
andmco_cluster_set_window_params()
are used to retrieve the current window parameters and to set new values.The transaction window has three threshold values specified in structure
mco_cluster_window_t
:bsize
,length
andtimeout
. The first one (bsize
) defines the maximum size of the "buffered" transaction data in bytes,length
is the maximum number of buffered transactions andtimeout
is the maximum delay ( in milliseconds) before sending the buffered data to other nodes. The buffered data will be sent as soon as any of these thresholds is reached. So the number of buffered transactions is never more thanmco_cluster_window_t::length
and the delay is never more thanmco_cluster_window_t::timeout
.The initial parameters of the window are set via the
mco_cluster_params_t::window
passed to the C APImco_cluster_db_open()
. The default values arelength=0
,bsize=0
andtimeout=1
, which means that the window is not used because the maximum number of buffered transactions (length
) is zero. If one of the cluster nodes sets the windowlength
to a value greater than 0, all other nodes also must use thelength
greater than 0, but the values may be different on the different nodes. (Note that a windowlength = 1
is roughly the same aslenght = 0
). The value of the windowbsize
can be 0, which means "unlimited data size".Note that normally the window
length
should not exceed the number of active write threads - otherwise the data will always be sent by thetimeout
threshold. For example, the window is absolutely useless if the application has only one write thread. Usually the best results are reached if the windowlength
is in the range of 50%-75% times the number of writers. The value of the windowbsize
can be used to limit the size of buffered transaction data in the case of "large" transactions.Cluster and Transaction Logging
In some cases it may seem desirable to optimize eXtremeDB Cluster performance for persistent databases by instead using the eXtremeDB Transaction Logging feature which provides a possible alternative to using persistent database classes. In order to enable eXtremeDB Transaction Logging (TL), the cluster mode flag
MCO_CLUSTER_MODE_START_DETACHED
must be used to create a cluster database that is not immediately connected to other nodes. This makes it possible to add processing between the moment the database is created and the moment the node is connected to the cluster. Specifically, when TL is enabled, the transaction log can be applied to the cluster database by callingmco_translog_apply()
.For example, the following code opens a cluster database in
detached
mode, applies the transaction log, and then connects to other cluster nodes:main() { mco_db_h db; mco_cluster_params_t cl_params; mco_cluster_params_init(&cl_params); <fill cl_params> FILE *rfile = fopen(image_path, "r"); if (rfile) /* load image if exists */ { cl_params.stream_handle = rfile; cl_params.input_stream_reader = &file_reader; } /* start in detached mode */ cl_params.mode_mask |= MCO_CLUSTER_MODE_START_DETACHED; mco_cluster_db_open(db_name,...., &cl_params); mco_db_connect(db_name, &db); /* Apply transaction log */ mco_translog_apply(db, log_path, MCO_TRANSLOG_ALL_LABELS); /* Connect to other cluster nodes */ mco_cluster_attach(db, &cl_params); <start mco_cluster_listen() thread> /* Start transaction logging */ mco_translog_start(db, log_path, tl_start_data); /* Save snapshot to ensure all remote transactions are saved */ FILE *wfile = fopen(image_path, "w"); mco_translog_db_save(wfile, &file_writer, db)); <... Do usual work ...> mco_cluster_stop(db); mco_translog_stop(db); <close database> }
Distributed Databases - Sharding
Sharding and clustering are two different concepts that exist for different purposes and are architecturally different. Sharding implements database partitioning, while a Cluster, in its default configuration, does not partition the database but rather replicates everything.
Cluster can be used to implement redundancy in a sharding database. In brief, each shard can be not a single node, but a cluster. The implementation is similar to HA Replication as described in the eXtremeDB High Availability User's Guide, with the difference that any node can act as a
master
; i.e. receive SQL statements and replicate the changes to other nodes in the shard.When an eXtremeDB database is distributed using eXtremeDB Cluster the following conditions apply:
- All nodes in a cluster are equal, each node can perform
READ_WRITE
andREAD_ONLY
operationsRead-only
transactions are always local (no network access)Write
transactions are distributed by the Cluster runtime to all (or some) nodes on the cluster
Cluster Runtime Open Flags
Following are explanations for 3 flags that can be specified in the
mco_cluster_params.mode_mask
to affect specific Cluster runtime behavior:1)
MCO_CLUSTER_MODE_EARLY_DATA_SEND
Normally, the cluster runtime sends transaction data to other nodes after the first phase of a commit on the initiator has completed successfully. This
allows the runtime to reduce network traffic for transactions that failed locally (for example, due to a uniqueness violation). On the other hand, sending data earlier, before the local phase_1, can reduce the total time to perform distributed transaction. Flag
MCO_CLUSTER_MODE_EARLY_DATA_SEND
forces the cluster runtime to send data before phase_1.
2.
MCO_CLUSTER_MODE_BINARY_EVOLUTION
- this flag allows nodes to join with different schemas into a single cluster. Also note that this flag prohibits any DDLstatements in the cluster.
3.
MCO_CLUSTER_MODE_START_DETACHED
Normally,
mco_cluster_db_open()
creates a database, connects to other cluster's nodes and synchronizes the database content. TheMCO_CLUSTER_MODE_START_DETACHED
flag tellsmco_cluster_db_open()
to just create the database and skip any network operations. To add this node to the cluster, the application must callmco_cluster_attach()
later.