Sharding is a way of distributing database data across multiple servers or splitting a database into multiple instances (“shards”) within a single process. For multiple servers, sharding is accomplished via the McoDistributedSqlEngine. Creating multiple shards within a single process is managed through the SqlAggregator class. xSQLcan be used to manage database shards in shared memory. For example, two shards can be created in two different console windows as follows:
Console 1: ./xsql -shared -dbname xsqldb0 -size 100M -i Console 2: ./xsql -shared -dbname xsqldb1 -size 100M -iThen the two shards can be managed by a third instance of xSQL launched with the command line option -shards. For example:
Console 3: ./xsql xsqldb%d -shards 2The configuration file parameter
sql_n_shards
can also be used to start xSQL to manage database shards.When a database is created and different shards created on separate network nodes, the McoDistributedSqlEngine sends a query to one of the network nodes, or broadcasts queries to all nodes and merges results accordingly.
To demonstrate, suppose we have database schema
create table Orders ( orderId int primary key, product string, customer string, price double, volume double )and data in CSV format:
orderId|product|customer|price|volume 1|A|james|10.0|100 2|B|bob|50.0|200 3|A|paul|11.0|300 4|C|paul|100.0|150 5|B|bob|52.0|100 6|B|bob|49.0|500 7|A|james|11.0|100 8|C|paul|105.0|300 9|A|bob|12.0|400 10|C|james|90.0|200First of all, we need to create several databases and run xSQL as a server on them. We will do this by using the following command-line parameters:
xsql -size 10m -p 10001This creates an in-memory database with size 10 Mbytes and starts xSQL as a server listening on port number 10001. Then we start two other server instances on ports 10002 and 10003:
xsql -size 10m -p 10002 xsql -size 10m -p 10003Now we can connect to all three servers as a client using the McoDistributedSqlEngine by starting xSQL with the addresses of these servers as command-line options:
xsql @127.0.0.1:10001 @127.0.0.1:10002 @127.0.0.1:10003Alternatively the addresses could be stored in a config file:
File client.cfg: { remote_client : [ "127.0.0.1:10001", "127.0.0.1:10002","127.0.0.1:10003"] }Then we would simply start xSQL with the following command:
xsql -c client.cfgAfter connecting to servers, xSQL goes to interactive mode. First we create a table on all nodes with command:
XSQL>create table Orders (orderId int primary key, product string, customer string, price double, volume double);By default, the distributed SQL engine sends queries to all nodes. So the create table statement will be executed by all three nodes. Then we can distribute data across the servers using the following SQL statement:
XSQL>insert into Orders select * from foreign table (path='order.csv', skip=1,nostop=1) as Orders where mod(orderId, %#)=%@;The pseudo-parameters
%#
and%@
refer to the total number of nodes and the zero-based node Id respectively. In our case%#
equals 3 and%@
is 0 for the first server (running on port 10001), 1 for the second server and 2 for the third server. For example, on the second server the statement will be equivalent toXSQL>insert into Orders select * from foreign table (path='order.csv', skip=1,nostop=1) as Orders where mod(orderId, 3)=1;As a result orders with Ids 1, 4, 7 and 10 will be inserted on node 2. (Note that if you start servers on different hosts, the file
order.csv
must be accessible on all hosts.) To check the data we select records from all nodes:XSQL>select * from Orders order by orderId; orderId product customer price volume ------------------------------------------------------------------------------------ 1 A james 10 100 2 B bob 50 200 3 A paul 11 300 4 C paul 100 150 5 B bob 52 100 6 B bob 4 9 500 7 A james 11 100 8 C paul 105 300 9 A bob 12 400 10 C james 90 200 Selected records: 10And select records from the second server only:
XSQL>2:select * from Orders; orderId product customer price volume ------------------------------------------------------------------------------------ 1 A james 10 100 4 C paul 100 150 7 A james 11 100 10 C james 90 200Sorting result rows
When the data is distributed in such a way that the merged query results don't need to be sorted in any particular order, it is recommended to use the specialized syntax with the “
*:
” prefix in the SQL statement. For example:XSQL>*:select * from Orders;instead of
XSQL>select * from Orders;The “
*:
” prefix instructs the SQL engine to not resort the combined result set, which is beneficial from a performance standpoint. Both statements will create a result set that includes the results from the first shard concatenated with the result set from the second shard and so on for each shard; and both final results are correct. But if the statement requires that the final results are sorted, then only the first form should be used. In other words, the followingselect
returns a sorted result set:XSQL>select * from Orders order by orderId; orderId product customer price volume ------------------------------------------------------------------------------ 1 A james 10 100 2 B bob 50 200 3 A paul 11 300 4 C paul 100 150 5 B bob 52 100 6 B bob 49 500 7 A james 11 100 8 C paul 105 300 9 A bob 12 400 10 C james 90 200 Selected records: 10However, the following
select
, though faster, returns an unordered (incorrect!) result set:XSQL>*:select * from Orders order by orderId; orderId product customer price volume ------------------------------------------------------------------------------ 3 A paul 11 300 6 B bob 49 500 9 A bob 12 400 1 A james 10 100 4 C paul 100 150 7 A james 11 100 10 C james 90 200 2 B bob 50 200 5 B bob 52 100 8 C paul 105 300 Selected records: 10For another example consider the
select
statement:XSQL>select count(*) from Orders ; #1 ------------------------------------------------------------------------------ 10 Selected records: 1This statement returns a single row with the correct aggregate value of 10 - the total number of rows from all of the shards. The SQL engine merges the result sets from each shard. But the following unsorted form of the statement returns three rows which display the correct count for each shard:
XSQL>*:select count(*) from Orders; #1 ------------------------------------------------------------------------------ 3 4 3 Selected records: 3Here the remote SQL engine does not merge the results sets.
(Note that the “
*:
” prefix is simply an extension of the shard number “N:
” notation that instructs the SQL engine to execute the statement on only the specified single shard; naturally the*
indicates "all shards").More elaborate distributed queries
We can also perform more elaborate queries on distributed shards, using
group by
andorder by
clauses. For example:XSQL>select product, sum(price*volume) as s from Orders group by product; product s ----------------------------------------------------------------------- A 10200.000000 B 39700.000000 C 64500.000000 Selected records: 3and:
XSQL>select customer, sum(price*volume) as s from Orders group by customer order by s; customer s ----------------------------------------------------------------------- james 20100.000000 bob 44500.000000 paul 49800.000000 Selected records: 3Here the
sum()
function is applied to the merged result sets from all shards.