Sharding with xSQL

Sharding is a way of distributing database data across multiple servers or splitting a database into multiple instances (“shards”) within a single process. For multiple servers, sharding is accomplished via the McoDistributedSqlEngine. Creating multiple shards within a single process is managed through the SqlAggregator class. xSQLcan be used to manage database shards in shared memory. For example, two shards can be created in two different console windows as follows:

 
    Console 1:
     
    ./xsql -shared -dbname xsqldb0 -size 100M -i
     
    Console 2:
     
    ./xsql -shared -dbname xsqldb1 -size 100M -i
     

Then the two shards can be managed by a third instance of xSQL launched with the command line option -shards. For example:

 
    Console 3:
     
    ./xsql xsqldb%d -shards 2
     

The configuration file parameter sql_n_shards can also be used to start xSQL to manage database shards.

When a database is created and different shards created on separate network nodes, the McoDistributedSqlEngine sends a query to one of the network nodes, or broadcasts queries to all nodes and merges results accordingly.

To demonstrate, suppose we have database schema

 
    create table Orders ( orderId int primary key,
                product string,
                customer string,
                price double,
                volume double )
                 

and data in CSV format:

     
    orderId|product|customer|price|volume
    1|A|james|10.0|100
    2|B|bob|50.0|200
    3|A|paul|11.0|300
    4|C|paul|100.0|150
    5|B|bob|52.0|100
    6|B|bob|49.0|500
    7|A|james|11.0|100
    8|C|paul|105.0|300
    9|A|bob|12.0|400
    10|C|james|90.0|200
     

First of all, we need to create several databases and run xSQL as a server on them. We will do this by using the following command-line parameters:

 
    xsql -size 10m -p 10001
     

This creates an in-memory database with size 10 Mbytes and starts xSQL as a server listening on port number 10001. Then we start two other server instances on ports 10002 and 10003:

     
    xsql -size 10m -p 10002
     
    xsql -size 10m -p 10003
     

Now we can connect to all three servers as a client using the McoDistributedSqlEngine by starting xSQL with the addresses of these servers as command-line options:

 
    xsql @127.0.0.1:10001 @127.0.0.1:10002 @127.0.0.1:10003
     

Alternatively the addresses could be stored in a config file:

 
    File client.cfg:
 
    {
        remote_client : [ "127.0.0.1:10001", "127.0.0.1:10002","127.0.0.1:10003"]
    }
     

Then we would simply start xSQL with the following command:

     
    xsql -c client.cfg
     

After connecting to servers, xSQL goes to interactive mode. First we create a table on all nodes with command:

 
    XSQL>create table Orders (orderId int primary key, product string, 
                    customer string, price double, volume double);
     

By default, the distributed SQL engine sends queries to all nodes. So the create table statement will be executed by all three nodes. Then we can distribute data across the servers using the following SQL statement:

 
    XSQL>insert into Orders select * from foreign table (path='order.csv', skip=1,nostop=1) 
        as Orders where mod(orderId, %#)=%@;
     

The pseudo-parameters %# and %@ refer to the total number of nodes and the zero-based node Id respectively. In our case %# equals 3 and %@ is 0 for the first server (running on port 10001), 1 for the second server and 2 for the third server. For example, on the second server the statement will be equivalent to

 
    XSQL>insert into Orders select * from foreign table (path='order.csv', skip=1,nostop=1) 
        as Orders where mod(orderId, 3)=1;
         

As a result orders with Ids 1, 4, 7 and 10 will be inserted on node 2. (Note that if you start servers on different hosts, the file order.csv must be accessible on all hosts.) To check the data we select records from all nodes:

 
    XSQL>select * from Orders order by orderId;
    orderId 	product 	customer    price   	     volume
    ------------------------------------------------------------------------------------
    1       	A      	 james   	10      	100
    2      	B       	bob     	50      	200
    3       	A       	paul    	11      	300
    4       	C       	paul    	100     	150
    5       	B       	bob     	52      	100
    6       	B       	bob     	4
    9      	500
    7       	A       	james   	11      	100
    8       	C       	paul    	105     	300
    9       	A       	bob     	12      	400
    10      	C       	james   	90      	200
 
    Selected records: 10
     

And select records from the second server only:

 
    XSQL>2:select * from Orders;
    orderId 	product 	customer    price 	volume
    ------------------------------------------------------------------------------------
    1       	A       	james   	10       	100
    4       	C       	paul    	100	150
    7       	A       	james   	11	100
    10      	C       	james   	90       	200
     

Sorting result rows

When the data is distributed in such a way that the merged query results don't need to be sorted in any particular order, it is recommended to use the specialized syntax with the “*:” prefix in the SQL statement. For example:

 
    XSQL>*:select * from Orders;
     

instead of

 
    XSQL>select * from Orders;
     

The “*:” prefix instructs the SQL engine to not resort the combined result set, which is beneficial from a performance standpoint. Both statements will create a result set that includes the results from the first shard concatenated with the result set from the second shard and so on for each shard; and both final results are correct. But if the statement requires that the final results are sorted, then only the first form should be used. In other words, the following select returns a sorted result set:

 
    XSQL>select * from Orders  order by orderId;
    orderId 	product 	customer	 price 	volume
    ------------------------------------------------------------------------------
    1       	A      	 james   	10      	100
    2      	B       	bob     	50      	200
    3       	A       	paul    	11      	300
    4       	C       	paul    	100     	150
    5       	B       	bob     	52      	100
    6       	B       	bob     	49      	500
    7       	A       	james   	11      	100
    8       	C       	paul    	105     	300
    9       	A       	bob     	12      	400
    10      	C       	james   	90      	200
     
    Selected records: 10
     

However, the following select, though faster, returns an unordered (incorrect!) result set:

 
    XSQL>*:select * from Orders  order by orderId;
    orderId 	product 	customer  	price   	volume
    ------------------------------------------------------------------------------
    3      	A       	paul    	11      	300
    6      	B       	bob     	49      	500
    9       	A       	bob     	12      	400
    1       	A       	james   	10      	100
    4       	C       	paul    	100     	150
    7       	A       	james   	11      	100
    10      	C       	james   	90      	200
    2       	B       	bob     	50      	200
    5       	B       	bob     	52      	100
    8       	C       	paul    	105     	300
     
    Selected records: 10
     

For another example consider the select statement:

 
    XSQL>select  count(*) from Orders ;
    #1
    ------------------------------------------------------------------------------
    10
     
    Selected records: 1
     

This statement returns a single row with the correct aggregate value of 10 - the total number of rows from all of the shards. The SQL engine merges the result sets from each shard. But the following unsorted form of the statement returns three rows which display the correct count for each shard:

 
    XSQL>*:select count(*) from Orders;
    #1
    ------------------------------------------------------------------------------
    3
    4
    3
     
    Selected records: 3
     

Here the remote SQL engine does not merge the results sets.

(Note that the “*:” prefix is simply an extension of the shard number “N:” notation that instructs the SQL engine to execute the statement on only the specified single shard; naturally the * indicates "all shards").

More elaborate distributed queries

We can also perform more elaborate queries on distributed shards, using group by and order by clauses. For example:

 
    XSQL>select product, sum(price*volume) as s from Orders group by product;
    product s
    -----------------------------------------------------------------------
    A       10200.000000
    B       39700.000000
    C       64500.000000
     
    Selected records: 3
     

and:

     
    XSQL>select customer, sum(price*volume) as s from Orders group by customer order by s;
    customer        s
    -----------------------------------------------------------------------
    james   20100.000000
    bob     44500.000000
    paul    49800.000000
     
    Selected records: 3
     

Here the sum() function is applied to the merged result sets from all shards.