To use an MPI channel in a Cluster application, it must be linked with the
mcoclmpi
library (instead ofmcocltcp
). The MPI standard requires that the first MPI call in the MPI-program must beMPI_Init()
orMPI_Init_thread()
, and the last call must beMPI_Finalize()
. Another requirement is thatMPI_Init()
orMPI_Init_thread()
andMPI_Finalize()
must be called only once per process.The difference between
MPI_Init()
andMPI_Init_thread()
is thatMPI_Init_thread()
allows the application to set the desired level of thread support. The Cluster runtime requires theMPI_THREAD_MULTIPLE
level, which allows the runtime to call MPI functions from different threads without additional synchronization.In view of the above, there are two ways to initialize the MPI in cluster applications:
explicitly
orimplicitly
. In the first case, the application explicitly callsMPI_Init_thread()
/MPI_Finalize()
, and passes tomco_cluster_db_open()
an opaque MPI object calledcommunicator
. Communicator defines the communication context, in other words it defines which processes are involved in communications.The
MPI
runtime automatically creates the communicatorMPI_COMM_WORLD
, which includes all the processes. There is also anMPI API
, that allows creation of derived communicators. The communicator is passed tomco_cluster_db_open()
via the communicator field in themco_clnw_mpi_params_t
structure. If this field is not set (or isNULL
),MPI_COMM_WORLD
is used.In the second case (implicit MPI initialization)
MPI_Init_thread()
andMPI_Finalize()
will be automatically called inside the cluster runtime and theMPI_COMM_WORLD
communicator will be used. Implicit initialization is a little easier for developers and the application can still be linked withmcoclmpi
, or withmcocltcp
if theMPI
channel is not used, without any changes.
Explicit
initialization allows the user to define thelifetime
of MPI. For example, if MPI is used for custom communications, unrelated to eXtremeDB Cluster. For example the following sample illustrates explicit MPI initialization:main (int argc, char * argv) { mco_cluster_params_t cl_params; mco_db_params_t db_params; mco_db_h db; mco_device_t devs [N]; int provided; MPI_Init_thread (& argc, & argv, MPI_THREAD_MULTIPLE, & provided); mco_cluster_init (); mco_cluster_params_init (& cl_params); cl_params.nw.mpi.communicator = (void *)MPI_COMM_WORLD; <initialize db_params and devs> mco_cluster_db_open (dbName, cldb_get_dictionary (), dev, n_dev, &db_params, &cl_params); /* start listener threads */ sample_start_connected_task (& listener_task, ClusterListener, dbName, 0); mco_db_connect(dbName, &db); ... mco_cluster_stop (db); sample_join_task (& listener_task); mco_db_disconnect (db); mco_db_close (dbName); MPI_Finalize (); return 0; }
This is very similar to a normal Cluster application using TCP, with the exception of the MPI calls and setting the communicator parameter:
cl_params.nw.mpi.communicator = (void *)MPI_COMM_WORLD;Some fields of the
mco_cluster_params_t
structure are not mandatory for MPI channel:
uint2 n_nodes -
total number of nodes in the cluster.MPI provides the function
MPI_Comm_size()
which returns the number of processes in the communicator. So this Cluster parameter is optional. Ifn_nodes
is specified and not the same asMPI_Comm_size
, the error codeMCO_E_NW_INVALID_PARAMETER
is returned.
uint2 node_id –
the ID of the node.MPI provides the function
MPI_Comm_rank()
which returns the process ID in the communicator. So this Cluster parameter is also optional. If it is specified and not the same asMPI_Comm_rank()
,MCO_E_NW_INVALID_PARAMETER
is returned.
mco_cluster_node_params_t * nodes
– the node list.This list is also optional for MPI. It can still be used to specify the names of the processes (
addr
field in themco_cluster_node_params_t
structure ).
Also, currently MPI doesn't use the
check_quorum_func
andcheck_quorum_param
fields, since MPI doesn't allow dynamic connection / disconnection of processes.To compile applications that explicitly call MPI functions (and link with the
mcoclmpi
library) the wrappers (scripts)mpicc
(for C) andmpicxx
(for C++) must be used. These scripts, which are part of the MPI package, callgcc/g++
or another standard compiler and know how to build MPI applications specifying the proper include / library paths, macro definitions and other compiler options.For UNIX-based systems, specifying
PRJ_F_CLUSTER = MPI
in the makefile will causeheader.mak
to automatically setCC = mpic
c andCXX = mpicxx
, and link the application with themcoclmpi
library.To start MPI applications the
mpirun
ormpiexec
commands are used. These are also the part of the MPI package. The command is:mpirun <mpirun arguments> <program> <program arguments>The most important
mpirun
arguments are:
-n <nprocs> -
number of processes (nodes
) in the cluster
-machinefile <filename>
- the name of the text file that contains a list of nodes, one node per lineFor example, to run the application
mpi_test
on nodesnodeA
andnodeB
, create the filenodes
with the lines:nodeA nodeBThen execute the command:
mpirun -n 2 -machinefile ./nodes ./mpi_testAs mentioned above,
n_nodes
andnode_id
are not mandatory for the MPI channel, sompi_test
can be started without command line arguments.If an MPI channel is used, then the underlying transport (TCP, IB, shared memory, etc.) is determined by MPI tools. Usually it's a command line option for
mpirun
or an environment variable, but it depends on the MPI implementation. Often MPI automatically determines the "best" transport. For example, If we run multiple processes on a single physical host, MPI will use IPC (shared memory). If we run the application on different hosts that haveInfiniband
, MPI will use the IB transport, otherwise (withoutInfiniband
) it will use TCP.The limitation of MPI is that it uses a static process model. There is no convenient and standard way to handle a node's failure or to dynamically connect new nodes. The MPI implementations may provide tools or APIs for Network-Level fault tolerance, migration, etc., but these features are not covered by the standard, so they highly depend on the MPI library in use.