To use an MPI channel in a Cluster application, it must be linked with the
mcoclmpilibrary (instead ofmcocltcp). The MPI standard requires that the first MPI call in the MPI-program must beMPI_Init()orMPI_Init_thread(), and the last call must beMPI_Finalize(). Another requirement is thatMPI_Init()orMPI_Init_thread()andMPI_Finalize()must be called only once per process.The difference between
MPI_Init()andMPI_Init_thread()is thatMPI_Init_thread()allows the application to set the desired level of thread support. The Cluster runtime requires theMPI_THREAD_MULTIPLElevel, which allows the runtime to call MPI functions from different threads without additional synchronization.In view of the above, there are two ways to initialize the MPI in cluster applications:
explicitlyorimplicitly. In the first case, the application explicitly callsMPI_Init_thread()/MPI_Finalize(), and passes tomco_cluster_db_open()an opaque MPI object calledcommunicator. Communicator defines the communication context, in other words it defines which processes are involved in communications.The
MPIruntime automatically creates the communicatorMPI_COMM_WORLD, which includes all the processes. There is also anMPI API, that allows creation of derived communicators. The communicator is passed tomco_cluster_db_open()via the communicator field in themco_clnw_mpi_params_tstructure. If this field is not set (or isNULL),MPI_COMM_WORLDis used.In the second case (implicit MPI initialization)
MPI_Init_thread()andMPI_Finalize()will be automatically called inside the cluster runtime and theMPI_COMM_WORLDcommunicator will be used. Implicit initialization is a little easier for developers and the application can still be linked withmcoclmpi, or withmcocltcpif theMPIchannel is not used, without any changes.
Explicitinitialization allows the user to define thelifetimeof MPI. For example, if MPI is used for custom communications, unrelated to eXtremeDB Cluster. For example the following sample illustrates explicit MPI initialization:main (int argc, char * argv) { mco_cluster_params_t cl_params; mco_db_params_t db_params; mco_db_h db; mco_device_t devs [N]; int provided; MPI_Init_thread (& argc, & argv, MPI_THREAD_MULTIPLE, & provided); mco_cluster_init (); mco_cluster_params_init (& cl_params); cl_params.nw.mpi.communicator = (void *)MPI_COMM_WORLD; <initialize db_params and devs> mco_cluster_db_open (dbName, cldb_get_dictionary (), dev, n_dev, &db_params, &cl_params); /* start listener threads */ sample_start_connected_task (& listener_task, ClusterListener, dbName, 0); mco_db_connect(dbName, &db); ... mco_cluster_stop (db); sample_join_task (& listener_task); mco_db_disconnect (db); mco_db_close (dbName); MPI_Finalize (); return 0; }
This is very similar to a normal Cluster application using TCP, with the exception of the MPI calls and setting the communicator parameter:
cl_params.nw.mpi.communicator = (void *)MPI_COMM_WORLD;Some fields of the
mco_cluster_params_tstructure are not mandatory for MPI channel:
uint2 n_nodes -total number of nodes in the cluster.MPI provides the function
MPI_Comm_size()which returns the number of processes in the communicator. So this Cluster parameter is optional. Ifn_nodesis specified and not the same asMPI_Comm_size, the error codeMCO_E_NW_INVALID_PARAMETERis returned.
uint2 node_id –the ID of the node.MPI provides the function
MPI_Comm_rank()which returns the process ID in the communicator. So this Cluster parameter is also optional. If it is specified and not the same asMPI_Comm_rank(),MCO_E_NW_INVALID_PARAMETERis returned.
mco_cluster_node_params_t * nodes– the node list.This list is also optional for MPI. It can still be used to specify the names of the processes (
addrfield in themco_cluster_node_params_tstructure ).
Also, currently MPI doesn't use the
check_quorum_funcandcheck_quorum_paramfields, since MPI doesn't allow dynamic connection / disconnection of processes.To compile applications that explicitly call MPI functions (and link with the
mcoclmpilibrary) the wrappers (scripts)mpicc(for C) andmpicxx(for C++) must be used. These scripts, which are part of the MPI package, callgcc/g++or another standard compiler and know how to build MPI applications specifying the proper include / library paths, macro definitions and other compiler options.For UNIX-based systems, specifying
PRJ_F_CLUSTER = MPIin the makefile will causeheader.makto automatically setCC = mpicc andCXX = mpicxx, and link the application with themcoclmpilibrary.To start MPI applications the
mpirunormpiexeccommands are used. These are also the part of the MPI package. The command is:mpirun <mpirun arguments> <program> <program arguments>The most important
mpirunarguments are:
-n <nprocs> -number of processes (nodes) in the cluster
-machinefile <filename>- the name of the text file that contains a list of nodes, one node per lineFor example, to run the application
mpi_teston nodesnodeAandnodeB, create the filenodeswith the lines:nodeA nodeBThen execute the command:
mpirun -n 2 -machinefile ./nodes ./mpi_testAs mentioned above,
n_nodesandnode_idare not mandatory for the MPI channel, sompi_testcan be started without command line arguments.If an MPI channel is used, then the underlying transport (TCP, IB, shared memory, etc.) is determined by MPI tools. Usually it's a command line option for
mpirunor an environment variable, but it depends on the MPI implementation. Often MPI automatically determines the "best" transport. For example, If we run multiple processes on a single physical host, MPI will use IPC (shared memory). If we run the application on different hosts that haveInfiniband, MPI will use the IB transport, otherwise (withoutInfiniband) it will use TCP.The limitation of MPI is that it uses a static process model. There is no convenient and standard way to handle a node's failure or to dynamically connect new nodes. The MPI implementations may provide tools or APIs for Network-Level fault tolerance, migration, etc., but these features are not covered by the standard, so they highly depend on the MPI library in use.