Jabir Hussain
MPI-04: MPI Communicators
Conceptual Focus
MPI scales by domain decomposition: each process owns a subdomain of the global problem, and exchanges boundary (“halo”) data with neighbours. The lecture’s 2D periodic grid sketch shows 4 MPI tasks laid out in a (2×2) grid; each task needs “up/down/left/right” halo values from adjacent ranks.
To implement this cleanly you need answers to:
- Where am I in the logical processor grid?
- Which ranks are my neighbours?
- How many processes exist along each dimension?
- How do I structure communication so only the relevant processes participate?
MPI’s solution is communicators, plus optional topologies attached to them.
1) Communicators (what they are and why you care)
1.1 Definitions
- A communicator is a context in which a set of processes can communicate.
- Default is
MPI_COMM_WORLD(all processes). - In C, communicator handles have type
MPI_Command are opaque (you don’t inspect internals; you pass them to MPI calls).
Key operational implication:
- Point-to-point and collective operations only occur within a communicator.
- Creating sub-communicators lets you structure complex codes (e.g., row-wise collectives, neighbour exchange groups) without global coordination overhead.
1.2 Splitting a communicator: MPI_Comm_split
Use case: arrange (P=N²) ranks into an (N×N) logical grid, then create per-row communicators.
MPI_Comm_split(existing_comm, split_id, split_key, &split_comm);
Arguments (lecture’s terminology):
existing_comm: communicator being split (e.g.MPI_COMM_WORLD)split_id: “color” (which subgroup this rank belongs to)split_key: ordering within the new communicatorsplit_comm: output communicator handle for this rank
Row communicator example:
int world_rank, P;
MPI_Comm row_comm;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &P);
int N = (int)(sqrt((double)P)); // assuming perfect square
int row_id = world_rank / N; // which row in an N×N grid
MPI_Comm_split(MPI_COMM_WORLD, row_id, world_rank, &row_comm);
/* ... communicate using row_comm ... */
MPI_Comm_free(&row_comm);
The mapping table shows: ranks with the same row_id are grouped, and using world_rank as the key preserves the intuitive left-to-right ordering inside each row.
1.3 Splitting by hardware locality: MPI_Comm_split_type
Motivation: processes on the same node share physical memory; intra-node MPI can be much faster than inter-node communication. Creating “shared-memory node” communicators enables hybrid designs (shared-memory optimisations, per-node leaders, etc.).
MPI_Comm_split_type(MPI_COMM_WORLD,
MPI_COMM_TYPE_SHARED,
split_key, MPI_INFO_NULL,
&node_comm);
Notes:
- Must be called by all ranks in
MPI_COMM_WORLD. - Some implementations offer vendor-specific split types (e.g., NUMA-aware).
2) Groups and Communicators
2.1 What is a group?
A group is an ordered set of processes (still opaque). A communicator is effectively:
communicator = group + communication context.
Groups allow nontrivial selections like “processes 2, 4, 8, 13 but not 6”.
2.2 Create a group from MPI_COMM_WORLD
MPI_Group world_grp;
MPI_Comm_group(MPI_COMM_WORLD, &world_grp);
2.3 Include specific ranks: MPI_Group_incl
Example in the lecture uses a “column” subset:
int column[] = {1, 5, 9, 13};
MPI_Group column_grp;
MPI_Group_incl(world_grp, 4, column, &column_grp);
Then free groups when done:
MPI_Group_free(&column_grp);
MPI_Group_free(&world_grp);
(Freeing matters in long-running codes; MPI_Finalize will also clean up.)
2.4 Turn a group into a communicator: MPI_Comm_create
MPI_Comm column_comm;
MPI_Comm_create(MPI_COMM_WORLD, column_grp, &column_comm);
Important behaviour: ranks not in the group get MPI_COMM_NULL.
So you must guard communicator usage:
if (column_comm != MPI_COMM_NULL) {
/* safe to call collectives/pt2pt on column_comm */
MPI_Comm_free(&column_comm);
}
The lecture’s example broadcasts within column_comm, which corresponds to world ranks (1 → 5,9,13).
3) Topologies
3.1 What a topology gives you
A topology attaches a logical addressing scheme to a communicator so ranks can answer:
- “What are my coordinates?”
- “Who are my neighbours?” This is virtual (not necessarily the physical interconnect).
This lecture focuses on Cartesian (grid) topologies.
3.2 Create a 2D Cartesian topology: MPI_Cart_create
For N=4, P=16:
int dims[2] = {4, 4};
int wrap[2] = {1, 1}; // periodic in both dimensions
int reorder = 1; // MPI may reorder ranks for performance
MPI_Comm cart_comm;
MPI_Cart_create(MPI_COMM_WORLD, 2, dims, wrap, reorder, &cart_comm);
Meaning:
dims[d]: number of processes along dimension (d).wrap[d]=1: periodic boundaries (torus);0: nonperiodic.reorder=1: MPI can remap ranks incart_commto better match hardware.
3.3 Rank ↔ coordinates: MPI_Cart_coords and MPI_Cart_rank
int cart_rank, coords[2];
MPI_Comm_rank(cart_comm, &cart_rank);
MPI_Cart_coords(cart_comm, cart_rank, 2, coords);
Inverse:
MPI_Cart_rank(cart_comm, coords, &cart_rank);
The lecture notes that ranks in cart_comm are row-major in the conceptual grid.
3.4 Neighbours: MPI_Cart_shift
This is the workhorse for halo exchange patterns.
int src, dst;
MPI_Cart_shift(cart_comm, dir, 1, &src, &dst);
dir: which dimension (0..ndims-1)shift=1: immediate neighbour- returns neighbour ranks
src(negative direction) anddst(positive direction)
Examples in the lecture:
- For process 9 in a (4×4) grid, shifting in
dir=0by 1 givessrc=5,dst=13. - Shifting in
dir=1by 1 givessrc=8,dst=10. - Negative shifts behave as expected (swap directions).
- On a boundary with non-periodic wrap, a neighbour may be
MPI_PROC_NULL. Sending to / receiving fromMPI_PROC_NULLis defined as a no-op, which simplifies edge handling.
Practical pairing:
MPI_Cart_shiftis commonly used to set up safe neighbour exchanges viaMPI_Sendrecv(from Lecture 11).
3.5 Grid slicing: MPI_Cart_sub
This is the Cartesian-topology analogue of MPI_Comm_split, used to build row/column communicators for collectives.
int remain_dims[2] = {0, 1}; // keep dim 1 (columns vary) => row slices
MPI_Comm row_comm;
MPI_Cart_sub(cart_comm, remain_dims, &row_comm);
int remain_dims2[2] = {1, 0}; // keep dim 0 => column slices
MPI_Comm col_comm;
MPI_Cart_sub(cart_comm, remain_dims2, &col_comm);
The diagram/table shows how cart_comm ranks map into row_comm (each row has ranks 0..3 locally) and col_comm (each column has ranks 0..3 locally).
4) Architecture of Collective Communications
4.1 Why discuss it?
MPI collectives are “built-in and optimised”, but understanding typical implementations helps you reason about:
- latency vs bandwidth regimes,
- algorithmic step counts,
- and when you may need custom communication patterns.
4.2 Broadcast / reduce structures
The lecture sketches three patterns:
- Simple loop (root sends to each rank): (P-1) steps
- Ring pass: still (P-1) steps, but structured
- Tree/graph: (log₂ P) steps
4.3 Ring-pass allreduce (chunked, pipelined)
A ring can implement allreduce by splitting data into (P) chunks to pipeline communication.
Mechanics:
- Each rank knows its ring neighbours via
MPI_Cart_shift(cart_comm, dir, 1, &src, &dst). - Over (P-1) rounds, chunks circulate and partial reductions accumulate.
- Then another (P-1) rounds distribute the fully reduced chunks, giving total (2(P-1)) passes for full allreduce.
Key idea:
- Good for large data because it pipelines bandwidth and tends to use “nearby” links in the logical topology.
4.4 Butterfly (hypercube) allreduce
For P=2^N processes, butterfly reduces a scalar in N steps (and arrays in 2N steps as per lecture) by exchanging with a partner that differs in one bit of the rank each stage.
The hypercube mapping slide explicitly motivates converting ranks to binary so pairing is “flip bit (k)”.
Key idea:
- Fewer rounds than ring → good when latency dominates (small messages, many ranks).
4.5 Latency vs bandwidth trade-off
The final slide states the practical performance heuristic:
- Butterfly: fewer messages/rounds → best for small data (minimise latency).
- Ring-pass: more rounds but predictable/pipelined transfers → best for large data (maximise bandwidth).
MPI libraries typically choose internally at runtime for
MPI_Allreducebased on message size and process count.
Summary
- Communicators define who can communicate; they structure complex MPI programs beyond
MPI_COMM_WORLD. MPI_Comm_splitpartitions a communicator using a color/key mechanism;MPI_Comm_split_typecan split by hardware locality (e.g., shared-memory nodes).- Groups allow arbitrary subsets; to communicate you must create a communicator, and excluded ranks receive
MPI_COMM_NULL(must guard). - Cartesian topologies (
MPI_Cart_create) let you compute coordinates and neighbours cleanly (MPI_Cart_shift) and slice communicators for row/column collectives (MPI_Cart_sub). - Collective operations can be implemented with loop, ring, or tree/butterfly patterns; performance depends on latency vs bandwidth regime.
PX457 Implementation Checklist (C)
- Assume (P=N²). Build
cart_commwith periodic boundaries and query each rank’s(i,j)coordinates usingMPI_Cart_coords. - Use
MPI_Cart_shiftto compute four neighbours; implement halo exchange withMPI_Sendrecvand handle edges viaMPI_PROC_NULLif nonperiodic. - Use
MPI_Cart_subto createrow_command runMPI_Allreduceper row; verify only row ranks participate. - Recreate “row communicator” using
MPI_Comm_splitand confirm it matches theMPI_Cart_subresult. - Micro-benchmark
MPI_Allreduceon small vs large buffers; interpret results using the lecture’s butterfly-vs-ring latency/bandwidth discussion.