Jabir Hussain
MPI-03: Collective Communications
Conceptual Focus
Point-to-point messaging (MPI_Send/MPI_Recv) is flexible but verbose and easier to get wrong (ordering, deadlocks, complexity). MPI therefore provides collective operations that implement common communication patterns efficiently and portably.
Collectives are:
- Called by all ranks in a communicator (typically
MPI_COMM_WORLD) with compatible arguments. - Implemented by the MPI library in a way that is often topology- and hardware-optimised.
- A mix of data movement and often implicit synchronisation (because everyone must participate).
Why collectives matter (I/O use case)
In many HPC environments, you should assume rank 0 is the only rank that reliably reads from standard input. If every rank tries to read, behaviour is implementation-dependent and can become chaotic. Hence: rank 0 reads, then broadcasts configuration/inputs to everyone else.
1) Basic Collective Communications
1.1 Broadcast: MPI_Bcast
Goal: Copy a buffer from one “root” rank to all ranks.
MPI_Bcast(a, n, MPI_FLOAT, root, MPI_COMM_WORLD);
Semantics
- Every rank calls
MPI_Bcastwith the samen, datatype,root, communicator. - On
root: bufferais the source. - On non-root ranks: buffer
ais overwritten with root’s data. - Blocking: returns when
ais safe to use on that rank.
Mental model: rank 1 holds [4,2] and after MPI_Bcast(... root=1 ...) all ranks have [4,2].
1.2 Reduce: MPI_Reduce
Goal: Combine per-rank buffers into one result on the root, using an associative operator.
MPI_Reduce(a, a_sum, n, MPI_FLOAT, MPI_SUM, root, MPI_COMM_WORLD);
Semantics
- Each rank provides input buffer
a(lengthn). - Root receives output buffer
a_sum(lengthn). - Combination is elementwise across ranks (e.g. sums per element).
- Not allowed:
aanda_sumbeing the same variable. - Blocking: completes when input has been contributed; root has output.
Operations shown: MPI_SUM, also MPI_PROD, MPI_MAX, MPI_MIN, logical ops, and location-aware variants MPI_MAXLOC/MPI_MINLOC.
Mental model: four ranks each contribute 2 ints; root gets the elementwise sum [16,20].
1.3 Allreduce: MPI_Allreduce
Goal: Like Reduce, but every rank receives the reduced result.
MPI_Allreduce(a, a_sum, n, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD);
Semantics
- Functionally similar to
MPI_Reducefollowed byMPI_Bcast, but implementations can do better than naïve composition. - Blocking: every rank returns with
a_sumpopulated.
Mental model: every rank ends up with the same [16,20].
2) More Advanced Collective Communications
2.1 Gather: MPI_Gather
Goal: Collect fixed-size blocks from all ranks onto the root, concatenated in rank order.
MPI_Gather(a, n, MPI_FLOAT,
b, n, MPI_FLOAT,
root, MPI_COMM_WORLD);
Semantics
- Each rank sends
nelements froma. - Root receives
n * Pelements inblaid out as:b[0..n-1]from rank 0,b[n..2n-1]from rank 1, etc. - Root must allocate
bwith capacity ≥n*P.
Mental model: with n=2, P=4, root ends up with an 8-element array formed by concatenating each rank’s 2-element a.
2.2 Allgather: MPI_Allgather
Goal: Like Gather, but the gathered buffer is replicated on all ranks.
MPI_Allgather(a, n, MPI_FLOAT,
b, n, MPI_FLOAT,
MPI_COMM_WORLD);
Semantics
- Equivalent “in effect” to
MPI_GatherthenMPI_Bcast, but optimised. - Every rank allocates
bof sizen*P.
2.3 Scatter: MPI_Scatter
Goal: Root splits a large buffer into rank-ordered chunks and distributes them.
MPI_Scatter(a, n, MPI_FLOAT,
b, n, MPI_FLOAT,
root, MPI_COMM_WORLD);
Semantics
- Root has
awith at leastn*Pelements. - Rank
rreceives chunka[r*n : (r+1)*n]intob. - Non-root ranks’
aargument is ignored in practice (but still must be a valid pointer in C codebases; common convention is to passNULLwhen allowed by your MPI).
Mental model: an 8-element array on root is split into 4 chunks of length 2, one per rank.
2.4 All-to-all: MPI_Alltoall
Goal: Every rank sends a distinct chunk to every other rank (full exchange).
MPI_Alltoall(a, n, MPI_FLOAT,
b, n, MPI_FLOAT,
MPI_COMM_WORLD);
Semantics
- Each rank’s send buffer
ais logically partitioned intoPchunks of sizen. - Chunk
iis sent to ranki. - Receive buffer
bstores the chunk received from each rank in rank order. - Both
aandbneed capacity ≥n*P.
Mental model: with n=1, P=4, each rank contributes one element to each other rank; each rank’s b ends up containing 4 values—one from each sender.
3) Collective vs Point-to-Point: Ordering Rules
3.1 Point-to-point ordering is not enough
MPI guarantees that point-to-point messages arrive in the order they were sent, but the receiver can still choose to receive them in a different order by specifying different tags—if buffering allows it. The lecture’s example shows why this can be “unsafe” when synchronous semantics occur (deadlock risk).
3.2 Collective calls must match in the same order
Collectives have no tags. Therefore the sequence of collective calls is matched strictly by call order.
The lecture gives an explicit “wrong program” pattern: rank 0 broadcasts a then b, rank 1 calls in the same order, but another rank calls b then a. Result: that rank receives swapped values (or you trigger undefined behaviour depending on the collective and implementation). The “Before/After” table on the last slide shows rank 2 ending with (a,b)=(10,5) rather than (5,10) due purely to mismatched call order.
Rule: every rank must execute collectives in the same order on the same communicator, with argument compatibility.
Summary
- Collectives are standard communication patterns implemented efficiently by MPI libraries.
- Core operations:
MPI_Bcast: root → all (replicate data).MPI_Reduce: all → root (combine via operator).MPI_Allreduce: all → all (combine + replicate).MPI_Gather/MPI_Allgather: concatenate rank blocks to root / to all.MPI_Scatter: root splits and distributes blocks.MPI_Alltoall: full exchange of fixed-size blocks.
- Ordering constraint: collectives must be called in the same order everywhere; there are no tags to “disambiguate.”
PX457 Implementation Checklist (C)
- Rank 0 reads
Nand broadcasts it withMPI_Bcast. - Each rank computes a local sum; combine with
MPI_ReduceandMPI_Allreduceand compare. - Create a distributed array with
MPI_Scatter, locally modify, thenMPI_Gatherto reconstruct on root. - Demonstrate
MPI_Alltoallby transposing a block-partitioned vector/matrix layout. - Add a deliberate collective-order bug (swap two
MPI_Bcastcalls on one rank) and explain the observed mismatch, referencing the lecture’s example.