OpenMP: Loop Scheduling and Load Balancing

OpenMP 04: Loop Scheduling and Parallel Sections

Conceptual Focus

The previous lectures dealt with correctness (data scope & synchronisation).

Now, the emphasis moves to performance tuning — how iterations are assigned to threads and how we can parallelise different tasks (not just data loops).

Two dimensions:

Loop scheduling → improving load balance for irregular workloads.
Parallel sections → exploiting functional rather than data parallelism.

Loop Scheduling and Load Balancing

Motivation

Even if each thread gets the same number of iterations,

work per iteration may differ — causing load imbalance and idle threads.

OpenMP exposes several scheduling policies via the schedule clause.

`schedule(static[, chunk])`

Precomputed at compile/start time.
Default mode.
Divides the iteration space into equal-sized chunks.

#pragma omp parallel for schedule(static)
for (int i=0; i<N; ++i) work(i);

Case	Meaning
`schedule(static)`	each thread gets ~N/P consecutive iterations
`schedule(static, C)`	iterations split into chunks of size C; assigned round-robin

Lowest overhead; good for uniform workloads.

Poor load balance if iteration cost varies.

`schedule(dynamic[, chunk])`

Iterations assigned at runtime.
When a thread finishes a chunk, it grabs the next available one.
Reduces idle time for uneven workloads.

#pragma omp parallel for schedule(dynamic,4)
for (int i=0; i<N; ++i) heavy_work(i);

Balances irregular work (e.g. variable loop cost).

Higher runtime overhead (contention on the work queue).

`schedule(guided[, chunk])`

Hybrid: large chunks first, smaller later.
Each thread starts with a big block; chunk size decreases geometrically until hitting chunk.

#pragma omp parallel for schedule(guided,8)

Useful when early iterations dominate cost.

Less predictable pattern; sometimes over-tuned.

`schedule(runtime)`

Schedule determined at runtime via environment variable or API.

export OMP_SCHEDULE="static,4"

or in code:

omp_set_schedule(omp_sched_static, 4);
omp_get_schedule(&kind, &chunk);

Excellent for benchmarking and tuning without recompiling.

Load Imbalance Example

Triangular matrix initialisation:

#pragma omp parallel for private(i,j)
for (int i=0; i<N; ++i)
  for (int j=i; j<N; ++j)
    A[i][j] = heavy_fn(i,j);

Later i values do less work → threads handling small i dominate runtime.

Schedule	Behaviour	Relative Runtime
`static`	contiguous ranges	slow
`static,1`	round-robin	balanced, faster
`dynamic`	adaptive	similar or better, depends on cost distribution

Summary Table

Clause	When to Use	Overhead	Load Balance
`static`	Uniform iteration cost	Low	Poor (irregular loops)
`static, C`	Moderate variation	Low	Medium
`dynamic`	Irregular workloads	High	Excellent
`guided`	Work decreases over i	Medium	Good
`runtime`	Tune experimentally	depends	variable

Parallel Sections (Functional Parallelism)

While for loops distribute data,

sections distribute independent functions or code blocks.

#pragma omp parallel sections
{
  #pragma omp section
  f1 = fun1();

  #pragma omp section
  f2 = fun2();

  #pragma omp section
  f4 = fun4();
}
f3 = fun3(f1,f2);

Each section runs in parallel (possibly on different threads).
Remaining threads idle if fewer sections than threads.
Execution order not guaranteed.

Use when different tasks can execute concurrently (I/O + compute + post-processing).

Not for dependent computations (e.g. f3 depends on f1,f2).

Section Clauses

Clause	Meaning
`private(x)`	separate copy per section
`firstprivate(x)`	copy of master’s value
`lastprivate(x)`	value from last section copied back
`reduction(op:x)`	combine results across sections

Rules & Gotchas

All threads must encounter the same work-sharing construct in identical order.
Each section must be a self-contained block.

Cannot use branching logic to selectively enter sections or for regions:

  // Wrong
  if (tid == 0) { #pragma omp sections ... }
  else { #pragma omp for ... }

→ results in undefined behaviour.

Other Control Directives

`single`

Executed by exactly one thread (whichever arrives first).

Implicit barrier at end.

#pragma omp parallel
{
  #pragma omp single
  printf("Initialisation...\n");
}

`master` / `masked`

Executed only by master thread (thread 0).

No implicit barrier at end.

masked (OpenMP 5) allows filtering (subset of threads).

#pragma omp master
printf("Master thread only\n");

Function Safety

When calling functions inside parallel regions:

Ensure they don’t modify global state unless protected.
Stack variables inside the function are private by default.
Global variables can be made thread-local using threadprivate.

Thread-safe = function produces correct results under concurrent calls.

Summary

Concept	Key Idea	Example
Loop scheduling	Controls iteration-to-thread mapping	`schedule(dynamic, 4)`
Static vs Dynamic	compile-time vs run-time allocation	`static` = cheap, `dynamic` = balanced
Guided	Decreasing chunk size	adaptive
Runtime	Controlled via env var or API	tuning knob
Parallel sections	Functional parallelism	concurrent function calls
Single / Master	Serial portions in parallel region	controlled I/O or init

PX457 Practical Checklist

Benchmark static vs dynamic for a triangular matrix kernel.
Measure thread imbalance using timing per thread (omp_get_wtime()).
Profile chunk size to find the optimal trade-off.
Demonstrate functional parallelism by executing 3–4 expensive functions concurrently.
Include commentary on scheduling overheads vs balance in your lab report.