OpenMP: Dependencies and Nested Parallelism

OpenMP 05: Dependences and Nested Parallelism

Conceptual Focus

Parallelisation isn’t only about dividing work — you must ensure independent execution.

This lecture covers:

How data dependences break safe parallelism and how to remove them.
When a data race is harmless (relaxed algorithms).
How nested parallelism allows hybrid combinations of data- and function-level concurrency.

Data Dependences in Loops

1. Flow (true) dependence

Earlier iteration writes, later iteration reads the same memory.

Must preserve order → usually cannot parallelise directly.

for (int i=1; i<n; ++i) {
  b[i] += a[i-1];   // read from previous iteration
  a[i] += c[i];     // write current element
}

Fix via loop skewing:

b[1] += a[0];
for (int i=1; i<n-1; ++i) {
  a[i] += c[i];
  b[i+1] += a[i];
}
a[n-1] += c[n-1];

Each iteration now uses data computed within that same iteration.

2. Anti-dependence

Earlier iteration reads, later iteration writes to the same location.

Not a true dependence — just memory reuse.

for (int i=0; i<n; ++i) {
  x = (b[i] + c[i]) / 2;
  a[i] = a[i+1] + x;   // later iteration will overwrite a[i+1]
}

Fix: duplicate read array.

for (int i=0; i<n; ++i) a2[i] = a[i+1];
#pragma omp parallel for private(x)
for (int i=0; i<n; ++i) a[i] = a2[i] + (b[i]+c[i])/2;

3. Output dependence

Multiple iterations write to the same location.

for (int i=0; i<n; ++i) {
  x = (b[i] + c[i]) / 2;
  a[i] += x;
  d[0] = 2*x;  // shared target
}

Fix: use a temporary and lastprivate.

#pragma omp parallel for private(x,tmp) lastprivate(tmp)
for (int i=0; i<n; ++i) {
  x = (b[i] + c[i]) / 2;
  a[i] += x;
  tmp = 2*x;
}
d[0] = tmp;

Scalar Expansion and Loop Fission

Sometimes a variable accumulates over iterations and is later reused:

for (int i=0; i<n; ++i) {
  y += a[i];
  b[i] = (b[i]+c[i]) * y;
}

Split (fission) and expand:

double *y2 = malloc(n * sizeof *y2);
y2[0] = y + a[0];
for (int i=1; i<n; ++i) y2[i] = y2[i-1] + a[i];
y = y2[n-1];

#pragma omp parallel for shared(b,c,y2)
for (int i=0; i<n; ++i)
  b[i] = (b[i]+c[i]) * y2[i];
free(y2);

→ The dependence on y is removed; the second loop is parallelisable.

Pointer Pitfalls and Data Races

Declaring a pointer as private does not copy its contents — just the pointer value.

Threads still reference the same memory → data race.

Allocate inside the parallel region for truly independent buffers.

Always free() within the same region to avoid leaks.

Relaxed Algorithms (When Races Don’t Matter)

Sometimes correctness ≠ bitwise reproducibility.

Example 1 — Flagging condition

int found = 0;
#pragma omp parallel for
for (int i=0; i<n; ++i)
  if (a[i] > threshold) found = 1;   // race, but benign

Any thread setting found = 1 suffices — exact timing irrelevant.

Example 2 — Iterative relaxation

for (int i=1; i<n-1; ++i)
  for (int j=1; j<n-1; ++j)
    a[i][j] = (a[i][j] + a[i][j-1] + a[i][j+1]
              + a[i-1][j] + a[i+1][j]) / 5.0;

Parallel updates cause small differences but converge to the same steady state.

Used in Jacobi or heat-diffusion methods.

Do not apply this to non-convergent schemes (e.g. explicit integrators).

Nested Parallelism

Concept

OpenMP allows parallel regions inside parallel regions.

#pragma omp parallel num_threads(2)
{
  work_A();

  #pragma omp parallel num_threads(4)
  work_B();   // each of the 2 threads spawns 4 subthreads
}

Each outer thread becomes a team leader for its own sub-team.

Useful for combining functional and data parallelism.

Control of Nesting Depth

Method	Example	Meaning
Environment variable	`export OMP_MAX_ACTIVE_LEVELS=2`	allow 2 nested levels
API call	`omp_set_max_active_levels(2);`	same, programmatic
Deprecated	`OMP_NESTED=TRUE`, `omp_set_nested()`	legacy, avoid

Beyond the allowed depth, additional parallel regions run serially.

Summary

Topic	Idea	Fix / Usage
Flow dep.	later reads earlier write	loop skewing
Anti dep.	later writes earlier read	duplicate read array
Output dep.	multiple writes	temporary + `lastprivate`
Scalar expansion	convert scalar recurrence to array	parallel second loop
Pointer races	private pointer ≠ private data	allocate per-thread
Relaxed algos	data races tolerated	e.g. convergence loops
Nested parallelism	parallel within parallel	control via `OMP_MAX_ACTIVE_LEVELS`

PX457 Implementation Advice

Demonstrate dependency removal with before/after code and runtime.
Quantify performance vs. correctness (float drift, convergence).
Experiment with nested regions: run num_threads(2) + num_threads(4) and measure total utilisation.
Use omp_get_level() to verify current nesting depth during experiments.
Document any race-tolerant algorithm clearly — justify why it’s mathematically safe.