Text: Week 3

1. Basic Serial (SISD) Architecture

Flynn’s Taxonomy — A Foundation

Computer architectures can be classified by how instructions and data streams interact:

Type	Meaning	Description
SISD	Single Instruction, Single Data	One instruction operates on one data element at a time (classic CPU).
SIMD	Single Instruction, Multiple Data	Same instruction applied to multiple data elements in parallel.
MISD	Multiple Instruction, Single Data	Multiple instructions on same data stream (rare).
MIMD	Multiple Instruction, Multiple Data	Multiple programs operating on multiple data streams (modern clusters).

SISD: The Classic Model

One program → one processor.
Processor fetches and executes instructions sequentially.
Each instruction operates on one piece of data at a time.
Includes fetching/writing data to and from main memory.

Example: Your laptop CPU running a single-threaded Python script.

2. Memory Hierarchy — Latency and Locality

Hierarchy Overview

Memory is hierarchical — smaller, faster memory near the CPU; larger, slower memory farther away.

Level	Typical Size	Location	Access Time (cycles)	Description
Registers	100–500 per core	On-chip	1	CPU currently active data
L1 Cache	32–64 KB	Per core	1–3	Fast SRAM
L2 Cache	256 KB	Per core	5–10	Medium-speed buffer
L3 Cache	2–40 MB	Shared across cores	15–25	Shared “last-level cache”
Main Memory (RAM)	8–64 GB	Off-chip	30–300	Dynamic DRAM

Access time grows by orders of magnitude as you move down the hierarchy.

Performance depends critically on cache hit rates (minimising “cache misses”).

3. Cache Mechanics — The “Toy Model”

Direct-Mapped Cache (Simplified Example)

Imagine:

Main memory: thousands of 64-bit words (8 bytes each).
Cache: 4 “lines” each storing 16 words.

When CPU requests A[i]:

Check if data is already in cache (cache hit).
If not → load the entire cache line containing A[i] from main memory.
If cache is full → overwrite an old cache line (write-back vs. write-through policies).

Implication: Accessing adjacent elements (good spatial locality) is free once data is cached.

4. Cache Performance in Practice

Example 1 – “A Happy Cache”

double A[32], B[32];
for (i = 0; i < 32; i++) {
    A[i] = A[i] * B[i];
}

A[i] and B[i] map to different cache lines.
Both arrays fit comfortably within cache.
Only 4 reads from main memory needed.
Reuse within cache → high temporal locality.

Example 2 – “Cache Thrashing”

double A[64], B[64];
for (i = 0; i < 64; i++) {
    A[i] = A[i] * B[i];
}

Now A[i] and B[i] map to the same cache line.
Each iteration evicts the previous data.
Results in 128 main memory reads instead of 4.
Performance collapses due to repeated cache misses.

This is cache conflict thrashing — the cache is “fighting” itself due to address overlap.

Fix 1 – Padding

double A[64], C[16], B[64];   // Add padding array C
for (i = 0; i < 64; i++) {
    A[i] = A[i] * B[i];
}

C acts as a buffer to shift memory alignment.
Now A and B occupy different cache lines again.
8 reads instead of 128 — massive performance gain.

Padding is rarely portable — modern architectures are associative, meaning they auto-handle mapping flexibility. Still, conceptually important.

5. Spatial and Temporal Locality

Cache logic assumes:

Spatial locality:

If you access A[i], you’ll soon access A[i+1], A[i+2], etc.
Temporal locality:

If you use A[i] now, you’ll likely reuse it soon.

Lesson: Write code that accesses contiguous memory and reuses data often.

Avoid:

Strided access (skipping elements)
Random memory access patterns

6. Interleaving — Data Structure Design

Separate Arrays

double A[64], B[64];
for (i = 0; i < 64; i++) {
    A[i] = A[i] * B[i];
}

Two separate blocks in memory → poor spatial locality.

Interleaved Structs

struct my_type {
    double A;
    double B;
};
struct my_type d[64];
for (i = 0; i < 64; i++) {
    d[i].A = d[i].A * d[i].B;
}

Each A[i] is stored next to its corresponding B[i].
When a cache line loads, both values arrive together.
8 main memory reads → same as padded case, but cleaner.

7. Access Patterns — Even/Odd Example

Bad (strided) access

for (i = 0; i < 127; i += 2)
    A[i] = A[i] * A[i];
for (i = 1; i < 128; i += 2)
    A[i] = A[i] * A[i] * A[i];

Two passes → double the reads.
Only half the cache line used per iteration.
16 reads total.

Good (unit-stride) access

for (i = 0; i < 127; i += 2) {
    A[i]   = A[i] * A[i];
    A[i+1] = A[i+1] * A[i+1] * A[i+1];
}

All elements processed in one contiguous sweep.
8 reads total, 2× faster due to full cache-line utilisation.

8. Multidimensional Arrays in C

Poor Access (Column-wise)

double A[4][32];
double sumcol[32];

for (i = 0; i < 32; i++) {
    sumcol[i] = 0.0;
    for (j = 0; j < 4; j++)
        sumcol[i] += A[j][i];
}

Accesses A[j][i] with stride 32 → jumps across memory.
Each inner loop triggers a new cache line load.
Total: 128 reads from main memory.

Fixed Access (Row-wise)

double A[32][4];
double sumcol[32];

for (i = 0; i < 32; i++) {
    sumcol[i] = 0.0;
    for (j = 0; j < 4; j++)
        sumcol[i] += A[i][j];
}

Now accessing contiguous memory.
8 reads total — up to 16× faster in realistic systems.

9. The Programmer’s Mantra

For C (Row-Major Storage):

“When addressing arrays in C, the right-most index must vary most rapidly.”

A[i][j]   // i = row, j = column → j must be inner loop

For Fortran (Column-Major Storage):

“When addressing arrays in Fortran, the left-most index must vary most rapidly.”

A(i, j)   ! i is the inner loop index

Mixing conventions (e.g., porting C to Fortran) often causes catastrophic slowdowns due to cache inefficiency.

10. Memory Optimisation Summary

Principle	Explanation
Use unit stride	Access contiguous memory sequentially.
Exploit spatial locality	Touch data near what’s already in cache.
Exploit temporal locality	Reuse data soon after loading it.
Avoid cache-line multiples	Powers of 2 array sizes often cause aliasing issues.
Minimise passes	Do more work per memory read.
Respect array layout	Row-major (C) vs column-major (Fortran).
Legacy code caution	Older code predates caching and may perform poorly.

11. Beyond SISD — Alternative Architectures

MISD (Multiple Instruction, Single Data)

Multiple programs process the same data stream with different algorithms.
Not used in modern HPC (rare exceptions: fault-tolerant systems like the Space Shuttle).

SIMD (Single Instruction, Multiple Data)

One instruction applied across many data points simultaneously.

Example: a vectorised operation like:

  A = B * C;  // performed on multiple elements in parallel

Historically used in vector supercomputers (Cray, Fujitsu VP).
Reappeared via CPU vector extensions (SSE, AVX, NEON).

12. Vectorisation in Modern CPUs

What Is Vectorisation?

Executing a single instruction across multiple data points (SIMD inside a CPU core).

Evolution:

Extension	Width	Year	Notes
SSE	128-bit	~1999	Operates on 2 doubles or 4 floats
AVX	256-bit	~2011	Doubles vector width
AVX2	256-bit	~2013	Adds integer operations
AVX-512	512-bit	~2017	Can process 8 doubles or 16 floats per instruction

Enabling Vectorisation:

Short, simple loops
Avoid branching (if, switch) inside hot loops
Avoid inter-loop dependencies
Let the compiler auto-vectorise

Helpful GCC flags:

-fopt-info-vec-optimized
-fopt-info-vec-missed

These report which loops were successfully vectorised and why others failed.

Pitfall:

Manual loop unrolling can actually prevent compiler vectorisation — compilers are sophisticated enough to do this themselves optimally.

13. MIMD — The Modern Parallel Model

Definition:

Multiple Instruction streams → Multiple Data streams

Each processor (core or node) executes its own program, often on different data.

This is the dominant model for:

Multicore CPUs
HPC clusters
Cloud computing infrastructures

14. Hardware Models

Model	Description	Example
Shared Memory (SMP)	All cores access the same physical memory. Synchronisation via threads.	Multicore CPU
Distributed Memory	Each processor has its own local memory; communication via network.	HPC cluster

Hybrid model: Modern supercomputers combine both (clusters of SMP nodes).

15. Programming Models for MIMD

Paradigm	Model	API / Standard	Notes
Shared Memory	Threads sharing one memory space	OpenMP	Easy to parallelise loops; ideal for multicore CPUs.
Distributed Memory	Tasks on different nodes communicating via messages	MPI (Message Passing Interface)	Explicit control over data communication; used in clusters.
Hybrid	Mix of both	OpenMP + MPI	Real-world supercomputing approach.

Summary

Key Takeaways

Memory hierarchy is the bottleneck in modern computing — not CPU speed.
Cache awareness, stride control, and data locality dominate performance.
Always design data access patterns to exploit contiguity.
Understand architecture-level parallelism (SIMD) and system-level parallelism (MIMD).
Use compiler analysis to verify vectorisation.
Programming model choice (OpenMP vs MPI) depends on your hardware.