The Gray Scott School

Day 1 — Foundations

The vocabulary and principles. Optimizing means understanding how the hardware works and locating where time is lost.

1. The CPU

A processor is made of several independent cores. Clock frequency has stagnated since ~2005 (thermal limit); performance now comes from adding cores, which makes parallelism unavoidable.

Performance comes from the number of cores, not raw frequency.

CPU (socket)core 1registers · ALUSIMDL1 · L2core 2registers · ALUSIMDL1 · L2core 3registers · ALUSIMDL1 · L2core 4registers · ALUSIMDL1 · L2shared L3 cachememory bus (slow)RAM
A processor: independent cores, each with its registers, ALU, vector unit and caches

2. Compilation

Compilation translates C++ into machine code ahead of execution, with optimization. Its four stages: preprocessor, compilation (assembly + optimization), assembling, linking. Optimization and vectorization are born at compile time.

source.cpppreprocessor#include, macroscompilationOPTIMIZATIONassembling.olinking+ libs → executable-O3 -march=native
The four stages of compilation — optimization and vectorization are born at the compile stage

Recommended flags: -O3 -march=native. Without optimization (-O0), code can be an order of magnitude slower.

3. Vectorization (SIMD)

Single Instruction, Multiple Data: one instruction processes several values at once. The compiler generates it automatically when the loop is regular (contiguous access, no dependencies).

Scalar — 8 instructions, one after another++++++++t₁ → t₂ → t₃ → … → t₈Vectorized (AVX2) — 1 instruction on 8 floats++++++++256-bit register · t₁ only
The same computation: 8 sequential additions, or a single vector instruction

Typical gain: ×4 to ×16 depending on register width (SSE 128-bit, AVX2 256-bit, AVX-512 512-bit).

4. Concurrency vs parallelism

ConcurrencyParallelism
Goalhide waiting (I/O, network)speed up computation
Hardwarepossible on a single corerequires several cores
Domainservers, async I/Ocompute-intensive work
Concurrency — 1 coretasks take turns (hide the waiting)coretime →ABParallelism — 4 corescomputations truly run at the same timecore 1core 2core 3core 4time →×4
Concurrency: structuring the waiting on one core. Parallelism: actually computing at the same time on several cores

Concurrency is a structuring tool; parallelism is the result when the hardware allows it.

5. Memory-bound and compute-bound

  • Compute-bound — limited by the compute units → adding cores helps.
  • Memory-bound — limited by bandwidth; cores wait for data → adding cores barely helps.

Arithmetic intensity (operations per byte loaded) sets the regime. Stencils have low arithmetic intensity and are usually memory-bound. The reference analysis tool is the roofline model.

performance (Gflop/s, log)arithmetic intensity (flops / byte, log)memory-boundcompute-boundmemory roof — bandwidthcompute roof — peak Gflop/sridge pointGray-Scott stencil
The roofline: under the sloped roof you wait for memory; under the flat roof you wait for compute. The stencil lives on the left — memory-bound
The memory wall: cost of one access by level
Register
0,3 ns
L1
1 ns
L2
4 ns
L3
15 ns
RAM
100 ns

Typical orders of magnitude — a RAM access costs ~100× an L1 hit.

6. Purity

A function is pure when its output depends only on its inputs, with no side effects. A pure computation has no hidden dependency: it is parallelizable without data races. A double-buffer scheme (read one array, write another) is pure and trivially parallel.

Pure — double bufferU0 (read)U1 (write)✓ safely parallelizableIn place — impureU (read + write)⚠ possible data race
Read one array, write another: no hidden dependency. Read and write the same one: threads step on each other

Purity is what makes parallelization safe.

7. Unit tests in scientific computing

Floating-point rounding forbids exact comparison. Three rules guide validation:

1.  Compare with tolerance    |a − b| < ε   (never strict equality)
2.  Check invariants          energy, symmetry, no NaN, analytic case
3.  Seq. / parallel equality  a divergence reveals a data race
Copyright © 2026