The Gray Scott School

Day 1 — Foundations

The vocabulary and principles. Optimizing means understanding how the hardware works and locating where time is lost.

1. The CPU

A processor is made of several independent cores. Clock frequency has stagnated since ~2005 (thermal limit); performance now comes from adding cores, which makes parallelism unavoidable.

Performance comes from the number of cores, not raw frequency.

A processor: independent cores, each with its registers, ALU, vector unit and caches

Compilation translates C++ into machine code ahead of execution, with optimization. Its four stages: preprocessor, compilation (assembly + optimization), assembling, linking. Optimization and vectorization are born at compile time.

The four stages of compilation — optimization and vectorization are born at the compile stage

Recommended flags: -O3 -march=native. Without optimization (-O0), code can be an order of magnitude slower.

3. Vectorization (SIMD)

Single Instruction, Multiple Data: one instruction processes several values at once. The compiler generates it automatically when the loop is regular (contiguous access, no dependencies).

The same computation: 8 sequential additions, or a single vector instruction

Typical gain: ×4 to ×16 depending on register width (SSE 128-bit, AVX2 256-bit, AVX-512 512-bit).

4. Concurrency vs parallelism

	Concurrency	Parallelism
Goal	hide waiting (I/O, network)	speed up computation
Hardware	possible on a single core	requires several cores
Domain	servers, async I/O	compute-intensive work

Concurrency: structuring the waiting on one core. Parallelism: actually computing at the same time on several cores

Concurrency is a structuring tool; parallelism is the result when the hardware allows it.

5. Memory-bound and compute-bound

Compute-bound — limited by the compute units → adding cores helps.
Memory-bound — limited by bandwidth; cores wait for data → adding cores barely helps.

Arithmetic intensity (operations per byte loaded) sets the regime. Stencils have low arithmetic intensity and are usually memory-bound. The reference analysis tool is the roofline model.

The roofline: under the sloped roof you wait for memory; under the flat roof you wait for compute. The stencil lives on the left — memory-bound

The memory wall: cost of one access by level

0,3 ns

1 ns

4 ns

15 ns

RAM

100 ns

Typical orders of magnitude — a RAM access costs ~100× an L1 hit.

6. Purity

A function is pure when its output depends only on its inputs, with no side effects. A pure computation has no hidden dependency: it is parallelizable without data races. A double-buffer scheme (read one array, write another) is pure and trivially parallel.

Read one array, write another: no hidden dependency. Read and write the same one: threads step on each other

Purity is what makes parallelization safe.

7. Unit tests in scientific computing

Floating-point rounding forbids exact comparison. Three rules guide validation:

1.  Compare with tolerance    |a − b| < ε   (never strict equality)
2.  Check invariants          energy, symmetry, no NaN, analytic case
3.  Seq. / parallel equality  a divergence reveals a data race

Edit this pageorReport an issue

CINERI Presentation

The presentation of CINERI live to the whole Gray Scott School 2026 — special session of June 25, broadcast on the official live stream.

Day 2 — C++ on CPU

Two sessions on June 23: C++ 17/20/23 on CPU in the morning, advanced optimization (blocking & Pyramid) in the afternoon. Measure, understand the stencil, exploit the cache.