Day 1 — Foundations
1. The CPU
A processor is made of several independent cores. Clock frequency has stagnated since ~2005 (thermal limit); performance now comes from adding cores, which makes parallelism unavoidable.
Performance comes from the number of cores, not raw frequency.
2. Compilation
Compilation translates C++ into machine code ahead of execution, with optimization. Its four stages: preprocessor, compilation (assembly + optimization), assembling, linking. Optimization and vectorization are born at compile time.
Recommended flags: -O3 -march=native. Without optimization (-O0), code can be an order of
magnitude slower.
3. Vectorization (SIMD)
Single Instruction, Multiple Data: one instruction processes several values at once. The compiler generates it automatically when the loop is regular (contiguous access, no dependencies).
Typical gain: ×4 to ×16 depending on register width (SSE 128-bit, AVX2 256-bit, AVX-512 512-bit).
4. Concurrency vs parallelism
| Concurrency | Parallelism | |
|---|---|---|
| Goal | hide waiting (I/O, network) | speed up computation |
| Hardware | possible on a single core | requires several cores |
| Domain | servers, async I/O | compute-intensive work |
Concurrency is a structuring tool; parallelism is the result when the hardware allows it.
5. Memory-bound and compute-bound
- Compute-bound — limited by the compute units → adding cores helps.
- Memory-bound — limited by bandwidth; cores wait for data → adding cores barely helps.
Arithmetic intensity (operations per byte loaded) sets the regime. Stencils have low arithmetic intensity and are usually memory-bound. The reference analysis tool is the roofline model.
Typical orders of magnitude — a RAM access costs ~100× an L1 hit.
6. Purity
A function is pure when its output depends only on its inputs, with no side effects. A pure computation has no hidden dependency: it is parallelizable without data races. A double-buffer scheme (read one array, write another) is pure and trivially parallel.
Purity is what makes parallelization safe.
7. Unit tests in scientific computing
Floating-point rounding forbids exact comparison. Three rules guide validation:
1. Compare with tolerance |a − b| < ε (never strict equality)
2. Check invariants energy, symmetry, no NaN, analytic case
3. Seq. / parallel equality a divergence reveals a data race
CINERI Presentation
The presentation of CINERI live to the whole Gray Scott School 2026 — special session of June 25, broadcast on the official live stream.
Day 2 — C++ on CPU
Two sessions on June 23: C++ 17/20/23 on CPU in the morning, advanced optimization (blocking & Pyramid) in the afternoon. Measure, understand the stencil, exploit the cache.